<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
    <title>13. Harvester and Harvest List Editor &mdash; Metacat 2.10.4 documentation</title>
    <link rel="stylesheet" href="_static/bootstrap.min.css" type="text/css" />
    <link rel="stylesheet" href="_static/font-awesome/css/font-awesome.min.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    <link rel="stylesheet" href="_static/metacatui.css" type="text/css" />
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    './',
        VERSION:     '2.10.4',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>
    <script type="text/javascript" src="_static/jquery.js"></script>
    <script type="text/javascript" src="_static/underscore.js"></script>
    <script type="text/javascript" src="_static/doctools.js"></script>
    <link rel="top" title="Metacat 2.10.4 documentation" href="index.html" />
    <link rel="prev" title="12. Replication" href="replication.html" />
    <link rel="next" title="14. OAI Protocol for Metadata Harvesting" href="oaipmh.html" /> 
  </head>
  <body>
  <div id="metacatDocs">
	  <div class="banner">
	      <a href="index.html"><img class="logo" src="_static/metacat-logo-white.png" /></a>
	      <a href="index.html"><h1 class="title">Metacat: Metadata and Data Management Server</h1></a>
	      <img class="logo-right" src="_static/nceas-logo-white.png" />
	  </div>
    <div class="related">
      <h3>Navigation</h3>
      <ul>
        <li class="right">
        <span id="searchbox" style="display: none;">
            <form class="search" action="search.html" method="get">
              <input type="text" name="q" size="18" />
              <input type="submit" value="Go" class="icon-search"/>
              <input type="hidden" name="check_keywords" value="yes" />
              <input type="hidden" name="area" value="default" />
            </form>
        </span>
        </li>
        <script type="text/javascript">$('#searchbox').show(0);</script>
        <li class="right">
          <a href="genindex.html" title="General Index"
             accesskey="I">index</a>
           </li>
        <li class="right">
          <a href="oaipmh.html" title="14. OAI Protocol for Metadata Harvesting"
             accesskey="N">next</a>
           </li>
        <li class="right">
          <a href="replication.html" title="12. Replication"
             accesskey="P">previous</a>
           </li>
        <li class="breadcrumb first"><a href="index.html">Metacat 2.10.4 documentation</a> &raquo;</li> 
      </ul>
      
    </div>

	
    <div class="document">
     	 <div class="documentwrapper">
	        <div class="bodywrapper">
	          <div class="body">
	            
  <div class="section" id="harvester-and-harvest-list-editor">
<h1>13. Harvester and Harvest List Editor<a class="headerlink" href="#harvester-and-harvest-list-editor" title="Permalink to this headline">¶</a></h1>
<p>Metacat&#8217;s Harvester is an optional feature that can be used to automatically
retrieve EML documents from one or more custom data management system (e.g.,
SRB or PostgreSQL) and to insert (or update) those documents to the home
repository. The local sites control when they are harvested, and which documents
are harvested.</p>
<p>For example, the Long Term Ecological Research Network (LTER) uses the Metacat
Harvester to create a centralized repository of data stored on twenty-six
different sites that store EML metadata, but that use different data management
systems. Once the data have been harvested and placed into a centralized
repository, they are replicated to the KNB network, exposing the information
to an even larger scientific community.</p>
<p>Once the Harvester is properly configured, listed documents are retrieved and
uploaded on a regularly scheduled basis. You must configure both the home
Metacat and the remote sites (aka the &#8220;harvest sites&#8221;) before using this
feature. Local sites must also provide the Metacat server with a list of
documents that should be harvested.</p>
<div class="section" id="configuring-harvester">
<h2>13.1. Configuring Harvester<a class="headerlink" href="#configuring-harvester" title="Permalink to this headline">¶</a></h2>
<p>Before you can use the Harvester to retrieve documents, you must configure the
feature using the settings in the metacat.properties file. Note that you must
also configure each site that the Harvester will connect to and retrieve
documents from (see section 7.2 for details).</p>
<p>The Harvester configuration information is managed in the metacat.properties
file, which is located at:</p>
<div class="highlight-python"><div class="highlight"><pre>&lt;CONTEXT_DIR&gt;/WEB_INF/metacat.properties
</pre></div>
</div>
<p>The Harvester properties are grouped together and begin after the comment line:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="c"># Harvester properties</span>
</pre></div>
</div>
<p>To configure Harvester, edit the metacat.properties and set appropriate values
for the harvesterAdministrator and smtpServer property. You may also wish to
customize the other Harvester paramaters, each discussed in the table below.</p>
</div>
<div class="section" id="harvester-properties-and-their-functions">
<h2>13.2. Harvester Properties and their Functions<a class="headerlink" href="#harvester-properties-and-their-functions" title="Permalink to this headline">¶</a></h2>
<table border="1" class="docutils">
<colgroup>
<col width="27%" />
<col width="72%" />
<col width="1%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Property</th>
<th class="head">Description and Values</th>
<th class="head">&nbsp;</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>connectToMetacat</td>
<td><p class="first">Determine whether Harvester should connect to Metacat to upload retrieved documents.
Set to true (the default) under most circumstances. To test whether Harvester can
retrieve documents from a site without actually connecting to Metacat
to upload the documents, set the value to false.</p>
<p class="last">Values: true/false</p>
</td>
<td>&nbsp;</td>
</tr>
<tr class="row-odd"><td>delay</td>
<td><p class="first">The number of hours that Harvester will wait before beginning its first harvest.
For example, if Harvester is run at 1:00 p.m., and the delay is set to 12,
Harvester will begin its first harvest at 1:00 a.m.</p>
<p class="last">Default: 0</p>
</td>
<td>&nbsp;</td>
</tr>
<tr class="row-even"><td>harvesterAdministrator</td>
<td><p class="first">The email address of the Harvester Administrator. Harvester will send
email reports to this address after every harvest. Enter multiple email addresses by separating
each address with a comma or semicolon (e.g., <a class="reference external" href="mailto:name1&#37;&#52;&#48;abc&#46;edu">name1<span>&#64;</span>abc<span>&#46;</span>edu</a>,name2&#64;abc.edu).</p>
<p class="last">Values: An email address, or multiple email addresses separated by commas or semi-colons</p>
</td>
<td>&nbsp;</td>
</tr>
<tr class="row-odd"><td>logPeriod</td>
<td><p class="first">The number of days to retain Harvester log entries. Harvester log entries
record information such as which documents were harvested, from which sites,
and whether any errors were encountered during the harvest. Log entries older
than logPeriod number of days are purged from the database at the end of each harvest.</p>
<p class="last">Default: 90</p>
</td>
<td>&nbsp;</td>
</tr>
<tr class="row-even"><td>maxHarvests</td>
<td><p class="first">The maximum number of harvests that Harvester should execute before
shutting down. If the value of maxHarvests is set to 0 or a
negative number, Harvester will execute indefinitely.</p>
<p class="last">Default: 0</p>
</td>
<td>&nbsp;</td>
</tr>
<tr class="row-odd"><td>period</td>
<td><p class="first">The number of hours between harvests. Harvester will run a new harvest
every specified period of hours (either indefinitely or until the maximum
number of harvests have run, depending on the value of maxHarvests).</p>
<p class="last">Default: 24</p>
</td>
<td>&nbsp;</td>
</tr>
<tr class="row-even"><td>smtpServer</td>
<td><p class="first">The SMTP server that Harvester uses for sending email messages to the
Harvester Administrator and Site Contacts.
(e.g., somehost.institution.edu). Note that the default value only works
if the Harvester host machine is configured as a SMTP server.</p>
<p class="last">Default: localhost</p>
</td>
<td>&nbsp;</td>
</tr>
<tr class="row-odd"><td>Harvester Operation Properties
(GetDocError, GetDocSuccess, etc.)</td>
<td>The Harvester Operation properties are used by Harvester to report information
about performed operations for inclusion in log entries and email messages.
Under most circumstances the values of these properties should not be modified.</td>
<td>&nbsp;</td>
</tr>
</tbody>
</table>
</div>
<div class="section" id="configuring-a-harvest-site-instructions-for-site-contact">
<h2>13.3. Configuring a Harvest Site (Instructions for Site Contact)<a class="headerlink" href="#configuring-a-harvest-site-instructions-for-site-contact" title="Permalink to this headline">¶</a></h2>
<p>After Metacat&#8217;s Harvester has been configured, remote sites can register and
send information about which files should be retrieved. Each remote site must
have a site contact who is responsible for registering the site and creating a
list of EML files to harvest (the &#8220;Harvest List&#8221;), as well as for reviewing
harvest reports. The site contact can unregister the site from the Harvester
at any time.</p>
<p>To use Harvester:</p>
<ol class="arabic simple">
<li>Register with Harvester</li>
<li>Compose a Harvest List (you will likely wish to use the Harvest List Editor)</li>
<li>Prepare your EML Documents for Harvest</li>
<li>Review the Harvester Reports</li>
</ol>
<div class="section" id="register-with-harvester">
<h3>13.3.1. Register with Harvester<a class="headerlink" href="#register-with-harvester" title="Permalink to this headline">¶</a></h3>
<p>To register a remote site with Harvester, the Site Contact should log in to
Metacat&#8217;s Harvester Registration page and enter information about the site and
how it should be harvested.</p>
<ol class="arabic">
<li><p class="first">Using a Web browser, log in to Metacat&#8217;s Harvester Registration page.
The Harvester Registration page is inside the skins directory. For example,
if the Metacat server that you wish to register with resides at the following URL:</p>
<div class="highlight-python"><div class="highlight"><pre>http://somehost.somelocation.edu:8080/metacat/index.jsp
</pre></div>
</div>
<p>then the Harvester Registration page would be accessed at:</p>
<div class="highlight-python"><div class="highlight"><pre>http://somehost.somelocation.edu:8080/metacat/style/skins/default/harvesterRegistrationLogin.jsp
</pre></div>
</div>
</li>
</ol>
<div class="figure align-center" id="id1">
<img alt="_images/image065.jpg" src="_images/image065.jpg" />
<p class="caption"><span class="caption-text">Metacat&#8217;s Harvester Registration page.</span></p>
</div>
<ol class="arabic" start="2">
<li><p class="first">Enter your Metacat account information and click Submit to log in to your
Metacat from the Harvester Registration page.</p>
<p>Note: In some cases, you may need to log in to an anonymous &#8220;site&#8221; account
rather than your personal account so that the registered data will not appear
to have been registered by a single user. For example, an information
manager (jones) who is registering data created by a team of scientists
(jones, smith, and barney) from the Georgia Coastal Ecosystems site  might
log in to a dedicated account (named with the site&#8217;s acronym, &#8220;GCE&#8221;) to
indicate that the registered data is from the entire site rather than &#8220;jones&#8221;.</p>
</li>
<li><p class="first">Enter information about your site and how often you want to schedule harvests
and then click the Register button (Figure 7.2). The Harvest List URL should
point to the location of the Harvest List, which is an XML file that lists
the documents to harvest. If you do not yet have a Harvest List, please see
the next section for more information about creating one.</p>
</li>
</ol>
<div class="figure align-center" id="id2">
<img alt="_images/image067.jpg" src="_images/image067.jpg" />
<p class="caption"><span class="caption-text">Enter information about your site and how often you want to schedule harvests.</span></p>
</div>
<p>The example settings in the previous figure instruct Harvester to harvest
documents from the site once every two weeks. The Harvester will access the
site&#8217;s Harvest List at URL &#8220;<a class="reference external" href="http://somehost.institution.edu/~myname/harvestList.xml">http://somehost.institution.edu/~myname/harvestList.xml</a>&#8221;,
and will send email reports to the Site Contact at email address
&#8220;<a class="reference external" href="mailto:myname&#37;&#52;&#48;institution&#46;edu">myname<span>&#64;</span>institution<span>&#46;</span>edu</a>&#8221;. Note that you can enter multiple email addresses by
separating each address with a comma or a semi-colon. For example,
&#8220;<a class="reference external" href="mailto:myname&#37;&#52;&#48;institution&#46;edu">myname<span>&#64;</span>institution<span>&#46;</span>edu</a>,anothername&#64;institution.edu&#8221;</p>
</div>
<div class="section" id="compose-a-harvest-list-the-harvest-list-editor">
<h3>13.3.2. Compose a Harvest List (The Harvest List Editor)<a class="headerlink" href="#compose-a-harvest-list-the-harvest-list-editor" title="Permalink to this headline">¶</a></h3>
<p>The Harvest List is an XML file that contains a list of documents to be harvested.
The list is created by the site contact and stored on the site contact&#8217;s site
at the location specified during the Harvester registration process (see
previous section for details). The list can be generated by hand, or you can
use Metacat&#8217;s Harvest List Editor to automatically generate and structure the
list to conform to the required XML schema (displayed in figure at the end of
this section). In this section we will look at what information is required when
building a Harvest List, and how to configure and use the Harvest List Editor.
Note that you must have a source distribution of Metacat in order to use the
Harvest List Editor.</p>
<p>The Harvest List contains information that helps Metacat identify and retrieve
each specified EML file. Each document in the list must be described with a
docid, documentType, and documentURL (see table).</p>
<p>Table: Information that must be included in the Harvest List about each EML file
+&#8212;&#8212;&#8212;&#8212;&#8211;+&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-+
| Item         | Description                                                                                     |
+==============+=================================================================================================+
| docid        | The docid uniquely identifies each EML document. Each docid consists of three elements:         |
|              |                                                                                                 |
|              | <code class="docutils literal"><span class="pre">scope</span></code> The document group to which the document belongs                                      |
|              | <code class="docutils literal"><span class="pre">identifier</span></code>  A number that uniquely identifies the document within the scope.                |
|              | <code class="docutils literal"><span class="pre">revision</span></code> Anumber that indicates the current revision.                                       |
|              |                                                                                                 |
|              | For example, a valid docid could be: demoDocument.1.5, where demoDocument represents            |
|              | the scope, 1 the identifier, and 5 the revision number.                                         |
+&#8212;&#8212;&#8212;&#8212;&#8211;+&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-+
| documentType | The documentType identifies the type of document as EML                                         |
|              | e.g., &#8220;eml://ecoinformatics.org/eml-2.0.0&#8221;.                                                     |
+&#8212;&#8212;&#8212;&#8212;&#8211;+&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-+
| documentURL  | The documentURL specifies a place where Harvester can locate and retrieve the                   |
|              | document via HTTP. The Metacat Harvester must be given read access to the contents at this URL. |
|              | e.g. &#8220;http://www.lternet.edu/~dcosta/document1.xml&#8221;.                                            |
+&#8212;&#8212;&#8212;&#8212;&#8211;+&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-+</p>
<p>The example Harvest List below contains two &lt;document&gt; elements that specify the
information that Harvester needs to retrieve a pair of EML documents and
upload them to Metacat.</p>
<div class="highlight-python"><div class="highlight"><pre>&lt;!-- Example Harvest List --&gt;
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;hrv:harvestList xmlns:hrv=&quot;eml://ecoinformatics.org/harvestList&quot; &gt;
  &lt;document&gt;
      &lt;docid&gt;
          &lt;scope&gt;demoDocument&lt;/scope&gt;
          &lt;identifier&gt;1&lt;/identifier&gt;
          &lt;revision&gt;5&lt;/revision&gt;
      &lt;/docid&gt;
      &lt;documentType&gt;eml://ecoinformatics.org/eml-2.0.0&lt;/documentType&gt;
      &lt;documentURL&gt;http://www.lternet.edu/~dcosta/document1.xml&lt;/documentURL&gt;
  &lt;/document&gt;
  &lt;document&gt;
      &lt;docid&gt;
          &lt;scope&gt;demoDocument&lt;/scope&gt;
          &lt;identifier&gt;2&lt;/identifier&gt;
          &lt;revision&gt;1&lt;/revision&gt;
      &lt;/docid&gt;
      &lt;documentType&gt;eml://ecoinformatics.org/eml-2.0.0&lt;/documentType&gt;
      &lt;documentURL&gt;http://www.lternet.edu/~dcosta/document2.xml&lt;/documentURL&gt;
  &lt;/document&gt;
&lt;/hrv:harvestList&gt;
</pre></div>
</div>
<p>Rather than formatting the list by hand, you may wish to use Metacat&#8217;s Harvest
List Editor to compose and edit it. The Harvest List Editor displays a Harvest
List as a table of rows and fields. Each table row corresponds to
a single &lt;document&gt; element in the corresponding Harvest List file (i.e., one
EML document). The row numbers are used only for visual reference and are
not editable.</p>
<p>To add a new document to the Harvest List, enter values for all five editable
fields (all fields except the &#8220;Row #&#8221; field). Partially filled-in rows will
cause errors that will result in an invalid Harvest List.</p>
<p>The buttons at the bottom of the Editor can be used to Cut, Copy, and Paste
rows from one location to another. Select a row and click the desired button,
or paste the default values (which are specified in the Editor&#8217;s configuration
file, discussed later in this section) into the currently selected row by
clicking the Paste Defaults button. Note: Only one row can be selected at any
given time: all cut, copy, and paste operations work on only a single row
rather than on a range of rows.</p>
<p>To run the Harvest List Editor, from the terminal on which the Metacat
source code is installed:</p>
<ol class="arabic">
<li><p class="first">Open a system command window or terminal window.</p>
</li>
<li><p class="first">Set the METACAT_HOME environment variable to the value of the Metacat
installation directory. Some examples follow:</p>
<p>On Windows:</p>
<div class="highlight-python"><div class="highlight"><pre>set METACAT_HOME=C:\somePath\metacat
</pre></div>
</div>
<p>On Linux/Unix (bash shell):</p>
<div class="highlight-python"><div class="highlight"><pre>export METACAT_HOME=/home/somePath/metacat
</pre></div>
</div>
</li>
<li><p class="first">cd to the following directory:</p>
<p>On Windows:</p>
<div class="highlight-python"><div class="highlight"><pre>cd %METACAT_HOME%\lib\harvester
</pre></div>
</div>
<p>On Linux/Unix:</p>
<div class="highlight-python"><div class="highlight"><pre>cd $METACAT_HOME/lib/harvester
</pre></div>
</div>
</li>
<li><p class="first">Run the appropriate Harvester shell script, as determined by the operating system:</p>
<p>On Windows:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">runHarvestListEditor</span><span class="o">.</span><span class="n">bat</span>
</pre></div>
</div>
<p>On Linux/Unix:</p>
<div class="highlight-python"><div class="highlight"><pre>sh runHarvestListEditor.sh
</pre></div>
</div>
<p>The Harvest List Editor will open.</p>
</li>
</ol>
<p>If you would like to customize the Harvest List Editor (e.g., specify a
default list to open automatically whenever the editor is opened and/or
default values), create a file called .harvestListEditor (note the leading
dot character). Use a plain text editor to create the file and place the file
in the Site Contact&#8217;s home directory. To determine the home directory, open a
system command window or terminal window and type the following:</p>
<p>On Windows:</p>
<div class="highlight-python"><div class="highlight"><pre>echo %USERPROFILE%
</pre></div>
</div>
<p>On Linux/Unix:</p>
<div class="highlight-python"><div class="highlight"><pre>echo $HOME
</pre></div>
</div>
<p>The configuration file contains a number of optional properties that can make
using the Editor more convenient. A sample configure file is displayed below, and
more information about each configuration property is contained in the table.</p>
<p>A sample .harvestListEditor configuration file</p>
<div class="highlight-python"><div class="highlight"><pre>defaultHarvestList=C:/temp/harvestList.xml
defaultScope=demo_document
defaultIdentifier=1
defaultRevision=1
defaultDocumentURL=http://www.lternet.edu/~dcosta/
defaultDocumentType=eml://ecoinformatics.org/eml-2.0.0
</pre></div>
</div>
<p>Harvest List Editor Configuration Properties</p>
<table border="1" class="docutils">
<colgroup>
<col width="18%" />
<col width="82%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Property</th>
<th class="head">Description</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>defaultHarvestList</td>
<td><p class="first">The location of a Harvest List file that the Editor will
automatically open for editing on startup. Set this property
to the path to the Harvest List file that you expect to edit most frequently.</p>
<p class="last">Examples:
<code class="docutils literal"><span class="pre">/home/jdoe/public_html/harvestList.xml</span></code>
<code class="docutils literal"><span class="pre">C:/temp/harvestList.xml</span></code></p>
</td>
</tr>
<tr class="row-odd"><td>defaultScope</td>
<td><p class="first">The value pasted into the Editor&#8217;s Scope field when the Paste
Defaults button is clicked. The Scope field should contain
a symbolic identifier that indicates the family of documents
to which the EML document belongs.</p>
<p class="last">Example:   xyz_dataset
Default:    dataset</p>
</td>
</tr>
<tr class="row-even"><td>defaultIdentifer</td>
<td>The value pasted into the Editor&#8217;s Identifier field when the
Paste Defaults button is clicked. The Scope field should contain
a numeric value indicating the identifier for this particular EML document within the Scope.</td>
</tr>
<tr class="row-odd"><td>defaultRevision</td>
<td><p class="first">The value pasted into the Editor&#8217;s Revision field when the Paste Defaults button
is clicked. The Scope field should contain a numeric value indicating the
revision number of this EML document within the Scope and Identifier.</p>
<p class="last">Example:   2
Default:    1</p>
</td>
</tr>
<tr class="row-even"><td>defaultDocumentType</td>
<td><p class="first">The document type specification pasted into the
Editor&#8217;s DocumentType field when the Paste Defaults button is clicked.</p>
<p class="last">Default: <code class="docutils literal"><span class="pre">eml://ecoinformatics.org/eml-2.0.0</span></code></p>
</td>
</tr>
<tr class="row-odd"><td>defaultDocumentURL</td>
<td><p class="first">The URL or partial URL pasted into the Editor&#8217;s URL field
when the Paste Defaults button is clicked. Typically, this
value is set to the portion of the URL shared by all harvested EML documents.</p>
<p class="last">Example:
<code class="docutils literal"><span class="pre">http://somehost.institution.edu/somepath/</span></code>
Default: <code class="docutils literal"><span class="pre">http://</span></code></p>
</td>
</tr>
</tbody>
</table>
<p>XML Schema for Harvest Lists</p>
<div class="highlight-python"><div class="highlight"><pre>&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;!-- edited with XMLSPY v5 rel. 4 U (http://www.xmlspy.com) by Matt Jones (NCEAS) --&gt;
&lt;xs:schema xmlns:xs=&quot;http://www.w3.org/2001/XMLSchema&quot; xmlns:hrv=&quot;eml://ecoinformatics.org/harvestList&quot; xmlns=&quot;eml://ecoinformatics.org/harvestList&quot; targetNamespace=&quot;eml://ecoinformatics.org/harvestList&quot; elementFormDefault=&quot;unqualified&quot; attributeFormDefault=&quot;unqualified&quot;&gt;
&lt;xs:annotation&gt;
  &lt;xs:documentation&gt;This module defines the required information for the harvester to collect documents from the local site. The local system containing this document must give the Metacat Harvester read access to this document.&lt;/xs:documentation&gt;
&lt;/xs:annotation&gt;
&lt;xs:annotation&gt;
  &lt;xs:appinfo&gt;
    &lt;tooltip/&gt;
    &lt;summary/&gt;
    &lt;description/&gt;
  &lt;/xs:appinfo&gt;
&lt;/xs:annotation&gt;
&lt;xs:element name=&quot;harvestList&quot;&gt;
  &lt;xs:annotation&gt;
    &lt;xs:documentation&gt;This represents the local document information that is used to inform the Harvester of the docid, document type, and location of the document to be harvested.&lt;/xs:documentation&gt;
  &lt;/xs:annotation&gt;
  &lt;xs:complexType&gt;
    &lt;xs:sequence&gt;
      &lt;xs:element name=&quot;document&quot; maxOccurs=&quot;unbounded&quot;&gt;
        &lt;xs:complexType&gt;
          &lt;xs:sequence&gt;
            &lt;xs:element name=&quot;docid&quot;&gt;
              &lt;xs:annotation&gt;
                &lt;xs:documentation&gt;The complete document identifier to be used by metacat.  The docid is a compound element that gives a scope for the identifier, an integer local identifer that is unique within that scope, and a revision.  Each revision is assumed to specify a unique, non-changing document, so once a particular revision is harvested, there is no need for it to be harvested again.  To trigger a harvest of a document that has been updated, increment the revision number for that identifier.&lt;/xs:documentation&gt;
              &lt;/xs:annotation&gt;
              &lt;xs:complexType&gt;
                &lt;xs:sequence&gt;
                  &lt;xs:element name=&quot;scope&quot; type=&quot;xs:string&quot;&gt;
                    &lt;xs:annotation&gt;
                      &lt;xs:documentation&gt;The system prefix of a metacat docid that defines the scope within which the identifier is unique.&lt;/xs:documentation&gt;
                    &lt;/xs:annotation&gt;
                  &lt;/xs:element&gt;
                  &lt;xs:element name=&quot;identifier&quot; type=&quot;xs:long&quot;&gt;
                    &lt;xs:annotation&gt;
                      &lt;xs:documentation&gt;The local (site specific) portion of the identifier (docid) that is unique within the context of the scope.&lt;/xs:documentation&gt;
                    &lt;/xs:annotation&gt;
                  &lt;/xs:element&gt;
                  &lt;xs:element name=&quot;revision&quot; type=&quot;xs:long&quot;&gt;
                    &lt;xs:annotation&gt;
                      &lt;xs:documentation&gt;The revision identifier for this document, indicating a unique document version.&lt;/xs:documentation&gt;
                    &lt;/xs:annotation&gt;
                  &lt;/xs:element&gt;
                &lt;/xs:sequence&gt;
              &lt;/xs:complexType&gt;
            &lt;/xs:element&gt;
            &lt;xs:element name=&quot;documentType&quot; type=&quot;xs:string&quot;&gt;
              &lt;xs:annotation&gt;
                &lt;xs:documentation&gt;The type of document to be harvested, indicated by a namespace string, formal public identifier, mime type, or other type indicator.   &lt;/xs:documentation&gt;
              &lt;/xs:annotation&gt;
            &lt;/xs:element&gt;
            &lt;xs:element name=&quot;documentURL&quot; type=&quot;xs:anyURI&quot;&gt;
              &lt;xs:annotation&gt;
                &lt;xs:documentation&gt;The documentURL field contains the URL of the document to be harvested. The Metacat Harvester must be given read access to the contents at this URL.&lt;/xs:documentation&gt;
              &lt;/xs:annotation&gt;
            &lt;/xs:element&gt;
          &lt;/xs:sequence&gt;
        &lt;/xs:complexType&gt;
      &lt;/xs:element&gt;
    &lt;/xs:sequence&gt;
  &lt;/xs:complexType&gt;
&lt;/xs:element&gt;
&lt;/xs:schema&gt;
</pre></div>
</div>
</div>
<div class="section" id="prepare-eml-documents-for-harvest">
<h3>13.3.3. Prepare EML Documents for Harvest<a class="headerlink" href="#prepare-eml-documents-for-harvest" title="Permalink to this headline">¶</a></h3>
<p>To prepare a set of EML documents for harvest, ensure that the following is true for each document:</p>
<ul class="simple">
<li>The document contains valid EML</li>
<li>The document is specified in a <code class="docutils literal"><span class="pre">&lt;document&gt;</span></code> element in the site&#8217;s Harvest List</li>
<li>The file resides at the location specified by its URL in the Harvest List</li>
</ul>
</div>
<div class="section" id="review-harvester-reports">
<h3>13.3.4. Review Harvester Reports<a class="headerlink" href="#review-harvester-reports" title="Permalink to this headline">¶</a></h3>
<p>Harvester sends an email report to the Site Contact after every scheduled site
harvest. The report contains information about the performed operations, such
as which EML documents were harvested and whether any errors were encountered.
Errors are indicated by operations that display a status value of 1; a status
value of 0 indicates that the operation completed successfully.</p>
<p>When errors are reported, the Site Contact should try to determine whether the
source of the error is something that can be corrected at the site. Common
causes of errors include:</p>
<ul class="simple">
<li>a document URL specified in the Harvest List does not match the location of the actual EML file on the disk</li>
<li>the Harvest List does not contain valid XML as specified in the harvestList.xsd schema</li>
<li>the URL to the Harvest List (specified during registration) does not match the actual location of the Harvest List on the disk</li>
<li>an EML document that Harvester attempted to upload to Metacat does not contain valid EML</li>
</ul>
<p>If the Site Contact is unable to determine the cause of the error and its
resolution, he or she should contact the Harvester Administrator for assistance.</p>
</div>
<div class="section" id="unregister-with-harvester">
<h3>13.3.5. Unregister with Harvester<a class="headerlink" href="#unregister-with-harvester" title="Permalink to this headline">¶</a></h3>
<p>To discontinue harvests, the Site Contact must unregister with Harvester.
To unregister:</p>
<ol class="arabic">
<li><p class="first">Using a Web browser, log in to Metacat&#8217;s Harvester Registration page.
The Harvester Registration page is inside the skins directory. For example,
if the Metacat server that you wish to register with resides at the
following URL:</p>
<div class="highlight-python"><div class="highlight"><pre>http://somehost.somelocation.edu:8080/metacat/index.jsp
</pre></div>
</div>
<p>then the Harvester Registration page would be accessed at:</p>
<div class="highlight-python"><div class="highlight"><pre>http://somehost.somelocation.edu:8080/metacat/style/skins/default/harvesterRegistrationLogin.jsp
</pre></div>
</div>
</li>
<li><p class="first">Enter and submit your Metacat account information. On the subsequent screen,
click Unregister to remove your site and discontinue harvests.</p>
</li>
</ol>
</div>
</div>
<div class="section" id="running-harvester">
<h2>13.4. Running Harvester<a class="headerlink" href="#running-harvester" title="Permalink to this headline">¶</a></h2>
<p>The Harvester can be run as a servlet or in a command window. Under most
circumstances, Harvester is best run continuously as a background servlet
process. However, if you expect to use Harvester infrequently, or if wish only
to test that Harvester is functioning, it may desirable to run it from a
command window.</p>
<div class="section" id="running-harvester-as-a-servlet">
<h3>13.4.1. Running Harvester as a Servlet<a class="headerlink" href="#running-harvester-as-a-servlet" title="Permalink to this headline">¶</a></h3>
<p>To run Harvester as a servlet:</p>
<ol class="arabic">
<li><dl class="first docutils">
<dt>Remove the comment symbols around the HarvesterServlet entry in the</dt>
<dd><p class="first last">deployed Metacat web.xml ($TOMCAT_HOME/webapps/&lt;context&gt;/WEB-INF).</p>
</dd>
</dl>
<div class="highlight-python"><div class="highlight"><pre>&lt;!--
&lt;servlet&gt;
  &lt;servlet-name&gt;HarvesterServlet&lt;/servlet-name&gt;
  &lt;servlet-class&gt;edu.ucsb.nceas.metacat.harvesterClient.HarvesterServlet&lt;/servlet-class&gt;
  &lt;init-param&gt;
  &lt;param-name&gt;debug&lt;/param-name&gt;
  &lt;param-value&gt;1&lt;/param-value&gt;
  &lt;/init-param&gt;
  &lt;init-param&gt;
  &lt;param-name&gt;listings&lt;/param-name&gt;
  &lt;param-value&gt;true&lt;/param-value&gt;
  &lt;/init-param&gt;
  &lt;load-on-startup&gt;1&lt;/load-on-startup&gt;
&lt;/servlet&gt;
--&gt;
</pre></div>
</div>
</li>
<li><p class="first">Save the edited file.</p>
</li>
<li><p class="first">Restart Tomcat.</p>
</li>
</ol>
<p>About thirty seconds after you restart Tomcat, the Harvester servlet will
start executing. The first harvest will occur after the number of hours
specified in the metacat.properties file. The servlet will continue running
new harvests until the maximum number of harvests have been completed, or until
Tomcat shuts down (harvest frequency and maximum number of harvests are also
set in the Harvester properties).</p>
</div>
<div class="section" id="running-harvester-in-a-command-window">
<h3>13.4.2. Running Harvester in a Command Window<a class="headerlink" href="#running-harvester-in-a-command-window" title="Permalink to this headline">¶</a></h3>
<p>To run Harvester in a Command Window:</p>
<ol class="arabic">
<li><p class="first">Open a system command window or terminal window.</p>
</li>
<li><p class="first">Set the <code class="docutils literal"><span class="pre">METACAT_HOME</span></code> environment variable to the value of the
Metacat webapp deployment directory.</p>
<p>On Windows:</p>
<div class="highlight-python"><div class="highlight"><pre>set METACAT_HOME=C:\somePath\metacat
</pre></div>
</div>
<p>On Linux/Unix (bash shell):</p>
<div class="highlight-python"><div class="highlight"><pre>export METACAT_HOME=/home/somePath/metacat
</pre></div>
</div>
</li>
<li><p class="first">cd to the following directory:</p>
<p>On Windows:</p>
<div class="highlight-python"><div class="highlight"><pre>cd %METACAT_HOME%\lib\harvester
</pre></div>
</div>
<p>On Linux/Unix:</p>
<div class="highlight-python"><div class="highlight"><pre>cd $METACAT_HOME/lib/harvester
</pre></div>
</div>
</li>
<li><p class="first">Run the appropriate Harvester shell script, as determined by the operating system:</p>
<p>On Windows:</p>
<div class="highlight-python"><div class="highlight"><pre>runHarvester.bat %METACAT_HOME%
</pre></div>
</div>
<p>On Linux/Unix:</p>
<div class="highlight-python"><div class="highlight"><pre>sh runHarvester.sh $METACAT_HOME
</pre></div>
</div>
</li>
</ol>
<p>The Harvester application will start executing. The first harvest will occur
after the number of hours specified in the <code class="docutils literal"><span class="pre">metacat.properties</span> <span class="pre">file</span></code>. The
servlet will continue running new harvests until the maximum number of harvests
have been completed, or until you interrupt the process by hitting CTRL/C in
the command window (harvest frequency and maximum number of harvests are also
set in the Harvester properties).</p>
</div>
</div>
<div class="section" id="reviewing-harvest-reports">
<h2>13.5. Reviewing Harvest Reports<a class="headerlink" href="#reviewing-harvest-reports" title="Permalink to this headline">¶</a></h2>
<p>Harvester sends an email report to the Harvester Administrator after every
harvest. The report contains information about the performed operations, such
as which sites were harvested as well as which EML documents were harvested
and whether any errors were encountered. Errors are indicated by operations
that display a status value of 1; a status value of 0 indicates that the
operation completed successfully.</p>
<p>The Harvester Administrator should review the report, paying particularly
close attention to any reported errors and accompanying error messages. When
errors are reported at a particular site, the Harvester Administrator should
contact the Site Contact to determine the source of the error and its
resolution. Common causes of errors include:</p>
<ul class="simple">
<li>a document URL specified in the Harvest List does not match the location of the actual EML file on the disk</li>
<li>the Harvest List does not contain valid XML as specified in the harvestList.xsd schema</li>
<li>the URL to the Harvest List (specified during registration) does not match the actual location of the Harvest List on the disk</li>
<li>an EML document that Harvester attempted to upload to Metacat does not contain valid EML</li>
</ul>
<p>Errors that are independent of a particular site may indicate a problem with
Harvester itself, Metacat, or the database connection. Refer to the error
message to determine the source of the error and its resolution.</p>
</div>
</div>


	          </div>
	        </div>
      	</div>

	
	      <div class="clearer"></div>
	    </div>
	    <div class="footer">
	    	<div class="footerNav">
				
    <div class="related">
      <h3>Navigation</h3>
      <ul>
        <li class="right">
        <span id="searchbox" style="display: none;">
            <form class="search" action="search.html" method="get">
              <input type="text" name="q" size="18" />
              <input type="submit" value="Go" class="icon-search"/>
              <input type="hidden" name="check_keywords" value="yes" />
              <input type="hidden" name="area" value="default" />
            </form>
        </span>
        </li>
        <script type="text/javascript">$('#searchbox').show(0);</script>
        <li class="right">
          <a href="genindex.html" title="General Index"
             >index</a>
           </li>
        <li class="right">
          <a href="oaipmh.html" title="14. OAI Protocol for Metadata Harvesting"
             >next</a>
           </li>
        <li class="right">
          <a href="replication.html" title="12. Replication"
             >previous</a>
           </li>
        <li class="breadcrumb first"><a href="index.html">Metacat 2.10.4 documentation</a> &raquo;</li> 
      </ul>
      
    </div>
			</div>
	    	<div class="small-print">
			      &copy; Copyright 2012, Regents of the University of California.
			      Created using <a href="http://sphinx.pocoo.org/">Sphinx</a> 1.3.2.
			</div>
	    </div>
	</div>
  </body>
</html>