CN Search Index

2 main components for interactive search indexing:

Index Task Generator
Listens to CN cluster messages for creates and updates to system metadata data store (hazelcast).  
Reacts to messages by creating 'index tasks' which represent an item to be added/updated in the search index.  
Index tasks are stored in a postgres data store for processing.

Index Task Processor
Periodically reads the 'index task' datasource.  Grabs items that are ready to be processed.  
Processing converts the index task object into an update to the search index (solr).  

Processing uses the system metadata document to generate the search index record.  
Fields such as 'id', 'size', 'datasource' are pulled from the system metadata record primarily 
using XPath selector expressions.  

These xPath rules are matched with search index fields through configuration rules 
written in spring context files.   These rules define the search index field name, the xPath 
expression to the data in the xml document, and any additional processing instructions.  
For complex search field constructs, more complicated java based data mining objects are also 
used in conjuntion with xPath expressions.  For example, constructing some search fields 
requires some "AND/OR" logic and other data manipulation operations that are not possible with just xPath.

Science metadata document formats are also assigned special index parsing rules.  
These allow the descriptive information found in science metadata documents (describing data files) 
to be placed in the search index - to support search use cases.  These additional rules are 
configured using the same techniques as the rules used to mine system metadata information.  
The rules are configured on a 'formatId' basis - creating configurations that are specific to 
particular science metadata formats and even to specific versons of the format.

Index Processor Configuration
Index processing is configured using spring context files.  Typically each science metadata 
format family will define a 'base' context file which contains the field rules which are 
defined for the science metadata format family.  For example the 'eml' science metadata 
format has several versions, but share most or all of the field rules.  By placing the field 
definitions in a common 'base' config file they can be re-used across configurations for the 
various versions of the eml family of science metadata documents - allowing for variation where 
the particular version needs it.  For example:

The above file contains spring bean definitions each one representing a field in the index paired 
with an expression of how to find the data in the science metadata document.  Each of these bean 
definitions are implemented by a DataONE's SolrField class heirarchy.  
Javadoc for the SolrField and associates can be found here:

In particular:

Example field definition bean::

	<bean id="eml.keywords" class="">
		<constructor-arg name="name" value="keywords" />
		<constructor-arg name="xpath" value="//dataset/keywordSet/keyword/text()" />
		<property name="multivalue" value="true" />
		<property name="dedupe" value="true" />

The bean's name is "eml.keywords" and it is responsible for mining the keyword search field from 
eml science metadata documents.  This bean is an instance of the dataONE SolrField class.  
Its first constructor arguement defines search field name this data will be mapped to - "keywords".  
Keywords is a multi-value field in the solr search index - meaning that the 'keywords' field is 
an array or list of values (each value a different keyword).  The second arguement to this bean 
is the 'xpath' variable which contains and xPath expression.  This expression is used to define 
the path to the desired data in an eml document.  In this case the 'dataset' element contains 
a 'keywordSet' element which in turn contains the 'keyword' element.  We are interested in the 
text value of these nodes - which is expressed by: //dataset/keywordSet/keyword/text().  The next 
property defined for this bean is the 'multivalue' property which tells SolrField that this search 
field is multivalued and to place each keyword in its own position in the multivalued field.  
The final property defined for this bean is 'dedupe'.  It simply tells SolrField to remove any 
duplicate keywords found in the current eml document - to remove duplicate values.  When this 
bean is run over an eml document, the search field 'keywords' will be filled with values from 
the eml document's 'keywordSet' collection.   This is an example of a simple search field 
definition but for a majority of search fields, they do not get much more comlicated than that.

The configuration for a specific verson of a science metadata format defines a bean which is 
the document processor class.  This beans defines which DataONE formatId this instance of the 
document processor operates on (will be matched against systemMetadata.formatId).  This bean 
also defines a property called 'field list' which is a list of field definition bean names.   
For example:

In this file you will notice the field list property contains many "ref" elements.  
Each of the 'bean' property values in the 'ref' elements refer to a search field definition 
bean provided in the eml base context file as shown above (application-context-eml-base.xml).

Adding new science metadata format
Adding a new science metadata format to the index processor is mostly a matter of providing new 
spring configuration to direct the index processor how to act when encountering the new science 
metadata formatId during normal index processing.  To begin, a new 'base' configuration file of 
search field definition rules should be created to map science metadata data values to search 
fields.  See previous section for more detail on creating search field definition configuration.  
Examples of these 'base' configuration can be found here:

See the application context files that end in 'base' for examples of search field definitions by 
science metadata format.

The next step is to define the document processor bean for the new science metadata formatId.  
Examples of this configuration can be found in the same directory - but this time looking at 
the specific metadata version files.  
For example - application-context-fgdc-std-0012-1999.xml.

Each of these configuration files contain one bean - which is the 'document processor bean' 
for the science metadata format::

<bean id="fgdcstd00121999Subprocessor" class="">
	<property name="matchDocument" value="/d100:systemMetadata/formatId[text() = 'FGDC-STD-001.2-1999']"></property>
  	<property name="fieldList">
	    	<ref bean="fgdc.abstract" />
	    	<ref bean="fgdc.beginDate"/>
 	    	<ref bean="fgdc.attributeText" />

In this example, the name of the 'document processor bean' is: 'fgdcstd00121999Subprocessor'.

After this, the new context files need to be registered with the index process daemon's configuration:
Simply add new <import> elmeents for the new context files.

The final step is to register the document processor bean with the uber document processor bean.  
This is a bean which contains references to all the 'document processor beans':

A new 'ref' element should be created for the new 'document processor' bean::

 <bean id="documentParsers" class="java.util.ArrayList" autowire="byName">
    <bean class="">
     <constructor-arg name="fields" ref="xpath_system_metadata_100">
     <constructor-arg name="xmlNamespaceConfig" ref="xmlNamespaceConfig" />
     <property name="solrBaseUri" value="${solr.base.uri}" />
     <property name="httpService" ref="httpService" />
     <property name="subprocessors">
       <ref bean="eml200Subprocessor" />
       <ref bean="eml201Subprocessor" />
       <ref bean="eml210Subprocessor" />
       <ref bean="eml211Subprocessor" />
       <ref bean="resourceMapSubprocessor" />
       <ref bean="fgdcstd0011998Subprocessor" />
       <ref bean="fgdcstd00111999Subprocessor" />
       <ref bean="fgdcstd00121999Subprocessor" />
       <ref bean="fgdcEsri80Subprocessor" />
       <ref bean="dryad30Subprocessor" />
       <ref bean="dryad31Subprocessor" />

From here a unit test can be developed to ensure the search field definitions and mappings 
are working as expected.  For an example of unit tests for science metadata see:

Once the configuration and unit test are complete, the new science metadata format is ready 
to be processed for the search index.  The final step is to copy the new context files 
and the two modified context files to the associated debian buildout project.  
Currently the trunk location for these files is here:

Adding a new science metadata rule
First configure the new search field definition bean in the 'base' context file for the
 appropriate science metadata format.  Next add the new bean field to the specific science 
 metadata document processor bean's fieldList property.