CN Search Index
====================
2 main components for interactive search indexing:
Index Task Generator
-------------------------
Listens to CN cluster messages for creates and updates to system metadata data store (hazelcast).
Reacts to messages by creating 'index tasks' which represent an item to be added/updated in the search index.
Index tasks are stored in a postgres data store for processing.
Index Task Processor
---------------------------
Periodically reads the 'index task' datasource. Grabs items that are ready to be processed.
Processing converts the index task object into an update to the search index (solr).
Processing uses the system metadata document to generate the search index record.
Fields such as 'id', 'size', 'datasource' are pulled from the system metadata record primarily
using XPath selector expressions.
These xPath rules are matched with search index fields through configuration rules
written in spring context files. These rules define the search index field name, the xPath
expression to the data in the xml document, and any additional processing instructions.
For complex search field constructs, more complicated java based data mining objects are also
used in conjuntion with xPath expressions. For example, constructing some search fields
requires some "AND/OR" logic and other data manipulation operations that are not possible with just xPath.
Science metadata document formats are also assigned special index parsing rules.
These allow the descriptive information found in science metadata documents (describing data files)
to be placed in the search index - to support search use cases. These additional rules are
configured using the same techniques as the rules used to mine system metadata information.
The rules are configured on a 'formatId' basis - creating configurations that are specific to
particular science metadata formats and even to specific versons of the format.
Index Processor Configuration
----------------------------------
Index processing is configured using spring context files. Typically each science metadata
format family will define a 'base' context file which contains the field rules which are
defined for the science metadata format family. For example the 'eml' science metadata
format has several versions, but share most or all of the field rules. By placing the field
definitions in a common 'base' config file they can be re-used across configurations for the
various versions of the eml family of science metadata documents - allowing for variation where
the particular version needs it. For example:
https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/main/resources/application-context-eml-base.xml
The above file contains spring bean definitions each one representing a field in the index paired
with an expression of how to find the data in the science metadata document. Each of these bean
definitions are implemented by a DataONE's SolrField class heirarchy.
Javadoc for the SolrField and associates can be found here:
http://dev-testing.dataone.org:8080/hudson/view/CN%20Snapshot%20Jobs/job/d1_cn_index_processor/ws/target/site/apidocs/index.html
In particular:
http://dev-testing.dataone.org:8080/hudson/view/CN%20Snapshot%20Jobs/job/d1_cn_index_processor/ws/target/site/apidocs/org/dataone/cn/indexer/parser/package-summary.html
Example field definition bean::
The bean's name is "eml.keywords" and it is responsible for mining the keyword search field from
eml science metadata documents. This bean is an instance of the dataONE SolrField class.
Its first constructor arguement defines search field name this data will be mapped to - "keywords".
Keywords is a multi-value field in the solr search index - meaning that the 'keywords' field is
an array or list of values (each value a different keyword). The second arguement to this bean
is the 'xpath' variable which contains and xPath expression. This expression is used to define
the path to the desired data in an eml document. In this case the 'dataset' element contains
a 'keywordSet' element which in turn contains the 'keyword' element. We are interested in the
text value of these nodes - which is expressed by: //dataset/keywordSet/keyword/text(). The next
property defined for this bean is the 'multivalue' property which tells SolrField that this search
field is multivalued and to place each keyword in its own position in the multivalued field.
The final property defined for this bean is 'dedupe'. It simply tells SolrField to remove any
duplicate keywords found in the current eml document - to remove duplicate values. When this
bean is run over an eml document, the search field 'keywords' will be filled with values from
the eml document's 'keywordSet' collection. This is an example of a simple search field
definition but for a majority of search fields, they do not get much more comlicated than that.
The configuration for a specific verson of a science metadata format defines a bean which is
the document processor class. This beans defines which DataONE formatId this instance of the
document processor operates on (will be matched against systemMetadata.formatId). This bean
also defines a property called 'field list' which is a list of field definition bean names.
For example:
https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/main/resources/application-context-eml200.xml
In this file you will notice the field list property contains many "ref" elements.
Each of the 'bean' property values in the 'ref' elements refer to a search field definition
bean provided in the eml base context file as shown above (application-context-eml-base.xml).
Adding new science metadata format
--------------------------------------
Adding a new science metadata format to the index processor is mostly a matter of providing new
spring configuration to direct the index processor how to act when encountering the new science
metadata formatId during normal index processing. To begin, a new 'base' configuration file of
search field definition rules should be created to map science metadata data values to search
fields. See previous section for more detail on creating search field definition configuration.
Examples of these 'base' configuration can be found here:
https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/main/resources/
See the application context files that end in 'base' for examples of search field definitions by
science metadata format.
The next step is to define the document processor bean for the new science metadata formatId.
Examples of this configuration can be found in the same directory - but this time looking at
the specific metadata version files.
For example - application-context-fgdc-std-0012-1999.xml.
https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/main/resources/application-context-fgdc-std-0012-1999.xml
Each of these configuration files contain one bean - which is the 'document processor bean'
for the science metadata format::
In this example, the name of the 'document processor bean' is: 'fgdcstd00121999Subprocessor'.
After this, the new context files need to be registered with the index process daemon's configuration:
https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/main/resources/processor-daemon-context.xml
Simply add new elmeents for the new context files.
The final step is to register the document processor bean with the uber document processor bean.
This is a bean which contains references to all the 'document processor beans':
https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/main/resources/index-processor-context.xml
A new 'ref' element should be created for the new 'document processor' bean::
From here a unit test can be developed to ensure the search field definitions and mappings
are working as expected. For an example of unit tests for science metadata see:
https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/test/java/org/dataone/cn/index/SolrFieldXPathFgdcTest.java
and
https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/test/java/org/dataone/cn/index/SolrFieldXPathEmlTest.java
and
https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/test/java/org/dataone/cn/index/SolrFieldXPathDryad31Test.java
Once the configuration and unit test are complete, the new science metadata format is ready
to be processed for the search index. The final step is to copy the new context files
and the two modified context files to the associated debian buildout project.
Currently the trunk location for these files is here:
https://repository.dataone.org/software/cicore/trunk/cn-buildout/dataone-cn-index/usr/share/dataone-cn-index/debian/index-generation-context/
Adding a new science metadata rule
-----------------------------------------
First configure the new search field definition bean in the 'base' context file for the
appropriate science metadata format. Next add the new bean field to the specific science
metadata document processor bean's fieldList property.