2 main components for interactive search indexing:
Listens to CN cluster messages for creates and updates to system metadata data store (hazelcast). Reacts to messages by creating ‘index tasks’ which represent an item to be added/updated in the search index. Index tasks are stored in a postgres data store for processing.
Periodically reads the ‘index task’ datasource. Grabs items that are ready to be processed. Processing converts the index task object into an update to the search index (solr).
Processing uses the system metadata document to generate the search index record. Fields such as ‘id’, ‘size’, ‘datasource’ are pulled from the system metadata record primarily using XPath selector expressions.
These xPath rules are matched with search index fields through configuration rules written in spring context files. These rules define the search index field name, the xPath expression to the data in the xml document, and any additional processing instructions. For complex search field constructs, more complicated java based data mining objects are also used in conjuntion with xPath expressions. For example, constructing some search fields requires some “AND/OR” logic and other data manipulation operations that are not possible with just xPath.
Science metadata document formats are also assigned special index parsing rules. These allow the descriptive information found in science metadata documents (describing data files) to be placed in the search index - to support search use cases. These additional rules are configured using the same techniques as the rules used to mine system metadata information. The rules are configured on a ‘formatId’ basis - creating configurations that are specific to particular science metadata formats and even to specific versons of the format.
Index processing is configured using spring context files. Typically each science metadata format family will define a ‘base’ context file which contains the field rules which are defined for the science metadata format family. For example the ‘eml’ science metadata format has several versions, but share most or all of the field rules. By placing the field definitions in a common ‘base’ config file they can be re-used across configurations for the various versions of the eml family of science metadata documents - allowing for variation where the particular version needs it. For example:
The above file contains spring bean definitions each one representing a field in the index paired with an expression of how to find the data in the science metadata document. Each of these bean definitions are implemented by a DataONE’s SolrField class heirarchy. Javadoc for the SolrField and associates can be found here: http://dev-testing.dataone.org:8080/hudson/view/CN%20Snapshot%20Jobs/job/d1_cn_index_processor/ws/target/site/apidocs/index.html
Example field definition bean:
<bean id="eml.keywords" class="org.dataone.cn.indexer.parser.SolrField">
<constructor-arg name="name" value="keywords" />
<constructor-arg name="xpath" value="//dataset/keywordSet/keyword/text()" />
<property name="multivalue" value="true" />
<property name="dedupe" value="true" />
</bean>
The bean’s name is “eml.keywords” and it is responsible for mining the keyword search field from eml science metadata documents. This bean is an instance of the dataONE SolrField class. Its first constructor arguement defines search field name this data will be mapped to - “keywords”. Keywords is a multi-value field in the solr search index - meaning that the ‘keywords’ field is an array or list of values (each value a different keyword). The second arguement to this bean is the ‘xpath’ variable which contains and xPath expression. This expression is used to define the path to the desired data in an eml document. In this case the ‘dataset’ element contains a ‘keywordSet’ element which in turn contains the ‘keyword’ element. We are interested in the text value of these nodes - which is expressed by: //dataset/keywordSet/keyword/text(). The next property defined for this bean is the ‘multivalue’ property which tells SolrField that this search field is multivalued and to place each keyword in its own position in the multivalued field. The final property defined for this bean is ‘dedupe’. It simply tells SolrField to remove any duplicate keywords found in the current eml document - to remove duplicate values. When this bean is run over an eml document, the search field ‘keywords’ will be filled with values from the eml document’s ‘keywordSet’ collection. This is an example of a simple search field definition but for a majority of search fields, they do not get much more comlicated than that.
The configuration for a specific verson of a science metadata format defines a bean which is the document processor class. This beans defines which DataONE formatId this instance of the document processor operates on (will be matched against systemMetadata.formatId). This bean also defines a property called ‘field list’ which is a list of field definition bean names. For example: https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/main/resources/application-context-eml200.xml
In this file you will notice the field list property contains many “ref” elements. Each of the ‘bean’ property values in the ‘ref’ elements refer to a search field definition bean provided in the eml base context file as shown above (application-context-eml-base.xml).
Adding a new science metadata format to the index processor is mostly a matter of providing new spring configuration to direct the index processor how to act when encountering the new science metadata formatId during normal index processing. To begin, a new ‘base’ configuration file of search field definition rules should be created to map science metadata data values to search fields. See previous section for more detail on creating search field definition configuration. Examples of these ‘base’ configuration can be found here: https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/main/resources/
See the application context files that end in ‘base’ for examples of search field definitions by science metadata format.
The next step is to define the document processor bean for the new science metadata formatId. Examples of this configuration can be found in the same directory - but this time looking at the specific metadata version files. For example - application-context-fgdc-std-0012-1999.xml. https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/main/resources/application-context-fgdc-std-0012-1999.xml
Each of these configuration files contain one bean - which is the ‘document processor bean’ for the science metadata format:
<bean id="fgdcstd00121999Subprocessor" class="org.dataone.cn.indexer.parser.ScienceMetadataDocumentSubprocessor">
<property name=”matchDocument” value=”/d100:systemMetadata/formatId[text() = ‘FGDC-STD-001.2-1999’]”></property> <property name=”fieldList”>
<list> <ref bean=”fgdc.abstract” /> <ref bean=”fgdc.beginDate”/> <ref bean=”fgdc.attributeText” /> </list></property>
</bean>
In this example, the name of the ‘document processor bean’ is: ‘fgdcstd00121999Subprocessor’.
After this, the new context files need to be registered with the index process daemon’s configuration: https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/main/resources/processor-daemon-context.xml Simply add new <import> elmeents for the new context files.
The final step is to register the document processor bean with the uber document processor bean. This is a bean which contains references to all the ‘document processor beans’: https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/main/resources/index-processor-context.xml
A new ‘ref’ element should be created for the new ‘document processor’ bean:
<bean id="documentParsers" class="java.util.ArrayList" autowire="byName">
<constructor-arg>
<list>
<bean class="org.dataone.cn.indexer.XPathDocumentParser">
<constructor-arg name="fields" ref="xpath_system_metadata_100">
</constructor-arg>
<constructor-arg name="xmlNamespaceConfig" ref="xmlNamespaceConfig" />
<property name="solrBaseUri" value="${solr.base.uri}" />
<property name="httpService" ref="httpService" />
<property name="subprocessors">
<list>
<ref bean="eml200Subprocessor" />
<ref bean="eml201Subprocessor" />
<ref bean="eml210Subprocessor" />
<ref bean="eml211Subprocessor" />
<ref bean="resourceMapSubprocessor" />
<ref bean="fgdcstd0011998Subprocessor" />
<ref bean="fgdcstd00111999Subprocessor" />
<ref bean="fgdcstd00121999Subprocessor" />
<ref bean="fgdcEsri80Subprocessor" />
<ref bean="dryad30Subprocessor" />
<ref bean="dryad31Subprocessor" />
</list>
</property>
</bean>
</list>
</constructor-arg>
</bean>
From here a unit test can be developed to ensure the search field definitions and mappings are working as expected. For an example of unit tests for science metadata see: https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/test/java/org/dataone/cn/index/SolrFieldXPathFgdcTest.java and https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/test/java/org/dataone/cn/index/SolrFieldXPathEmlTest.java and https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/test/java/org/dataone/cn/index/SolrFieldXPathDryad31Test.java
Once the configuration and unit test are complete, the new science metadata format is ready to be processed for the search index. The final step is to copy the new context files and the two modified context files to the associated debian buildout project. Currently the trunk location for these files is here: https://repository.dataone.org/software/cicore/trunk/cn-buildout/dataone-cn-index/usr/share/dataone-cn-index/debian/index-generation-context/