Content Discovery
=================

.. module:: SearchMetadata

.. currentmodule:: SearchMetadata

.. contents::

Searching for content is a functionality that is primarily supported by
Coordinating Nodes, programmatically through the CN_read.search method, and
through a web browser interface that connects through a user interface
implemented by Mercury_ which utilizes a SOLR_ index to support search
operations.

This document describes the properties of the SOLR index, how it is populated,
and how it can be used to support programmatic search and introspection into the
DataONE holdings.

The SOLR index is populated from content that is copied to Coordinating Nodes,
that is :term:`System Metadata`, :term:`Science Metadata` and :term:`Resource
Maps`. Each index entry represents information about the information referenced
by a single identifier (PID). For PIDs that refer to science data objects the
index entry will be constructed from system metadata and zero or more resource
maps that reference the object. If a PID identifies a science metadata object,
the index entry will have fields populated with content extracted from the
system metadata for the object, the science metadata document and zero or more
resource maps that reference the science metadata. Similarly, if the PID
identifies a resource map, then the index entry fields for that PID will be
populated by the System Metadata and resource map values for that PID.

The general sequence for indexing content is illustrated in Figure 1 below.

The general process of indexing content involves retrieving a System Metadata
document, parsing system metadata properties such as permission rules to
generate a SOLR index document, then optionally adding to the document if the
PID refers to a science metadata or resource map object. The general sequence
of operations is outlined in Figure 1. If the PID refers to a resource map,
then it is necessary to update the :attr:`~SearchMetadata.documents` and
:attr:`~SearchMetadata.isDocumentedBy` properties of the Science Metadata and
data objects that appear in the data package defined by the Resource Map.

.. figure:: images/index_sequence.png

.. 
   @startuml images/index_sequence.png
   
   (*) -right-> "System metadata updated"
   --> "Index Task"
   --> "Extract system metadata"
   if "Science\nMetadata?" then
     -right-> [true] "Parse science metadata"
     --> "Update SOLR"
   else
     if "Resource\nMap?" then
       -right-> [true] "Parse resource map"
       --> "Update data and metadata\n resource map index fields"
       --> "Update SOLR"
     else
       --> "Update SOLR"
       --> (*)
     endif
   endif

   @enduml


A full list of index fields is provided in Table 4 below.


Querying the SOLR Index
-----------------------

The `SOLR query syntax`_ is based on the `Lucene query syntax`_ and adds a
couple of extensions. The SOLR search index is exposed directly as a SOLR search
service with the base url of::

  http(s)://server.name/solr/


System Metadata Index Properties
--------------------------------

The main goals for indexing system metadata include:

1. Support access control rules for search results

2. Support efficient discovery of objects by their properties

Table 1 provides a list of SOLR attributes that are populated by values taken
from System Metadata.

**Table 1.** Elements derived from System Metadata and available for all
objects. "Multi" fields may be populated with more than a single value per
document (PID) indexed.

.. sqltable:: System Metadata Properties
   :header: Attribute,Type,Multi?,Content Origin
   :widths: 3 2 1 10
   :driver: xlsx
   :source: source/design/data/SearchFields.xlsx
   :sql: SELECT ':attr:`~SearchMetadata.'||SOLR||'`', FieldType, Multi, Reference FROM SystemMetadata;


Populating Permission Fields
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This process is fairly straight forward, and requires simply de-duping entries
so that each of the permission fields contains a distinct list of subjects
granted that permission. If the DataONE :term:`public user` is present in the
read permissions, then the *isPublic* field is set to True.


Example Queries
~~~~~~~~~~~~~~~

The following provide some example queries that may be issued directly against
the SOLR index operating on the Coordinating Nodes. Note that search terms must
be properly escaped for the SOLR query syntax, and additionally if submitted
directly as part of a URL, must also be properly URL escaped.

1. Index entries that match PID (should be at most one match)::

     id:"PID"

   where PID = identifier of object being located

2. Index entries with PID that starts with "some_prefix"::

     id:"some_prefix*"

3. Index entries that match a particular format id::

     formatId:"format_a"

   where format_a = the formatId being located

4. Entries with formatId matching "fmtid_1" or "fmtid_2"::

     formatId:"fmtid_1" || formatId:"fmtid_2"

5. Entries with byte size less than or equal to 10000::

     size:[* TO 10000]

6. Entries with byte size less than 10000::

     size:{* TO 10000}

7. Objects modified before 09:56:04 on 2012-01-03 UTC::

     datemodified:{* TO 2012-01-03T09:56:04.000Z}

8. Objects modified less than 10 minutes ago::

     datemodified:[NOW-10MINUTE TO *]

9. Objects of a formatId equal to "format_a" and modified within one day::

     formatId:"format_a" AND datemodified:[NOW-1DAY TO *]

10. Metadata associated with data objects of relevance to Photosynthesis::

      photosynthesis AND documents:[* TO *]


Properties of the Index Derived from Resource Maps
--------------------------------------------------

Table 2 provides a list of SOLR attributes that are populated by values obtained
by parsing resource map objects.

**Table 2.** Elements derived from Resource Map documents. "Multi" fields may be
populated with more than a single value per document (PID) indexed.

.. sqltable:: Resource Map Properties
   :header: Attribute,Type,Multi?,Content Origin
   :widths: 3 2 1 10
   :driver: xlsx
   :source: source/design/data/SearchFields.xlsx
   :sql: SELECT ':attr:`SearchMetadata.'||SOLR||'`', FieldType, Multi, Reference FROM ResourceMap;


Populating Object Relation Fields
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The fields *resourceMap*, *documents*, and *isDocumentedBy* contain
identifiers of other objects in the DataONE system related to the subject of
the current record. These relationships are recorded in OAI-ORE resource map
documents as described in the section :doc:`DataPackage`. Since the values for
these fields are contained in resource maps, it is necessary to build these
fields in progression as additional resource maps are synchronized with the
coordinating nodes.

For each object, a query is issued against the index to retrieve the list of
resource maps in which the object identifier is present, enabling population of
the *resourceMaps* field for the object. Each of those resource maps is examined
to obtain the list of identifiers for the *documents* field if the object is
science metadata, and the *isDocumentedBy* field if the object is data.

**Example Sequence for Index Population**

The following sequence steps through the generation of index entries for three
data packages, providing an example of how the index entries are updated
depending on relationships between the objects being added.


Adding Package 1
................

  .. graphviz::
                            
         digraph RM1 {        
         graph [rankdir="LR", label="Package 1"];
         A [shape=folder];    
         B [shape=note];      
         C [shape=oval];      
         A -> B;              
         A -> C;              
       }                      

A simple data package with a Resource Map *A* indicating that the metadata
document *B* and the data object *C* together form a data package, and that
*B* documents *C* and that *C* isDocumentedBy *B*.

Index entries after Package 1 is added.

  .. list-table::
     :header-rows: 1
     :widths: 3 3 3 3
   
     * - PID
       - resourceMaps
       - documents
       - isDocumentedBy
     * - A
       - 
       -
       -
     * - B
       - A
       - C
       -
     * - C
       - A
       -
       - B


Adding Package 2
................

  .. graphviz::

         digraph RM1 {
         graph [rankdir="LR", label="Package 2"];
         D [shape=folder];
         B [shape=note];
         E [shape=oval];
         D -> B;
         D -> E;
       }

A simple data package with a Resource Map *D* indicating that the metadata
document *B* and the data object *E* together form a data package, and that
*B* documents *E* and that *E* isDocumentedBy *B*. Note that the metadata
document *B* is also used in Package 1.

Index entries after package 2 is added. New entries and changes to existing
are indicated in **bold**.

  .. list-table::
     :header-rows: 1
     :widths: 3 3 3 3
   
     * - PID
       - resourceMaps
       - documents
       - isDocumentedBy
     * - A
       - 
       -
       -
     * - B
       - A, **D**
       - C, **E**
       -
     * - C
       - A
       -
       - B
     * - D
       - 
       -
       - 
     * - E
       - **D**
       -
       - **B**


Adding Package 3
................

  .. graphviz::

         digraph RM1 {
         graph [rankdir="LR", label="Package 3"];
         F [shape=folder];
         G [shape=note];
         D [shape=folder];
         F -> G;
         F -> D;
       }

A data package that references another, already indexed data package
(represented by the Resource Map *D*) as its data object.

Index entries after package 3 is added. New entries and changes to existing
are indicated in **bold**.

  .. list-table::
     :header-rows: 1
     :widths: 3 3 3 3
   
     * - PID
       - resourceMaps
       - documents
       - isDocumentedBy
     * - A
       - 
       -
       -
     * - B
       - A, D
       - C, E
       -
     * - C
       - A
       -
       - B
     * - D
       - **F**
       -
       - **G**
     * - E
       - D
       -
       - B
     * - F
       - 
       -
       - 
     * - G
       - **F**
       - **D**
       - 


Limitations of Multi-Valued Fields in SOLR
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Multi-valued fields in SOLR have an internal representation that can be
likened to a delimited concatenation of all the values for the field. The
fields have a configurable limit to the maximum field size, and this limit may
be encountered in the various entries such as the permission fields or object
relation fields where a potentially large number of subjects or identifiers
may be stored. Further research has indicated this problem is unlikely since
simply increasing the SOLR *maxFieldLength* configuration value will enable
larger field content, with examples in the wild of several thousand entries
equivalent to DataONE identifiers in a single field.


.. _Mercury: http://mercury.ornl.gov/

.. _OAI-ORE: http://www.openarchives.org/ore/


Values Extracted from Science Metadata
--------------------------------------

Indexing the system metadata entries provides a mechanism for selection of
content using low level attributes of the objects such as the type, size and
relationships, however is not useful for locating content relevant to a
particular topic or purpose. The :term:`science metadata` contains such
information and the index records associated with these documents contain
additional information described below.

The initial web user interface to DataONE utilizes the Mercury_ system. This
in turn relies upon a SOLR index to support the search, faceting, and
sub-setting operations that are exposed through Mercury. The underlying SOLR
index is also exposed through as an API, so other clients (such as the file
system driver) can leverage the search capabilities thus provided.

The remainder of this document describes the search properties that are
extracted from the various formats and syntaxes of of science metadata that
are supported by the DataONE indexing system.


**Table 3.** Index fields populated from science metadata documents. "Multi"
fields may be populated with more than a single value per document (PID)
indexed.

.. sqltable:: Resource Map Properties
   :header: Field,Type,Multi?,Description
   :widths: 2 1 1 10
   :driver: xlsx
   :source: source/design/data/SearchFields.xlsx
   :sql: SELECT ':attr:`SearchMetadata.'||SOLR||'`', FieldType, Multi, Description FROM Fields WHERE Source='S';


Date Representations in Science Metadata Documents
--------------------------------------------------

The following are examples of date values extracted from FGDC, ISO-19115, and
EML science metadata documents currently in use. The literal value appearing in
the document and the interpreted date value are shown.

Examples of date values appearing in FGDC *<pubdate>*:

================================== ====================
Value                              Interpreted Value
================================== ====================
Unknown                            Null
unknown                            Null
Unpublished material               Null
unpublished material               Null
1993                               1993-01-01 00:00:00Z
199607                             1996-07-01 00:00:00Z
20000101                           2000-01-01 00:00:00Z
19981231                           1998-12-31 00:00:00Z
196820405                          1968-01-01 00:00:00Z
1992 onwards                       1992-01-01 00:00:00Z
1989 and 1990                      1989-01-01 00:00:00Z
varies                             Null
Present                            Null
1995/1996                          1995-01-01 00:00:00Z
1991-1992                          1991-01-01 00:00:00Z
variouis                           Null
April 1999                         1999-04-01 00:00:00Z
1980 on                            1980-01-01 00:00:00Z
2005-06-24                         2005-06-24 00:00:00Z
NA                                 Null
1990- [unpublished annual reports] 1990-01-01 00:00:00Z
November, 1994                     1994-11-01 00:00:00Z
================================== ====================

Examples of date values appearing in ISO-19115 *<gco:Date>*:

================================== ====================
Value                              Interpreted Value
================================== ====================
1999                               1999-01-01 00:00:00Z  
2010-03-03                         2010-03-03 00:00:00Z
================================== ====================


Examples of date values appearing in EML *<calendarDate>*: 

================================== ====================
Value                              Interpreted Value
================================== ====================
2002-06-20                         2002-06-20 00:00:00Z
1998                               1998-01-01 00:00:00Z
2004-02-13                         2004-02-13 00:00:00Z
================================== ====================


Science metadata examples
-------------------------

Examples for integration testing:

  https://repository.dataone.org/software/cicore/trunk/d1_integration/src/test/resources/d1_testdocs/

FGDC:

 http://mercury.ornl.gov/metadata/nbii/xml/nbii/

ISO 19115:

 http://mercury.ornl.gov/metadata/ornldaac/xml/daac/iso/

EML:

 2.0.0: https://repository.dataone.org/software/cicore/trunk/d1_integration/src/test/resources/d1_testdocs/eml200/

 2.0.1: https://repository.dataone.org/software/cicore/trunk/d1_integration/src/test/resources/d1_testdocs/eml201/

 2.1.0: https://repository.dataone.org/software/cicore/trunk/d1_integration/src/test/resources/d1_testdocs/eml210/


Standard Specific Metadata Extraction Notes
-------------------------------------------


.. toctree::
   :maxdepth: 1

   SearchMetadata_eml
   SearchMetadata_fgdc
   SearchMetadata_dryad


Attribute Descriptions and Notes
--------------------------------

A list of all fields in search index, their origin and brief description. Source
column indicates which DataONE content the values are retrieved from: Y = System
Metadata, S = Science Metadata, R = Resource Map, and C = copied internally by
the search index.

.. sqltable:: . . . 
   :header: Index Field,Multi?,Source,Description
   :widths: 1 1 1 10
   :driver: xlsx
   :source: source/design/data/SearchFields.xlsx
   :sql: SELECT '.. attribute:: '||SOLR, Multi, Source, Description FROM Fields ORDER BY LOWER(SOLR) ASC;

----


Creating Citations from Index Fields
------------------------------------

**Table 3.** Cross reference of index fields and elements useful in citation
best practices as exemplified in the `interagency data stewardship`_ wiki page
of ESIP. * = Mandatory (if applicable). COinS tags provide a mechanism of
embedding `Z39.88-2004`_ elements in HTML for consumption by citation manager
tools.

.. list-table::
   :header-rows: 1
   :widths: 5 5 5 5 5

   * - Citation Element
     - FGDC CSDGM field
     - DataCite Metadata Scheme ID and Property
     - DataONE Index Field
     - COinS_ Tag
   * - Author or Creator*
     - idinfo > citation > citeinfo > "origin"
     - 2 Creator*
     - :attr:`author`
     - rft.creator
   * - Release Date*
     - idinfo > citation > citeinfo > "pubdate" and sometimes "othercit"
     - 5 PublicationYear*
     - :attr:`datePublished`
     - rft.date
   * - Title*
     - idinfo > citation > citeinfo > "title" and possibly "edition"
     - 3 Title*
     - :attr:`title`
     - rft.title
   * - Version*
     - 
     - 15 Version
     - See notes below.
     - 
   * - Archive and/or Distributor*
     - idinfo > citation > citeinfo > "publish"
     - 4 Publisher*
     - See notes below.
     - rft.publisher
   * - Locator, Identifier, or Distribution Medium*
     - idinfo > citation > citeinfo > "othercit" or "onlink"
     - 1 Identifier*
     - :attr:`pid`
     - rft.identifier
   * - Access Date and Time*
     - not applicable
     - 8 Date
     - perhaps :attr:`dateUploaded`
     - 
   * - Subset Used
     - not applicable
     - 12 RelatedIdentifier DataCite recommends obtaining an identifier for
       any subset that needs to be cited as well as an identifier for the
       larger whole.
     - perhaps a reference to the resource map?
     - 
   * - Editor or Other Important Role
     - idinfo > citation > citeinfo > "origin"
     - 7 Contributor
     - See notes below.
     - rft.source
   * - Publication Place
     - idinfo > citation > citeinfo > "pubplace"
     - 17 Description
     - See notes below.
     - 
   * - Distributor, Associate Archive, or other Institutional Role
     - idinfo > citation > citeinfo > "othercit"
     - 7 Contributor or possibly 4 Publisher
     - "http://dataone.org"
     - 
   * - Data Within a Larger Work
     - idinfo > citation > citeinfo > "othercit" or "lworkcit"
     - 12 RelatedIdentifier
     - perhaps a reference to the resource map?
     - 
   * - 
     - 
     - 
     - :attr:`keyWord`
     - rft.subject
   * - 
     - 
     - 
     - :attr:`abstract`
     - rft.description
   * - 
     - 
     - 
     - object format name, or "data" | "metadata" | "resource map"
     - rft.type
   * - 
     - 
     - 
     - object format type
     - rft.format
   * - 
     - 
     -
     - constructed from bounding box
     - rft.coverage


**Notes**

:Version:

  There is no clear recommendation on how to express version. For an object
  with a linear revision history (A is obsoleted by B is obsoleted by C),
  object C might be considered to have a version of "2". However, it is
  possible for an object to have a tree lineage (A1 and A2 are obsoleted by
  B1, and B1, A3 and Z are obsoleted by C). A suggestion for a version
  indicator in this case may be the total number of parent objects, 5 in this
  case. Another option may be to indicate the number of contributing objects
  at each revision, 1.1 in the first case, and 2.3 in the second.


:Publisher:

  The publisher in DataONE may in some cases be the origin member node,
  however this will not always be the case. For example, many data sets in the
  KNB are published by LTER, though KNB would be the member node. Where
  possible the publisher information should be pulled from the science
  metadata, though this may be complicated where multiple data and metadata
  entries occur within a data package and there are several publishers listed.
  In this case, a simple concatenation of the publishers (semi-colon + space
  delimited) may be sufficient.


:Date:

  In the DataCite 2.2 document, a mapping between Date and the dcterms:date is
  suggested. The description of dcterms:date is "A point or period of time
  associated with an event in the lifecycle of the resource." In DataONE there
  are two main dates related to an object, when the content is added to the
  system, and when the properties of the object (access control, and so forth)
  are modified. Since the content of the object does not change once submitted
  to DataONE, the :attr:`SystemMetadata.dateUploaded` value seems appropriate
  here, expressed as :attr:`dateUploaded` in the search index.


:Subset Used:

  DataONE does not provide a mechanism for identifying sets of information
  other than through resource maps which describe data packages. It is
  possible that a resource map may contain references to additional resource
  maps, conceptually equivalent to subsets of content. The simplest option
  here would be to use the identifier of the resource map object describing
  the package in question.


:Contributor:

  Described in DataCite 2.2 as "The institution or person responsible for
  collecting, creating, or otherwise contributing to the development of the
  dataset", this optional element can be populated by the citeinfo/origin
  element from FGDC.


:Publication Place:

  In CSDGM, "the name of the city (and state or province, and country, if
  needed to identify the city) where the data set was published or released".


:Data Within a Larger Work:

  This is similar to the "Subset Used" entry except in the other direction of
  containment. The identifier of the containing resource map relevant to the
  current package should be used.


.. _interagency data stewardship: http://wiki.esipfed.org/index.php/Interagency_Data_Stewardship/Citations/provider_guidelines
.. _DataCite terms: http://wiki.esipfed.org/index.php/Interagency_Data_Stewardship/Citations/provider_guidelines
.. _Z39.88-2004: http://www.niso.org/kst/reports/standards/kfile_download?id%3Austring%3Aiso-8859-1=Z39-88-2004.pdf&pt=RkGKiXzW643YeUaYUqZ1BFwDhIG4-24RJbcZBWg8uE4vWdpZsJDs4RjLz0t90_d5_ymGsj_IKVaGZww13HuDlZQ8NBt1sTxP_v4iiGqH7rSaAeVDnMfeKJrrJ_JSEGPB
.. _COinS: http://ocoins.info/
.. _Well Known Text: http://en.wikipedia.org/wiki/Well-known_text
.. _SOLR: http://lucene.apache.org/solr/
.. _SOLR query syntax: http://wiki.apache.org/solr/SolrQuerySyntax
.. _Lucene: http://lucene.apache.org/java/docs/index.html
.. _Lucene query syntax: http://lucene.apache.org/java/2_9_1/queryparsersyntax.html


..
  NOTES etc (not rendered below)

  Searching content in DataONE involves the process of matching a user provided
  value with some value extracted from the science metadata.

  Values can be simple (e.g. string, integer, floating point, date) or compound
  (e.g. spatial coordinate constructed from two floats - the latitude and
  longitude and a string - the coordinate system identifier).

  Expression in science metadata

  Expression in extracted values that are indexed

  Expression in a query

  * Integer

  * Float

  * String

  * DateTime

  * SpatialFeature (use `Well Known Text`_ )

  * Publication (A structured type, perhaps search by decomposing to elements)

  * ScientificName (A structured type, perhaps search by decomposing to elements)