Register an Object Format

While DataONE recognizes many of the common object formats, it is entirely expected that other ones will need to be registered in the future. Object formats are categorized into 3 types: DATA, METADATA, and RESOURCE, representing data objects, metadata objects, and resource maps, respectively. DataONE is responsible for maintaining the extent and categorization of all individual object formats.

All format identifiers are registered in each Coordinating Node environment via a manual process by CN operators.

  • RESOURCE format registration

    Currently, DataONE only reads one type of object format for recording data package relationships (http://www.openarchives.org/ore/terms). New formats also require development, testing and deployment of parsers before they can be considered fully registered.

Manually Adding Object Formats

DataONE is primarily concerned with the proper scoping and MIME type associations of new object formats that represent data objects. Registration is a straightforward process that requires little testing. Once formats are registered into the object format list, additional work may have to occur for further processing of metadata formats.

The DataONE Object Format list is maintained on the Coordinating Nodes for each environment. For a given environment, the object format list needs to be added to a single CN during a fresh install of the CN, and the Metacat application on each CN handles the replication of the list to the other CNs in the environment. The production list is maintained in the dataone-cn-metacat buildout package and is named objectFormatList.xml. The insertOrUpdateObjectFormatList.sh script is also maintained in the same directory, and provides a convenient way to insert or update the document in Metacat.

First time inserts in a new CN environment

When a Coordinating Node is first installed, the object format list needs to be inserted into the Metacat database. To do so, on one of the CNs in the environment, issue the following commands:

$ cd /usr/share/metacat/debian
$ sudo chmod +x insertOrUpdateObjectFormatList.sh
$ sudo ./insertOrUpdateObjectFormatList.sh objectFormatList.xml

When prompted for the password, enter the password for the uid=dataone_cn_metacat,o=DATAONE,dc=ecoinformatics,dc=org user, which is stored in the SystemPW.txt.gpg file in subversion.

Note: We’ve changed the above DN in the production environment to cn=dataone_cn_metacat,dc=dataone,dc=org. Because of this, before executing the script, change the script to have:

username="cn=dataone_cn_metacat,dc=dataone,dc=org";

Use the password for this DN found in the ProductionPW.txt.gpg file in subversion.

Updating the object format list

Before updating the list, consult the Unfied Digital Format Registry and search for the file format in that registry to help decide what the DataONE formatId should be for the format. It’s important to ensure that the format id is unique, as well as versioned in some manner in order to accomodate future iterations of the format. Also look through the existing objectFormatList.xml to ensure the format doesn’t already exist, perhaps even under a different formatId.

To update the list, do an svn checkout of the dataone-cn-metacat package:

$ svn co https://repository.dataone.org/software/cicore/trunk/cn-buildout/dataone-cn-metacat

Modify the objectFormatList.xml file by adding new formats according to the ObjectFormat Type. Never modify an existing format, and never delete an existing format. Update the total and count attributes of the ObjectFormatList element. It can be helpful to use xmlstarlet to count the total as a cross check:

$ xmlstarlet sel -t -v "count(//objectFormat/formatId)"

Commit the changes:

$ svn commit objectFormatList.xml

Copy the new list to the CN you are modifying, and replace the file in /usr/share/metacat/debian/objectFormatList.xml:

$ scp objectFormatList.xml cn-dev-ucsb-1.test.dataone.org:
$ ssh cn-dev-ucsb-1.test.dataone.org
$ sudo cp objectFormatList.xml /usr/share/metacat/debian/objectFormatList.xml

Lastly, run the update script against the new format list document:

$ cd /usr/share/metacat/debian
$ sudo chmod +x insertOrUpdateObjectFormatList.sh
$ sudo ./insertOrUpdateObjectFormatList.sh objectFormatList.xml

After being prompted for the password, the list should be updated in Metacat.

You can verify that each CN has the updated list by visiting the Cn’s formats REST endpoint:

https://cn-dev-ucsb-1.test.dataone.org/cn/v1/formats
https://cn-dev-unm-1.test.dataone.org/cn/v1/formats
https://cn-dev-orc-1.test.dataone.org/cn/v1/formats

Maintenance of all format lists

When updating the object format list, it’s best to do so in all environments at once because the list only gets initially added when first installing the CN. So, perform the above steps for DEV, SANDBOX, SANDBOX2, STAGE, STAGE2, and PRODUCTION environments.

Note that the identifier for the object format list XML document may differ across environments because some environments get wiped clean and re-installed. For instance, in DEV it might be OBJECT_FORMAT_LIST.1.1 whereas in PRODUCTION it might be OBJECT_FORMAT_LIST.1.8.

METADATA format registration

In addition to the concerns of data format registration, DataONE Coordinating Nodes must parse metadata objects in order to add their information to the search index, and so the format needs to be tested, and parsers built and deployed before it can be considered fully registered.

Overview

While DataONE’s architecture is designed to accommodate any metadata format Member Nodes make use of, each new metadata format requires a bit of development to enable DataONE’s discovery mechanisms for those metadata documents. Both Content Curator (usually a Member Node administrator) and DataONE developer effort is required, and more significantly, a patch-level release of the CN software stack needs to be performed so that content of the new format can be synchronized, indexed, and ultimately discovered. The building, testing, and deploying the necessary items to the CNs does necessitate a lag between when the new format is published and when content using it can be successfully created. Accordingly, content curators making use of a new format, or a new version of an existing format, need to account for that in their own planning.

The process of registering a new metadata format involves the creation and testing of the following items:

  1. a published schema or DTD (done by Content Curator)
  2. an indexing parser (a DataONE developer responsibility)
  3. an XSLT template (built by either, depending on time and ability ) // TODO: verify who’s responsible

Once all are available and tested, the format can be fully registered into DataONE as a new object format.

When done as part of a new Member Node deployment, it is good to plan for this work to be done early on, as final testing of the node requires that all objects use a registered format.

Metadata Format Registration

Irrespective of Member Node deployment, registering a metadata format follows the same steps:

Content Curators:

  1. develop and test their schema or DTD. The schema or DTD needs to pass standard schema validation tests that can be found at numerous testing services online (search for “online XML schema validation”).
  2. publish the schema such that the namespace and schemaLocation of the metadata documents point to an immutable copy of the schema, where it can continue to be resolved consistently indefinitely.
  3. contact DataONE via support@dataone.org, attaching example metadata documents, or providing a link to a test instance of the Member Node that contains them.

DataONE developers:

  1. test the schema format via the examples, iterating with the content curator on any bug fixes.
  2. write an indexing parser and / or XSLT template.
  3. test the indexing parser and XSLT template (in the DEV environment).
  4. Review test results with the content curator (show search results, and metadata visualizations)
  5. Deploy indexing parser and XSLT templates and new object format record to additional environments (STAGE and/or production) (Currently XSLT template is handed off to ONEMercury maintainers)
  6. Notify content curator when work is done.

Content Curator can then start submitting metadata objects using the new format.

// TODO: who names the object format (gives the identifier?)

As part of Member Node deployment

Deployment-phase testing of Member Nodes requires all metadata formats used by the prospective Member Node to be registered, so that the processes under test (synchronization, indexing, ONEMercury presentation) can be run. Keeping in mind that DataONE will need to build, test, and deploy items to the Coordinating Nodes, format registration would ideally be started during the implementation / development phase of the Member Node on-boarding process. Specifically, the first item (the published schema) needs to be published and tested, and the object format registered to the target testing environment before the Member Node itself can be tested. Absent these things, synchronization will fail, and the indexing and ONEMercury tests cannot be run.

Typically, the indexing parser and SXLT template are tested and deployed to the Coordinating Nodes of the DEV testing environment for testing by DataONE developers, and then if successful, deployed to the STAGE environment, in preparation for registration of the prospective Member Node in that environment.

Member Node implementers should work out specific timings and placements with their primary DataONE contact to optimize their development cycles.

Notes:

What information is pulled from metadata into the search index:

http://mule1.dataone.org/ArchitectureDocs-current/design/SearchMetadata.html#values-extracted-from-science-metadata

current effort estimation:

  • 2 days dev, 2 days testing (sandbox, staging), 1 for the release, 1 day ONEMercury upgrade.
  • new versions of existing formats require less development and result in quicker testing
  • what is process for registering a data format?

Remaining issue

Because of the difficulty re-synchronizing failed objects, the Member Node is dependent on DataONE to register the data format before it can start even entering data onto their node. This seems like a backwards dependency that puts DataONE resources on the critical path of external projects.

  1. is there a more graceful way to handle this situation?