KNB Developer Tools

The KNB supports the DataONE REST API, and Java, Python, and R libraries for easily creating client tools.

DataONE REST API

A REST API for accessing and contributing data.

The KNB supports the DataONE REST API for automating the process of uploading, downloading, and searching for data on the KNB using scripted languages such as shell, R, Matlab, and Python, among others. This guide is a brief synopsis of the DataONE API, which is more comprehensively documented in the DataONE Architecture Documentation (also see the development version of the architecture guide for future releases). This API allows any software tool that supports the DataONE API (such as the rDataONE R package) to also be able to seamlessly interact with KNB data. While DataONE maintains the full technical documentation on the API, here is a brief overview for commonly accessed services on the KNB.

Summary

DataONE distinguishes three classes of objects that it will store and manage: data objects, science metadata objects, and resource map documents. Each of these are uniquely identifiable by their persistent identifier (PID), and each has associated SystemMetadata which describes the object type, size, access rules, etc.

  • Data objects

    are treated as opaque blobs, and are retrievable via the get method given a persistent identifier (PID). Data objects can be represented in any format, but the repository encourages the use of non-proprietary, open formats such as CSV and netCDF.

  • Science metadata objects

    are metadata documents such as EML, FGDC, ISO19115, and so forth that provide metadata describing some data object(s). These are represented in XML according to their respective schema.

  • Resource Map objects

    are OAI-ORE documents that describe the aggregations of data and metadata into data packages. Individual data and metadata files can be uploaded to the repository, but to indicate that a set of files is part of an aggregated data package, you must provide a OAI-ORE resource map linking the objects.

    In addition to aggregation, Resource Maps can describe the origin of objects by asserting provenance relationships. These relationships will be displayed on the KNB.

All API access is over HTTPS, and accessed via the https://knb.ecoinformatics.org/knb/d1/mn/v2/ endpoint. The relative path prefix /v2/ indicates that we are currently using version 2 of the DataONE API.

The examples below show calls to the production KNB data repository REST endpoint (https://knb.ecoinformatics.org/knb/d1/mn/v2), but users should not create test data on the production KNB. Instead, please use a test Metacat server to explore the API and create test data (e.g., https://dev.nceas.ucsb.edu/knb/d1/mn/v2).

Quick Reference

URL Method Example
/object/<pid> GET Get an Object
/object POST Create an Object
/object/<pid> PUT Update an Object
/archive/<pid> PUT Archive an Object
/meta/<pid> GET Get System Metadata for an Object
/generate POST Generate an Identifier
/query/solr/<query> GET Search the metadata index
/object GET List objects
/object/<pid> DELETE Delete an Object

Request Format

  • GET, HEAD, and DELETE requests only pass parameters as part of the URL. The parameter values must be converted to UTF-8 and appropriately escaped for incorporating into the URL.
  • Message bodies (e.g. for POST and PUT requests) are encoded using MIME Multipart, mixed (RFC2046). All information for creating the new object or resource is transmitted in the message body, which is encoded as a MIME multipart/mixed message. We use two types of content in MIME multipart/mixed messages: parameters and files. Parameters are to be used for all simple types (such as a String value). Files are to be used for all complex types (such as an XML structure) and for octet streams.

Response Format

Version 1.0 of the DataONE services only support XML serialization, and this format MUST be used when communicating with the KNB. Request and response documents MUST also be encoded using UTF-8.

Authentication and Authorization

Two mechanisms are supported for authentication:

  • Authentication Tokens passed in the HTTP "Authorization" header
  • Client-side SSL certificates

Using Authentication Tokens. In this scenario, users sign in to the <%=window.themeTitle%>, and copy an authentication token from their profile page which can then be included in HTTPS requests in the HTTP header "Authorization:".

Users copy an authentication token from their profile page.

The Authentication Token is a long base64-encoded string of characters that encodes the user's credentials in a signed and validatable JWT token. Each language or tool will have is own mechanism for setting HTTP headers. For example, for curl an authenticated request can be made using the '-H' command line option, such as: $ export TOKEN='eyJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJodHRwOlwvXC9vcmNp...' $ curl -H "Authorization: Bearer $TOKEN" https://knb.ecoinformatics.org/knb/d1/mn/v2/object The token expires in 18 hours, and so a new token can be retrieved beforehand.

The authentication token should be carefully protected just as you would an account password, as it gives the holder full rights to the account. Do not save tokens in code, and don't check them into version control systems, or otherwise make them available to other people.

Using client-side certificates over SSL. Users can log into CILogon to download a client certificate, which can then be included in requests as part of the SSL session with the host. The Subject of the provided certificate will be used by the KNB to determine all access control decisions for accessing, creating, updating, archiving, and deleting objects. If a client-side certificate is not provided, the user will be considered an anonymous public user and will only be able to access public content.

Each language or submission tool will have different mechanisms for setting the client certificate in the SSL session. For example, for Curl the certificate filename is passed in on the command line: curl -X POST --cert /tmp/x509up_u502 ....

The version of curl shipped by Apple on MacOS X 10.9 and later is broken and does not support providing PEM certificates via the command line. Instead, it uses certificates registered in the system keychain, as described on the curl mailing list. Thus, calls to the KNB that require a certificate will fail on the standard Mac curl version, which can be fixed by replacing this with the MacPorts version of curl, or by using a certificate converted to PK12 format. A workaround for these issues is being explored, as the behavior differs in Mavericks and Yosemite.

Method Details and Examples

Get an Object

Each object on DataONE has a persistent identifier (PID), which can be used to get the bytes of tha object. Note that PID identifiers must be escaped using URL escaping conventions if they contain characters that are normally reserved in URLs. For example, a DOI such as

doi:10.5063/FF1HT2M7Q is a PID which would need to be escaped to doi:10.5063%2FF1HT2M7Q when used in a URL. ENDPOINT="https://knb.ecoinformatics.org/knb/d1/mn/v2" curl -X GET \ -H "Accept: text/xml" \ "${ENDPOINT}/object/doi:10.5063%2FF1HT2M7Q" If a certificate is not provided in the request, then the results will only include publicly accessible content. To view private content, be sure to include a valid X.509 certificate in the request (e.g,, in curl, use the --cert argument to provide the path to a certificate that that was previously downloaded from CILogon).

Create an Object

An object can be inserted into the repository using the create API call, which involves POSTing the object to the object collection. Required parameters include the pid to be used for the object, the bytes of the object itself, and an XML SystemMetadata (sysmeta) document describing core metadata properties about the object, including who owns it, its format, etc. curl -X POST \ --cert /tmp/x509up_u501 \ -H "Charset: utf-8" \ -H "Content-Type: multipart/mixed; boundary=----------4A2D135C-52CC-017FC-B269-B711ED211576_$" \ -H "Accept: text/xml" \ -F pid=urn:uuid:56eafcec-8b0a-11e3-a5e8-00334b2a1a0a \ -F object=@mydatafile.csv \ -F sysmeta=@mysystemmetadata.xml \ "${ENDPOINT}/object"

Update an Object

An object can be updated in the repository using the update API call, which involves PUTing the object to the object collection. Required parameters include the newPid to be used for the object, the bytes of the object itself, and an XML SystemMetadata (sysmeta) document describing core metadata properties about the object, including who owns it, its format, etc. Note that this operation occurs against the original object by including its pid in the REST URL. curl -X PUT \ --cert /tmp/x509up_u501 \ -H "Charset: utf-8" \ -H "Content-Type: multipart/mixed; boundary=----------4A2D135C-52CC-017FC-B269-B711ED211576_$" \ -H "Accept: text/xml" \ -F newPid=urn:uuid:21865616-8b0d-11e3-a31f-00334b2a1a0a \ -F object=@mydatafile.csv \ -F sysmeta=@mysystemmetadata.xml \ "${ENDPOINT}/object/urn:uuid:56eafcec-8b0a-11e3-a5e8-00334b2a1a0a"

Archive an Object

An object can be archived, which moves it out of the search path so it won't be discovered, but is still accessible to users who know the pid of the object so that citations remain viable. To archive an object, call the archive service using an HTTP PUT with the pid in the service endpoint. curl -X PUT \ --cert /tmp/x509up_u501 \ -H "Accept: text/xml" \ "${ENDPOINT}/archive/urn:uuid:56eafcec-8b0a-11e3-a5e8-00334b2a1a0a"

Get System Metadata for an Object

Use the getSystemMetadata to access the SystemMetadata for an object, which represents critical information about each object on the repository, including its identifier, its type, access control policies, and replication policies, and other details like size and checksum. curl -X GET \ -H "Accept: text/xml" \ "${ENDPOINT}/meta/urn:uuid:56eafcec-8b0a-11e3-a5e8-00334b2a1a0a"

Generate an identifier

Creating an object on the repository requires submitting it with a globally unique identifier, which can be generated by calling the generateIdentifier service. This service can be used to generate identifiers that are UUIDs, DOIs, and that potentially follow other syntax conventions. The scheme parameter controls which type of identifier should be generated. Generally, the use of UUIDs is encouraged for fine-grained identification of individual files within a data package, and the use of DOIs for the identifier for the metadata record for an overall data package. curl -X POST \ --cert /tmp/x509up_u501 \ -H "Accept: text/xml" \ -F scheme=UUID \ "${ENDPOINT}/generate"

Search the metadata index

To search across all of the metadata in the repository, use the query service to configure a SOLR query. The full SOLR syntax is supported, providing the means to create complex logical query conditions, and to customize the metadata fields returned. Query results can be returned in xml and json formats. Paging through results is supported using the rows and start parameters. To search only the most recent version of the metadata, include the -obsoletedBy:* constraint in the SOLR query. And note that all SOLR queries must be properly URL-escaped and SOLR escaped to be processed correctly (e.g., spaces in the SOLR query need to be escaped with a '+' or '%20', and colons in a SOLR query value need to be preceded by a backslash). In addition, to run these commands from curl, shell escapes will also need to be added as appropriate (e.g., by quoting strings). curl -X GET \ -H "Accept: text/xml" \ "${ENDPOINT}/query/solr/q=title:soil+AND+-obsoletedBy:*&fl=identifier,title,origin&rows=30&start=0&wt=xml"

The searchable SOLR fields that can be used to compose queries are accessible from the query service as well by accessing the endpoint without any query constraints.

curl -X GET \ -H "Accept: text/xml" \ "${ENDPOINT}/query/solr"

Example: To retrieve the download/view counts of a particular object in the KNB, use this Solr query:

curl -X GET \ -H "Accept: text/xml" \ "{ENDPOINT}/query/solr/q=id:{OBJECT_PID}&fl=read_count_i"

List Objects

The listObjects service provides a sequential list of objects on a node, and is minimally filterable. The query service generally contains more information and is preferred, but the object list can be useful to see recent activity on the repository. curl -X GET -H "Accept: text/xml" "${ENDPOINT}/object?start=0&count=100"

Delete an Object

Delete is an administrative service that can not be called by users. Contact an administrator for appropriate credentials. The delete service is provided to fully remove content from the repository, particularly when that content violates a law or ethical standard. When removing content for scientific reasons, archive is the proper method as it preserves citable links while still hiding content from search. curl -X DELETE \ --cert /tmp/x509up_u501 \ -H "Accept: text/xml" \ "${ENDPOINT}/object/urn:uuid:56eafcec-8b0a-11e3-a5e8-00334b2a1a0a"

DataONE Java Client Library

A helper library for calling the REST API using Java.

DataONE Python Client Library

A helper library for calling the REST API using Python.

DataONE R Package

An R package providing classes and methods for calling the API within R.

DataONE MATLAB library

A MATLAB package providing classes and methods for calling the API within Matlab.