Identifiers in DataONE ====================== Identifiers (PIDs, Persistent IDentifiers) are handles that uniquely identify objects within the DataONE system. * All data, metadata, and resource map objects in DataONE have a unique identifier. * PIDs will always refer to the same set of bytes accessed through the DataONE API methods such as :func:`MNRead.get`. * The location of content identified by a PID is determined by calling the :func:`CNCore.resolve` method. * PIDs are persistent. Once content is registered with DataONE, the identifier for that content will remain in the DataONE system. * PIDs are unique, and can not be reused once assigned. * PIDs are generally controlled by Member Nodes, however their uniqueness and immutability is enforced primarily by the Coordinating Nodes. Uniqueness ~~~~~~~~~~ Generation of identifiers in DataONE is largely under the control of the Member Nodes (i.e. the data providers), with the requirement that an existing identifier (i.e. one that is already registered in the DataONE system) can not be reused. This rule is enforced for new content by checking the uniqueness of a proposed identifier in the :func:`MNStorage.create` method, and for existing content by ignoring content with identifiers that are already in use. The :func:`CNCore.reserveIdentifier` method may be used to reserve an identifier, so that a client may for example compose a composite object prior to committing the new content to storage on the Member Node. Similarly, Tier 3 and above Member Nodes may support the :func:`MNStorage.generateIdentifier` which will typically delegate to a third party persistent identifier service such as EZID [1]_ to return an identifier guaranteed to be unique within the DataONE system. Authority ~~~~~~~~~ DataONE treats the original identifier (i.e. the first assignment of the identifier to an object that becomes known to DataONE) as the authoritative identifier for an object. Although generally not encouraged, multiple identifiers may refer to a particular object and in such cases, DataONE will attempt to utilize the original identifier for all communications about the object. Opacity ~~~~~~~ Identifiers utilized by Member Nodes can take many different forms from automatically generated sequential or random character strings to strings that conform to schemes such as the LSID [2]_ and DOI [3]_ specifications. DataONE does not directly utilize implied functionality and services that might be available for some of the identifier schemes. This is not to say that mechanisms such as metadata retrieval for LSIDs is not used by any components of the DataONE infrastructure, but rather that the DataONE infrastructure and services have no functional dependency on such external services. Identifiers are treated as opaque strings in the DataONE system, with no meaning inferred from structure or pattern that may be present in identifiers. The rules for identifier construction in DataONE are minimal and intended to ensure practical utility of identifiers. There is a set of characters that can not be used within an identifier string (non-printing and whitespace characters), and the maximum number of characters that such a string may contain (800 characters, #577). Leading and trailing white space is not allowed. Immutability ~~~~~~~~~~~~ Once assigned and registered in the DataONE infrastructure, an identifier will always refer to the same sequence of bytes. Generation of other representations of objects may be supported by services (e.g. an image may be transformed from TIFF to JPEG), but the identifier will always refer to the original form. Resolvability ~~~~~~~~~~~~~ A fundamental goal of DataONE is to ensure that any identifier utilized in the system is resolvable, that is, DataONE provides a mechanism that will enable the location of the object to be determined. Resolution is handled by the Coordinating Nodes through the :func:`CNCore.resolve` method, which returns a list of nodes from which the object may be retrieved. A guarantee of identifier resolvability is an important, core function of the DataONE infrastructure upon which many other services may be constructed, both within DataONE and by third party systems. Granularity ~~~~~~~~~~~ Identifiers refer to managed objects in DataONE. Initially data, science metadata documents, and resource maps have identifiers. The definition of "data" is somewhat arbitrary though, and a single data object may be a single record within some larger collection, or may refer to an entire set of records contained within some package. Structure ~~~~~~~~~ The characters that may appear in an identifier string acceptable to the DataONE system is constrained by the XMLSchema definition (:class:`Types.Identifier`), which is essentially a string of length greater than zero but less than 800 characters with no whitespace (spaces, tabs, non-printing characters, carriage returns, new lines). Identifiers may be Unicode provided they conform to the fairly liberal restrictions imposed by the XML specification [4]_. Examples of valid identifiers in DataONE are shown in the section *Serializing* below. Serializing ~~~~~~~~~~~ When identifiers appear in text, the full identifier should be presented unmodified. Identifiers appearing in URLs or other representations that have reserved characters should be escaped according to the rules of the targeted serialization format. For example, the identifiers:: 10.1000/182 urn:lsid:ubio.org:namebank:11815 http://example.com/data/mydata?row=24 ldap://ldap1.example.net:6666/o=University%20of%20Michigan,c=US??sub?(cn=Babs%20Jensen) ฉันกินกระจกได้ Is_féidir_liom_ithe_gloine would be serialized in DataONE :func:`MNRead.get` URLs (or any other URL path) according to RFC3986_ encoding guidelines for URI path segments:: http://mn.example.com/mn/object/10.1000%2F182 http://mn.example.com/mn/object/urn:lsid:ubio.org:namebank:11815 http://mn.example.com/mn/object/http:%2F%2Fexample.com%2Fdata%2Fmydata%3Frow=24 http://mn.example.com/mn/object/ldap:%2F%2Fldap1.example.net:6666%2Fo=University%2520of%2520Michigan,c=US%3F%3Fsub%3F(cn=Babs%2520Jensen) http://mn.example.com/mn/object/%E0%B8%89%E0%B8%B1%E0%B8%99%E0%B8%81%E0%B8%B4%E0%B8%99%E0%B8%81%E0%B8%A3%E0%B8%B0%E0%B8%88%E0%B8%81%E0%B9%84%E0%B8%94%E0%B9%89 http://mn.example.com/mn/object/Is_f%C3%A9idir_liom_ithe_gloine .. Note:: The "+" (plus) character is a special case since it was once treated as a space character in URLs, and was changed in RFC3986 [5]_ such that the "+" would not be treated as a space. To minimize confusion when the plus character appears in an identifier, DataONE recommends that the character is percent escaped (``%2B``) when it appears in DataONE service URLs. All DataONE libraries and services operate in this manner. The necessary encoding of URLs can be usually achieved through standard libraries available in many languages, with the caveat that the encoding follows the RFC3986 encoding rules. Many packages over-escape, keeping only the unreserved character set unescaped. For its client libraries, DataONE is taking a minimal escaping approach within the latitude RFC3986 allows. Specifically, using [pchar] - ['+'] as the set of unescaped characters for identifiers in path segments, and [pchar] - ['+', '&', '='] + ['/', '?'] for identifiers in query segments, (segments in both cases meaning characters between delimiters). For example:: example-location-dependent-__/__?__&__=__ example-common-unescaped-;:@$-_.!*()',~ will be encoded in paths to:: example-location-dependent-__%2F__%3F__&__=__ example-common-unescaped-;:@$-_.!*()',~ and encoded in the query section to:: example-location-dependent-__/__?__%26__%3D__ example-common-unescaped-;:@$-_.!*()',~ Note that RFC3986 [5]_ treats the query section of the URI as a blackbox, so '&' and '=' are unescaped (to be used as sub-delimiters). For the purpose of encoding content, we take the approach of encoding at the segment level, so need to escape those characters. For those implementations using standard encoding routines, it is important to know that package's treatment of this. The following examples in Python and Java illustrate percent encoding of data such as an identifier appropriate for appending to a URL. Each processes utf-8 encoded input through *stdin* and outputs percent encoded or decoded responses. In java pseudo-code the general process is as follows. .. code-block:: java // pseudo-code: this will not compile! CharacterSet PATH_SAFE = RFC3986_PCHAR and not ['+']; CharacterSet QUERY_SAFE = PATH_SAFE and not ['&','='] or ['?','/']; String encodeUtf8_pathSegment(identifier) { String utf8ID = identifier.translate("UTF-8"); return encodedID = percentEscape(utf8ID,PATH_SAFE); } String encodeUtf8_querySegment(identifier) { String utf8ID = identifier.translate("UTF-8"); return encodedID = percentEscape(utf8ID,QUERY_SAFE); } String decodeString(string) { // older clients may encode spaces with '+' // so if we see them in the input, it is due to that // and we need to decode them, too. String correctedString = string.replace("+","%2B"); return decodePercentEscaped(correctedString); } .. code-block:: python import sys import codecs import urllib def pctEncode(data): '''Encode the unicode string data as utf-8 then percent encode that ready for appending as a path element to a URL. ''' response = urllib.quote(data.encode("utf-8"), safe=":") return response def pctDecode(data): '''Decode a percent encoded string and return the unicode object. but first handle any mistaken '+' in the data string ''' data = data.replace("+","%2B") response = urllib.unquote(data) return response if __name__ == "__main__": ''' Read utf-8 encoded input from stdin and percent encode or decode (with command line argument -d). e.g. given test_ids.txt, a UTF-8 encoded file with identifiers appearing one per line: cat test_ids.txt | python PctEncode.py | python PctEncode.py -d should output equivalent to: cat test_ids.txt ''' doEncode = True try: if sys.argv[1] == "-d": doEncode = False except: pass id = unicode(sys.stdin.readline(), "utf-8").strip() while len(id) > 0: if doEncode: print pctEncode(id) else: print pctDecode(id) id = unicode(sys.stdin.readline(), "utf-8").strip() .. code-block:: java import java.io.*; import java.net.*; class PctEncode { /** Simple example of URL path encoding of UTF-8 strings for including as path elements in URLs as per RFC3986. e.g. given test_ids.txt, a UTF-8 encoded file with identifiers appearing one per line: cat test_ids.txt | java PctEncode | java PctEncode -d should output equivalent to: cat test_ids.txt */ public static String pctDecode(String data) { /** Decode a percent encoded string, returning a Java Unicode string */ String response = null; try { data = data.replace("+","%2B"); response = URLDecoder.decode( data, "UTF-8"); } catch (java.io.UnsupportedEncodingException e) { System.out.println("Error pctDecode : " + e.getMessage()); } return response; } public static String pctEncodePathSegment(String data) { /** Encode a Java string according to the path encoding rules in RFC3986. Note that this does not encode properly for data that is to be the root of the path, it is assumed that the data will be appended to the end of a a URL path. */ String response = null; try { response = URLEncoder.encode( data, "UTF-8" ); // fix outdated space-to-+ convention response = response.replace("+","%20"); // now un-escape for minimally escaped result response = response.replace("%3A",":").replace("%28","("); response = response.replace("%3B",";").replace("%29",")"); response = response.replace("%40","@").replace("%27","'"); response = response.replace("%24","$").replace("%2C",","); response = response.replace("%21","!").replace("%7E","~"); } catch (java.io.UnsupportedEncodingException e) { System.out.println("Error pctEncode: " + e.getMessage()); } return response; } public static void main( String[] args ) { try { boolean doEncode = true; try { if (args[0].equals( "-d" )) doEncode = false; } catch(ArrayIndexOutOfBoundsException e) { } PrintStream outs = new PrintStream( System.out, true, "UTF-8" ); InputStreamReader isr = new InputStreamReader( System.in, "UTF-8" ); BufferedReader reader = new BufferedReader( isr ); String id = null; String data = null; while ( (id = reader.readLine()) != null ) { if (doEncode) { data = pctEncode( id ); } else { data = pctDecode( id ); } outs.println( data ); } } catch(java.io.IOException e) { System.out.println("Error main: " + e.getMessage()); } } } Given this code and a utf-8 encoded source file *test_ids.txt* such as:: ö 10.1000/182 urn:lsid:ubio.org:namebank:11815 http://example.com/data/mydata?row=24 ldap://ldap1.example.net:6666/o=University%20of%20Michigan,%20c=US??sub?(cn=Babs%20Jensen)", ฉันกินกระจกได้ Is_féidir_liom_ithe_gloine The following commands should output the same as ``cat test_ids.txt``:: cat test_ids.txt | java PctEncode | python PctEncode.py -d cat test_ids.txt | python PctEncode.py | java PctEncode -d .. _guid: http://en.wikipedia.org/wiki/Globally_unique_identifier#Algorithm .. _OGC WKT: http://en.wikipedia.org/wiki/Well-known_text .. [1] http://n2t.net/ezid/ .. [2] http://lsids.sourceforge.net/ .. [3] http://www.doi.org/ .. [4] http://www.w3.org/TR/xml11/#charsets .. [5] http://tools.ietf.org/html/rfc3986 .. OLD Notes follow, preserved here for now but likely to be removed Suggested Strategy ------------------ 1. DataONE supports all identifier schemes where the PID can be represented as a Unicode string (this should be any identifier). 2. The original identifier first assigned by a Member Node is the identifier promoted as the authoritative identifier for that content. Other identifiers that may be assigned by MNs that don't support the original scheme will be mapped to the original. 3. If the original MN discontinues participation in DataONE, then the identifier originally used remains as the authoritative identifier. 4. Any identifiers in use by the DataONE system can be resolved at any node (CN or MN). A caching system (e.g. memcached) should be used to improve resolution performance (can be primed with existing IDs). This strategy will enable the use of any identifier that is represented by a string, and will persist the original identifier for the object regardless of what happens to the originating Member Node. An obvious concern with this strategy is that a single object may have multiple identifiers associated with it. Since the original identifier is persisted, however, it will be the primary identifier by which that content will be referenced, regardless of which node the object is located on. .. @startuml images/resolve.png title Resolve PID actor User participant "CRUD API" as m_crud << Member Node >> participant "Cache" as m_cache << Member Node >> participant "CRUD API" as cn_crud << Coordinating Node >> participant "Directory" as cn_dir << Coordinating Node >> User -> m_crud: resolve(token, "A5548D") m_crud -> m_cache: cache_lookup("A5548D") m_cache --> m_crud: FAIL m_crud -> cn_crud: resolve(token, "A5548D") cn_crud -> cn_dir: lookup("A5548D") cn_dir --> cn_crud: metadata cn_crud --> m_crud: metadata m_crud --> m_cache: addEntry("A5548D", metadata) m_crud --> User: metadata @enduml .. image:: images/resolve.png *Figure 1.* Resolving a PID. In this scenario a user is trying to determine what the ID "A5548D" refers to, and uses the resolution service of a Member Node to that effect. .. @startuml images/resolve-detail.png title Resolve PID Detail actor User participant "CRUD API" as m_crud << Member Node >> participant "Cache" as m_cache << Member Node >> participant "CRUD API" as cn_crud << Coordinating Node >> participant "Directory" as cn_dir << Coordinating Node >> participant "CRUD API" as m_crud2 << Member Node 2 >> User -> m_crud: get(token, "A5548D") m_crud -> m_cache: lookup("A5548D") note right Local resolve failed, defer to CN endnote m_cache --> m_crud: FAIL m_crud -> cn_crud: resolve(token, "A5548D") cn_crud -> cn_dir: lookup("A5548D") cn_dir --> cn_crud: metadata cn_crud --> m_crud: metadata m_crud --> m_cache: addEntry(GUID, metadata) m_crud -> m_crud: parseMetadata(metadata) note right Found data URL = http://mn2.dataone.org/objects/A4448D endnote m_crud --> User: HTTP 302: http://mn2.dataone.org/objects/A4448D note right Return a redirect to the MN 2 get object interface for the specified object. endnote User -> m_crud2: GET "http://mn2.dataone.org/objects/A4448D" m_crud2 --> User: bytes @enduml .. image:: images/resolve-detail.png *Figure 2.* Detail for object retrieval of an object identified by a PID. In this case, the User is requesting a data object from MN 1, though the data is actually located on MN 2. .. @startuml images/resolve-conflict.png title Conflicting IDs participant "MN_A" as mn_a participant "MN_B" as mn_b participant "CN" as cn participant "CN OStore" as cn_os mn_a -> cn: registerID("435") cn -> cn_os: store("mn_a:435") cn_os <-- cn: ACK mn_a <-- cn: ACK mn_b -> cn: registerID("435") cn -> cn_os: store("mn_b:435") cn_os <-- cn: ACK mn_b <-- cn: ACK actor user user -> cn: resolve("435") user <-- cn: "mn_a:435", "mn_b:435" @enduml .. image:: images/resolve-conflict.png *Figure 3.* A scenario where two MNs happen to add different content to the system with the same identifier. Resolving the identifier without including the namespace results in two matches that must be interpreted by the client. The likelihood of such a scenario should be low, given that MNs should be utilizing identifier schemes that under ideal circumstances should not generate duplicate identifiers. Notes from the 20090602 Albuquerque Meeting ------------------------------------------- These lightly edited notes were taken by Bruce Wilson of the group discussion about identifiers during the VDC-TWG 20090602 Albuquerque Meeting. Original notes are located in subversion at: /documents/Projects/VDC/docs/20090602_04_ABQ_Meeting Design Goals ~~~~~~~~~~~~ From the DataONE perspective, an identifier is opaque. DataONE does not attach any meaning or resolution protocol based on the identifier. A call to return the object associated with a particular identifier should always return either identically the same object or n/a if that object is no longer available. This raises a number of implementation issues, noted below. Particular issues include how to handle data which is regularly updated and things like status changes. A Member Node may use its own internal identification scheme, but must be able to retrieve an object based on its DataONE globally unique identifier. Member Nodes may generate their own unique identifiers, such as DOIs_, Handles_, PURLs_, or UUIDs_. The only requirement is that the identifier is unique across the space of DataONE. This implies that CN's must have functionality to: .. _DOIs: http://www.doi.org/ .. _Handles: http://www.handle.net/ .. _PURLs: http://purl.org/docs/index.html .. _UUIDs: http://en.wikipedia.org/wiki/UUID (a) check that an identifier is unique and (b) to "reserve" or stub-out an identifier while the MN goes through the process of assembling the package to submit the object into DataONE. When an object is replicated from one MN to another MN, the receiving MN must be able to accept and resolve the supplied DataONE identifier. That is, an object, no matter where it is within the DataONE network must be retrievable by its DataONE identifier, regardless of location. There was a lot of discussion on this point, and this is my interpretation of the conclusion. I believe we came out with the point that if a receiving Member Node assigns its own permanent identifier, then that creates more confusion, requires the MN to register that second ID with the CNs, and we can have confusion regarding the citation (for example) of the piece of data. It also makes tracking things like metrics, since the originating MN must then find out all other identifiers for the data and search for all of those. And while it can be argued that nobody "owns" the data, there is (currently) a culture and need for the original archive to feel like it still can receive credit for that investment. A system doesn't need to maintain every version, but it does need to be able to identify every version. Identifiers also apply to metadata as well as data. Questions for Further Consideration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If a MN uses a DOI for a data set identifier, is it appropriate to include doi: in the identifier. For example, 10.3334/ORNLDAAC/840 is the DOI for a particular data set at the ORNL DAAC. Both "doi:10.3334/ORNLDAAC/840" and "10.3334/ORNLDAAC/840" can be presumed to be unique identifiers. Which should be used? BEW: My personal preference is to use the one with the resolution protocol included. That does, however, make the identifier more of a "smart" identifier, which is generally problematic. Where an identifier has a mechanism to resolve to multiple locations (such as is possible with an LSID and some DOI mechanisms) and that object is replicated from one MN to another MN, this would suggest that the originating MN needs to be notified of the additional location and has the option of registering the new location with the handle registration authority. This also means that if a replication is removed, the original MN should have the option of being notified, so that the resolution points are updated. Ideally, this should happen before the replica is removed (where possible), so that we eliminate (or at least minimize) the amount of time that an invalid resolution point is in someone else's system. Where an identifier (such as a Handle) has a URL resolution, what should that resolution be? ORNL DAAC DOI's resolve to a web page where a user (after logging in) can see and download the components of the data set. Our opinion is that the DOI resolving to a human interpretable description of the object is more important than a machine interpretable resolution point. Some thought and guidance on this point for the overall DataONE community of practice is desirable. Do we want/need a registry of name spaces? Where a MN uses a UUID (for example), there may not be a way to describe the name space for identifiers, unless the MN prefixes the UUID with some descriptor, which generally violates the general admonition about smart identifiers. It might, however, be helpful to have something like a set of regexps that describe the name space for a MN's identifiers, particularly if an automated way could be developed to look for potential collisions (non-null overlaps) between name spaces. BEW: My thought is that this is far from an initial feature, but the desirability of this as a possible future feature could have implications on the way we do things from the start. Can the metadata standards support multiple globally unique identifiers? For example, what happens in the case that a MN starts down the DOI path and then switches to LSID's because of economic costs, for example, and goes back and assigns an LSID to historical data sets. Those data sets now have both an LSID and a DOI. Where is this in the metadata? Is there a mechanism for indicating the preferred ID and the alternate ID's? Likewise, how should things be handled when a MN decides to register an object with e.g. GCMD and the namespace that GCMD allows for identifiers does not allow for the MN's preferred identifier. Can a MN update the metadata to show an alternate key with the GCMD identifier (data set is also known as)? What is the implication for the metadata identifier in such a case? This is an update operation to the metadata, which implies that the metadata identifier is changed. How would one update the old metadata record to indicate that it is: (a) deprecated and (b) the id of the new metadata record? The above also relates to the issue of establishing predecessor-successor relationships between identifiers. How should this be done across the system? How do versions enter into the identifiers scheme? The general concept is that different versions of an object have different identifiers. What about having some type of an identifier that aggregates all versions of an object and which always points to the latest version of that object? How does D1 know that an object is a new version of an existing object? Update operation should take the old identifier and the new identifier. That would allow for the tracking of updates. A Member Node may track versions. Could create an interface specification for "latest version" where the CN calls the authoritative MN for the DS and asks for the identifier of the latest version of a particular identifier. Points back to the need for what amounts to meta-metadata - where the metadata object can be updated to indicate the status level of the data set (e.g. deprecated). Where is the identifier for something like World Ocean Data Base - this gets updated quarterly. They think of the fundamental unit as an observation point, which is either a location (e.g. buoy, possibly with different identifiers for different depths) or a leg of a trip, with multiple observations along a path. For identifiers, we may need to specify the character space. What happens when a MN stores unique identifiers in a database field that supports just ASCII, but a different MN does its unique identifiers in some other character set? PURL is a possible unique identifier, but we can get into cases now where URLs have characters from other language character sets (such as Arabic, Kanji, Ö) What happens when a request for a replicated version of a data set comes to the replicate MN and the data set has been updated and the originating MN has not supplied the information about the update (e.g. they did an insert for the new version)? How do we assign ID's for a continuous data stream or for a subset calculated on the fly? Does this mean that every request for a continuous data stream gets its own data set identifier, which then gets stored in the D1 system someplace? What is the value to the overall enterprise for storing the data set identifiers for each request, particularly in the context of something like a stream, where the on-the-fly processing is used to get a dynamic subset or dynamic reprojection? Examples of this sort of situation include the stream gauge data or the Atmospheric Radiation Measurement (ARM) archive. Ameriflux Flux tower data is a simpler case, in that they work on the basis of a site-year as a unit of data. The World Oceanic DataBase (WODB), however, operates on a location (and possibly depth) as a unit of data. Many of these are updated quarterly. Each unit of data has an identifier, unique within WODB, and WODB publishes a data stream that indicates what data packages were updated at what point in time. It is possible to determine whether a particular data package changed between two points in time. The differences are human interpretable, but it is not possible (in any generally automated fashion) to recreate the data stream for a particular data package at an arbitrary point in prior time. Do the CN's need a method to determine the object type for an identifier? Do identifiers need to be unique across all types of identified objects?