Warning: These documents are under active
development and subject to change (version 2.1.0-beta).
The latest release documents are at:
https://purl.dataone.org/architecture
Identifiers (PIDs, Persistent IDentifiers) are handles that uniquely identify objects within the DataONE system.
MNRead.get()
.CNCore.resolve()
method.Generation of identifiers in DataONE is largely under the control of the Member
Nodes (i.e. the data providers), with the requirement that an existing
identifier (i.e. one that is already registered in the DataONE system) can not
be reused. This rule is enforced for new content by checking the uniqueness of a
proposed identifier in the MNStorage.create()
method, and for existing
content by ignoring content with identifiers that are already in use. The
CNCore.reserveIdentifier()
method may be used to reserve an identifier, so
that a client may for example compose a composite object prior to committing the
new content to storage on the Member Node. Similarly, Tier 3 and above Member
Nodes may support the MNStorage.generateIdentifier()
which will typically
delegate to a third party persistent identifier service such as EZID [1] to
return an identifier guaranteed to be unique within the DataONE system.
DataONE treats the original identifier (i.e. the first assignment of the identifier to an object that becomes known to DataONE) as the authoritative identifier for an object. Although generally not encouraged, multiple identifiers may refer to a particular object and in such cases, DataONE will attempt to utilize the original identifier for all communications about the object.
Identifiers utilized by Member Nodes can take many different forms from automatically generated sequential or random character strings to strings that conform to schemes such as the LSID [2] and DOI [3] specifications. DataONE does not directly utilize implied functionality and services that might be available for some of the identifier schemes. This is not to say that mechanisms such as metadata retrieval for LSIDs is not used by any components of the DataONE infrastructure, but rather that the DataONE infrastructure and services have no functional dependency on such external services.
Identifiers are treated as opaque strings in the DataONE system, with no meaning inferred from structure or pattern that may be present in identifiers. The rules for identifier construction in DataONE are minimal and intended to ensure practical utility of identifiers. There is a set of characters that can not be used within an identifier string (non-printing and whitespace characters), and the maximum number of characters that such a string may contain (800 characters, #577). Leading and trailing white space is not allowed.
Once assigned and registered in the DataONE infrastructure, an identifier will always refer to the same sequence of bytes. Generation of other representations of objects may be supported by services (e.g. an image may be transformed from TIFF to JPEG), but the identifier will always refer to the original form.
A fundamental goal of DataONE is to ensure that any identifier utilized in the
system is resolvable, that is, DataONE provides a mechanism that will enable the
location of the object to be determined. Resolution is handled by the
Coordinating Nodes through the CNCore.resolve()
method, which returns a
list of nodes from which the object may be retrieved.
A guarantee of identifier resolvability is an important, core function of the DataONE infrastructure upon which many other services may be constructed, both within DataONE and by third party systems.
Identifiers refer to managed objects in DataONE. Initially data, science metadata documents, and resource maps have identifiers. The definition of “data” is somewhat arbitrary though, and a single data object may be a single record within some larger collection, or may refer to an entire set of records contained within some package.
The characters that may appear in an identifier string acceptable to the
DataONE system is constrained by the XMLSchema definition
(Types.Identifier
), which is essentially a string of length greater
than zero but less than 800 characters with no whitespace (spaces, tabs,
non-printing characters, carriage returns, new lines). Identifiers may be
Unicode provided they conform to the fairly liberal restrictions imposed by
the XML specification [4]. Examples of valid identifiers in DataONE are shown
in the section Serializing below.
When identifiers appear in text, the full identifier should be presented unmodified.
Identifiers appearing in URLs or other representations that have reserved characters should be escaped according to the rules of the targeted serialization format. For example, the identifiers:
10.1000/182
urn:lsid:ubio.org:namebank:11815
http://example.com/data/mydata?row=24
ldap://ldap1.example.net:6666/o=University%20of%20Michigan,c=US??sub?(cn=Babs%20Jensen)
ฉันกินกระจกได้
Is_féidir_liom_ithe_gloine
would be serialized in DataONE MNRead.get()
URLs (or any other URL path)
according to RFC3986_ encoding guidelines for URI path segments:
http://mn.example.com/mn/object/10.1000%2F182
http://mn.example.com/mn/object/urn:lsid:ubio.org:namebank:11815
http://mn.example.com/mn/object/http:%2F%2Fexample.com%2Fdata%2Fmydata%3Frow=24
http://mn.example.com/mn/object/ldap:%2F%2Fldap1.example.net:6666%2Fo=University%2520of%2520Michigan,c=US%3F%3Fsub%3F(cn=Babs%2520Jensen)
http://mn.example.com/mn/object/%E0%B8%89%E0%B8%B1%E0%B8%99%E0%B8%81%E0%B8%B4%E0%B8%99%E0%B8%81%E0%B8%A3%E0%B8%B0%E0%B8%88%E0%B8%81%E0%B9%84%E0%B8%94%E0%B9%89
http://mn.example.com/mn/object/Is_f%C3%A9idir_liom_ithe_gloine
Note
The “+” (plus) character is a special case since it was once treated as a
space character in URLs, and was changed in RFC3986 [5] such that the “+”
would not be treated as a space. To minimize confusion when the plus
character appears in an identifier, DataONE recommends that the character
is percent escaped (%2B
) when it appears in DataONE service URLs. All
DataONE libraries and services operate in this manner.
The necessary encoding of URLs can be usually achieved through standard libraries available in many languages, with the caveat that the encoding follows the RFC3986 encoding rules. Many packages over-escape, keeping only the unreserved character set unescaped. For its client libraries, DataONE is taking a minimal escaping approach within the latitude RFC3986 allows. Specifically, using [pchar] - [‘+’] as the set of unescaped characters for identifiers in path segments, and [pchar] - [‘+’, ‘&’, ‘=’] + [‘/’, ‘?’] for identifiers in query segments, (segments in both cases meaning characters between delimiters). For example:
example-location-dependent-__/__?__&__=__
example-common-unescaped-;:@$-_.!*()',~
will be encoded in paths to:
example-location-dependent-__%2F__%3F__&__=__
example-common-unescaped-;:@$-_.!*()',~
and encoded in the query section to:
example-location-dependent-__/__?__%26__%3D__
example-common-unescaped-;:@$-_.!*()',~
Note that RFC3986 [5] treats the query section of the URI as a blackbox, so ‘&’ and ‘=’ are unescaped (to be used as sub-delimiters). For the purpose of encoding content, we take the approach of encoding at the segment level, so need to escape those characters. For those implementations using standard encoding routines, it is important to know that package’s treatment of this.
The following examples in Python and Java illustrate percent encoding of data such as an identifier appropriate for appending to a URL. Each processes utf-8 encoded input through stdin and outputs percent encoded or decoded responses. In java pseudo-code the general process is as follows.
// pseudo-code: this will not compile!
CharacterSet PATH_SAFE = RFC3986_PCHAR and not ['+'];
CharacterSet QUERY_SAFE = PATH_SAFE and not ['&','='] or ['?','/'];
String encodeUtf8_pathSegment(identifier) {
String utf8ID = identifier.translate("UTF-8");
return encodedID = percentEscape(utf8ID,PATH_SAFE);
}
String encodeUtf8_querySegment(identifier) {
String utf8ID = identifier.translate("UTF-8");
return encodedID = percentEscape(utf8ID,QUERY_SAFE);
}
String decodeString(string) {
// older clients may encode spaces with '+'
// so if we see them in the input, it is due to that
// and we need to decode them, too.
String correctedString = string.replace("+","%2B");
return decodePercentEscaped(correctedString);
}
import sys
import codecs
import urllib
def pctEncode(data):
'''Encode the unicode string data as utf-8 then percent encode that
ready for appending as a path element to a URL.
'''
response = urllib.quote(data.encode("utf-8"), safe=":")
return response
def pctDecode(data):
'''Decode a percent encoded string and return the unicode object.
but first handle any mistaken '+' in the data string
'''
data = data.replace("+","%2B")
response = urllib.unquote(data)
return response
if __name__ == "__main__":
'''
Read utf-8 encoded input from stdin and percent encode or
decode (with command line argument -d).
e.g. given test_ids.txt, a UTF-8 encoded file with identifiers
appearing one per line:
cat test_ids.txt | python PctEncode.py | python PctEncode.py -d
should output equivalent to:
cat test_ids.txt
'''
doEncode = True
try:
if sys.argv[1] == "-d":
doEncode = False
except:
pass
id = unicode(sys.stdin.readline(), "utf-8").strip()
while len(id) > 0:
if doEncode:
print pctEncode(id)
else:
print pctDecode(id)
id = unicode(sys.stdin.readline(), "utf-8").strip()
import java.io.*;
import java.net.*;
class PctEncode
{
/**
Simple example of URL path encoding of UTF-8 strings for including as
path elements in URLs as per RFC3986.
e.g. given test_ids.txt, a UTF-8 encoded file with identifiers
appearing one per line:
cat test_ids.txt | java PctEncode | java PctEncode -d
should output equivalent to:
cat test_ids.txt
*/
public static String pctDecode(String data) {
/**
Decode a percent encoded string, returning a Java Unicode string
*/
String response = null;
try {
data = data.replace("+","%2B");
response = URLDecoder.decode( data, "UTF-8");
} catch (java.io.UnsupportedEncodingException e) {
System.out.println("Error pctDecode : " + e.getMessage());
}
return response;
}
public static String pctEncodePathSegment(String data) {
/**
Encode a Java string according to the path encoding rules in
RFC3986. Note that this does not encode properly for data that
is to be the root of the path, it is assumed that the data will
be appended to the end of a a URL path.
*/
String response = null;
try {
response = URLEncoder.encode( data, "UTF-8" );
// fix outdated space-to-+ convention
response = response.replace("+","%20");
// now un-escape for minimally escaped result
response = response.replace("%3A",":").replace("%28","(");
response = response.replace("%3B",";").replace("%29",")");
response = response.replace("%40","@").replace("%27","'");
response = response.replace("%24","$").replace("%2C",",");
response = response.replace("%21","!").replace("%7E","~");
} catch (java.io.UnsupportedEncodingException e) {
System.out.println("Error pctEncode: " + e.getMessage());
}
return response;
}
public static void main( String[] args ) {
try {
boolean doEncode = true;
try {
if (args[0].equals( "-d" ))
doEncode = false;
} catch(ArrayIndexOutOfBoundsException e) {
}
PrintStream outs = new PrintStream( System.out, true, "UTF-8" );
InputStreamReader isr = new InputStreamReader( System.in, "UTF-8" );
BufferedReader reader = new BufferedReader( isr );
String id = null;
String data = null;
while ( (id = reader.readLine()) != null ) {
if (doEncode) {
data = pctEncode( id );
} else {
data = pctDecode( id );
}
outs.println( data );
}
} catch(java.io.IOException e) {
System.out.println("Error main: " + e.getMessage());
}
}
}
Given this code and a utf-8 encoded source file test_ids.txt such as:
ö
10.1000/182
urn:lsid:ubio.org:namebank:11815
http://example.com/data/mydata?row=24
ldap://ldap1.example.net:6666/o=University%20of%20Michigan,%20c=US??sub?(cn=Babs%20Jensen)",
ฉันกินกระจกได้
Is_féidir_liom_ithe_gloine
The following commands should output the same as cat test_ids.txt
:
cat test_ids.txt | java PctEncode | python PctEncode.py -d
cat test_ids.txt | python PctEncode.py | java PctEncode -d
[1] | http://n2t.net/ezid/ |
[2] | http://lsids.sourceforge.net/ |
[3] | http://www.doi.org/ |
[4] | http://www.w3.org/TR/xml11/#charsets |
[5] | (1, 2) http://tools.ietf.org/html/rfc3986 |