Warning: These documents are under active
development and subject to change (version 2.1.0-beta).
The latest release documents are at:
https://purl.dataone.org/architecture
To support the goals of preservation and scientific reproducibility, all registered objects in DataONE are considered immutable, with each object representing a published snapshot of data or metadata associated with a specific time. DataONE manages the registration, indexing, and replication of these snapshots throughout the DataONE network of Member Nodes. Upon this foundation, DataONE can guarantee that the exact byte array returned through the DataONE Read APIs (MNRead and CNRead) is the one submitted and registered.
Any repository that provides unique identifiers to snapshots (or revisions) can participate as DataONE Member Nodes, irrespective of whether or not they retain past snapshots. This is accomplished by the use of two identifiers, one representing the revision, and the other representing the changing (or mutable) entity. For those Member Nodes only managing the mutable entity, as long as a unique revision-level identifier is generated upon each update to the entity, DataONE will not reject the update. In situations where the rate of change is faster than DataONE’s Member Node synchronization, it is possible that some snapshots will fail to be registered. However, since that revision’s unique identifier is never indexed or otherwise made available, the chance for needing to retrieve that snapshot (and not finding it) in the future is very small.
The two identifiers are known as:
systemMetadata.identifier
field. This identifier represents
the snapshot or revision DataONE replicates among the DataONE federation.systemMetadata.seriesId
field. This identifier represents
the mutable content, and resolves to the latest revision among all registered revisions
when used in the DataONE Read APIs.DataONE relies on content originators to generate the identifiers they use for each snapshot (with the series identifier being optional) being registered, and determining which field will hold the “citable” identifier.
DataONE considers any change that results in a different byte array of content to be a new snapshot, and thus a new object to be registered. Subtle changes, such as whitespace differences, although potentially meaningless, do therefore constitute a new object. If not properly identified with a new PID, the content held on that Member Node is invalid. Member Nodes that periodically regenerate their stored content or manipulate it upon retrieval will need to take extra care to validate checksums after regeneration or manipulation and resolve any discrepancies in content they may encounter.
The SID is provided expressly to group the snapshots of a single entity that is stable over time. It was not intended to represent highly volatile entities, or those that significantly “drift” over time. Member Nodes are encouraged to instead use registered services for volatile content, and create new entities when significant change in the scope of an entity occurs.
It is not clear that individual contributers will always have the means to register a service, and so may have entities organized for input, such as a file that accumulates observations. If items such as these are registered as an object, the rightsHolders should be mindful to apply reasonable temporal bounds.
When the scope of an item has changed significantly, it is permissible to supply a new seriesId to the next snapshot while still relating the two items by obsoletes and obsoletedBy. However, it is not necessary, and may be more straightforward to simply save the new entity without relating the snapshots through system metadata, but through provenance mechanisms instead.
DataONE anticipates data consumers using other contributors’ data will prefer to cite using the PID, for the certainty that provides, and will prefer using the SID when citing indirectly (when citing metadata). Similarly, we anticipate content originators will wish to promote one identifier for citation. For those content providers using both identifiers, it is recommended to assign the preferred identifier according to anticipated data consumer preference.
To aid in aggregating download statistics for data providers, DataONE provides the
cn/v2/query/logsolr
query endpoint. Both PID and SID field are included in the index records
to allow straightforward retrieval of download statistics by either field.
All DataONE APIs accepting an Identifier must treat PIDs as requests for the exact snapshot, and SIDs as a request for the latest snapshot that DataONE Node has knowledge of. Due to the distributed nature of snapshot replication, it is possible that a replica Member Node not know about the latest snapshot, in which case, a request by SID to that node should give a previous snapshot. For all nodes, even the authoritative Member Node, a request by PID for a snapshot that it doesn’t host must return a NotFound exception.
End users should therefore rely on v2.cn.resolve
for object retrieval. If the
Identifier used for the resolve is a SID, this method will return an ObjectLocationList
for the latest known snapshot.
The primary way for determining the head of a series is via the obsoletedBy
field
in the system metadata that links to the next snapshot in the chain. With all of
the snapshots synchronized, there should only be one of the series that is not
obsolete, and that is the head. With incomplete synchronization, there will be
possibly more than one snapshot that is not obsoleted, and in these cases, the
one with the latest dateUploaded
value will be chosen as the head.
Use of the obsoleted fields as the primary indicator for the head of the series
is preferred because it is a direct reflection of the rightsHolder’s intentions,
whereas the dateUploaded
value is only a reflection of the order in which the
Member Node processes uploaded content.
Member Nodes that manage by mutable entity (don’t preserve prior snapshots) should
populate the obsoletes
and obsoletedBy
fields, even if they do not plan to preserve
older snapshots. Replica nodes and the DataONE Coordinating Nodes can use these
fields to optimize queries for finding the head of the series.
Question: should mutable Member Nodes keep systemMetadata documents for snapshots they no longer have? (it would allow the obsoletedBy fields to be synchronized, but would it be in conflict with the behavior of deleted items (and would it matter?)
To illustrate by way of example, author A uploads an item to Member Node M, with an identifier S not using the DataONE API, but with M‘s primary API. M builds a systemMetadata document for S, generating a PID, P1, to uniquely identify the initial snapshot, and assigns an upload date of D1, and uses the identifier S for the seriesId. DataONE synchronizes the object, and replicates the snapshot P1 to one other Member Node R1. The author, A, then saves changes to the item, whereupon M generates another PID, P2, to uniquely identifier this newer snapshot, uses S in the seriesId field, puts P1 in the obsoletes field, and D2 in the dateUploaded field. (size and checksum are also calculated for the new snapshot.) This is synchronized and replicated to a different Member Node, R2.
When v2.cn.resolve(S`)
is called, an ObjectLocationList for P2 is returned, listing
Member Nodes M and R2 as locations for retrieval. A call to M.get(P2)
or
R2.get(P2)
will return the latest snapshot, as will the same call using S as the identifier
instead. However, a call to R1.get(P2)
will return NotFound, because it was not
a replication target for that snapshot, and a call to R1.get(S)
will return the
initial snapshot, because it has snapshot P1 with the associated seriesId S.
Notice, too, that v2.cn.resolve(P1)
will return an ObjectLocationList containing
both M and R1, although retrieval from M is no longer possible, since M doesn’t
preserve past snapshots. M.get(P1)
should return a NotFound, and the client will
move on to R1, and be able to retrieve P1 with R1.get(P1)
.
The CN, when v2.cn.resolve(S)
was called, determined the head of the series by first
finding all of the snapshots where the seriesId is S, and obsoletedBy is null or
the obsoletedBy object has a different seriesId. In this case, since the P1
systemMetadata is never updated to fill in the obsoletedBy field, the algorithm
will get both P1 and P2. It will then notice that P2 has the later date of D2,
so will choose P2 as the head of the series.
Suppose now that A spawns two more snapshots in quick succession, and DataONE
synchronizes afterwards. It missed P3(S) but picks up P4(S). Cn.resolve(S)
will
return an ObjectLocationList for the P4 snapshot, since it is the latest of all
non-obsoleted snapshots.
Later, A makes some changes and realizes that the content is significantly different
from previous revisions, so renames it S2. The system treats it as related, so
links to the P4 snapshot with the obsoletes field. P5(S2) is now hosted on M, but
P4(S) is gone. cn.resolve(S)
will return an ObjectLocationList for P4, but M.get(P4)
will return NotFound, and the client will have to retrieve from a replica Member Node,
if possible. Cn.resolve(S2)
will return an OLL for P5. Note also that M.get(S)
will not be able to resolve the SID to any PID, since it doesn’t host any of the
snapshots of S.
cn.hasReservation(SID)