Warning: These documents are under active
development and subject to change (version 2.1.0-beta).
The latest release documents are at:
https://purl.dataone.org/architecture
Contents
All content synchronized by DataONE is immutable, and so resolution of a persistent identifier (PID) will always result in a pointer (URI) to a set of bytes that are in all respects identical to the original. Version 2.0 of the DataONE APIs introduced the ability to associate an optional series identifier (SID) with an object. Unlike a PID, resolution of a SID will always result in a pointer (URI) to a set of bytes that represent the latest revision of an object.
A revision or obsolescence chain is constructed by setting the obsoletes and
obsoletedBy properties of the new and old objects respectively. For example,
here PID_B represents the latest revision of object as it obsoletes PID_A
(object PID_A has a value of “PID_B” in its system metadata
obsoletedBy
property, and object PID_B has a
value of “PID_A” in its system metadata obsoletes
property):
+------------+ +------------+
| | ----- obsoletes ---> | |
| PID_B | | PID_A |
| | <--- obsoletedBy --- | |
+------------+ +------------+
resolve(PID_A) => PID_A
resolve(PID_B) => PID_B
In version 1.x of DataONE, it was necessary to manually follow the obsolescence chain in order to find the latest version of an object. This process is simplified in version 2.x and later through the use of series identifiers. The previous example can be augmented with series identifiers:
+------------+ +------------+
| | ----- obsoletes ---> | |
| PID_B | | PID_A |
| SID_1 | | SID_1 |
| | <--- obsoletedBy --- | |
+------------+ +------------+
resolve(PID_A) => PID_A
resolve(PID_B) => PID_B
resolve(SID_1) => PID_B
Each object in the obsolescence chain has the same value for the series
identifier (“SID_1”), and calling resolve()
with the value “SID_1”
will result in the URIs from which the object “PID_B” may be retrieved, since
that object is the latest revision in the obsolescence chain.
The availability of PIDs and SIDs means users may now refer to objects using either a PID when it is necessary or appropriate to refer to an exact set of bytes that represent an object or through a SID when referring to the latest version of an object. The former is important for repeatable analyses, since the same content may be reliably referenced and retrieved. The latter is important for referencing the most up to date revision of some object, and so may be useful for example to perform anaysis with the latest information available.
Unless indicated otherwise, the DataONE version 2.x and later APIs will accept either a PID or a SID when an identifier is specified as a request parameter.
In a perfect world, all obsolescence chains will have be complete,
bi-directional links, and so determining the latest version of an object is
determined simply by examining the set of all objects with the same SID, and
selecting the object that is not obsoletedBy
anything else. Obsolescence
chains may be incomplete for various reasons and in such situations, resolution
of series identifiers should still operate consistently.
The following series of scenarios demonstrate the behavior of the DataONE
system when resolving a seriesId to a specific object. The behavior of
resolution is to rely primarily on the obsoletes and obsoletedBy entities,
falling back to the date when an object is added to a Member Node
(dateUploaded
) to determine the newer version.
The following notation is used herein:
\(P_i\): | Refers to a Persistent Identifier (PID) |
---|---|
\(S_i\): | Refers to a Series Identifier (SID) |
\(t_i\): | The value of dateUploaded for
an object |
\(t_1\) < \(t_2\): | \(t_1\) is older than \(t_2\) |
\(P_i \binom{S_j}{t_k}\): | An object with
identifier (PID) \(P_i\), a
seriesId (SID)
of \(S_j\), and a dateUploaded
of \(t_k\). |
\(P_i\) \(\rightarrow\) \(P_j\): | \(P_i\) has an obsoletedBy
entry that contains the value \(P_j\) |
\(P_i\) \(\leftarrow\) \(P_j\): | \(P_j\) has an obsoletes
entry that contains the value \(P_i\) |
\(P_i\) \(\leftrightarrows\) \(P_j\): | \(P_i\) has an
obsoletedBy entry that
contains the value \(P_j\) and \(P_j\) has an
obsoletes entry that
contains the value \(P_i\). |
\(P_i\) \(\square\) \(P_j\): | Neither obsoletedBy
nor obsoletes is set by
\(P_i\) or \(P_j\). |
?? : | Object was not synchronized, and so unknown to DataONE |
\(resolve(S_i) \Rrightarrow P_j\): | Resolving SID \(S_i\) results in PID \(P_j\) |
A set of objects \(O = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).
All objects in \(O\) are participants in an obsolescence chain since \(P_2\)
obsoletes
\(P_1\) and \(P_1\) is
obsoletedBy
\(P_2\).
All elements of the obsolescence chain \(P_1 \leftrightarrows P_2\) have the same series identifier, \(S_1\).
The dateUploaded
of \(P_1\) is older than that of
\(P_2\).
This is a perfect obsolescence chain and resolving \(S_1\) will result in the object identified by \(P_2\).
A set of objects \(O = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).
No obsolescence information associates objects in \(O\).
The dateUploaded
of \(P_1\) is older than that of
\(P_2\).
No obsolescence assertions are made, so resolution is inferred by the most
recent value of dateUploaded
.
A set of objects \(O = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).
All objects in \(O\) are participants in an obsolescence chain since \(P_2\)
obsoletes
\(P_1\) even though \(P_1\) does not assert
it is obsoletedBy
\(P_2\).
All elements of the obsolescence chain \(P_1 \leftarrow P_2\) have the same series identifier, \(S_1\).
The dateUploaded
of \(P_1\) is older than that of
\(P_2\).
This is a damaged, but consistent obsolescence chain and resolving \(S_1\) will result in the object identified by \(P_2\).
A set of objects \(O_{S_1} = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).
A set of objects \(O_{S_2} = \lbrace P_3 \rbrace\) has the series identifier, \(S_2\).
Objects \(O = O_{S_1} \cup O_{S_2}\) all participate in a full, bi-directional obsolescence chain.
In this case resolving \(S_1\) will result in \(P_2\) which is not the most recent object in the obsolescence chain, however it is the newest version in the obsolescence chain identified by \(S_1\).
Resolving \(S_2\) will result in \(P_3\).
A set of objects \(O_{S_1} = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).
A set of objects \(O_{S_2} = \lbrace P_3 \rbrace\) has the series identifier, \(S_2\).
Objects \(O = O_{S_1} \cup O_{S_2}\) all participate in a damaged, though consistent obsolescence chain.
In this case resolving \(S_1\) will result in \(P_2\) which is not the most recent object in the obsolescence chain, however it is the newest version in the obsolescence chain identified by \(S_1\).
Resolving \(S_2\) will result in \(P_3\).
A set of objects \(O_{S_1} = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).
Objects \(O = O_{S_1} \cup P_3\) all participate in an obsolescence chain.
In this case resolving \(S_1\) will result in \(P_2\) which is not the most recent object in the obsolescence chain, however it is the newest version in the obsolescence chain identified by \(S_1\).
A set of objects \(O_{S_1} = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).
A set of objects \(O_{S_2} = \lbrace P_4 \rbrace\) has the series identifier, \(S_2\).
Objects \(O = O_{S_1} \cup P_3 \cup O_{S_2}\) all participate in an obsolescence chain.
In this case resolving \(S_1\) will result in \(P_2\) which is not the most recent object in the obsolescence chain, however it is the newest version in the obsolescence chain identified by \(S_1\).
Resolving \(S_2\) will result in \(P_4\)
A set of objects \(O_{S_1} = \lbrace P_1, P_2, P_4 \rbrace\) have the same series identifier, \(S_1\).
Objects \(O_{S_1}\) all participate in an obsolescence chain, however the
chain is broken with no way to traverse between \(P_2\) and \(P_4\) because the
object that \(P_2\) indicates it is obsoletedBy
, and the object that \(P_4\)
indicates it obsoletes
is not recorded by the DataONE Coordinating Nodes
(does not resolve).
In this case resolving \(S_1\) will result in \(P_4\) since that is the most recent object in the set of objects \(O_{S_1}\).
A set of objects \(O_{S_1} = \lbrace P_1, P_2, P_4 \rbrace\) have the same series identifier, \(S_1\).
Objects \(O_{S_1}\) all participate in an obsolescence chain, however the
chain is broken with no way to traverse between \(P_2\) and \(P_4\) because the the
object that \(P_4\) indicates it obsoletes
is not recorded by the DataONE
Coordinating Nodes (does not resolve).
In this case resolving \(S_1\) will result in \(P_4\) since that is the most recent object in the set of objects \(O_{S_1}\).
A set of objects \(O_{S_1} = \lbrace P_1, P_2, P_4 \rbrace\) have the same series identifier, \(S_1\).
The object \(P_{del}\) was deleted from the system, so the identifier is known, but the object and associated system metadata are no longer available.
Objects \(O_{S_1}\) all participate in an obsolescence chain, however the
chain is broken with no way to traverse between \(P_2\) and \(P_4\) because the
object that \(P_2\) indicates it is obsoletedBy
, and the object that \(P_4\)
indicates it obsoletes
is not recorded by the DataONE Coordinating Nodes
(does not resolve).
In this case resolving \(S_1\) will result in \(P_4\) since that is the most recent object in the set of objects \(O_{S_1}\).
A set of objects \(O_{S_1} = \lbrace P_1, P_2, P_3 \rbrace\) have the same series identifier, \(S_1\).
Objects \(O_{S_1}\) all participate in an obsolescence chain.
Object \(P_3\) has been archived, and so is not discoverable.
In this case resolving \(S_1\) will result in \(P_3\) which is the most recent object in the obsolescence chain even though it is archived.
A set of objects \(O_{S_1} = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).
Objects \(O_{S_1}\) participate in an obsolescence chain which is damaged
by \(P_2\) indicating it is obsoletedBy
some object that is not resolvable.
In this case resolving \(S_1\) will result in \(P_2\) which is the most recent resolvable object in the obsolescence chain.
A set of objects \(O_{S_1} = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).
Objects \(O_{S_1}\) participate in a damaged obsolescence chain since \(P_2\)
indicates it is obsoletedBy
some object that is not resolvable, and \(P_1\)
does not assert it is obsoletedBy
\(P_2\).
In this case resolving \(S_1\) will result in \(P_2\) which is the most recent resolvable object in the obsolescence chain.
A set of objects \(O_{S_1} = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).
A set of objects \(O_{S_2} = \lbrace P_3 \rbrace\) has the series identifier, \(S_2\).
Objects \(O = O_{S_1} \cup O_{S_2}\) all participate in a damaged obsolescence chain, with \(P_1\) not indicating it is obsoleted by \(P_2\), and \(P_3\) not indicating that it obsoletes \(P_2\).
In this case resolving \(S_1\) will result in \(P_2\) which is not the most recent object in the obsolescence chain, however it is the newest version in the obsolescence chain identified by \(S_1\).
\(S_2\) will resolve to \(P_3\).
A set of objects \(O_{S_1} = \lbrace P_1, P_2, P_4 \rbrace\) have the same series identifier, \(S_1\).
A set of objects \(O_{S_2} = \lbrace P_5 \rbrace\) has the series identifier, \(S_2\).
Objects \(O = O_{S_1} \cup P_3 \cup O_{S_2}\) all participate in a damaged obsolescence chain with no assertion of the relationship between \(P_2\) and \(P_4\).
In this case resolving \(S_1\) will result in \(P_4\) which is not the most recent object in the obsolescence chain, however it is the newest version in the obsolescence chain identified by \(S_1\).
Resolving \(S_2\) will result in \(P_5\).
A set of objects \(O_{S_1} = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).
A set of objects \(O_{S_2} = \lbrace P_4 \rbrace\) has the series identifier, \(S_2\).
Objects \(O_{S_1}\) and \(O_{S_2}\) are both damaged obsolescence
chains though the Coordinating Nodes may infer association between
\(O_{S_1}\) and \(O_{S_2}\) since even though the object that \(P_2\) is
obsoletedBy
and the object that \(P_4\) obsoletes
can not be resolved,
\(P_2.obsoletedBy\) and \(P_4.obsoletes\) are be the same value.
In this case resolving \(S_1\) will result in \(P_2\) which is the most recent resolvable object in the obsolescence chain.
Resolving \(S_2\) will result in \(P_4\).
A set of objects \(O_{S_1} = \lbrace P_1, P_2, P_4 \rbrace\) have the same series identifier, \(S_1\).
Objects \(O_{S_1}\) form a damaged obsolescence chain though it can be
inferred that \(P_2\) is obsoletedBy
and \(P_4\) obsoletes
the same object
even though it can not be resolved, \(P_2.obsoletedBy\) and
\(P_4.obsoletes\) are be the same value.
In this case resolving \(S_1\) will result in \(P_4\).
A set of objects \(O_{S_1} = \lbrace P_1, P_2, P_5 \rbrace\) have the same series identifier, \(S_1\).
The obsolescence chain \(O_{S_1}\) is broken, with no way to traverse from \(P_2\) to \(P_5\).
The dateUploaded
places \(P_5\) as the newest
object with the series Id of \(S_1\).
Resolving \(S_1\) results in \(P_5\).
A set of objects \(O_{S_1} = \lbrace P_1, P_2, P_3 \rbrace\) have the same series identifier, \(S_1\).
Objects \(O_{S_1}\) form a damaged obsolescence chain since only
obsoletes
values are specified.
The dateUploaded
of \(P_1\) is newer than \(P_2\),
which in turn is newer than \(P_3\).
In this case resolving \(S_1\) will result in \(P_3\) even though \(P_1\) is the most recent object since the obsolescence chain overrides the times.
The use of the PID or SID for either citation or analysis workflows is up to the
user and is context dependent. In general, DataONE anticipates DATA
and
RESOURCE_MAP
objects will be referenced by PID, to ensure reproducibility;
and in general, METADATA
documents will be referenced by SID, to take
advantage of any data curation / correction efforts that would not otherwise
affect scientific reproducibility. Additionally, clues for the content
submitter’s preference can be found in the format of the identifiers
themselves. For example, DOIs and EZIDs take a recognizable format, and are
often encouraged in scientific communities for citations, so an end-user might
take that into consideration when deciding which identifier to choose.
Todo
guidance on RESOURCE_MAPS - initial thoughts: depends on references to DATA objects, whether they be SIDs or PIDs
Depending on the Member Node used as the primary repository, content originators may have some choice in assigning identifiers. For those that do, it is advised that they assign PIDs and SIDs according to the typical usage pattern described above.
Some Member Nodes may not preserve past versions of content, in which case the PID is likely to be automatically generated, and the submitter only has to determine the SID, and may not need to know the difference between the SID and PID. Other Member Nodes may still be at v1 of the DataONE APIs and only allow assignment of the PID.
The SID is used to conceptually represent an object that may vary modestly over time, but remains conceptually the same. Content contributers should be careful to apply reasonable limits on the scope of documents such that an entity does not deviate too much from the original item. In such cases, a new / different series should be initiated.
For Member Nodes that employ a mutable content storage model, the only additional DataONE requirement is that the Member Node generate a SystemMetadata document for the updated content, containing:
- unique PID in systemMetadata.identifier field
- new checksum
- the previous PID in the systemMetadata.obsoletes field
Ideally, the SystemMetadata of now unavailable versions will be maintained, and
the obsoletedBy
field is populated with the PID of the version that replaced
it.
Some Member Nodes may opt to preserve recent back-versions to aid the complete capture of versions by the DataONE network via synchronization.
to be determined
DataONE will attempt to synchronize all versions it’s made aware of through the synchronization process, but may miss short-lived versions that are in existence only between the Member Node’s synchronization interval. Please note, also, that the synchronization schedule is not guaranteed. Periods of DataONE maintenance may suspend synchronization, or high CN load could prolong the synchronization interval.
Member Nodes keen to make sure versions have the highest chance of
synchronization can choose to issue a CNCore.synchronize()
command that
will put the item on the synchronization queue instead of waiting for the
harvest interval.
Conversely, if the Member Node expressly doesn’t want DataONE to preserve back-versions, they can set systemMetadata.replicationPolicy.numberReplicas field to 0.
At its core, DataONE is in the business of preserving definite versions of content through centrally coordinated per-to-peer replication. That is, DataONE Coordinating Nodes direct certain Member Nodes to replicate newly synchronized objects from the originating Member Node to better preserve it. New versions of objects appear as first class immutable objects with unique PIDs, even if originating from mutable Member Nodes.
From the DataONE perspective the only difference between objects from mutable Member Nodes and immutable Member Nodes is the completeness of the series of versions it is able to synchronize and replicate.
Current DataONE replication processes and fixity checks depend on content identified by a PID that does not change. If this were not enforced, mutable content from a member node would not be differentiated from corrupt copies of the object and our replication and recovery features would attempt to correct the byte inconsistency. The immutability requirement helps to ensure reproducible results of any use of an object. Any analysis on a data set repeated sometime in the future should yield identical results (within the limits of precision of the analytical tools) and this is one of the major guiding principles in creating DataONE as a long term data repository federation. By simply overwriting existing content using the same identifier, nodes cannot be relied upon for repeatable retrieval of content.
The proposal for supporting “mutable” content is to allow a series identifier (SID) to facilitate the semantics of citing an object at the conceptual level, instead of the version level. As content changes over time, new identifiers (PIDs) will still be used to mark each change, but the conceptual object can continue to be referred to with an unchanging identifier (SID). The member node will be responsible for creating each version and assigning a unique PID to it and these objects will be synchronized and replicated to other DataONE member nodes as they are today. So instead of allowing content to be directly modified, we are allowing strongly-versioned chains to be referenced by an identifier; and relaxing the requirement that all revisions be resolvable forever.
The proposed solution is to model and implement a “series identifier” (SID) along with modified services that would work with both SIDs and PIDs. From a DataONE perspective, the series identifiers would be assigned to all versions of an object, be unique in DataONE (assigned to only one version chain), and would be reserved just as PIDs - from the same namespace. The series identifier, once assigned to the version chain, would similarly be immutable, and could apply to all new versions of the item. It is also assumed that in order to coordinate users to use one identifier for citations, that the cardinality for the citation identifier would be 0..1. The semantics for making API calls with a SID would, in general, be to return responses as if the call were made with the most current PID.
Member Nodes that only maintain the latest version of an item would be required to use a new PID for any updated content, and modify the System Metadata appropriately so that the new version can be synchronized with the network. The same SID would typically be used for the updated object, although we would allow the revision chain to shift to a new SID as desired by the client and/or member node.
It cannot be assumed that a user with an identifier in hand knows whether it is a SID or a PID, so DataONE expects the user to refer to the System Metadata once it has the item to determine if the identifier used in the call matches the PID or the SID. Similarly, they could interrogate search results for the same information. For high-level interfaces, like D1Client.getD1Object(id), the PID of the object returned may or may not match the passed in ‘id’. So, high-level functions or applications that use resolve will have to make sure they handle the new resolving semantics.
It is recommended that search indexes include a search field for the series identifier that can also be returned in the results.
A SID chain closes with two types of ends:
Type 1: An object on the SID chain doesn’t have the “obsoletedBy” field.
Example:
P1(S1) ⟺ P2(S1)
P2
is a type 1 end.
Type 2: An object on the SID chain does have the “obsoletedBy” field, but the PID in the “obsoletedBy” field has a different SID (including no SID value).
Examples:
P1(S1) ⟺ P2(S2)
P1(S1) ⟺ P2()
P1
is a type 2 end on both chains.
It is tricky to determine a type 2 end if the object in the “obsoletedBy” field
is missing. For example, P1(S1) ⟺ P2(S1) ⟹ ??
. We don’t have the
knowledge of the series id of the object ”??”. So we generally consider it a
type 2 end except we are sure it is not an end - there is another object in the
chain (has the same series id) that obsoletes the missing object.
In previous example [P1(S1) ⟺ P2(S1) ⟹ ??], P2 is a type 2 end (case 12).
However, P1(S1) ⟺ P2(S1) ⟹ ?? ⟸ P4(S1), P2 is not an end (case 8) since ”??” is in the obsoletes field of P4 that has the same series id - S1 (We are sure that the ”??” has the series id S1 as well, so P2 is not an end).
In P1(S1) ⟺ P2(S1) ⟹ ?? ⟸ P4(S2), P2 is a type 2 end even though ”??” is in the obsoletes field of P4. But P4 has a different series id - S2 (so we are not sure ”??” has the S1 or S2).
Ideally, if there is one and only one end on a SID chain, this end will be the HEAD (current) version. This kind of chains are called ideal chains.
Mutable content implies that back-versions of content may not be readily available on the nodes that originally produce the content. For metadata and resource maps, the coordinating nodes will store previous versions of objects during the synchronization process, but any data updates will result in only the latest version being available at the originating node. If the data objects were replicated (as is the hope), it is likely that previous versions of the data can still be resolved from replica target nodes, though this is dependent on replication policies, synchronization schedules and the availability of replica storage across the federation.
The current DataONE storage model, through the MN_Storage.update method, places responsibility for storing versions squarely on the submitter. Each update to the object requires a new unique identifier (PID) and must state which PID the new version is obsoleting. We will continue to require that unique PIDs are provided for each and every version of an object, but the member node will not be required to maintain a copy of previous revisions if it chooses not to. An optional series identifier (SID) can be provided with object SystemMetadata to group revisions together and to provide a convenient way to refer to the latest version of the object.
As is currently the case, the member node should maintain all versions of content using unique identifiers (PID) and synchronization will harvest each new revision to the network. While there will be no requirement that the Member node continue to make available the object identified by the obsoleted PID, the hope is that they will persist the data history as best they can. If the objects in the revision chain have a SID assigned, the new PID will be considered the latest version of this series.
The member node can allow access to the current version of the object using MN_Read.get(sid) as a convenience and any reference to the SID would resolve to the latest version of the object with a potentially different checksum and PID from what was originally present when the citation was distributed.
The member node must [minimally] maintain system metadata for the current revision of the object. Any updated object is still required to be identified by a new unique PID, but would include the same SID used in the previous version. The obsoletes field should indicate that the new PID replaces the previous PID. The coordinating node learns about the updated content during synchronization because there is:
- a new PID
- an updated dateSystemMetadataUpdated timestamp
- an updated checksum (other fields may also be updated).
N.B. Multiple revisions between synchronization periods would not result in multiple versions recorded in the federation - just the revision[s] that happened to be synchronized would be persisted in DataONE. This leaves open the possibility of an end user retrieving a version from the MN that will ultimately not be persisted in perpetuity.
DataONE essentially considers member nodes as the originators of selected versions of content. That is, not every intermediate revision on the way to a final product should neccessarily be saved for future reference. Organizations following the mutable content model for storage may wish to limit the objects returned by listObjects() to those that are considered in their publishable form. Certainly theses objects can later be updated as needed, but minimizing draft-status objects will reduce the amount of [possibly irretrievable] draft content floating around the federated network.
As illustrated in the optional use cases, the rate and regularity of change of objects can be widely variable. The more frequent the change, the less likely that all versions would need to be reproduced, and the utility of complete version history diminishes. One can imagine a member node serving up an unrecorded data stream, such as a web-cam, delaying creating a version until a user calls MN.get() on the item, by tee’ing the output stream to file while returning the object.
Additionally the need to keep past versions may be less important for metadata objects (correcting typos that do not change the meaning or interpretation of the data) than data objects or resource maps.
The use case of mutable data objects that grow with new records appended to the end of a table, for example, was given as a common practice for some groups, and one that would produce progressively redundant information with each persisted version. The motivation for rolling up records accumulated over time instead of new data files for each is the ease of use for end users. Using a SID to access the data object will always give the latest snapshot of the data records where old revisions may or may not also be accessible.
Objects like NetCDF files that include both metadata and data in the same object will be managed with the same PID and SID considerations. If only the metadata portion of the file is modified, the SID may remain the same, but a new PID and checksum must be created and made available for synchronization. The old revision may immediately become inaccessible using the PID and that is allowable under the proposal.
Implicit in the support for versioned content is support for retrieval of, or possibly just resolution to, the current object bytes by the identifier assigned in the originating system. At a minimum CNs will be required to support calculating which is the current version of series of versions and returning it or its identifier. This will be accomplished using the series identifier (SID) associated with object[s] in a revision chain. The “current” version of an object is defined as the non-obsoleted object with a SID that matches the requested identifier. Objects that are marked as “archived” may be returned as the most current version, but they should not be seen in default search interfaces. Since DataONE identifiers have no special formating semantics, those following a citation will not know by looking at the identifier whether it is referring to a specific version (PID) or the latest version of the item (SID), so services may be provided to easily investigate an entire version series. Existing services allow clients to deduce this information by inspecting the system metadata for the identifier and following any obsolescence properties as needed.
Because the content of an object is retrieved in a separate call from its system metadata, use of the SID for MN Read API calls is troublesome because the content may be updated between the two calls. It would be impossible to tell if the bytes retrieved were incorrect (bit rot) or correct (newer version) when comparing checksums in this case. If data consistency is important to the caller, the PID should be used to guarantee that only the expected bytes (or a NotFound exception) are returned by any MN.get calls.
Those making a citation may wish to cite a specific version, or the latest current version. Followers of citations may wish to, if given an identifier representing a specific version (PID), find out what is the latest version (another, newer PID, or the SID). Conversely, if given a series identifier that navigates to the latest version, they may wish to find out what the content was at some previous point in time (e.g., the time of the citation) by following the obsolescence chain backward.
DataONE will be providing CN services for navigating to the latest version of an object, since the only way to do it currently is for the clients to serially retrieve the system metadata for versions in the chain until they reach the head version, which is can be inefficient. A new method to retrieve the entire version history is also under consideration.
The use cases below organize the identified requirements related to mutable content, with the most relevant use cases listed first.
Defined as activities that help ensure continued discoverability and usefulness and usually in reference to metadata, not data.
For institutions following a mutable content model:
What is the best way to version mutable data that frequently changes but may or not be used. For example a “current time” object, replaced every minute, or “current weather radar” that’s replaced every 3 hours.
The underlying dynamic here is the the rate of mutation vs. the rate of synchronization
This means supporting data objects that add records over time, either:
Some formats combine data with metadata, for example netCDF, so allowing the metadata to change without impacting the consistency assessment of the data itself.
but may be referenced using a seriesId
Mutable content can theoretically include things that are live feeds from sensors, but are otherwise not captured.
This proposal does not accommodate streams unless they have discrete snapshots that can be referenced as part of a seriesId.