Warning: These documents are under active
development and subject to change (version 2.1.0-beta).
The latest release documents are at:
https://purl.dataone.org/architecture
Revisions: |
|
---|
The DataONE systems metadata is critical to ensuring that science data, science metadata, and other digital objects stored in DataONE are discoverable, accessible, auditable, verifiable, and are associated with meaningful related digital objects. Digital objects in DataONE must also be viable for the long term - for many decades - and so the system metadata must also include provenance information.
Due primarily to the project deliverable schedule, the current DataONE system metadata definition (DataONE, 2010b) is focused on the essential metadata values that must be available to support the earliest versions of DataONE. So far, relatively little attention has been paid to ensuring that system metadata contains appropriate attributes for the long haul.
This document describes some of the results of a practicum project carried out by Elizabeth (Betsy) Allen in Spring 2010. The PREMIS Data Dictionary for Preservation Metadata was used as a standard against which the explicit and implicit requirements of DataONE would be measured: “The PREMIS Data Dictionary defines ‘preservation metadata’ as the information a repository usees to support the digital preservation process” (PREMIS Editorial Committee, 2008, p.3). PREMIS is focused on the “viabilility, renderability, understandability, authenticity, and identity” of digital objects “in a preservation context” (i.e., DataONE), and pays particular attention to “digital provenance (the history of an object and to the documentation of relationships, especially relationships among different objects within the preservation repository” (p.3).
The PREMIS Data Dictionary seeks to identify “core” elements with its set of definitions, where core implies “things that most working preservation repositories are likely to need to know in order to support digital preservation” (p.3). PREMIS also considered “implementability”: because of the large amounts of data held within preservation systems, metadata values should be suppied autmatically and be capable of automated processing: required human intervention should be avoided.
It should also be noted that PREMIS has been created according to the “1:1 principle” which “asserts that each description describes one and only one resource.... It is not possible to change a file...; on can only create a new file... that is related to the source object” (p.14). The practicality of this principle in DataONE has been debated by the Core Cyberinfrastructure Team, which recognizes its conceptual cleanliness as well as its operational impracticality.
PREMIS does not specify formats or other requirements for how preservation is implemented in a preservation repository: these are left as local decisions.
DataONE, as a preservation repository, could aim toward “PREMIS conformance” by implementing metadata elements that share the names and semantics of PREMIS Data Dictionary semantic units. PREMIS is also intended to be a foundation for interoperable preservation repositories (pp.15-16); PREMIS recommends using its semantic unit names to aid this interoperability. This document does not argue for or against seeking PREMIS conformance for DataONE; rather, it seeks to identify and summarize topics and outstanding issues for discussion within the broader DataONE community.
This practicum project also took the Open Archives Initiative’s Object Reuse and Exchange (OAI-ORE) (Lagoze et al, 2008) and the BagIt File Packaging Format (Boyko et al, 2009) into account as possible standards for aggregations, or packages, of Web resources and as possible methods for recording the relationships between preserved objects in DataONE. BagIt is “is a hierarchical file packaging format designed to support disk-based or network-based storage and transfer of generalized digital content” (p.3). OAI-ORE “defines standards for the description and exchange of aggregations of Web resources”. This document looks at the points where BagIt and OAI-ORE may play a role in supporting the long-term preservation needs of DataONE.
BagIt and OAI-ORE have different strengths, and both systems have potential for use within DataONE. Their differences are significant, and so they cannot be viewed as substitutes for each other. The strengths of BagIt include simplicity (text and file/directory orientations; supports opaque payloads; supports aggregation; self-describing); its limitations include its hierarchical structure and what appears to be the “fixed-in-time” nature of a bag: the specification doesn’t discuss how the content of a bag might evolve over time, which limits its utility for application to supporting provenance tracking for objects in a preseravation repository.
OAI-ORE’s strengths include its flexibility and extensibility, its graph-based architecture, specifications based upon stable and widely-used technologies included the URI and RDF, and the ability for relationships to be added to existing OAI-ORE resources as time passes. OAI-ORE is also related to other efforts that may play a role in DataONE, such as the Open Annotation Collaboration (http://www.openannotation.org/). OAI-ORE’s flexibility has a cost in terms of its complexity, so it will be more costly to develop and maintain a reliable implementation (although its current popularity may mean that existing code implementations may be available for use within DataONE). OAI-ORE is also expressed in XML, which tends to increase storage consumption. Large XML data stores are also time-consuming to parse without optimization.
The document is organized as follows. A set of high-level requirements was developed to represent the general needs of the DataONE system metadata. For each of the high-level requirements identified, the relevant sections of the PREMIS data dictionary were identified, and missing, additional, mismatched aspects are identified in the text. The documentation for BagIt and OAI-ORE were consulted and their relevance to the requirement is also discussed in the text. Optionally, use cases relevant to the requirement are described, using science data specified in EML, Dryad, ORNL DAAC, and/or NBII formats as examples. The section on each requirement ends with a general discussion of the overall analysis.
For each of the high-level requirements identified, the relevant sections of the PREMIS data dictionary were identified, and missing, additional, mismatched aspects are identified in the text. The documentation for BagIt and OAI-ORE were consulted and their relevance to the requirement is also discussed in the text. Optionally, use cases relevant to the requirement are described, using science data specified in EML, Dryad, ORNL DAAC, and/or NBII formats as examples. The section on each requirement ends with a general discussion of the overall analysis.
To increase accessibility and help ensure long-term preservation, the Coordinating Nodes will perform replications on digital objects. Systems metadata will be replicated at each of the three Coordinating Nodes, while datasets and their associated descriptive metadata will be replicated at a minimum of two Member Nodes.
When a replication is performed, the DataONE system will need to record which object was replicated (1.1), the unique identifier of the new copy (1.1), where the replicate is stored in the system (1.7), information on the derivative relationship between the original object and the new one (1.10) [in PREMIS, replication of an object is defined as one type of a derivation relationship; see p.13], and information on the event that created the replicate such as the unique identifier of the event (2.1), type = replication (2.2), time (2.3), who performed the replication (2.6), and a link between the replicated object and the event (2.7).
(Requirement) System supports data storage https://trac.dataone.org/ticket/383
(Requirement) The infrastructure must survive destruction of one or more data storage nodes https://trac.dataone.org/ticket/411
(Requirement) Data and metadata is replicated to at least one other Member Node https://trac.dataone.org/ticket/433
Migration is one kind of preservation strategy that Coordinating Nodes may choose to use when a particular format of an object is in danger of obsolescence. Also, through time, the physical media the digital objects are stored on will degrade and an object will need to be migrated to a new media.
Prior to migrating an object to a different format the system must first know the following information: current format name and version (1.5.4.1.1 and 1.5.4.1.2, respectively), assurance that the file to be migrated is not corrupted (1.5.2 fixity), which alternative format is the best possible format to migrate the file to given the hardware and software requirements (refer to a digital format registry?), and name and version of application that created the object (1.5.5.1 and 1.5.5.2, respectively).
When performing a migration due to format obsolescence, the DataONE system will need to record the following metadata: which object is being replicated (1.1), what is the unique identifier of the new object (1.1), where in the system is the new object stored (1.7), information on the derivative relationship between the original object and the new one (1.10). Also, it needs to record metadata on the event that created the newly migrated object such as unique identifier (2.1), type = migration (2.2), time (2.3), who performed the migration (2.6), and a link between the migrated object and the event (2.7).
When migration for physical media obsolescence occurs, the system should record where the object is now located (1.7.1 contentLocation).
(Requirement) The infrastructure must support long term preservation of data https://trac.dataone.org/ticket/410
(Requirement) Maintain original copies of all science metadata https://trac.dataone.org/ticket/439
PREMIS suggest the system record as semantic units to define structural, derivaton and dependency relationships. For structural relationships, which “show relationships between parts of objects” (p.13), characterizations of these relationship types are recorded with a description of the relationship type, such as “structural” (1.10.1), relationship sub-type such as “is a part of” (1.10.2), and the unique identifier of the related object (1.10.3).
Derivative relationships “result from the replication or transformation of an object” (p.13). Because this type of relationship involves an event, the system must record the unique event identifier (1.10.4).
Dependency relationships exist “when one object requires another to support its functino,m delivery, or coherence of content”. Examples include a data type definition needed to render another file or modules needed by a software program that is required to render an object. These relationships are characterized in 1.8.4 “dependency” and 1.8.5.5 “swDependency” respectively.
(Requirement) Identifiers for all objects https://trac.dataone.org/ticket/317
(Requirement) Support arbitrary unique identifiers https://trac.dataone.org/ticket/385
(Requirement) Identifiers always refer to the same object https://trac.dataone.org/ticket/412
Digital object discovery by DataONE users is supported primarily by the descriptive metadata associated with data objects ingested into DataONE. The DataONE design refers to this metadata as “science metadata” (DataONE, 2010a).
Other digital object scenarios should also be considered. For example, when managing digital objects for long-term curation and stewardship, DataONE personnel and processes may use the system metadata (DataONE, 2010b) as the means for digital object discovery.
PREMIS defines descriptive metadata as ”...metadata ... used to describe Intellectual Entities” (p.23), and assumes that which in DataONE maps to the science metadata submitted to the system.
DataONE Use Case 33 - Search for Data (http://mule1.dataone.org/ArchitectureDocs/UseCases/33_uc.html)
(Requirement) Enable efficient mechanisms for users to discover content https://trac.dataone.org/ticket/384
Relationships , entities , citation, life science identifiers [exchange of digital objects between repositories? METS?]
Potential users of digital objects need to know of any structural, derivative, and dependency relationships properties in order to re-use an object. For example, databases are often stored in repsotiories as two files: one for content and oen for the schema. The user needs to access both files to re-use the databse. The suggested PREMIS semantic units for relationships are described under general requirement 3. Citation and persistent identifiers, such as LSIDs, are not addressed in PREMIS.
(Requirement) Enable efficient mechanisms for users to discover content https://trac.dataone.org/ticket/384
Emulation is a core preservation strategy for digital objects.
PREMIS provides the notion of a representation to as “the set of files required” to “maintain usable versions of intellectual entities over time” (p. 8). Emulation is one preservation approach to ensure long-term usability of digital objects. To emulate a digital object whose format is obsolete, the DataONE system must record information that characterizes both the software (1.8.5) and hardware (1.8.6) environent for each object. PREMIS requires software/hardware name and type to be recorded, while software version (1.8.5.2), software components needed by the software (1.8.5.5), and other information are optional.
(Requirement) The infrastructure must support long term preservation of data https://trac.dataone.org/ticket/410
(Requirement) Maintain original copies of all science metadata https://trac.dataone.org/ticket/439
Recording provenance allows users of digital objects to follow who has created and acted upon the object, what action was taken, and when the action occured. PREMIS uses associations between events and objects to record provenance. PREMIS, however, leaves decisions on which events are worthy of recording to the preservation system.
PREMIS states that provenance is one of the many attributes necessary for a digital object to be authentic (pg. 200); however, because demonstrating provenance involves many semantic units, it deserves to be its own requirement rather than a sub-requirement for authenticity [bad justification?]. The DataONE systems would capture provenance by recording who is doing what to the digital object through time. This includes recording information on the unique object identifier (1.1), the original name of the object if it was not created within the repository (1.6), and any relationships this item has with other digital objects such as “is a source of” (1.10). The majority of semantic units necessary to record provenance come from the event entity. The system will need to create a unique identifier for each event (2.1), describe the event type taken from a controlled vocabulary, (e.g. migration and ingestion)(2.1), and record when the event occurred (2.3). Optionally, ir could store details about the event, which are non-machine readable (2.4), and any information on the success of the event (2.5).
(Requirement) Identifiers for all objects https://trac.dataone.org/ticket/317
(Requirement) Support arbitrary unique identifiers https://trac.dataone.org/ticket/385
(Requirement) Consistent mechanism for identifying users https://trac.dataone.org/ticket/390
(Requirement) Identifiers always refer to the same object https://trac.dataone.org/ticket/412
PREMIS defines viability as the “property of being readable from media”. The PREMIS working group intentionally avoided defining detailed semanitc units for viability with the exception of 1.7.2, storage media, where the medium for storing an object is defined. More detailed information on media would likely be desirable so that repository managers would know when to refresh the medium.
(Requirement) The infrastructure must support long term preservation of data https://trac.dataone.org/ticket/410
(Requirement) Maintain original copies of all science metadata https://trac.dataone.org/ticket/439
Authenticity is the “quality of being what it purports to be”. This includes the conepts of fixity, integrity, and the use of digital signatures.
PREMIS has many semantic units that can be used as evidence of an object’s authenticity (1.5 and its sub-units). It is mandatory to record either format designation of the object from a controlled vocabulary (e.g. base64 or Adobe PDF)(1.5.4) or identify the format type through reference to a format registry (1.5.4.2). It is recommended, though optional, that the DataONE system record the message digest (1.5.2.2), the specific algorithm used to create the message digest (1.5.2.1), and who created the original digest (1.5.2.3).
Digital signature information is an optional unit in PREMIS (1.9). [“A repository may have a policy of generating digital signatures for files on ingest, or may have a need to store and later validate incoming digital signatures”. Which is it for DataONE or is it both?] To use digital signatures the system need to record the signature value (1.9.1.4), the “designation for the encryption and hash algorithms used for signature generation” (1.9.1.3), the rules for validating the signature (1.9.1.5), the encoding used for the singature (1.9.1.1), the signer’s public key (1.9.1.7) and who created the signature (1.9.1.2 or 3.1). [Should recording the object’s size be a requirement for authenticity? It is a characteristic, but I think it is more important for ensuring that a replication was successful]
The semantic unit 1.5 is used to record object characteristics, but demonstrating that the object characteristics are in fact valid occurs through events. For example, performing regular fixity checks is captured through the units event identifier (2.1), event type such as “fixity check” (2.2), and event date (2.3). Digital signature validation and format validation are also types of events that need to be recorded to show authenticity (2.3).
(Requirement) The infrastructure must support long term preservation of data https://trac.dataone.org/ticket/410
(Requirement) Maintain original copies of all science metadata https://trac.dataone.org/ticket/439
Software, organization. public key,
Principals are called agents in PREMIS. They are associated with either events that occur to a digital object or the rights associated with an object, but they are never directly linked to an object. PREMIS only has one required semantic unit for principal, which is agentIdentifier (3.1). Other optional units used to describe an agent include name (3.2) and type such as organization, software or person (3.3). The PREMIS Data Dictionary suggests that systems use digital signatures for authenticating submitters to and distributors from the system; however, because validation takes place right after signing, there is no need for the respository to preserve the signature itself through time. The system can record the act of validation as an Event if desired.
(Requirement) Consistent mechanism for identifying users https://trac.dataone.org/ticket/390
(Requirement) Enable different classes of users commensurate with their roles https://trac.dataone.org/ticket/391
Boyko, A., Kunze, J., Littman, J., Madden, L., Vargas, B. (2009). The BagIt File Packaging Format (V0.96). Retrieved April 2, 2010, from http://www.ietf.org/Internet-drafts/draft-kunze-bagit-04.txt
DataONE. (2010a). Metadata Attributes for Discovery. Retrieved April 2, 2010, from http://mule1.dataone.org/ArchitectureDocs/SearchMetadata.html.
DataONE. (2010b). System Metadata. Retrieved April 2, 2010, from http://mule1.dataone.org/ArchitectureDocs/SystemMetadata.html.
Lagoze, C., Van de Sompel, H., Johnston, P., Nelson, M., Sanderson, R., Warner, S. (2008). Open Archives Initiative Object Reuse and Exchange: ORE User Guide - Primer. Retrieved April 2, 2010, from http://www.openarchives.org/ore/1.0/primer.
PREMIS Editorial Committee. (2008). Data Dictionary for Preservation Metadata: PREMIS version 2.0. S.l. Retrieved April 2, 2010, from http://www.loc.gov/standards/premis/v2/premis-2-0.pdf.