Ecological Metadata Language (EML) Specification

Ecological Metadata Language (EML) Specification Preface

Introduction The Ecological Metadata Language (EML) is a metadata standard developed by the ecology discipline and for the ecology discipline. It is based on prior work done by the Ecological Society of America and associated efforts (Michener et al., 1997, Ecological Applications). EML is implemented as a series of XML document types that can by used in a modular and extensible manner to document ecological data. Each EML module is designed to describe one logical part of the total metadata that should be included with any ecological dataset.

Purpose Statement To provide the ecological community with an extensible, flexible, metadata standard for use in data analysis and archiving that will allow automated machine processing, searching and retrieval.

Features The architecture of EML was designed to serve the needs of the ecological community, and has benefitted from previous work in other related metadata languages. EML has adopted the strengths of many of these languages, but also addresses a number of short-comings that have proved to inhibit the automated processing and integration of dataset resources via their metadata. The following list represents some of the features of EML: Modularity: EML was designed as a collection of modules rather than one large standard to facilitate future growth of the language in both breadth and depth. By implementing EML with an extensible architecture, groups may choose which of the core modules are pertinent to describing their data, literature, and software resources. Also, if EML falls short in a particular area, it may be extended by creating a new module that describes the resource (e.g. a detailed soils metadata profile that extends eml-dataset). The intent is to provide a common set of core modules for information exchange, but to allow for future customizations of the language without the need of going through a lengthy 'approval' process. Detailed Structure: EML strives to balance the tradeoff of too much detail with enough detail to enable advanced services in terms of processing data through the parsing of accompanied metadata. Therefore, a driving question throughout the design was: 'Will this particular piece of information be machine-processed, just human readable, or both?' Information was then broken down into more highly structured elements when the answer involved machine processing. Compatibility: EML adopts much of it's syntax from the other metadata standards that have evolved from the expertise of groups in other disciplines. Whenever possible, EML adopted entire trees of information in order to facilitate conversion of EML documents into other metadata languages. EML was designed with the following standards in mind: Dublin Core Metadata Initiative, the Content Standard for Digital Geospatial Metadata (CSDGM from the US geological Survey's Federal Geographic Data Committee (FGDC)), the Biological Profile of the CSDGM (from the National Biological Information Infrastructure), the International Standards Organization's Geographic Information Standard (ISO 19115), the ISO 8601 Date and Time Standard, the OpenGIS Consortiums's Geography Markup Language (GML), the Scientific, Technical, and Medical Markup Language (STMML), and the Extensible Scientific Interchange Language (XSIL). Strong Typing: EML is implemented in an Extensible Markup Language (XML) known as XML Schema, which is a language that defines the rules that govern the EML syntax. XML Schema is an internet recommendation from the World Wide Web Consortium, and so a metadata document that is said to comply with the syntax of EML will structurally meet the criteria defined in the XML Schema documents for EML. Over and above the structure (what elements can be nested within others, cardinality, etc.), XML Schema provides the ability to use strong data typing within elements. This allows for finer validation of the contents of the element, not just it's structure. For instance, an element may be of type 'date', and so the value that is inserted in the field will be checked against XML Schema's definition of a date. Traditionally, XML documents (including previous versions of EML) have been validated against Document Type Definitions (DTDs), which do not provide a means to employ strong validation on field values through typing. There is a distinction between the content model (i.e. the concepts behind the structure of a document - which fields go where, cardinality, etc.) and the syntactic implementation of that model (the technology used to express the concepts defined in the content model). The normative sections below define the content model and the XML Schema documents distributed with EML define the syntactic implementation. For the foreseeable future, XML Schema will be the syntactic specification, although it may change later.

Overview of EML modules and their use

Module Overview Foreword The following section briefly describes each EML module and how they are logically designed in order to document ecological resources. Some of the modules are dependent on others, while others may be used as stand-alone descriptions. This section describes the modules using a "top down" approach, starting from the top-level eml wrapper module, followed by modules of increasing detail. However, there are modules that may be used at many levels, such as eml-access. These modules are described when it is appropriate.

Root-level structure

Top-level resources The following four modules are used to describe separate resources: datasets, literature, software, and protocols. However, note that the dataset module makes use of the other top-level modules by importing them at different levels. For instance, a dataset may have been produced using a particular protocol, and that protocol may come from a protocol document in a library of protocols. Likewise, citations are used throughout the top-level resource modules by importing the literature module.

Supporting Modules - Adding detail to top-level resources The following six modules are used to qualify the resources being described in more detail. They are used to describe access control rules, distribution of the metadata and data themselves, parties associated with the resource, the geographic, temporal, and taxonomic extents of the resource, the overall research context of the resource, and detailed methodology used for creating the resource. Some of these modules are imported directly into the top-level resource modules, often in many locations in order to limit the scope of the description. For instance, the eml-coverage module may be used for a particular column of a dataset, rather than the entire dataset as a whole.

Data organization - Modules describing dataset structures The following three modules are used to document the logical layout of a dataset. Many datasets are comprised of multiple entities (e.g. a series of tabular data files, or a set of GIS features, or a number of tables in a relational database). Each entity within a dataset may contain one or more attributes (e.g. multiple columns in a data file, multiple attributes of a GIS feature, or multiple columns of a database table). Lastly, there may be both simple or complex relationships among the entities within a dataset. The relationships, or the constraints that are to be enforced in the dataset, are described using the eml-constraint module. All entities share a common set of information (described using eml-entity), but some discipline specific entities have characteristics that are unique to that entity type. Therefore, the eml-entity module is extended for each of these types (dataTable, spatialRaster, spatialVector, etc...) which are described in the next section.

Entity types - Detailed information for discipline specific entities The following six modules are used to describe a number of common types of entities found in datasets. Each entity type uses the eml-entity module elements as it's base set of elements, but then extends the base with entity-specific elements. Note that the eml-spatialReference module is not an entity type, but is rather a common set of elements used to describe spatial reference systems in both eml-spatialRaster and eml-spatialVector. It is described here in relation to those two modules.

Utility modules - Metadata documentation enhancements The following modules are used to highlight the information being documented in each of the above modules where prose may be needed to convey the critical metadata. The eml-text module provides a number of text-based constructs to enhance a document (including sections, paragraphs, lists, subscript, superscript, emphasis, etc.)

Dependency Chart The multiple modules in EML all depend on each other in complex ways. To easily see these dependencies see the EML Dependency Chart.

Internationalization - Metadata in multiple languages EML supports internationalization using the i18nNonEmptyStringType. Fields defined as this type include: Title Keyword Contact information (e.g. names, organizations, addresses) TextType fields also support language translations. These fields include: Abstract Methods Protocol Internationalization techniques Core metadata should be provided in English. The core elements can be augmented with translations in a native language. Detailed metadata can be provided in the native language as declared using the xml:lang attribute. Authors can opt to include English translations of this detailed metadata as they see fit. The following example metadata document is provided primarily in Portuguese but includes English translations of core metadata fields. The xml:lang="pt_BR" attribute at the root of the EML document indicates that, unless otherwise specified, the content of the document is supplied in Portuguese (Brazil). The xml:lang="en_US" attributes on child elements denote that the content of that element is provided in English. Core metadata (i.e. title) is provided in English, supplemented with a Portuguese translation using the value tag with an xml:lang attribute. Note that child elements can override the root language declaration of the document as well as the language declaration of their containing elements. The abstract element is primarily given in Portuguese (as inherited from the root language declaration), with an English translation. Many EML fields are repeatable (i.e. keyword) so that multiple values can be provided for the same concept. Translations for these fields should be included as nested value tags to indicate that they are equivalent concepts expressed in different languages rather than entirely different concepts. <?xml version="1.0"?> <eml:eml packageId="eml.1.1" system="knb" xml:lang="pt_BR" xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd"> <dataset id="ds.1">  <title xml:lang=""en_US"> Sample Dataset Description <value xml:lang="pt_BR">Exemplo Descrição Dataset</value> </title> ...  <abstract> <para> Neste exemplo, a tradução em Inglês é secundário <value xml:lang="en_US">In this example, the English translation is secondary</value> <para> </abstract> ...  <keywordSet> <keyword keywordType="theme"> árvore <value xml:lang="en_US">tree</value> <keyword> <keyword keywordType="theme"> água <value xml:lang="en_US">water</value> <keyword> </keywordSet> ... </dataset> </eml:eml>

Technical Architecture (Normative)

Introduction This section explains the rules of EML. There are some rules that cannot be written directly into the XML Schemas nor enforced by an XML parser. These are guidelines that every EML package must follow in order for it to be considered EML compliant.

Module Structure Each EML module, with the exception of "eml" itself, has a top level choice between the structured content of that modules or a "references" field. This enables the reuse of content previously defined elsewhere in the document. Methods for defining and referencing content are described in the next section

Reusable Content EML allows the reuse of previously defined structured content (DOM sub-trees) through the use of key/keyRef type references. In order for an EML package to remain cohesive and to allow for the cross platform compatibility of packages, the following rules with respect to packaging must be followed. An ID is required on the eml root element. IDs are optional on all other elements. If an ID is not provided, that content must be interpreted as representing a distinct object. If an ID is provided for content then that content is distinct from all other content except for that content that references its ID. If a user wants to reuse content to indicate the repetition of an object, a reference must be used. Two identical ids with the same system attribute cannot exist in a single document. "Document" scope is defined as identifiers unique only to a single instance document (if a document does not have a system attribute or if scope is set to 'document' then all IDs are defined as distinct content). "System" scope is defined as identifiers unique to an entire data management system (if two documents share a system string, then any IDs in those two documents that are identical refer to the same object). If an element references another element, it must not have an ID itself. The system attribute must have the same value in both the target and referencing elements or it must be absent in both. All EML packages must have the 'eml' module as the root. The system and scope attribute are always optional except for at the 'eml' module where the scope attribute is fixed as 'system'. The scope attribute defaults to 'document' for all other modules.

EML Parser Because some of these rules cannot be enforced in XML-Schema, we have written a parser which checks the validity of the references and IDs used in your document. This parser is included with the 2.1.0 release of EML. To run the parser, you must have Java 1.3.1 or higher. To execute it change into the lib directory of the release and run the 'runEMLParser' script passing your EML instance file as a parameter. There is also an online version of this parser which is publicly accessible. The online parser will both validate your XML document against the schema as well as check the integrity of your references.

ID and Scope Examples

Example Documents Invalid EML due to duplicate identifiers <?xml version="1.0"?> <eml:eml packageId="eml.1.1" system="knb" xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd"> <dataset id="ds.1"> <title>Sample Dataset Description</title>  <creator id="23445" scope="document"> <individualName> <surName>Smith</surName> </individualName> </creator> <creator id="23445" scope="document"> <individualName> <surName>Myer</surName> </individualName> </creator> ... </dataset> </eml:eml>

This instance document is invalid because both creator elements have the same id. No two elements can have the same string as an id.

Invalid EML due to a non-existent reference <?xml version="1.0"?> <eml:eml packageId="eml.1.1" system="knb" xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd"> <dataset id="ds.1"> <title>Sample Dataset Description</title> <creator id="23445" scope="document"> <individualName> <surName>Smith</surName> </individualName> </creator> <creator id="23446" scope="document"> <individualName> <surName>Myer</surName> </individualName> </creator> ... <contact> <references>23447</references> </contact> </dataset> </eml:eml>

This instance document is invalid because the contact element references an id that does not exist. Any referenced id must exist.

Invalid EML due to a conflicting id attribute and a <references> element <?xml version="1.0"?> <eml:eml packageId="eml.1.1" system="knb" xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd"> <dataset id="ds.1"> <title>Sample Dataset Description</title> <creator id="23445" scope="document"> <individualName> <surName>Smith</surName> </individualName> </creator> <creator id="23446" scope="document"> <individualName> <surName>Meyer</surName> </individualName> </creator> ... <contact id="522"> <references>23445</references> </contact> </dataset> </eml:eml>

This instance document is invalid because the contact element both references another element and has an id itself. If an element references another element, it may not have an id. This prevents circular references.

A valid EML document <?xml version="1.0"?> <eml:eml packageId="eml.1.1" system="knb" xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd"> <dataset id="ds.1"> <title>Sample Dataset Description</title> <creator id="23445" scope="document"> <individualName> <surName>Smith</surName> </individualName> </creator> <creator id="23446" scope="document"> <individualName> <surName>Smith</surName> </individualName> </creator> ... <contact> <references>23446</references> </contact> <contact> <references>23445</references> </contact> </dataset> </eml:eml>

This instance document is valid. Each contact is referencing one of the creators above and all the ids are unique.

Module Descriptions (Normative)

.xsd

Index A

<xsl:value-of select="./doc:moduleName"/> Normative technical docs for ./.html

./.html#