EML Frequently Asked Questions

EML Frequently Asked Questions Back to EML Contents What is EML? EML stands for Ecological Metadata Language. It exists as a set of XML Schema documents that allow for the structural expression of metadata necessary to document a typical data set in the ecological sciences. Who is responsible for EML? The first two released versions of EML, EML 1.0 and EML 1.4.1 were developed at the National Center for Ecological Analysis and Synthesis (NCEAS), University of California at Santa Barbara, in Santa Barbara, California USA. The effort to produce EML 2.x (and all of the beta releases preceding it) is organized through the EML Project, an open source, community oriented project dedicated to providing a high-quality metadata specification for describing data relevant to the ecological discipline. The project is completely comprised of voluntary project members who donate their time and experience in order to advance information management for ecology. Project decisions are made by concensus according to the voting procedures described in the ecoinformatics.org Charter. Significant contributions for these recent releases have come from individuals at NCEAS, the Long Term Ecological Research Program (CAP, NET, KBS, JRN), and the Joseph W. Jones Ecological Research Center in Newton, GA. Why would I want to use EML when FGDC now supports biological data through the CSDGM? EML is modular and extensible. The Content Standard for Digital Geospatial Metadata (CSDGM) developed by the Federal Geographic Data Committee (FGDC) is a monolithic standard, and so it is difficult to mix and match parts of it with other standards -- mainly because of all of the spatial requirements. So, we built EML as a series of modules that can be linked together and can be linked to other metadata standards. This gives us the most flexibility, and given that we can easily translate into CSDGM compliant documents, there is little cost. Second, we're building advanced data processing tools that can automatically parse data sets and analyze them based on the EML metadata descriptions. Due to various shortcomings in the FGDC standard, mostly oriented around its tight focus on spatial data, we have found that the CSDGM isn't adequate for these needs, e.g., how can one add machine parsable, semantically oriented attribute tags to CSDGM? Answer, you can't, because it is monolithic and doesn't permit dynamic ties to other metadata specs -- the only extension method is via the administrative challenge of creating a superset of the CSDGM -- not very maintainable. In addition, the level of granularity for metadata in FGDC is very patchy -- it goes into great detail for spatial projections, etc., but is incredibly terse with respect to describing methods and non-standard data formats. This is appropriate in the spatial world where there are few data formats (< 100, many sensor derived streams), but not so good in ecology where there is no standardization of data formats (>>>5000, very few sensor derived). Is there documentation for EML in English? Yes, there is a formal specification of EML describing its development history, architecture, and modules. The intent of each module is described in narrative and there is a technical description of each module in XML notation. Included as part of the technical description is an element-by-element description of the module. We will eventually provide examples on usage. Why is EML such an important development? The last decade has witnessed a tremendous explosion of ecological and environmental data, catalyzed by societal concerns and facilitated by advancing technologies. These data have the potential to greatly enhance understanding of the complexity of the biosphere. However, broad-scale or synthetic research is stymied because data are largely unorganized and inaccessible as a consequence of their tremendous heterogeneity, complexity, and spatial dispersion in many separate repositories. EML is the first content standard designed specifically to address these issues for ecological data. Wide adoption and use of EML will create exciting new opportunities for data discovery, access, integration and synthesis. How do I get EML? All the documents associated with the EML development effort are available via the project web server at http://knb.ecoinformatics.org/software/eml/. These projects are licensed under the GPL (Gnu Public License) agreement and can be freely distributed and modified. The EML Schema documents are quite complex. An average ecologist probably cannot and more likely does not want to mark up content in an XML editor. How then do you get content into EML? The Knowledge Network for Biocomplexity project has developed a software client specifically to address this need. Morpho (after the butterfly genus) is written in java (making portable across computer platforms) combines an easy to use interface to EML with a number of tools to make it easier for ecologists to document data. These include a reverse-engineering wizard. Morpho is available from http://knb.ecoinformatics.org/software. Morpho currently supports the EML 2.0.0beta6 release, although the Morpho developers will update it to support the EML 2.0.0 release as soon as it has stabilized. In addition, the Xylographa and Xanthoria systems are designed to assist in editing EML documents and in producing EML-compliant metadata from existing database systems. EML contains provisions for communication. Is it possible to document in EML dynamic online data resources? Yes, there are provisions in the eml-physical module for descriptions of online data resources.. The eml-physical module describes the structural characteristics of data formats as delivered over the wire or as found in a file system. One physical object (which can be a bytestream or an object in a file system) might contain multiple entities (for example, this would be typical in a MS Access file that contained multiple tables of data). However, it is typically used to describe a file or stream that is in some text-based format such as ASCII or UTF-8, and includes the information needed to parse the data stream to extract the entity and its attributes from the stream. There are 3 distribution types, online, offline, and inline. To describe an online dataset in EML you would populate the online element with the distribution information. Do I need to download special software to use EML? No, but there is software available to work with EML. See FAQ 7 . How can I get my existing metadata into EML? There are several approaches that can be used to convert existing metadata into EML depending on what form your existing metadata take. Case 1: If your metadata is currently in a text format (not stored in a database) use the following conversion methods. Write a script (PERL, PHP, JAVA, etc.) to convert the text into EML compliant XML. Convert the text metadata into XHTML (HTML that is XML compliant). Write an XSLT script to transform the XHTML file into EML compliant XML. Use an special purpose XML editor that generates EML ( Morpho or Xylographa ) and manually retype the metadata. Use a general purpose XML development tool such as XML Spy that can create a sample document from an XML Schema and retype the metadata manually. Use a simple text editor and do everything from scratch. Use specialized data transformation software such as the Data Junction suite to extract text data and then map it into an EML structure. Case 2: If your metadata is stored in a relational database use the following conversion methods. Both Microsoft SQL Server and Oracle have utilities to generate XML from their database. If you use a tool like that, then you will have to write an XSLT script to transform the generated XML into EML. Use a vendor neutral Database-to-XML generator such as Cocoon (an Apache open source free tool). Cocoon can query the database, generate XML, and has a tool for creating the XSL Transformation scripts to convert the first stage XML output into EML format. Use a specialized tool such as Xanthoria (like Cocoon in may respects, but is easier to use) to generate XML from the database. Then use a tool such as XML Spy or Stylus Studio to develop the XSLT script to convert the generated XML into EML compliant XML. Use specialized data transformation software such as the Data Junction query the database and map it into an EML structure. Case 3: If your metadata is already in XML but in some other form such as NBII or FGDC use the following conversion method. Write an XSLT script to convert from the current format to EML (e.g. FGDC to EML). NOTE: In each of the cases it may be necessary to add some additional metadata in order to produce EML compliant documents. Morpho will automatically create EML compliant metadata either by adding it for you or indicating that certain fields are mandatory. Once I convert my metadata into EML, what do I do with it? If I am storing all my metadata in text-based EML files, how am I supposed to query them or use them for data management? EML is an exchange standard for communication of metadata but it can be used as the framework for a data management system. Metacat is a multipurpose XML metadata and data repository that is optimized for use with EML. If you store your metadata in a relational database management system or plan to then there are also solutions. Cocooon and Xanthoria are examples of programs that can get EML out of an RDBMS. Cocoon and Xanthoria are both java applications that use java database connection hooks and style sheets to retrieve and format data. Xanthoria is a light-weight solution and the XSLT stylesheets for EML 2.0 have already been written. This solution lets a site stick with the RDBMS system that they probably have integrated with their site management activities, yet also have their metadata exposed via EML. Does the modularity of EML mean that one description can be shared by many documents? In a previous version, EML packages (via RDF like triples) supported linking across packages, so you could re-use the same document in multiple packages. In EML 2.0.0 Release Candidate 1 we redesigned the packaging structure to only allow linking within a single package. Thus, one could re-use a party description or attribute list within a package, but not across several. This is a compromise that keeps some reusability but has fewer management problems. Along with this change is an ability to put all metadata and data in a single document for transport -- while still not limiting ourselves to a monolithic structure. This has benefits (akin to db normalization) and costs (access control, ownership, and multiple update problems). How are EML modules linked together? With "id" attributes and "references" elements in each module. Certain modules within EML allow you to identify specific subtrees with a unique identifier (id). This identifier can then be used in place of content in other parts of the EML document by placing it in a "references" element. Our general approach in EML has been to create ComplexTypes (CT) when we wanted a particular block to be reusable. This concept was extended for linking modules together by adding an optional attribute named "id" of type "xs:string" for each ComplexType. This allows us to uniquely address each block defined by a CT. For the "ResourceBase" CT, this id element replaces the "identifier" element and acts as the overall identifier for the package. The content model for each CT is a choice between the existing content model and a new element named "references" of type "xs:string". This element is used to hold a reference to an existing subtree identified by its id. This relationship between the "references" element and the "id" identifiers is enforced by defining a "key" for the "id" elements and a "keyref" for the "references" elements. This use of a key and keyref differs slightly from the XML Schema case because in XML Schema, keys can not be null, whereas we want people to be able to optionally omit the "id" attribute. Consequently, we have incorporated the rules about the correspondence between keys and keyrefs into the EML specification, but not into the schemas directly. Thus, in order to validate that an EML document is valid EML, you must use a parser that understands the referencing system in EML and can check that it is used correctly. An example system that handles this key validation is shipped with the EML distribution (see the "EML Parser"). Here's a fragemnt of an example xml doc to illustrate: Jones id.p1 lackey id.p1 ... ]]> This even works for types that extend other types as long as the subclass is the one that does the referencing (e.g., associatedParty can reference creator, but not vice versa). Can I put data into EML as well as metadata? Yes, there are provisions in the eml-physical module for inclusion of data. The module describes the structural characteristics of data formats as delivered over the wire or as found in a file system. One physical object (which can be a bytestream or an object in a file system) might contain multiple entities (for example, this would be typical in a MS Access file that contained multiple tables of data). However, it is typically used to describe a file or stream that is in some text-based format such as ASCII or UTF-8, and includes the information needed to parse the data stream to extract the entity and its attributes from the stream. There are 3 distribution types, online, offline, and inline. To include data in EML you would populate the inline element with the data file described in the data format element. The data that is in the inline element should conform to the description provided by the eml-physical module. Binary data files can be included using Base64 encoding. What can I do with my EML structured metadata? Tools are currently being developed to allow automated heterogeneous data integration, analytical processing and quality testing based on EML metadata. In general, using a metadata standard such as EML will lessen your data entropy and make it more useful to you and others in the future. Can I validate my EML documents against the DTD? No. As of EML 2.0.0 we are no longer creating DTDs as part of the EML release. Only XML Schemas will be released. Even then, there are some EML rules which are not expressible in XML Schema and for which you must use a specialized validator, such as the "EML Parser" that ships with the distribution. Are there required elements in EML? Yes, although we've made every attempt to limit required elements in the cause of flexibility there are a number of pieces of information required to make sense of the metadata document. To make the metadata more useful we do have recommended usages on the modules. See specification for details about required fields and recommended usage. In the future we may provide usage compliance information such that if you want your data and metadata to be useful in a particular analytical context you will be provided with those elements of EML that are required for that purpose. There appear to be multiple places to put some types of metadata in EML. How do I know which of these places is the right place for my information? The EML Specification describes each element in a detailed normative manner. EML is hierarchical so where you use different elements is very important. For instance, if you use a TemporalCoverage element and reference it to a dataset element, you are saying that that entire dataset took place during that time. If, instead, you reference it to a DataTable, you are saying that only that table was covered by that time period. You must gauge exactly what you are trying to describe in the structure that you are using. Questions about possible bugs in the definitions of elements can be posed via email to the eml-dev mailing list The differences between "method" and "protocol" seem to be very subtle in EML. How do I distinguish between the two? The eml-methods module describes the methods followed in the creation of the dataset being described, including description of field, laboratory and processing steps, sampling methods and units, quality control proceudures. The eml-methods module is used to describe the "actual" procedures that are used in the creation or the subsequent processing of a dataset. Likewise, eml-methods is used to describe proccesses that have been used to define / improve the quality of a data file, or to identify potential problems with the data file. The eml-protocol module is intended to be used to document a "prescribed" procedure, whereas the eml-method module is used to describe procedures that were actually performed. The distinction is that the use of the term "protocol" is used in the "prescriptive" sense, and the term "method" is used in the "descriptive" sense. This distinction allows managers to build a protocol library of well-known, established protocols (procedures), but also document what procedure was truely performed in relation to the established protocol. The method may have diverged from the protocol purposefully, or perhaps incidentally, but the procedural lineage is still preserved and understandable. The eml-methods module, like other modules, may be referenced via the <references> tag. This allows a method to be described once, and then used as a reference in other locations within the EML document via it's ID. How can 'references' be treated in XSLT transformations of EML? XSLT can be used to transform EML to other formats, but the treatment of 'references' elements is somewhat complicated. A text file describing the details of one method for handling the 'references' elements is available at http://knb.ecoinformatics.org/software/eml/eml-2.0.1/references_XSLT.txt I'm interested in contributing to EML. Can I? We welcome contributions to this work in any form. Individuals who invest substantial amounts of time and make valuable contributions to the development and maintenance of EML (in the opinion of current project members) will be invited to become EML project members according to the rules set forth in the ecoinformatics.org Charter. Contributions can take many forms, including the development of the EML schemas, writing documentation, and helping with maintenance, among others. You can contact the eml-dev mailing list if you would like to make a contribution in person-hours to this project and would like to discuss how that might occur. In general, we want all of the help we can get! Where can I get EML? You can download archived releases from http://knb.ecoinformatics.org/software/eml/ or you can check out the latest development version from our CVS server.