# Validation and Content references This section explains the validation rules of EML. While most of the validation rules are expressed as constraints within the XML Schema definition files, there are some rules that cannot be written directly into the XML Schemas nor enforced by an XML parser. These additional validation rules MUST be enforced by every EML package in order for it to be considered EML-compliant. ## Validation rules For a document to be EML-valid, all of the following constraints must hold true: - The document MUST validate using a compliant XML Schema validating parser - All EML documents MUST have the 'eml' module as the root - A `packageId` attribute MUST be present on the root `eml` element - All `id` attributes within the document MUST be unique - Elements which contain an `annotation` child element MUST contain an `id` attribute, unless the containing `annotation` element contains a `references` attribute - If an element references another using a child `references` element, another element with that value in its `id` attribute MUST exist in the document - When `references` is used, the `system` attribute MUST have the same value in both the target and source elements, or it must be absent in both. Frequently it is absent in both. - If an element references another using a child `references` element, it MUST not have an `id` attribute itself - If an `additionalMetadata` element references another using a child `describes` element, another element with that value in its `id` attribute MUST exist in the document ## Validation algorithm One reasonable algorithm for assessing these constraints without loading the XML into a DOM structure could be implemented by checking `id` and `references` fields while parsing the document and storing their values in `identifierHash` and `referencesHash` data structures in order to do the final consistency check. For example, in pseudocode: - Parse the XML document using an XML Schema-compliant parser - If the root element is not `eml`, then the document is invalid - For each element, record whether it has an `id` attribute or not - If an element does not contain an `id`, but it has a child `annotation` element, and that child annotation does not contain a `references` attribute, then the document is invalid - For each `id` attribute - If `id` is not in `identifiersHash` then add it as the key of `identifiersHash`, with its `system` as the value - If `id` is already in `identifiersHash` then the document is invalid - If the element containing the id contains a `references` element as an immediate child then the document is invalid - For each `references` element - If the `references` key is not in `referencesHash`, then add it as a key with the `system` value to `referencesHash` - If the `references` key is in `referencesHash`, but the current `system` value does not match the value for that key, then the document is invalid - For each `references` attribute on an `annotation` element - If the `references` key is not in `referencesHash`, then add it as a key with the empty string '' value to `referencesHash` - For each `describes` element within an `additionalMetadata` element - If the `describes` key is not in `referencesHash`, then add it as a key with the empty string '' value to `referencesHash` - Once document processing is complete, for each `key` in `referencesHash` - If `!identifierHash.hasKey(key) OR 'referencesHash[key] != identifierHash[key]'` then the document is invalid - If no validity errors are found above or by the parser, then the document is valid ## Content references Each EML module, with the exception of "eml" itself, has a top level choice between the structured content of that element or a "references" field. This enables the reuse of content previously defined elsewhere in the document. This allows, for example, an author to create a single `` element with all of its child detail, and then reference that as `m.jones` to indicate that the same person is both the creator and contact. This creates an unambiguous linkage via the `id` field that the two elements refer to the same entity, in this case a person, and avoids having to re-enter the same information multiple times in the document. Another common location for re-use is when a single `attributeList` is defined with a set of variables and their metadata, and then that list is referenced in multiple `dataTable` elements to show that they are structured identically. The reuse of structured content is accomplished through the use of `id`/`references` pairs. Each element that is to be reused will contain a unique `id` attribute on the element. Because this identifier is guaranteed to be unique within the EML document, any other location that wants to point at that content can do so using the `references` element, as shown in the example above. These types of references can also be used in the `references` attribute of `annotation` elements, and in the `describes` element within the `additionalMetadata` element. If an `id` attribute is provided for content, then that content is considered to represent a different entity than all other elements that are defined in the document, except for those that include its `id` in the `references` child. This is useful to indicate, for example, that two people with similar names (e.g., "D. Clark" and "D. Clark") are in fact distinct individuals (e.g., "Deborah Clark" and "David Clark"), or that two variables with the same `attributeName` are in fact different variables. While it would be bad practice to reuse attribute names like this, it does happen and EML needs to be able to document it when it does. ## EML Validity Parser Because some of these rules cannot be enforced in XML-Schema, we have written a parser which checks the validity of the references and `id`s used in a document. This parser is included with the release of EML. To run the parser, you must have Java installed. To execute it change into the top-lvel directory of the EML release and run the 'validate.sh' script passing your EML instance file as a parameter. There may also be an [online version](https://knb.ecoinformatics.org/emlparser) of this parser, which is publicly accessible. The validator will both validate your XML document against the schema as well as check the integrity of your references. ## id and Scope Examples **Example: Invalid EML due to duplicate identifiers** ```xml Sample Dataset Description Smith Smith ... ``` This instance document is invalid because both creator elements have the same id. No two elements can have the same string as an id. **Example: Invalid EML due to a non-existent reference** ```xml Sample Dataset Description Smith Myer ... 23447 ``` This instance document is invalid because the contact element references an `id` that does not exist. Any referenced `id` must exist in the document. **Example: Invalid EML due to a conflicting id attribute and a `` element** ```xml Sample Dataset Description Smith Meyer ... 23445 ``` This instance document is invalid because the contact element both references another element and has an id itself. If an element references another element, it may not have an id. This prevents circular references. **Example: A valid EML document** ```xml Sample Dataset Description Smith Smith ... 23446 23445 ``` This instance document is valid. Each contact is referencing one of the creators above and all the ids are unique. The each creator has a its own `id` indicates that they are different people, even though they have the same `surName` and there is no other distinguishing metadata.