The eml-physical module - Physical file format

'$RCSfile: eml-physical.xsd,v $' Copyright: 1997-2002 Regents of the University of California, University of New Mexico, and Arizona State University Sponsors: National Center for Ecological Analysis and Synthesis and Partnership for Interdisciplinary Studies of Coastal Oceans, University of California Santa Barbara Long-Term Ecological Research Network Office, University of New Mexico Center for Environmental Studies, Arizona State University Other funding: National Science Foundation (see README for details) The David and Lucile Packard Foundation For Details: http://knb.ecoinformatics.org/ '$Author: obrien $' '$Date: 2009-03-05 22:33:04 $' '$Revision: 1.82 $' This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA eml-physical

The eml-physical module - Physical file format The eml-physical module describes the external and internal physical characteristics of a data object as well as the information required for its distribution. Examples of the external physical characteristics of a data object would be the filename, size, compression, encoding methods, and authentication of a file or byte stream. Internal physical characteristics describe the format of the data object being described. Both named binary or otherwise proprietary formats can be cited (e.g., Microsoft Access 2000), or text formats can be precisely described (e.g., ASCII text delimited with commas). For these text formats, it also includes the information needed to parse the data object to extract the entity and its attributes from the data object. Distribution information describes how to retrieve the data object. The retrieval information can be either online (e.g., a URL or other connection information) or offline (e.g., a data object residing on an archival tape). The eml-physical module, like other modules, may be "referenced" via the <references> tag. This allows a physical document to be described once, and then used as a reference in other locations within the EML document via its ID.

Any data object that is being described by EML needs this information so the entities and attributes that reside with in the data object can be extracted. yes Physical structure Physical structure of an entity or entities. The content model for physical is a CHOICE between "references" and all of the elements that let you describe the internal/external characteristics and distribution of a data object (e.g., dataObject, dataFormat, distribution.) A physical element can contain a reference to an physical element defined elsewhere. Using a reference means that the referenced physical is identical, not just in name but identical in its complete description. Physical structure Physical structure of an entity or entities.

The eml-physical module describes the physical characteristics of a data object and the information required for its distribution. External physical characteristics include the filename, size, compression, encoding methods, and authentication of a file or byte stream. Internal physical characteristics describe the format of the data object. Proprietary formats can be cited (e.g., Microsoft Access 2000), or text formats can be precisely described (e.g., ASCII text delimited with commas). The module includes the information needed to parse the text data object to extract the entity and its attributes. Distribution information describes how to retrieve the data object, either as online (a URL or connection definition), offline (e.g., a data object residing on an archival tape), or inline (i.e., the data are included with the metadata). Like many other EML elements, a physical Type can contain a reference to another physical element defined elsewhere in the document instead of a description of the resource. Using a reference means that the referenced physical is identical, not just in name but identical in its complete description.

Data object name The name of the data object. The name of the data object. This is possibly distinct from the entity name in that one physical object can contain multiple entities, even though that is not a recommended practice. The objectName often is the filename of a file in a file system or that is accessible on the network. rainfall-sev-2002-10.txt Data object size Describes the physical size of the data object. This element contains information of the physical size of the entity, by default represented in bytes unless the unit attribute is provided to change the units. 134 Unit of measurement Unit of measurement for the entity size, by default byte This element gives the unit of measurement for the size of the entity, and is by default a byte. byte Authentication value A value, typically a checksum, used to authenticate that the bitstream delivered to the user is identical to the original. This element describes authentication procedures or techniques, typically by giving a checksum value for the object. The method used to compute the authentication value (e.g., MD5) is listed in the method attribute. f5b2177ea03aea73de12da81f896fe40 Authentication method The method used to calculate an authentication checksum. This element names the method used to calculate and authentication checksum that can be used to validate a bytestream. Typical checksum methods include MD5 and CRC. MD5 Compression Method Name of a compression method applied This element lists a compression method used to compress the object, such as zip, compress, etc. Compression and encoding methods must be listed in the order in which they were applied, so that decompression and decoding should occur in the reverse order of the listing. For example, if a file is compressed using zip and then encoded using MIME base64, the compression method would be listed first and the encoding method second. zip gzip compress Encoding Method Name of a encoding method applied This element lists a encoding method used to encode the object, such as base64, BinHex, etc. Compression and encoding methods must be listed in the order in which they were applied, so that decompression and decoding should occur in the reverse order of the listing. For example, if a file is compressed using zip and then encoded using MIME base64, the compression method would be listed first and the encoding method second. base64 uuencode binhex Character Encoding Contains the name of the character encoding used for the data. This element contains the name of the character encoding. This is typically ASCII or UTF-8, or one of the other common encodings. UTF-8 Data format Describes the internal physical format of a data object. This element is the parent which is a CHOICE between four possible internal physical formats which describe the internal physical characteristics of the data object. Using this information the user should be able parse physical object to extract the entity and its attributes. Note that this is the format of the physical object itself. Text Format Description of a text formatted object Description of a text formatted object. The description includes detailed parsing instructions for extracting attributes from the bytestream for simple delimited file formats (e.g., CSV), fixed format files that use fixed columns for attribute locations, and mixtures of the two. It also supports records that span multiple lines. Number of header lines Number of header lines preceding data. Number of header lines preceding data. Lines are determined by the physicalLineDelimiter, or if it is absent, by the recordDelimiter. This value indicated the number of header lines that should be skipped before starting to parse the data. 4 Number of footer lines Number of footer lines following data. Number of footer lines following data. Lines are determined by the physicalLineDelimiter, or if it is absent, by the recordDelimiter. This value indicated the number of footer lines that should be skipped after parsing the data. If this value is omitted, parsers should assume the data continues to the end of the data stream. 4 Record delimiter character Character used to delimit records. This element specifies the record delimiter character when the format is text. The record delimiter is usually a linefeed (\n) on UNIX, a carriage return (\r) on MacOS, or both (\r\n) on Windows/DOS. Multiline records are usually delimited with two line ending characters, for example on UNIX it would be two linefeed characters (\n\n). As record delimiters are often non-printing characters, one can use either the special value "\n" to represent a linefeed (ASCII 0x0a) and "\r" to represent a carriage return (ASCII 0x0d). Alternatively, one can use the hex value to represent character values (e.g., 0x0a). \n\r Physical line delimiter character Character used to delimit physical lines. This element specifies the physical line delimiter character when the format is text. The line delimiter is usually a linefeed (\n) on UNIX, a carriage return (\r) on MacOS, or both (\r\n) on Windows/DOS. Multiline records are usually delimited with two line ending characters, for example on UNIX it would be two linefeed characters (\n\n). As line delimiters are often non-printing characters, one can use either the special value "\n" to represent a linefeed (ASCII 0x0a) and "\r" to represent a carriage return (ASCII 0x0d). Alternatively, one can use the hex value to represent character values (e.g., 0x0a). If this value is not provided, processors should assume that the physical line delimiter is the same as the record delimiter. \n\r Physical lines per record The number of physical lines in the file spanned by a single logical data record. A single logical data record may be written over several physical lines in a file, with no special marker to indicate the end of a record. In such cases, it is necessary to know the number of lines per record in order to correctly read them. If this value is not provided, processors should assume that records are wholly contained on one physical line. If the value is greater than 1, then processors should examine the lineNumber field for each attribute to determine which line of the record contains the information. 3 Maximum record length The maximum number of characters in any record in the physical file. The maximum number of characters in any record in the physical file. For delimited files, the record length varies and this is not particularly useful. However, for fixed format files that do not contain record delimiters, this field is critical to tell processors when one record stops and another begins. 597 Orientation of attributes Orientation of attributes. Specifies whether the attributes described in the physical stream are found in columns or rows. The valid values are column or row. If set to 'column', then the attributes are in columns. If set to 'row', then the attributes are in rows. Row orientation is rare, but some systems such as SPlus and R utilize it. For example, some data with column orientation: DATE PLOT SPECIES 2002-01-15 hfr5 acer rubrum 2002-01-15 hfr5 acer xxxx The same data in a rowMajor table: DATE 2002-01-15 PLOT hfr5 SPECIES acer rubrum acer xxxx column row Simple delimited format A simple delimited format. A simple delimited format that uses one of a series of delimiters to indicate the ends of fields in the data stream. More complex formats such as fixed format or mixed delimited and fixed formats can be described using the "complex" element. Field Delimiter character Character used to delimit the end of an attribute This element specifies a character to be used in the object for indicating the ending column for an attribute. The delimiter character itself is not part of the attribute value, but rather is present in the column following the last character of the value. Typical delimiter characters include commas, tabs, spaces, and semicolons. The only time the fieldDelimiter character is not interpreted as a delimiter is if it is contained in a quoted string (see quoteCharacter) or is immediately preceded by a literalCharacter. Non-printable quote characters can be provided as their hex values, and for tab characters by its ASCII string "\t". Processors should assume that the field starts in the column following the previous field if the previous field was fixed, or in the column following the delimiter from the previous field if the previous field was delimited. , \t 0x09 0x20 Treat consecutive delimiters as one Specification of how to handle consecutive delimiters while parsing The collapseDelimiters element specifies whether sequential delimiters should be treated as a single delimiter or multiple delimiters. An example is when a space delimiter is used; often there may be several repeated spaces that should be treated as a single delimiter, but not always. The valid values are yes or no. If it is set to yes, then consecutive delimiters will be collapsed to one. If set to no or absent, then consecutive delimiters will be treated as separate delimiters. Default behaviour is no; hence, consecutive delimiters will be treated as separate delimiters, by default. yes no Quote character Character used to quote values for delimiter escaping This element specifies a character to be used in the object for quoting values so that field delimiters can be used within the value. This basically allows delimiter "escaping". The quoteChacter is typically a " or '. When a processor encounters a quote character, it should not interpret any following characters as a delimiter until a matching quote character has been encountered (i.e., quotes come in pairs). It is an error to not provide a closing quote before the record ends. Non-printable quote characters can be provided as their hex values. " ' Literal character Character used to escape other special characters This element specifies a character to be used for escaping special character values so that they are treated as literal values. This allows "escaping" for special characters like quotes, commas, and spaces when they are intended to be used in an attribute value rather than being intended as a delimiter. The literalCharacter is typically a \. \ Complex text format A complex text format. A complex text format that can describe delimited fields, fixed width fields, and mixtures of the two. This supports multiline records (where one record is distributed across multiple physical lines). When using the complex format, the number of textFixed and textDelimited elements should exactly equal the number of attributes that have been described for the entity, and the order of the textFixed and textDelimited elements should correspond to the order of the attributes as described in the entity. Thus, for a delimited file with fourteen attributes, one should provide exactly fourteen textDelimited elements. Fixed format text Describes the physical format of data sequences that use a fixed number of characters in a specified position in the stream to locate attribute values. Describes the physical format of data sequences that use a fixed number of characters in a specified position in the stream to locate attribute values. This method is common in sensor-derived data and in legacy database systems. To parse it, one must know the number of characters for each attribute and the starting column and line to begin reading the value. Field width Field width in characters for fixed field length. Fixed width fields have a set length, thus the end of the field can always be determined by adding the fieldWidth to the starting column number. 7 Physical Line Number The line on which the data field is found, when the data record is written over more than one physical line in the file. A single logical data record may be written over several physical lines in a file, with no special marker to indicate the end of a record. In such cases, the relative location of a data field must be indicated by both relative row and column number. The lineNumber should never greater that the number of physical lines per record. 3 Start column The starting column number for a fixed format attribute. Fixed width fields have a set length, thus the end of the field can always be determined by adding the fieldWidth to the starting column number. If the starting column is not provided, processors should assume that the field starts in the column following the previous field if the previous field was fixed, or in the column following the delimiter from the previous field if the previous field was delimited. 58 Delimited format text Describes the physical format of data sequences that use delimiters in the stream to locate attribute values. Describes the physical format of data sequences that use delimiters in the stream to locate attribute values. This method is common in data exported from spreadsheets and database systems, To parse it, one must know the character that indicates the end of each attribute and the line to begin reading the value. Field Delimiter character Character used to delimit the end of a particular attribute This element specifies a character to be used in the object for indicating the ending column for an attribute. The delimiter character itself is not part of the attribute value, but rather is present in the column following the last character of the value. Typical delimiter characters include commas, tabs, spaces, and semicolons. The only time the fieldDelimiter character is not interpreted as a delimiter is if it is contained in a quoted string (see quoteCharacter) or is immediately preceded by a literalCharacter. Non-printable quote characters can be provided as their hex values, and for tab characters by its ASCII string "\t". Processors should assume that the field starts in the column following the previous field if the previous field was fixed, or in the column following the delimiter from the previous field if the previous field was delimited. , \t 0x09 0x20 Treat consecutive delimiters as single Specification of how to handle consecutive delimiters while parsing The collapseDelimiters element specifies whether sequential delimiters should be treated as a single delimiter or multiple delimiters. An example is when a space delimiter is used; often there may be several repeated spaces that should be treated as a single delimiter, but not always. The valid values are yes or no. If it is set to yes, then consecutive delimiters will be collapsed to one. If set to no or absent, then consecutive delimiters will be treated as separate delimiters. Default behaviour is no; hence, consecutive delimiters will be treated as separate delimiters, by default. yes no Physical Line Number The line on which the data field is found, when the data record is written over more than one physical line in the file. A single logical data record may be written over several physical lines in a file, with no special marker to indicate the end of a record. In such cases, the relative location of a data field must be indicated by both relative row and column number. The lineNumber should never be greater that the number of physical lines per record. When parsing the first field on a physical line as a delimited field, they should assume that the field data starts in the first column. Otherwise, follow the rules indicated under fieldDelimiter. 3 Quote character Character used to quote values for delimiter escaping This element specifies a character to be used in the object for quoting values so that field delimiters can be used within the value. This basically allows delimiter "escaping". The quoteChacter is typically a " or '. When a processor encounters a quote character, it should not interpret any following characters as a delimiter until a matching quote character has been encountered (i.e., quotes come in pairs). It is an error to not provide a closing quote before the record ends. Non-printable quote characters can be provided as their hex values. " ' Literal character Character used to escape other special characters This element specifies a character to be used for escaping special character values so that they are treated as literal values. This allows "escaping" for special characters like quotes, commas, and spaces when they are intended to be used in an attribute value rather than being intended as a delimiter. The literalCharacter is typically a \. \ Externally Defined Format Information about a non-text or proprietary formatted object. Information about a non-text or proprietary formatted object. The description names the format explicitly, but assumes a processor implicitly knows how to parse that format to extract the data. A format version can be included. This is mainly used for proprietary formats, including binary files like Microsoft Excel and text formats like ESRI's ArcInfo export format. This is not a recommended way to permanently archive data because the software to parse the format is unlikely to be available over extended periods, but is included to allow for commonly used physical formats. Format Name Name of the format of the data object Name of the format of the data object Microsoft Excel Format Version Version of the format of the data object Version of the format of the data object 2000 (9.0.2720) Format citation Citation providing more details about the physical format. Citation providing more detail about the physical format, including parsing information or information about the software required for reading the object. Raster image format Contains binary raster data header parameters The binaryRasterInfo element is a container for various parameters used to described the contents of binary raster image files. In this case, it is based on a white paper on the ESRI site that describes the header information used for BIP and BIL files ("Extendable Image Formats for ArcView GIS 3.1 and 3.2"). Orientation for reading rows and columns Orientation for reading rows and columns. Specifies whether the data should be read across rows or down columns. The valid values are column or row. If set to 'column', then the data are read down columns. If set to 'row', then the data are read across rows. column row Multiple band image Multiple band image information. Information needed to properly interpret a multiband image. Number of Bands The number of spectral bands in the image. The number of spectral bands in the image. Must be greater than 1. 2 Layout The organization of the bands in the image file. The organization of the bands in the image file. Acceptable values are bil - Band interleaved by line. bip - Band interleaved by pixel. bsq - Band sequential. bil bip bsq Number of Bits The number of bits per pixel per band. The number of bits per pixel per band. Acceptable values are typically 1, 4, 8, 16, and 32. The default value is eight bits per pixel per band. For a true color image with three bands (R, G, B) stored using eight bits for each pixel in each band, nbits equals eight and nbands equals three, for a total of twenty-four bits per pixel. 8 Byte Order The byte order in which values are stored. The byte order in which values are stored. The byte order is important for sixteen-bit and higher images, that have two or more bytes per pixel. Acceptable values are little-endian (common on Intel systems like PCs) and big-endian (common on Motorola platforms). little-endian big-endian Skip Bytes The number of bytes of data in the image file to skip in order to reach the start of the image data. The number of bytes of data in the image file to skip in order to reach the start of the image data. This keyword allows you to bypass any existing image header information in the file. The default value is zero bytes. 0 Bytes per band per row The number of bytes per band per row. The number of bytes per band per row. This must be an integer. This keyword is used only with BIL files when there are extra bits at the end of each band within a row that must be skipped. 3 Total bytes of data per row The total number of bytes of data per row. The total number of bytes of data per row. Use totalrowbytes when there are extra trailing bits at the end of each row. 8 Bytes between bands The number of bytes between bands in a BSQ format image. The number of bytes between bands in a BSQ format image. The default is zero. 1 Distribution Information Information on how the resource is distributed online and offline This element provides information on how the resource is distributed. Connections to online systems can be described as URLs or as a list of connection parameters. Please see the Type definition for complete information. PhysicalDistributionType PhysicalDistributionType

The PhysicalDistributionType contains the information required for retrieving the resource. It differs from the res:DistributionType : Generally, the PhysicalDisribtutionType is intended for download whereas the Type at the resource level is intended primarily for information. The phys:PhysicalDistributionType includes an optional access tree which can be used to override access rules applied at the resource level. Access for the documents included entities can then be managed individually. Also see individual sub elements for more information.

online online Information for a resource that is distributed online. Please see the Type definition for complete information. offline offline Information for a resource that is distributed offline. Please see the Type definition for complete information. inline inline Information for a resource that is distributed inline, i.e., along with the metadata. Please see the Type definition for complete information. access access When this element occurs in a distribution module, it controls access only to the resource being described by the same distribution parent. Please see the Type definition for complete information on constructing an access tree. PhysicalOnlineType PhysicalOnlineType

Distribution information for accessing the resource online, represented either as a URL or as the series of named parameters needed to connect. The URL field can contain a simple web address or an entire query string. The connection element allows the components of a complex protocol to be described individually. The PhysicalOnlineType differs from the res:OnlineType in that this type only allows a connectionDefinition to appear as the child of a connection. In other words, in a PhysicalOnlineType, the connectionDefinition cannot be abstracted, and must be included as part of an actual connection.

onlineDescription onlineDescription The onlineDescription element can hold a brief description of the content of the online element's online|offline|inline child. This description element could supply content for an html anchor tag. url url The URL of the resource that is available online. Please see the Type definition for complete information. connection connection A connection to a resource that is available online. Please see the Type definition for complete information.