'$RCSfile: eml-physical.xsd,v $'
Copyright: 1997-2002 Regents of the University of California,
University of New Mexico, and
Arizona State University
Sponsors: National Center for Ecological Analysis and Synthesis and
Partnership for Interdisciplinary Studies of Coastal Oceans,
University of California Santa Barbara
Long-Term Ecological Research Network Office,
University of New Mexico
Center for Environmental Studies, Arizona State University
Other funding: National Science Foundation (see README for details)
The David and Lucile Packard Foundation
For Details: http://knb.ecoinformatics.org/
'$Author: jones $'
'$Date: 2002-12-06 22:23:43 $'
'$Revision: 1.62 $'
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
eml-physical
The eml-physical module - Physical file format
The eml-physical module describes the external
and internal physical characteristics of a data object as well as the
information required for its distribution. Examples of the external
physical characteristics of a data object would be the filename,
size, compression, encoding methods, and authentication of a file
or byte stream. Internal physical characteristics describe the
format of the data object being described. Both named binary or
otherwise proprietary formats can be cited (e.g., Microsoft Access
2000), or text formats can be precisely described (e.g., ASCII text
delimited with commas). For these text formats, it also includes the
information needed to parse the data object to extract the entity
and its attributes from the data object. Distribution information
describes how to retrieve the data object. The retrieval information
can be either online (e.g., a URL or other connection information)
or offline (e.g., a data object residing on an archival tape).
The eml-physical module, like other modules, may be
"referenced" via the <references> tag. This
allows a physical document to be described once, and then
used as a reference in other locations within the EML document
via it's ID.
Any data object that is being desribed by EML
needs this information so the entities and attributes that reside
with in the data object can be extracted.
yes
Physical structure
Physical structure of an entity or entities.
The content model for physical is a CHOICE between
"references" and all of the elements that let you describe the
internal/external characteristics and distribution of a data object
(e.g., dataObject, dataFormat, distribution.) A physical element can
contain a reference to an physical element defined elsewhere. Using
a reference means that the referenced physical is identical, not just
in name but identical in its complete description.
Data object name
The name of the data object.
The name of the data object. This is
possibly distinct from the entity name in that one physical
object can contain multiple entities, even though that is not
a recommended practice. The objectName often is the filename
of a file in a filesytem or that is accessible on the network.
rainfall-sev-2002-10.txt
Data object size
Describes the physical size of the
data object.
This element contains information of the
physical size of the entity, by default represented in
bytes unless the unit attribute is provided to change
the units.
134
Unit of measurement
Unit of measurement for the entity
size, by default byte
This element gives the unit of
measurement for the size of the entity, and is
by default a byte.
byte
Authentication value
A value, typically a checksum, used to
authenticate that the bitstream delivered to the user is
identical to the original.
This element describes authentication
procedures or techniques, typically by giving a checksum
value for the onject. The method used to compute the
authentication value (e.g., MD5) is listed in the method
attribute.
f5b2177ea03aea73de12da81f896fe40
Authentication method
The method used to calculate an
authentication checksum.
This element names the method used
to calculate and authentication checksum that can
be used to validate a bytestream. Typical checksum
methods include MD5 and CRC.
MD5
Compression Method
Name of a compression method applied
This element lists a compression method used
to compress the object, such as zip, compress, etc. Compression
and encoding methods must be listed in the order in which they
were applied, so that decompression and deencoding should
occur in the reverse order of the listing. For example,
if a file is compressed using zip and then encoded using
MIME base64, the compression method would be listed first
and the encoding method second.
zip
gzip
compress
Encoding Method
Name of a encoding method applied
This element lists a encoding method used
to encode the object, such as base64, binhex, etc. Compression
and encoding methods must be listed in the order in which they
were applied, so that decompression and deencoding should
occur in the reverse order of the listing. For example,
if a file is compressed using zip and then encoded using
MIME base64, the compression method would be listed first
and the encoding method second.
base64
uuencode
binhex
Character Encoding
Contains the name of the character encoding
used for the data.
This element contains the name of the
character encoding. This is typically ASCII or UTF-8, or
one of the other common encodings.
UTF-8
Data format
Describes the internal physical format
of a data object.
This element is the parent which is a CHOICE
between four possible internal physical formats
which describe the internal
physical characteristics of the data object. Using this
information the user should be able parse physical object to
extract the entity and its attributes. Note that this is
the format of the physical object itself.
Text Format
Description of a text formatted object
Description of a text formatted object.
The description includes detailed parsing instructions for
extracting attributes from the bytestream for simple
delimited file formats (e.g., CSV), fixed format files
that use fixed columns for attribute locations, and
mixtures of the two. It also supports records that
span multiple lines.
Number of header lines
Number of header lines preceding
data.
Number of header lines preceding
data. Lines are determined by the
physicalLineDelimiter, or if it is absent, by the
recordDelimiter. This value indicated the
number of header lines that should be skipped
before starting to parse the data.
4
Number of footer lines
Number of footer lines following
data.
Number of footer lines following
data. Lines are determined by the
physicalLineDelimiter, or if it is absent, by the
recordDelimiter. This value indicated the
number of footer lines that should be skipped
after parsing the data. If this value is omitted,
parsers should assume the data continues to the end
of the data stream.
4
Record delimiter character
Character used to delimit
records.
This element specifies the record
delimiter character when the format is text. The
record delimiter is usually a linefeed (\n) on UNIX, a
carriage return (\r) on MacOS, or both (\r\n) on
Windows/DOS. Multiline records are usually delimited
with two line ending characters, for example on UNIX
it would be two linefeed characters (\n\n). As record
delimeters are often non-printing characters, one can
use either the special value "\n" to represent a
linefeed (ASCII 0x0a) and "\r" to represent a carriage
return (ASCII 0x0d). Alternatively, one can use the
hex value to represent character values (e.g., 0x0a).
\n\r
Physical line delimiter character
Character used to delimit
physical lines.
This element specifies the physical
line delimiter character when the format is text. The
line delimiter is usually a linefeed (\n) on UNIX, a
carriage return (\r) on MacOS, or both (\r\n) on
Windows/DOS. Multiline records are usually delimited
with two line ending characters, for example on UNIX
it would be two linefeed characters (\n\n). As line
delimeters are often non-printing characters, one can
use either the special value "\n" to represent a
linefeed (ASCII 0x0a) and "\r" to represent a carriage
return (ASCII 0x0d). Alternatively, one can use the
hex value to represent character values (e.g., 0x0a).
If this value is not provided, prcessors should
assume that the physical line delimiter is the same
as the record delimiter.
\n\r
Physical lines per record
The number of physical lines in the file
spanned by a single logical data record.
A single logical data record may be
written over several physical lines in a file, with
no special marker to indicate the end of a record. In
such cases, it is necessary to know the number of
lines per record in order to correctly read
them. If this value is not provided, processors should
assume that records are wholly contained on one
physical line. If the value is greater than 1, then
processers should examine the lineNumber field for
each attribute to determine which line of the
record contains the information.
3
Maximum record length
The maximum number fo characters in any
record in the physical file.
The maximum number of chanracters
in any record in the physical file. For delimited
files, the record length varies and this is not
particularly useful. However, for fixed format files
that do not contain record delimiters, this field is
critical to tell processors when one record stops
and another begins.
597
Orientation of attributes
Orientation of attributes.
Specifies whether the attributes
described in the physical stream are found in
columns or rows. The valid values are column or row.
If set to 'column', then the attributes are in
columns. If set to 'row', then the attributes
are in rows. Row orientation is rare, but some
systems such as Splus and R utilize it.
For example, some data with column orientation:
DATE PLOT SPECIES
2002-01-15 hfr5 acer rubrum
2002-01-15 hfr5 acer xxxx
The same data in a rowMajor table:
DATE 2002-01-15
PLOT hfr5
SPECIES acer rubrum acer xxxx
column
row
Simple delimited format
A simple delimited format.
A simple delimited format that
uses one of a series of delimiters to indicate
the ends of fields in the data stream. More
complex formats such as fixed format or mixed
delimited and fixed formats can be described using
the "complex" element.
Field Delimiter character
Character used to delimit the
end of an attribute
This element specifies
a character to be used in the object for
indicating the ending column for an attribute.
The delimiter character itself is not part
of the attribute value, but rather is present
in the column following the last character
of the value. Typical delimiter characters
include commas, tabs, spaces, and semicolons.
The only time the fieldDelimiter character is
not interpreted as a delimiter is if it
is contained in a quoted string
(see quoteCharacter) or is immediately
preceded by a literalCharacter.
Non-printable quote characters can be
provided as their hex values, and for tab
characters by its ASCII string "\t".
Processors should assume that the field
starts in the column following the previous
field if the previous field was fixed,
or in the column following the delimiter
from the previous field if the previous
field was delimited.
,
\t
0x09
0x20
Quote character
Character used to quote values
for delimiter escaping
This element specifies
a character to be used in the object for
quoting values so that field delimeters can
be used within the value. This basically
allows delimeter "escaping". The quoteChacter
is typically a " or '. When a processor
encounters a quote character, it should
not interpret any following characters as
a delimiter until a matching quote character
has been encountered (i.e., quotes come in
pairs). It is an error to not provide a
closing quote before the record ends.
Non-printable quote characters can be
provided as their hex values.
"
'
Literal character
Character used to escape other
special characters
This element specifies
a character to be used for escaping
special character values so that they
are treated as literal values.
This allows "escaping" for special
characters like quotes, commas, and spaces
when they are intended to be used in an
attribute value rather than being intended
as a delimiter. The literalCharacter is
typically a \.
\
Complex text format
A complex text format.
A complex text format that
can describe delimited fields, fixed width
fields, and mixtures of the two. This supports
multiline records (where one record is distributed
across multiple physical lines). When using the
complex format, the number of textFixed and
textDelimited elements should exactly equal the
number of attributes that have been described
for the entity, and the order of the textFixed
and textDelimited elements should correspond to
the order of the attributes as described in the
entity. Thus, for a delimited file with fourteen
attributs, one should provide exactly fourteen
textDelimited elements.
Fixed format text
Describes the physical format
of data sequences that use a fixed
number of characters in a specified position
in the stream to locate attribute values.
Describes the physical
format of data sequences that use a fixed
number of characters in a specified position
in the stream to locate attribute values.
This method is common in sensor-derived
data and in legacy database systems. To
parse it, one must know the number
of characters for each attribute and the
starting column and line to begin reading
the value.
Field width
Field width in
characters for fixed field
length.
Fixed width fields
have a set length, thus the end of
the field can always be determined by
adding the fieldWidth to the starting
column number.
7
Physical Line Number
The line on which
the data field is found, when
the data record is written over
more than one physical line in
the file.
A single logical
data record may be written over
several physical lines in a file,
with no special marker to indicate
the end of a record. In such
cases, the relative location of
a data field must be indicated
by both relative row and column
number. The lineNumber should never
greater that the number of physical
lines per record.
3
Start column
The starting
column number for a fixed format
attribute.
Fixed width fields
have a set length, thus the end of
the field can always be determined by
adding the fieldWidth to the starting
column number. If the starting
column is not provided, processors
should assume that the field starts
in the column following the previous
field if the previous field was fixed,
or in the column following the
delimiter from the previous field if
the previous field was delimited.
58
Delimited format text
Describes the physical format
of data sequences that use delimiters
in the stream to locate attribute values.
Describes the physical
format of data sequences that use delimiters
in the stream to locate attribute values.
This method is common in data exported from
spreadsheets and database systems,
To parse it, one must know the character
that indicates the end of each attribute
and the line to begin reading the value.
Field Delimiter character
Character used
to delimit the end of a particular
attribute
This element
specifies a character to be used
in the object for indicating the
ending column for an attribute.
The delimiter character itself is
not part of the attribute value,
but rather is present in the column
following the last character of the
value. Typical delimiter characters
include commas, tabs, spaces,
and semicolons. The only time the
fieldDelimiter character is not
interpreted as a delimiter is if it
is contained in a quoted string (see
quoteCharacter) or is immediately
preceded by a literalCharacter.
Non-printable quote characters can
be provided as their hex values,
and for tab characters by its ASCII
string "\t". Processors should
assume that the field starts in the
column following the previous field
if the previous field was fixed,
or in the column following the
delimiter from the previous field
if the previous field was delimited.
,
\t
0x09
0x20
Physical Line Number
The line on which
the data field is found, when
the data record is written over
more than one physical line in
the file.
A single logical
data record may be written over
several physical lines in a file,
with no special marker to indicate
the end of a record. In such
cases, the relative location of
a data field must be indicated
by both relative row and column
number.
The lineNumber should never
greater that the number of physical
lines per record. When parsing the
first field on a physical line as
a delimited field, they should assume
that the field data starts in the
first column. Otherwise, follow the
rules indicated under fieldDelimiter.
3
Quote character
Character used
to quote values for delimiter
escaping
This element
specifies a character to be used in
the object for quoting values so
that field delimeters can be used
within the value. This basically
allows delimeter "escaping". The
quoteChacter is typically a " or
'. When a processor encounters
a quote character, it should not
interpret any following characters
as a delimiter until a matching
quote character has been encountered
(i.e., quotes come in pairs). It is
an error to not provide a closing
quote before the record ends.
Non-printable quote characters
can be provided as their hex
values.
"
'
Literal character
Character used
to escape other special
characters
This element
specifies a character to be used
for escaping special character
values so that they are treated
as literal values. This allows
"escaping" for special characters
like quotes, commas, and spaces
when they are intended to be used
in an attribute value rather than
being intended as a delimiter.
The literalCharacter is typically
a \.
\
Externally Defined Format
Information about a non-text or proprietary
formatted object.
Information about a non-text or
propriateary formatted object.
The description names the format explicitly, but assumes
a processor implicitly knows how to parse that format
to extract the data. A format version can be included.
This is mainly used for proprietary formats, including
binary files like Microsoft Excel and text formats like
ESRI's ArcInfo export format. This is not a recommended
way to permenantly archive data because the software to
parse the format is unlikely to be available over extended
periods, but is included to allow for commonly used
physical formats.
Format Name
Name of the format of the data
object
Name of the format of
the data object
Microsoft Excel
Format Version
Version of the format of the
data object
Version of the format of
the data object
2000 (9.0.2720)
Format citation
Citation providing more details about
the physical format.
Citation providing more detail about
the physical format, including parsing information
or information about the software required for
reading the object.
Raster image format
Contains binary raster data header
parameters
The binaryRasterInfo element is a
container for various parameters used to described the
contents of binary raster image files. In this case, it is
based on a white paper on the ESRI site that describes the
header information used for BIP and BIL files ("Extendable
Image Formats for ArcView GIS 3.1 and
3.2").
Orientation for reading rows and columns
Orientation for reading rows and columns.
Specifies whether the data should
be read across rows or down columns. The valid
values are column or row. If set to 'column', then
the data are read down columns. If set to 'row',
then the data are read across rows.
column
row
Multiple band image
Multiple band image information.
Information needed to properly
interpret a multiband image.
Number of Bands
The number of spectral bands in the
image.
The number of spectral
bands in the image. Must be greater than 1.
2
Layout
The organization of the bands
in the image file.
The organization of
the bands in the image file. Acceptable
values are bil - Band interleaved by
line. bip - Band interleaved by pixel.
bsq - Band sequential.
bil
bip
bsq
Number of Bits
The number of bits per pixel per
band.
The number of bits per pixel per
band. Acceptable values are typically 1, 4, 8, 16,
and 32. The default value is eight bits per pixel per
band. For a true color image with three bands (R, G,
B) stored using eight bits for each pixel in each
band, nbits equals eight and nbands equals three,
for a total of twenty-four bits per pixel.
8
Byte Order
The byte order in which values are
stored.
The byte order in which
values are stored. The byte order is important for
sixteen-bit and higher images, that have two or more
bytes per pixel.
Acceptable values are little-endian (common on Intel
systems like PCs) and big-endian (common on
Motorola platforms).
little-endian
big-endian
Skip Bytes
The number of bytes of data in the
image file to skip in order to reach the start of the
image data.
The number of bytes of data in the
image file to skip in order to reach the start of the
image data. This keyword allows you to bypass any
existing image header information in the file. The
default value is zero bytes.
0
Bytes per band per row
The number of bytes per band per
row.
The number of bytes per band per
row. This must be an integer. This keyword is used
only with BIL files when there are extra bits at the
end of each band within a row that must be
skipped.
3
Total bytes of data per row
The total number of bytes of data
per row.
The total number of bytes of data
per row. Use totalrowbytes when there are extra
trailing bits at the end of each
row.
8
Bytes between bands
The number of bytes between bands in
a BSQ format image.
The number of bytes between bands in
a BSQ format image. The default is
zero.
1
Distribution Information
Information on how the resource is distributed
online and offline
This element provides information on how the
resource is distributed online and offline. Connections to online
systems can be described as URLs and as a list of relevant
connection parameters.
Online Distribution Information
Distribution information for accessing the
resource online.
Distribution information for accessing the
resource online, represented either as a URL or as a series of
named parameters that are needed in order to
connect. The URL field is provided for the simple cases where a
file is available for download directly from a web server or
other similar server and a complex connection protocol is not
needed. The connection field provides an alternative where a
complex protocol needs to be named and described, along with
the necessary parameters needed for the connection.
Download site URL
A URL (Uniform Resource Locator) from which
this resource can be downloaded or information can be
obtained about downloading it.
A URL (Uniform Resource Locator) from
which this resource can be downloaded or additional
information can be obtained. If accessing the URL would
directly return the data stream, then the "function"
attribute should be set to "download". If the URL
provides further information about downloading the
object but does not directly return the data stream, then
the "function" attribute should be set to "information".
If the "function" attribute is omitted, then "download"
is implied for the URL function.
In more complex cases where a non-standard connection
must be established that complies with application
specific procedures beyond what can be described in the
simple URL, then the "connection" element should
be used instead of the URL element.
http://data.org/getdata?id=98332
Connection
A description of the information needed
to make an application connection to a data service.
A description of the information needed
to make an application connection to a data service.
The connection starts with a connectionDefinition which
lists all of the parameters needed for the connection
and possible default values for each. It then includes a
list of parameter values, one for each parameter, that
override the defaults for this particular connection.
One parameter element should exist for every
parameterDefinition that is present in the
connectionDefinition, except that parameters that were
defined with a defaultValue in their parameterDefinition
can be ommitted from the connection and the default
will be used. All information about how to use the
parameters to establish a session and extract data is
present in the connectionDefinition, possibly implicitly
by naming a connection schemeName that is well-known.
Connection Definition
Definition of the connection protocol
to be used for this connection.
Definition of the connection
protocol to be used for this connection. The
definition has a "scheme" which identifies the
protocol by name, and a detailed description of
the scheme and its required parameters.
Parameter
A parameter to be used to make this
connection.
A parameter to be used to make
this connection. This value overrides any
default value that may have been provided in the
connection definition.
Parameter Name
Name of the parameter to be
used to make this connection.
The name of the parameter
to be used to make this connection.
hostname
Parameter Value
The value of the parameter to
be used to make this connection.
The value of the parameter
to be used to make this connection. This
value overrides any default value that may
have been provided in the connection
definition.
nceas.ucsb.edu
medium of the resource
the medium on which this resource is distributed,
either digitally or as hardcopy
the medium on which this resource is distributed
digitally, such as 3.5" floppy disk, or various tape media types,
or 'hardcopy'
CD-ROM, 3.5 in. floppy disk, Zip disk
Medium name
Name of the medium that for this resource
distribution
Name of the medium on which this resource
is distributed. Can be various digital media such as tapes
and disks, or printed media which can collectively be
termed 'hardcopy'.
Tape, 3.5 inch Floppy Disk,
hardcopy
density of the digital medium
the density of the digital medium if this is
relevant.
the density of the digital medium if this
is relevant. Used mainly for floppy disks or
tape.
High Density (HD), Double Density
(DD)
units of a numerical density
a numerical density's units
if a density is given numerically, the
units should be given here.
B/cm
storage volume
total volume of the storage
medium
the total volume of the storage medium on
which this resource is shipped.
650 MB
medium format
format of the medium on which the resource is
shipped.
the file system format of the medium on
which the resource is shipped
NTFS, FAT32, EXT2, QIK80
note about the media
note about the media
any additional pertinent information about
the media
Inline distribution
Object data distributed inline in the metadata.
Object data distributed inline in the metadata.
Users have the option of including the data right inline in the
metadata by providing it inside of the "inline" element. For
many text formats, the data can be simply included directly in
the element. However, certain character sequences are invalid in
an XML document (e.g., <), so care will need to be taken to
either 1) wrap the data in a CDATA section if needed, or
2) encode the data using a text encoding algorithm such as
base64, and then include that in a CDATA section. The latter
will be necessary for binary formats. The data should be
de-encoded and de-compressed according to the encodingMethod
and compressionMethod fields in eml-physical as if the data
had been obtained out-of-band (e.g., from a URL).