EcoGRID Query Interface (EQI)

Contents

Overview

Note

This document is a DRAFT for comment.

This document describes the EcoGRID Query Interface, an API for accessing structured data in the EcoGRID of the SEEK project.

This document is concerned only with the process of data retrieval. It does not consider edit or deletion of records.

Use Cases

This is a set of use cases derived from the document http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/~checkout~/seek/docs/design/design-ecogrid-1.0.0.txt?rev=1.5&content-type=text/plain and also contains some additions.

UC1: User can search for data using query that may contain any combination of keywords, spatial constraints, and other query terms, and be returned a list of data that match

UC2: User can browse data using one or more inverted controlled vocabularies or ontologies

UC3: User can discover Data Providers that support a particular Federation Schema.

UC4: User can discover the Federation Schema supported by a Data Provider.

UC5: User can retrieve a count of records that match a particular filter.

UC6: User can retrieve records conformant to the Federation Schema or any valid restriction of the Federation Schema.

UC7: The system must provide a Result Set Identifier (RSID) that may be used as a handle by which a third party may retrieve the records that match the RSID.

UC8: A Data Provider must be able to decode an RSID and return the filter that was used to create the RSID.

Glossary

EcoGRID
The "data layer" of the SEEK project. A distributed network of data sources that expose a common programmatic interface and conform to a common authentication scheme.
Data Provider
A service that exposes the EcoGRID Query interface. A Data Provider exposes a single data source, although an implementation may use the same service instance with URL parameters to provide access to more than a single data set through the same service instance.
Record
A record is a chunk of data that conforms to the Federation Schema used to describe the content accessible through a Data Provider.
Federation Schema
An XMLSchema document that defines a record structure that may be shared by multiple Data Providers. Federation Schemas should support at least simple inheritance.
Result Set Identifier (RSID)
An RSID is a handle that identifies a set of records available for retrieval from a Data Provider. An RSID will generally be created by applying a filter against the set of records available from a Data Provider. An RSID may be used by a third party to retrieve data based on an original query- for example, a web browser EcoGrid client might submit a query that resolves to a large set of records. Instead of retrieving that data back to the client to be submitted to some analysis application, the Data Provider returns an RSID which the client than passes to the analysis application. The analysis application contacts the Data Provider with the RSID to retrieve the data, thus avoiding the need to retrieve data to the client.
Information Community
The collection of Data Providers that conform to a single Federation Schema.
Remote Join
The use of a record set from one Data Provider to identify a set of matching records on another Data Provider. For example, a set of records from a Taxon Information Community with <taxon:name> elements might be used in a query against an EML Information Community. The result would be the set of EML records with taoxnomic information that matches the set of taxonomic names retrieved from the taxonomic Data Providers.

Data Source Model

A Data Provider is a service that provides access to a container of records. It is assumed that regardles of the internal data storage mechanisms, all records exposed through a Data Provider shall be rendered in XML. Each record on a Data Provider conforms to a single Federation Schema. The process of selecting a subset of records from the Data Provider is performed by applying a filter to construct a set of matching records.

The filter matches terms with values and attributes of the records in the Data Provider record container. It is necessary that the exact and relative location of nodes within the record can be indicated in the filter.

Several Data Providers may reference a common Federation Schema, and form an Information Community. The same filter could be applied to each member of an Information Community to generate a set of all records from all Data Providers that match the filter.

Federation Schemas should support simple inheritance. For example, a ROOT Federation Schema could be defined that contained a simple, generic set of elements. Other Federation Schemas (such as an EML schema) could be derived from the ROOT schema, and inherit all of its properties. Then the EML Information Community could be searched using EML specific attributes or ROOT attributes, and hence all EML Data Providers will also participate in the ROOT Information Community.

The process of record retrieval is seperate from the process of identifiying a subset of matching records. This is required since a client should not have to retrieve the records in order to pass them to a third party. Instead, a token (RSID) is returned in reponse to a record selection process. This token can then be used by other systems to retrieve the actual data.

Records may be retrieved in their "natural" complete form, or may be transformed to another form that is compatible with the original structure. For example, instead of retrieving a very large record a client may only be interested in information such as the Author and Title elemnets of the record. The retrieval process must support the specification of a particular rendering of records, most probably through the use of XMLSchema restrictions on the records, or perhaps the application of an XSL transform on the record.

Simple remote joins should be possible. For example, it should be possible to create a resultset something like the following pseudo SQL applied against two hypothetical Information Communities:

SELECT * FROM EML_Information_Community 
  WHERE (<EML:taxon_name> IN (SELECT <taxon:name> FROM Taxon_Information_Community 
    WHERE (<taxon:classification>='Jones 1988' AND <taxon:Family>='Juncaceae')))

The subselect statement returns a set of records from the Taxon_Information_Community that contain just <taxon:name> values that match a hypothetical classification and Family. The resulting list of names is used to compare values in the (hypothetical) taxon_name element of the EML records held by the EML_Information_Community.

Methods

The following methods are defined in the EcoGRID Query Interface.

RSID                  search          (securityToken, filter)
recordArray           retrieve        (securityToken, RSID, start, number, format)
inventoryArray        inventory       (securityToken, conceptpath, [filter])
filter                decodeRSID      (RSID)
statusArray           status          (securityToken)
recordArray           searchRetrieve  (securityToken, filter, start, number, format)
<<anyType>>           directGet       (<<target specific parameters>>)

retrieve()

recordArray retrieve (securityToken, RSID, start, number, format)

Given an RSID, returns the records in a specified format. A page of records can be selected from the full set of records circumscribed by the RSID

http://tsadev.speciesanalyst.net/graphviz/dot.php?dot=http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/~checkout~/seek/projects/ecogrid/docs/QueryInterface/eqi_retrieve.gdot

inventory()

inventoryArray inventory (securityToken, conceptpath, [filter])

Returns a list of unique values for a particular column and their number of occurrences.

decodeRSID()

filter decodeRSID (RSID)

Given an RSID, this method should decode the RSID and return the filter that was used to generate it.

searchRetrieve()

recordArray searchRetrieve (securityToken, filter, start, number, format)

Combines the search and retrieve methods to provide a single call interface to retrieve records. No RSID is generated by this method.

status()

statusArray status (securityToken)

Returns a set of status records that indicate the load on the service.

directGet()

<<anyType>> directGet (<<target specific parameters>>)

Provides a mechanism for directly retrieving a single object from the full set.

Note: should this really be a method of the EQI? Seems like a property of the data "object".

Types

WSDL types define the data types (beyond the standard XMLSchema data types) that are referenced within the WSDL document. Types may appear as external XMLSchema documents which are brought into context with the import statement, or may be entire XMLSchema documents embedded within the WSDL Document in the wsdl:types section. The latter option is generally better supported by WSDL toolkits.

filter
The filter structure is not yet fully defined. It is not yet determined if this will be an actual structure or simply a String.
securityToken
The securityToken type is a place holder for an as yet undefined structure. The intent is that this structure will provide the authentication context in which the operation should be invoked.
RSID
The RSID is a Result Set Identifier. The actual structure of an RSID may likely be defined by individual data providers. This may be changed to the simple strings type in later revisions.
record
A record is used to contain record data retrieved from a data provider. This is a complex type that can contain any valid XML data.
inventoryRecord
A type of record that is common across all data providers and generated in response to the inventory method.
format
The format type is used to indicate the structure of data that is to returned to a client of a Data Provider. The type will contain any valid XMLSchema document or parsable XMLSchema fragment.

Discussion Items

The following items require discussion for resolution.

Query Syntax

There are a few options for specifying a filter for the search and inventory operations.

Query candidates include: XPath, XQuery, SQL and a query structure.

The "query structure" option means that the query (filter) is represented as a data structure (e.g. an XML document) rather than a particular query syntax.

SQL is probably not really a viable option, but some mutated form of it might be (e.g. using xpaths in place of column names). (kind of like xquery).

Questions/comments:

  1. Expressiveness of each possibility. Is it possible to formulate the necessary queries with the language?
  2. Implementation - how hard is it going to be to implement the language? We have a variety of data stores that use a variety of native languages. The filter sent to an EQI service instance must be able to parse the query and build something that works with the native query language.
  3. Verbosity. Not really a big restriction, but something to keep in mind. How important is it that the language chosen is fairly concise?
  4. Must work with deeply structured XML records.

Low Level Data Retrieval

The "Get" method is meant to return an object- an XML Document, raster image, excel spreadsheet, ...

Should this be defined as part of the EQI?

Should it simply be an element of records- e.g.:

<someRecord>
  ...
  <dataURL>http://someurl.com/data/mydata.xls</dataURL>
  ...
</someRecord>

Can the syntax of an RSID be used in some clever way to retrieve a specific record without the need to apply a search() operation first?

See Also

Additional resources that might be of interest:

WSDL Specification
http://www.w3.org/TR/wsdl
SOAP Specification
http://www.w3.org/TR/SOAP/
XQuery Specification
http://www.w3.org/TR/xquery/
XQuery Lite
http://phpxmlclasses.sourceforge.net/xquery_lite.html
XPath
http://www.w3.org/TR/xpath
XML Schema
http://www.w3.org/TR/xmlschema-0/