Note
This document is a DRAFT for comment.This document describes the EcoGRID Query Interface, an API for accessing structured data in the EcoGRID of the SEEK project.
This document is concerned only with the process of data retrieval. It does not consider edit or deletion of records.
This is a set of use cases derived from the document http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/~checkout~/seek/docs/design/design-ecogrid-1.0.0.txt?rev=1.5&content-type=text/plain and also contains some additions.
UC1: User can search for data using query that may contain any combination of keywords, spatial constraints, and other query terms, and be returned a list of data that match
UC2: User can browse data using one or more inverted controlled vocabularies or ontologies
UC3: User can discover Data Providers that support a particular Federation Schema.
UC4: User can discover the Federation Schema supported by a Data Provider.
UC5: User can retrieve a count of records that match a particular filter.
UC6: User can retrieve records conformant to the Federation Schema or any valid restriction of the Federation Schema.
UC7: The system must provide a Result Set Identifier (RSID) that may be used as a handle by which a third party may retrieve the records that match the RSID.
UC8: A Data Provider must be able to decode an RSID and return the filter that was used to create the RSID.
A Data Provider is a service that provides access to a container of records. It is assumed that regardles of the internal data storage mechanisms, all records exposed through a Data Provider shall be rendered in XML. Each record on a Data Provider conforms to a single Federation Schema. The process of selecting a subset of records from the Data Provider is performed by applying a filter to construct a set of matching records.
The filter matches terms with values and attributes of the records in the Data Provider record container. It is necessary that the exact and relative location of nodes within the record can be indicated in the filter.
Several Data Providers may reference a common Federation Schema, and form an Information Community. The same filter could be applied to each member of an Information Community to generate a set of all records from all Data Providers that match the filter.
Federation Schemas should support simple inheritance. For example, a ROOT Federation Schema could be defined that contained a simple, generic set of elements. Other Federation Schemas (such as an EML schema) could be derived from the ROOT schema, and inherit all of its properties. Then the EML Information Community could be searched using EML specific attributes or ROOT attributes, and hence all EML Data Providers will also participate in the ROOT Information Community.
The process of record retrieval is seperate from the process of identifiying a subset of matching records. This is required since a client should not have to retrieve the records in order to pass them to a third party. Instead, a token (RSID) is returned in reponse to a record selection process. This token can then be used by other systems to retrieve the actual data.
Records may be retrieved in their "natural" complete form, or may be transformed to another form that is compatible with the original structure. For example, instead of retrieving a very large record a client may only be interested in information such as the Author and Title elemnets of the record. The retrieval process must support the specification of a particular rendering of records, most probably through the use of XMLSchema restrictions on the records, or perhaps the application of an XSL transform on the record.
Simple remote joins should be possible. For example, it should be possible to create a resultset something like the following pseudo SQL applied against two hypothetical Information Communities:
SELECT * FROM EML_Information_Community WHERE (<EML:taxon_name> IN (SELECT <taxon:name> FROM Taxon_Information_Community WHERE (<taxon:classification>='Jones 1988' AND <taxon:Family>='Juncaceae')))
The subselect statement returns a set of records from the Taxon_Information_Community that contain just <taxon:name> values that match a hypothetical classification and Family. The resulting list of names is used to compare values in the (hypothetical) taxon_name element of the EML records held by the EML_Information_Community.
The following methods are defined in the EcoGRID Query Interface.
RSID search (securityToken, filter) recordArray retrieve (securityToken, RSID, start, number, format) inventoryArray inventory (securityToken, conceptpath, [filter]) filter decodeRSID (RSID) statusArray status (securityToken) recordArray searchRetrieve (securityToken, filter, start, number, format) <<anyType>> directGet (<<target specific parameters>>)
RSID search (securityToken, filter)
Applies a filter against a collection of records and returns an RSID that can be used later for retrieval of the records.
recordArray retrieve (securityToken, RSID, start, number, format)
Given an RSID, returns the records in a specified format. A page of records can be selected from the full set of records circumscribed by the RSID
inventoryArray inventory (securityToken, conceptpath, [filter])
Returns a list of unique values for a particular column and their number of occurrences.
filter decodeRSID (RSID)
Given an RSID, this method should decode the RSID and return the filter that was used to generate it.
recordArray searchRetrieve (securityToken, filter, start, number, format)
Combines the search and retrieve methods to provide a single call interface to retrieve records. No RSID is generated by this method.
statusArray status (securityToken)
Returns a set of status records that indicate the load on the service.
<<anyType>> directGet (<<target specific parameters>>)
Provides a mechanism for directly retrieving a single object from the full set.
Note: should this really be a method of the EQI? Seems like a property of the data "object".
WSDL types define the data types (beyond the standard XMLSchema data types) that are referenced within the WSDL document. Types may appear as external XMLSchema documents which are brought into context with the import statement, or may be entire XMLSchema documents embedded within the WSDL Document in the wsdl:types section. The latter option is generally better supported by WSDL toolkits.
The following items require discussion for resolution.
There are a few options for specifying a filter for the search and inventory operations.
Query candidates include: XPath, XQuery, SQL and a query structure.
The "query structure" option means that the query (filter) is represented as a data structure (e.g. an XML document) rather than a particular query syntax.
SQL is probably not really a viable option, but some mutated form of it might be (e.g. using xpaths in place of column names). (kind of like xquery).
Questions/comments:
The "Get" method is meant to return an object- an XML Document, raster image, excel spreadsheet, ...
Should this be defined as part of the EQI?
Should it simply be an element of records- e.g.:
<someRecord> ... <dataURL>http://someurl.com/data/mydata.xls</dataURL> ... </someRecord>
Can the syntax of an RSID be used in some clever way to retrieve a specific record without the need to apply a search() operation first?
Additional resources that might be of interest: