============================= EcoGRID Query Interface (EQI) ============================= .. contents:: Overview ======== .. note:: This document is a DRAFT for comment. This document describes the EcoGRID Query Interface, an API for accessing structured data in the EcoGRID of the SEEK project. This document is concerned only with the process of data retrieval. It does not consider edit or deletion of records. Use Cases ========= This is a set of use cases derived from the document http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/~checkout~/seek/docs/design/design-ecogrid-1.0.0.txt?rev=1.5&content-type=text/plain and also contains some additions. UC1: User can search for data using query that may contain any combination of keywords, spatial constraints, and other query terms, and be returned a list of data that match UC2: User can browse data using one or more inverted controlled vocabularies or ontologies UC3: User can discover Data Providers that support a particular Federation Schema. UC4: User can discover the Federation Schema supported by a Data Provider. UC5: User can retrieve a count of records that match a particular filter. UC6: User can retrieve records conformant to the Federation Schema or any valid restriction of the Federation Schema. UC7: The system must provide a Result Set Identifier (RSID) that may be used as a handle by which a third party may retrieve the records that match the RSID. UC8: A Data Provider must be able to decode an RSID and return the filter that was used to create the RSID. Glossary ======== EcoGRID The "data layer" of the SEEK project. A distributed network of data sources that expose a common programmatic interface and conform to a common authentication scheme. Data Provider A service that exposes the EcoGRID Query interface. A Data Provider exposes a single data source, although an implementation may use the same service instance with URL parameters to provide access to more than a single data set through the same service instance. Record A record is a chunk of data that conforms to the Federation Schema used to describe the content accessible through a Data Provider. Federation Schema An XMLSchema document that defines a record structure that may be shared by multiple Data Providers. ``Federation Schemas`` should support at least simple inheritance. Result Set Identifier (RSID) An RSID is a handle that identifies a set of records available for retrieval from a Data Provider. An RSID will generally be created by applying a filter against the set of records available from a Data Provider. An RSID may be used by a third party to retrieve data based on an original query- for example, a web browser EcoGrid client might submit a query that resolves to a large set of records. Instead of retrieving that data back to the client to be submitted to some analysis application, the Data Provider returns an RSID which the client than passes to the analysis application. The analysis application contacts the Data Provider with the RSID to retrieve the data, thus avoiding the need to retrieve data to the client. Information Community The collection of ``Data Providers`` that conform to a single ``Federation Schema``. Remote Join The use of a record set from one ``Data Provider`` to identify a set of matching records on another ``Data Provider``. For example, a set of records from a Taxon Information Community with elements might be used in a query against an EML Information Community. The result would be the set of EML records with taoxnomic information that matches the set of taxonomic names retrieved from the taxonomic Data Providers. Data Source Model ================= A ``Data Provider`` is a service that provides access to a container of records. It is assumed that regardles of the internal data storage mechanisms, all records exposed through a ``Data Provider`` shall be rendered in XML. Each record on a ``Data Provider`` conforms to a single ``Federation Schema``. The process of selecting a subset of records from the ``Data Provider`` is performed by applying a ``filter`` to construct a set of matching records. The ``filter`` matches terms with values and attributes of the records in the Data Provider record container. It is necessary that the exact and relative location of nodes within the record can be indicated in the filter. Several ``Data Providers`` may reference a common ``Federation Schema``, and form an ``Information Community``. The same filter could be applied to each member of an ``Information Community`` to generate a set of all records from all Data Providers that match the filter. ``Federation Schemas`` should support simple inheritance. For example, a ROOT ``Federation Schema`` could be defined that contained a simple, generic set of elements. Other ``Federation Schemas`` (such as an EML schema) could be derived from the ROOT schema, and inherit all of its properties. Then the EML ``Information Community`` could be searched using EML specific attributes or ROOT attributes, and hence all EML ``Data Providers`` will also participate in the ROOT ``Information Community``. The process of record retrieval is seperate from the process of identifiying a subset of matching records. This is required since a client should not have to retrieve the records in order to pass them to a third party. Instead, a token (RSID) is returned in reponse to a record selection process. This token can then be used by other systems to retrieve the actual data. Records may be retrieved in their "natural" complete form, or may be transformed to another form that is compatible with the original structure. For example, instead of retrieving a very large record a client may only be interested in information such as the Author and Title elemnets of the record. The retrieval process must support the specification of a particular rendering of records, most probably through the use of XMLSchema restrictions on the records, or perhaps the application of an XSL transform on the record. Simple ``remote joins`` should be possible. For example, it should be possible to create a resultset something like the following pseudo SQL applied against two hypothetical Information Communities:: SELECT * FROM EML_Information_Community WHERE ( IN (SELECT FROM Taxon_Information_Community WHERE (='Jones 1988' AND ='Juncaceae'))) The subselect statement returns a set of records from the Taxon_Information_Community that contain just values that match a hypothetical classification and Family. The resulting list of names is used to compare values in the (hypothetical) taxon_name element of the EML records held by the EML_Information_Community. Methods ======= The following methods are defined in the EcoGRID Query Interface. :: RSID search (securityToken, filter) recordArray retrieve (securityToken, RSID, start, number, format) inventoryArray inventory (securityToken, conceptpath, [filter]) filter decodeRSID (RSID) statusArray status (securityToken) recordArray searchRetrieve (securityToken, filter, start, number, format) <> directGet (<>) search() -------- :: RSID search (securityToken, filter) Applies a filter against a collection of records and returns an RSID that can be used later for retrieval of the records. .. image:: http://tsadev.speciesanalyst.net/graphviz/dot.php?dot=http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/~checkout~/seek/projects/ecogrid/docs/QueryInterface/eqi_search.gdot retrieve() ---------- :: recordArray retrieve (securityToken, RSID, start, number, format) Given an RSID, returns the records in a specified format. A page of records can be selected from the full set of records circumscribed by the RSID .. image:: http://tsadev.speciesanalyst.net/graphviz/dot.php?dot=http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/~checkout~/seek/projects/ecogrid/docs/QueryInterface/eqi_retrieve.gdot inventory() ----------- :: inventoryArray inventory (securityToken, conceptpath, [filter]) Returns a list of unique values for a particular column and their number of occurrences. decodeRSID() ------------ :: filter decodeRSID (RSID) Given an RSID, this method should decode the RSID and return the filter that was used to generate it. searchRetrieve() ---------------- :: recordArray searchRetrieve (securityToken, filter, start, number, format) Combines the search and retrieve methods to provide a single call interface to retrieve records. No RSID is generated by this method. status() -------- :: statusArray status (securityToken) Returns a set of status records that indicate the load on the service. directGet() ----------- :: <> directGet (<>) Provides a mechanism for directly retrieving a single object from the full set. Note: should this really be a method of the EQI? Seems like a property of the data "object". Types ===== WSDL types define the data types (beyond the standard XMLSchema data types) that are referenced within the WSDL document. Types may appear as external XMLSchema documents which are brought into context with the ``import`` statement, or may be entire XMLSchema documents embedded within the WSDL Document in the ``wsdl:types`` section. The latter option is generally better supported by WSDL toolkits. filter The ``filter`` structure is not yet fully defined. It is not yet determined if this will be an actual structure or simply a ``String``. securityToken The ``securityToken`` type is a place holder for an as yet undefined structure. The intent is that this structure will provide the authentication context in which the operation should be invoked. RSID The ``RSID`` is a Result Set Identifier. The actual structure of an RSID may likely be defined by individual data providers. This may be changed to the simple ``strings`` type in later revisions. record A ``record`` is used to contain record data retrieved from a data provider. This is a complex type that can contain any valid XML data. inventoryRecord A type of ``record`` that is common across all data providers and generated in response to the ``inventory`` method. format The ``format`` type is used to indicate the structure of data that is to returned to a client of a Data Provider. The type will contain any valid XMLSchema document or parsable XMLSchema fragment. Discussion Items ================ The following items require discussion for resolution. Query Syntax ------------ There are a few options for specifying a filter for the ``search`` and ``inventory`` operations. Query candidates include: XPath, XQuery, SQL and a query structure. The "query structure" option means that the query (filter) is represented as a data structure (e.g. an XML document) rather than a particular query syntax. SQL is probably not really a viable option, but some mutated form of it might be (e.g. using xpaths in place of column names). (kind of like xquery). Questions/comments: 1. Expressiveness of each possibility. Is it possible to formulate the necessary queries with the language? 2. Implementation - how hard is it going to be to implement the language? We have a variety of data stores that use a variety of native languages. The filter sent to an EQI service instance must be able to parse the query and build something that works with the native query language. 3. Verbosity. Not really a big restriction, but something to keep in mind. How important is it that the language chosen is fairly concise? 4. Must work with deeply structured XML records. Low Level Data Retrieval ------------------------ The "Get" method is meant to return an object- an XML Document, raster image, excel spreadsheet, ... Should this be defined as part of the EQI? Should it simply be an element of records- e.g.:: ... http://someurl.com/data/mydata.xls ... Can the syntax of an RSID be used in some clever way to retrieve a specific record without the need to apply a search() operation first? See Also ======== Additional resources that might be of interest: WSDL Specification http://www.w3.org/TR/wsdl SOAP Specification http://www.w3.org/TR/SOAP/ XQuery Specification http://www.w3.org/TR/xquery/ XQuery Lite http://phpxmlclasses.sourceforge.net/xquery_lite.html XPath http://www.w3.org/TR/xpath XML Schema http://www.w3.org/TR/xmlschema-0/