# # Copyright: 2003 Partnership for Biodiversity Informatics and # The Regents of the University of California # # '$RCSfile: ecogrid-query-interface.txt,v $' # '$Revision: 1.1 $' # '$Author: vieglais $' # '$Date: 2003-04-04 00:58:33 $' SEEK EcoGrid Query Interface ---------------------------- 1. Use XML document specifying XPath queries to specify query terms * This is likely to be a query structure in XML similar to the capabilities in Metacat/Xanthoria query pathquery documents. DiGIR is different, and allows the query schema to be specified on a per-node basis. This introduces irregularities in what can be queried, but is similar to the ability to query based on EML fields. Might want to consider how to handle this in a hybrid approach. * Consider the other features such as joins that are provided in XQuery, and strive towards being XQuery compliant * Query can specify parameters of result set and format of result set (e.g., XML as SOAP msg or streamed XML, how many records, resultsetid, continuation index, expiration time, etc) * Use WSDL/SOAP for passing queries, with the option for out-of-band resultset transfers a) define schema for queries b) define schema/approach for result sets 2. Do we query/process/analyze opaque data objects? * Probably defer to the AMS for this sort of processing, but need to coordinate such that the efficiency is maintained * AMS interfaces could be implemented on same nodes as EcoGrid interfaces to allow this coordination 3. API/Service definition/WSDL * general, and can be extended over time * start with very high level interfaces (e.g., put/get), advance to interfaces with finer control (read/write/seek) * depends on access and authentication interfaces as well as those below * Interfaces needed: Lowlevel API a) query(String xmlQueryDocument): ResultSetIdentiferStruct (RSIDS) RSIDS = Description of a set of records that match the query. An RSIDS can be sent to any eco-grid component that understands an RSIDS, and can be used to subsequently manipulate the resultset contents. b) transferRecords(RSIDS, startrec, maxrecs, recordStruct, [to]) Transfers a set of records (idesntified by RSIDS) to the caller or [to]. startrec is the (zero based) index of the first record. maxrecs is the maximum number of records to transfer. recordStruct specifies a projection of the records. x) read(RecordIdentifier) Returns an exact copy of the specified object. c) transformRecords(...) [ d) query(String xmlQueryDocument, startrec, maxrecs, recordStruct, [to]): ResultSet ] e) DiGIRResponse DiGIR:query(String resource, FilterStruct xmlQueryStruct, RecordStruct recstr, Integer startrec, Integer maxrecs, Bool countrecs) 4. Implementation of wrappers: notes for high-priority systems a) srb: xml query doc gets parsed and SRB "getmetainfo" is called with parsed parameters this might need to be multiple SRB calls b) metacat c) xanthoria d) digir 5. How do we handle session management? 6. Authentication * need for shared auth api * closely related to session management * build community around this * probably should use GSI, which will allow us to use the existing LDAP infrastructure we have in place, possibly with the addition of fields in the schema for one or more certificates * need to consider how to set up a certificate management infrastructure, and establish trusted referral 7. Access control * granularity issues * metadata acls versus data acls 8. How do we handle the distibuted model? * Do clients know about distributed nodes? Probably not, as that would mean we can't alter the backend optimization of queries. One of our requirements is that most common queries return within a few seconds, which will likely require some sort of optimization, even if nodes are distributed. In addition, failover requires that node metadata and data be replicated (at least optionally), and so the EcoGrid will need to coordinate the dispatch of queries to particular nodes, and possibly integrate result sets. * Nodes can register with the node registry, and can indicate node metadata that specifies: a) which EcoGrid interface clusters are available on the node b) computational, storage, and network capabilities (or this could be an API unto itself) * Need to determine how to establish trust networks among nodes, especially with respect to the interaction between replication, access control, and authentication. See the model used in metacat, with a 'hub' concept that is distinct from pairwise trust relationships between nodes. --- Registry Notes Any node can: * provide, store data - return information [in format x?] * computation "service" - using cycles on machine for running arbitrary code * registered analytical services - service is resident on machine and is called remotely Registry must include information about node capabilities - processing power, bandwidth, % usage, ... to help decide where an operation should be performed - bandwidth is a property of an edge in the network rather than a node Interface definitions are stored in registry (or location of interface definition) - language for interface definition (WSDL, IDL, ...)? Service (internet) locations are stored in registry, and the service record in the registry must contain a reference to an interface definition - promote interface reuse Registry provides a mechanism to generate a "system identifier". A local ID can be combined with the system identifier (e.g. assigned to a site) to create a GUID for a local object. ============================= Eco-Grid Meeting Notes 02 April 2003 Basic Functionality * Define a simple API for search and retrieval * Transport independent specification Replication * Need to keep track of where data is being replicated and accesses on replicated data sets GRID Activities * Need to maintain alignment with GRID activities similar to requirements of Eco-grid + SEEK Authentication * Authentication needs to be associated with datasets (including replicated data) (who can access which resource / (record / column) sub-element ?) * Should provide functionality for read + write access control. Public/unrestricted access should not be treated differently to an authenticated access (i.e. open access = authenticated to "open" group / user). Query Structure . . triplet = concept (column) + term + comparison operator Example 1: /*:orig/ns2:title LIKE My Data % /ns:orig/ns:title LIKE My Data % /orig/title LIKE My Data % /*:orig/*:year EQUALS 1988 /ns:orig/*:year LESS THAN 1940 /ns:orig/ns2:year LESS THAN 1940 operators =, <, >, !=, <=, >=, LIKE, NOT LIKE, EXISTS, NOT EXISTS, IN, NOT IN, (MATCH, ...) * schema * projection My Data % ... ... Tasks Matt and Jing -- Schema document for query, read, RSIDS, structure, with definitions and examples. Dave + Raja -- Static structure UML diagram for services Dave + Raja -- Collaboration diagram for services ??? -- WSDL Document for Interface () Matt and Jing -- Review possible registry implementations - UDDI, LDAP, Globus, ... Jing -- Experiment with a simple implementation * http://www.systinet.com/products/wasp_uddi/overview * jUDDI (http://www.juddi.org/ , was built by bowstreet.com now appears to be SF project) * http://www.themindelectric.com/glue/index.html (supports LDAP authentication) Matt -- Get GSI certificates working within the SOAP authentication scheme (with LDAP | SRB accounts) Protocol The protocol will be SOAP over HTTP Timing April 4 April 11 -- Design Diagrams April 18 -- WSDL, Registry instance operational, query + read, RSIDS schema and examples. April 25 May 2 May 9 Wrapper implementations + test client(s) May 16 (SEEK Technical WG meeting) May 23 May 30 -- Hard deadline for implementation of Eco-GRID alpha 1 Authentication * LDAP * SRB, GSI (http://www.globus.org/security/) See Also: http://soapclient.com/soapsecurity.html, http://www.w3.org/TR/SOAP-dsig/ http://www.ietf.org/internet-drafts/draft-bergeson-uddi-ldap-schema-01.txt