[seek-dev] implement XQuery in EcoGrid and SRB

Matt Jones jones at nceas.ucsb.edu
Fri Jun 6 10:04:54 PDT 2003


Bing,

In earlier discussions about the ecogrid query language, we agreed that 
we wanted to support multiple underlying metadata models.  So, for 
example, both EML and Darwin Core.  The syntax we proposed before 
explicitly allowed one to reference the model in the query.  For 
example, in the listing below, we can explicitly differentiate 
conditions that match EML  models and Dublin Core models:

<egq:query queryId="test.1.1" system="test"
     xmlns:egq="ecogrid://ecoinformatics.org/ecogrid-query-1.0.0alpha1">

   <namespace prefix="eml" space="eml://ecoinformatics.org/eml-2.0.0"/>
   <namespace prefix="dc"  space="dublinCoreNamespaceURI"/>
   <title>Soils metadata query</title>
   <AND>
     <OR>
       <condition operator="LIKE" concept="eml:title">%soil%</condition>
       <condition operator="LIKE" concept="dc:title">%soil%</condition>
     </OR>
     <OR>
       <condition operator="LIKE"
                        concept="eml:surName">%Jones%</condition>
       <condition operator="LIKE"
                        concept="dc:Creator">%Vieglais%</condition>
     </OR>
   </AND>
</egq:query>

I agree with Dave that this is not a feature that we should lose if we 
decide to move to an XQuery-based syntax.  A single common schema is 
simply not realistic.  Here's an XQuery example that preserves this 
capability, directly copied from the XQuery Use Case spec:

   declare namespace xlink = "http://www.w3.org/1999/xlink"
   <Q4>
     {
       for $hr in input()//@xlink:href
       return <ns>{ $hr }</ns>
     }
   </Q4>

We did discuss adding, at a later date, a query translation service that 
makes use of the SMS to map a query expressed in terms of one namespace 
(e.g., EML) to a query in another namespace (e.g., Dublin Core), so that 
repositories that only support metadata in one format might still be 
able to respond to the query.  But that is independent of the query 
language design I think, and something that represents second phase of 
design.

On another issue, I think we need a better mechanism than a single 
'virtual' document to represent the various repositories.  The EcoGrid 
registry will presumably have a (hopefully large) list of ecogrid nodes 
and their corresponding capabilities (e.g., can export Darwin Core 
records).  This registry should be able to be used to dynamically 
determine which nodes to include in a query.  So, our query syntax might 
need to support a mechanism for naming the nodes against which a query 
should be run.  In our original spec we said the clients would simply 
decide and invoke the query interface for each node at the right URL. 
But the "IN" clause of XQuery is very closely related to this.  In 
XQuery the concept of an "Input Sequence" is implementation defined (see 
section 2.2 of the XQuery 1.0 spec).  This means that each ecogrid node 
can decide how to implement the XQuery input sequence functions 
(fn:input(), fn:collection(), and fn:doc()).  We'd need to explore how 
these functions would be used to bind nodesets to queries for several 
types of systems implementing the ecogrid interface (such as srb and 
metacat and digir).

For example, I think the "fn:collection()" function is closest in spirit 
to what Bing is trying to accomplish in his example.  One could imagine 
an XQuery like this:
   declare namespace srb = "SRBMetadataURI"
   for $e in
     collection(srb://srb1.sdsc.edu//srb/home/bzhu.sdsc/designDocs)
     where $e/@srb:objtype = “file” and $e/@srb:time lt date(“09-23-2002”)
     return <dataname>$e/@srb:name</dataname>

This would allow precise specification of the srb network to hit, which 
collection to search, and the namespace bindings for metadata query 
semantics and resultset semantics.

Of course, this introduces substantial implementation overhead for 
ecogrid implementors.  I'm still not convinced that going with XQuery is 
a smart thing to do for us at this early stage of EcoGrid design.  This 
will certainly scare off most other implementors (e.g., we'll probably 
be limited to srb, metacat, xanthoria, and digir implementations, which 
isn't really our goal).  Our main problem is in finding or designing a 
query language that doesn't require tremendous rewriting of existing 
systems to accomodate new features that they don't already support.

Matt

Dave Vieglais wrote:
> Hi Bing,
> 
> Your argument and examples below assume that all data sources 
> contributing to the ecogrid shall conform to a common schema, which 
> seems overly restrictive even though it does make building XQuery 
> statements a bit easier since you are then working with a single schema.
> 
> I'm not sure that this is the intent of the ecogrid implementation, but 
> please correct me if I'm mistaken.
> 
> regards,
>   Dave V.
> 
> 
> Bing Zhu wrote:
> 
>> Peter and Jing,
>>
>> The whole EcoGrid data stored in metacat, SRB, (others), can be viewed 
>> as a
>> virtual XML document (or a DOM object). This DOM object has at least two
>> sub-nodes,
>> metacat and srb.
>>
>> Since SRB is organized in collection and sub-collection hierarchical
>> architecture, each collection is a subnode in the XML tree under
>> (/EcogGrid)/srb. (Actually
>> we can define a XML schema for our EcoGrid data. ). Thus we can implement
>> XQueries in
>> our EcoGrid to act as a common query engine across different systems
>> (metacat, SRB, etc.)
>>
>> I compiled some examples of XQueries to search documents in SRB.
>>
>> (1)    Search all design documents stored in SRB collection,
>> /home/bzhu.sdsc/designDocs,
>>       which were created before Sept 23, 2002.
>>
>>             for $e in 
>> document(“EcoGrid.xml”)/srb/home/bzhu.sdsc/designDocs
>>             where $e/@objtype = “file” and $e/@time lt date(“09-23-2002”)
>>             return <datasrc>SRB</datasrc><dataname>$e/@name</dataname>
>>
>> (2)    find all datasets stored in SRB which  have titles containing 
>> “protein”
>> and
>>       are owned by Professor John whose user name is john.
>>
>>             for $e document(“EcoGrid.xml”)/srb/home//
>>             where $e/@objtype = “file” and
>>                   contains($e/@name like, “protein”) and
>>                   contains($e/@owner, “john”)
>>             return <datasrc>SRB</datasrc><dataname>$e/@name</dataname>
>>
>>
>> Jing, Would you provide some examples (or info) regarding searching in
>> metacat?
>>
>> And We also can start with designing a XML schema for whole EcoGrid data
>> model
>> based on following (roughly).
>>
>> EcoGrid
>>      Metacat
>>>>      SRB
>>         collection (attribute: objtype, time, owner, …)
>>              dataset (attribute: objtype, time, owner, size, container,
>> resource, …)
>>                   user-defined metadata
>>      other data source
>>
>>
>> Cheers,
>> Bing
>>
>> =====================================================
>> Bing Zhu
>> San Diego Supercomputer Center
>> bzhu at sdsc.edu
>> (858)534-8373
>> =====================================================
>>
>> _______________________________________________
>> seek-dev mailing list
>> seek-dev at ecoinformatics.org
>> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
>>
> 

-- 
-------------------------------------------------------------------
Matt Jones                                     jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439    Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org
-------------------------------------------------------------------




More information about the Seek-dev mailing list