[seek-dev] RE: resultset question

Bertram Ludaescher ludaesch at sdsc.edu
Wed Apr 28 09:30:01 PDT 2004


Peter:

Good points! 

Is there a tutorial or reference paper that describes the fundamental
principles in EML?

Bertram

>>>>> "PM" == Peter McCartney <peter.mccartney at asu.edu> writes:
PM> 
PM> I think it is clear that we need a joint ecogrid SMS/KR meeting. On the
PM> SMS side im hearing still some unfamiliarity with what information has
PM> been designed into EML (ie - the low level class distinctions between
PM> resource types is built into the extensible resource module) on the
PM> ecogrid side, im seeing a need to get some input on where to couple our
PM> existing data with the new semantic ontologies. Hopefully you will all
PM> get to talk about some of this in scotland, which I unfortunately will
PM> miss. 
PM> 
PM> -----Original Message-----
PM> From: Bing Zhu [mailto:bzhu at sdsc.edu] 
PM> Sent: Tuesday, April 27, 2004 11:52 PM
PM> To: Shawn Bowers; Rod Spears
PM> Cc: Chad Berkley; Peter McCartney; Saritha Bhandarkar; seek-dev; Jing
PM> Tao; Matt Jones; Bertram Ludaescher
PM> Subject: RE: [seek-dev] RE: resultset question
PM> 
PM> 
PM> Ecogrid builds low level tools for SEEK users to export, import and
PM> query datasets and metadata. For storing datasets or building registry
PM> for "web service", "shape file", "source code", "PDF document",
PM> "ontology", etc,  we need some data modeling work. e.g. in SRB, you can
PM> create and organize datasets in different collections. With this
PM> approach, you can have a web service registry collection in which we can
PM> create and store datasets serving as our web service registry.
PM> 
PM> It seems to me that a perfect design for our Ecogrid resultset needs to
PM> use some knowledge from ontologies. I am not sure if it is appropriate
PM> to mix the ontology layer within Ecogrid software layer.
PM> 
PM> Bing
PM> 
PM> 
PM> 
PM> -----Original Message-----
PM> From: seek-dev-admin at ecoinformatics.org
PM> [mailto:seek-dev-admin at ecoinformatics.org]On Behalf Of Shawn Bowers
PM> Sent: Monday, April 26, 2004 9:40 AM
PM> To: Rod Spears
PM> Cc: Chad Berkley; Peter McCartney; Saritha Bhandarkar; seek-dev; Jing
PM> Tao; Matt Jones; Bertram Ludaescher
PM> Subject: Re: [seek-dev] RE: resultset question
PM> 
PM> 
PM> 
PM> 
PM> Rod Spears wrote:
>> See comments below.... and please comment on my comments.
PM> 
PM> Comments below on your comments... (Thanks for responding to my orginal
PM> mail so quickly)
PM> 
PM> Also, you should comment on my comments on your comments ;-)
PM> 
>> 
>> Shawn Bowers wrote:
>> 
>>> On Fri, 23 Apr 2004, Rod Spears wrote:
>>> 
>>> [snip ...]
>>> 
>>> 
>>> 
>>>> What can be done to help generic consumers and SMS?
>>>> 
>>>> 
>>> 
>>> I have some opinions/observations about what Ecogrid can provide for 
>>> SMS (assuming you mean Semantic Mediation System). No one actually 
>>> asked for my opinion, but the door is opened by the question, and I 
>>> thought I'd barge in :-)
>>> 
>>> Here is what I see SMS needing. (Note that this might be a lot 
>>> different than what ecogrid actually intends to provide -- these items
PM> 
>>> are more aligned with the architecture of traditional integration 
>>> systems and systems being developed like for GEON.)
>>> 
>>> 1). Every resource registered in the Ecogrid should have a persistent,
PM> 
>>> Ecogrid-relative unique identifier.
>>> 
>> Each does today. It has a unique name.
PM> 
PM> 
PM> I thought that it did.  This list of operations and data structures is
PM> just to say what SMS needs from Ecogrid -- I assumed much of this was
PM> implemented by Ecogrid already.
PM> 
PM> 
>>> 2). Every resource registered in the Ecogrid should fill-in two 
>>> Ecogrid metadata tags (dublin-core style). The first is the type of 
>>> resource registered, e.g., the type could be "dataset", "web service",
PM> 
>>> "shape file", "source code", "PDF document", "ontology", etc. (These 
>>> should be controlled values, i.e., come from a predefined list.)
>>> 
>> Dave and I were just talking about this. We hoped we could get by 
>> without an extra identifier. Meaning the "type" could be derived from 
>> the service's location (or the interfaces it implements). But maybe we
PM> 
>> will need a simple field for easier indentification.
PM> 
PM> 
PM> I think the assumption that the location determines the resource type is
PM> not general enough (and also not extensible).  For example, if we have
PM> an SRB repository used within Ecogrid for storing datasets as well as
PM> PDF documents and ontologies, then a namespace would have to capture all
PM> three of these types.  I believe that with many of these underlying
PM> systems, like SRB and Metacat, there is no requirement that all
PM> resources stored must be of the same type.
PM> 
PM> I stated above there should be a metadata tag for storing the type
PM> information, but it could just as easily be an operation (or query). For
PM> example, getResourceType : ResourceID -> ResourceType is a partial
PM> function, where ResourceID is the set of all possible Ecogrid resources
PM> identifiers and ResourceType is the set of all resource types known by
PM> Ecogrid ("dataset", "web service", and so on). So, for a given
PM> resource-id r, getResourceType(r) returns the associated resource type
PM> of r.  If Ecogrid calculates this op based on where r is stored, and
PM> that is really a valid assumption, that seems fine.  Note that the
PM> operation could also be expressed as a query, as opposed to a function
PM> or a metadata tag.
PM> 
>>> The other tag
>>> states the available (and Ecogrid accessible) standards-based metadata
PM> 
>>> for the resource, e.g., for a dataset this might include "FGDC", 
>>> "EML", "XML Schema" (for datasets stored in XML), "SQL DDL"; and for a
PM> 
>>> web service, "WSDL"; and so on. (Again, these should be controlled 
>>> values.) Other tags that might be useful (but not required by SMS) are
PM> 
>>> quality of resource (who registered it, whether it has been deemed 
>>> "accepted", and so on) and whether it is curated (stored by some 
>>> Ecogrid db) or stored externally (e.g., in the PNW database).
>>> 
>> Would a namespace be enough to be able to specify "how" the metadata 
>> was stored?
PM> 
PM> In this case, I don't think a namespace is enough.  Any given resource
PM> may have multiple metadata specifications. For example, if a given
PM> resource-id r happens to be a dataset, then there very easily could be
PM> both an FGDC and an EML metadata file for r.  So, what SMS needs is a
PM> (partial) function getMetadataType : ResourceID -> MetadataType^2, which
PM> takes a resource id and returns a set (^2 means powerset) of metadata
PM> types (e.g., "SQL DDL", "EML", and so on).
PM> 
PM> One question I have about the Ecogrid, and probably a misconception I
PM> have, is that it seems like what is searched *for* is metadata (like EML
PM> files), and not the actual resource. This was what prompted my earlier
PM> post on how to get all resources from the Ecogrid... do I have to first
PM> query for all the metadata associated with the resource, then look in
PM> these files to see where each resource is actually being stored? Like I
PM> said, this might be a misconception I have -- it seems like this
PM> metadata-centric view represents the only examples I've seen for
PM> Ecogrid. I would like for SMS to have resource-centric access for
PM> datasets; the resource is what is of interest (I give an example in my
PM> next comment below). The same should be true for Kepler -- datasets can
PM> be processed in a workflow, not the EML files of the datasets (there is
PM> a caveat to this; both Chad's EML ingestor and in some ways, Iklay's web
PM> service actor, take metadata files, but their purpose is to get from the
PM> metadata to the actual resource, I believe).   Of course, for web
PM> services (as an example), SMS doesn't need the actual resource, and only
PM> needs the WSDL description (which happens to be all that is needed to
PM> execute the web service).  However, conceptually, it is still the
PM> web-service that is the resource -- the web-service implementation is
PM> what is of interest, and the WSDL could be viewed as just a by-product
PM> of the implementation. In fact, there could be many WSDL descriptions of
PM> the same implementation.  There may be some disagreement about this
PM> notion of Ecogrid being resource-centric, but I would argue it is the
PM> more general semantics.
PM> 
PM> Does that make sense?
PM> 
PM> 
>>> 3). Ecogrid should support an operation to retrieve the metadata 
>>> definition for a resource. For example, if a dataset is stored through
PM> 
>>> the Ecogrid, and the resource has an EML description (which we know 
>>> from 2), then the operation would return the corresponding EML file 
>>> (of course, although not likely, there is nothing that would prevent a
PM> 
>>> resource from having multiple EML files).
>>> 
>> Seems reasonable.
>> 
>>> 
>>> 4). Ecogrid should support an operation to retrieve the actual 
>>> resource (the thing managed by the ecogrid; either a dataset, a web 
>>> service, a "code", or whatever).  Also, datasets should be returned 
>>> using a standard representation. For example, the canonical XML 
>>> representation for relational data or CSV.  I believe EML-tools 
>>> already provide some support for this for relational data. Thus, at 
>>> least for datasets, the Ecogrid should serve as a standard wrapper 
>>> service as used in distributed dbs and in information-integration 
>>> architectures. This service I see as useful for both SMS and for 
>>> Kepler in general.
>>> 
>> It's either doing this, or I don't quite understand the question.
PM> 
PM> Here is an example.  I am a scientist, and I have a dataset (a single
PM> relation) stored in an Access database. I also have an FGDC file that I
PM> created to describe my dataset. They are both living on my laptop.  I
PM> want to store my dataset on the Ecogrid. I create an Ecogrid resource-id
PM> for the dataset, ecogrid:042604, and I register the resource-id for the
PM> dataset. That is, I upload the Access database to some Ecogrid
PM> repository as well as the FGDC file, and I tell Ecogrid that the FGDC
PM> file should be used as the metadata file for the dataset.
PM> 
PM> Later, SMS needs to integrate the dataset with some other dataset. SMS
PM> knows the resource-id for both datasets. To do the integration, SMS
PM> needs access to both datasets. To get access to the datasets, SMS calls
PM> the Ecogrid function getResource(ecogrid:042604, "CSV"), which returns
PM> the dataset as a comma-separated-value text-file representation.
PM> Alternatively (and preferred), SMS could call
PM> getResource(ecogrid:042604, "RelationalXML"), which returns the same
PM> exact dataset using the standard relational to XML mapping.
PM> 
PM> Does Ecogrid already provide something like getResource? (If so that
PM> would be awesome!)
PM> 
PM> 
PM> Thanks,
PM> Shawn
PM> 
>> 
>>> 
>>> 5). Optionally (at least for SMS, these aren't required), Ecogrid can 
>>> offer a query-routing/execution service and/or web service invocation.
PM> 
>>> The purpose of offering query or invocation services would be for 
>>> optimization (in some cases) and to enable such operations for clients
PM> 
>>> that cannot perform these locally.
>>> 
>> 
>> I think this functionality is one of the benefits of using Globus.
>> 
>>> 
>>> I believe that items 1-4 are the only things really needed by SMS from
PM> 
>>> the Ecogrid. In particular, for SMS, it doesn't really matter how or 
>>> where the resource is stored (metacat, src, digir, etc.), and it 
>>> doesn't need services to query the catalog entries of those systems.  
>>> If people bypass the SMS system, then I guess these types of things 
>>> are needed.
>>> 
>>> Items 1-3 seem relatively straightforward. Item 4 seems harder, 
>>> although EML-tools exist for much of this I guess -- I am not really 
>>> sure.
>>> 
>>> 
>>> Shawn
>>> 
>>> 
>>> 
>>> 
>>>> The issue at the moment is that the contents of the <record> element 
>>>> is basically a blob and anything goes. For example:
>>>> 1) Metacat return a bunch of param elements contain the data
>>>> 2) DiGIR contents a bunuch of namespace qualified elements containing
PM> 
>>>> the data.
>>>> 3) The SRB doesn't even have any data in the record, the identifier 
>>>> attr is meaningful.
>>>> 
>>>> We need to provide a mechanism for the contents to be interpreted, to
PM> 
>>>> do this we will add four things to the existing resultset schema:
>>>> 1) One or more <namespace> elements the metadata - this will be the 
>>>> namespace for the new <returnfield> element
>>>> 2) Add a new element <returnfield>
>>>> 3) A "name" attribute for the returnfield element (basically the same
PM> 
>>>> as Peter 'xpath' att) which is a unique name within the record and 
>>>> may be meaning for whereever the data came from.
>>>> 4) A "type" attribute for the returnfield element that describe the 
>>>> type of data contained in the returnfield
>>>> 
>>>> The most important and powerful part of the new additions is the 
>>>> "type" attr. This enables the value to be interpreted. Most of the 
>>>> time it can be described by a schema defintion type, for example 
>>>> "xsi:string" etc. Or it could be an url that points to a schema 
>>>> definition document. This means the value of the returnfield element 
>>>> could be anything from a string or integer to an entire XML document.
>>>> 
>>>> (Note that the namespace attr has been removed from the record 
>>>> element)
>>>> 
>>>> The new namespace attrs in the metadata provide a way for the value 
>>>> of the name attr and the type attr to be interpreted.
>>>> 
>>>> Here is an example of the a metacat resultset that is returned today:
PM> 
>>>> <rs:resultset system="http://knb.ecoinformatics.org"
PM> resultsetId="eml.001"
>>>> 
>>>> xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"
>>>> 
>>>> xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.
>>>> 0.0b
PM> eta1
>>>> ../../src/xsd/resultset.xsd">
>>>> <resultsetMetadata>
>>>> <sendTime>2004-03-10T13:47:26-0600</sendTime>
>>>> <startRecord>1</startRecord>
>>>> <endRecord>14</endRecord>
>>>> <recordCount>14</recordCount>
>>>> </resultsetMetadata>
>>>> <record number="1"
>>>> system="http://dev.nceas.ucsb.edu"
>>>> identifier="obfs2.379.1"
>>>> namespace="eml://ecoinformatics.org/eml-2.0.0"
>>>> lastModifiedDate="2003-11-02T11:07:43-0600"
>>>> creationDate="2003-11-02T11:07:43-0600">
>>>> <param
PM> name="/eml/dataset/keywordSet/keyword">seasonality</param>
>>>> <param  name="/eml/dataset/keywordSet/keyword">macroalgal
>>>> bloom</param>
>>>> <param  name="/eml/dataset/keywordSet/keyword">green
PM> tide</param>
>>>> <param  name="/eml/dataset/keywordSet/keyword">Ulva</param>
>>>> <param 
>>>> name="/eml/dataset/creator/individualName/surName">Nelson</param>
>>>> <param  name="/eml/dataset/keywordSet/keyword">biomass</param>
>>>> <param  name="/eml/dataset/keywordSet/keyword">algal
PM> blooms</param>
>>>> <param  name="/eml/dataset/title">Armitage Bay Ulvoid Algal 
>>>> Biomass and Species Composition</param>
>>>> <param
PM> name="/eml/dataset/keywordSet/keyword">Enteromorpha</param>
>>>> <param  name="/eml/dataset/keywordSet/keyword">Ulvaria</param>
>>>> </record>
>>>> 
>>>> Here is an example of the same resultset as described by the new
PM> approach:
>>>> <rs:resultset system="http://knb.ecoinformatics.org"
PM> resultsetId="eml.001"
>>>> 
>>>> xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"
>>>> 
>>>> xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.
>>>> 0.0b
PM> eta1
>>>> ../../src/xsd/resultset.xsd">
>>>> <resultsetMetadata>
>>>> <sendTime>2004-03-10T13:47:26-0600</sendTime>
>>>> <startRecord>1</startRecord>
>>>> <endRecord>14</endRecord>
>>>> <recordCount>14</recordCount>
>>>> <namespace>eml://ecoinformatics.org/eml-2.0.0</namespace>
>>>> <namespace 
>>>> prefix="xsi">http://www.w3.org/2001/XMLSchema-instance</namespace>
>>>> </resultsetMetadata>
>>>> <record number="1"
>>>> system="http://dev.nceas.ucsb.edu"
>>>> identifier="obfs2.379.1"
>>>> lastModifiedDate="2003-11-02T11:07:43-0600"
>>>> creationDate="2003-11-02T11:07:43-0600">
>>>> <returnfield name="/eml/dataset/keywordSet/keyword"
>>>> type="xsi:string">seasonality</returnfield>
>>>> <returnfield name="/eml/dataset/keywordSet/keyword"
>>>> type="xsi:string">macroalgal bloom</returnfield>
>>>> <returnfield name="/eml/dataset/keywordSet/keyword"
>>>> type="xsi:string">green tide</returnfield>
>>>> <returnfield name="/eml/dataset/keywordSet/keyword"
>>>> type="xsi:string">Ulva</returnfield>
>>>> <returnfield name="/eml/dataset/creator/individualName/surName"
>>>> type="xsi:string">Nelson</returnfield>
>>>> <returnfield name="/eml/dataset/keywordSet/keyword"
>>>> type="xsi:string">biomass</returnfield>
>>>> <returnfield name="/eml/dataset/keywordSet/keyword"
>>>> type="xsi:string">algal blooms</returnfield>
>>>> <returnfield name="/eml/dataset/title" 
>>>> type="xsi:string">Armitage Bay Ulvoid Algal Biomass and Species
PM> Composition</returnfield>
>>>> <returnfield name="/eml/dataset/keywordSet/keyword"
>>>> type="xsi:string">Enteromorpha</returnfield>
>>>> <returnfield name="/eml/dataset/keywordSet/keyword"
>>>> type="xsi:string">Ulvaria</returnfield>
>>>> </record>
>>>> 
>>>> Note how we now can interpret the resultset in a much more meaningful
PM> 
>>>> way. Also, note that there are two new namespace elements, one 
>>>> contains a "prefix" attr the other does not. The one without becaomes
PM> 
>>>> the default namespace for unqualified values in the name and type 
>>>> attrs.
>>>> 
>>>> Here is the before and after for the DiGIR query:
>>>> Before:
>>>> <rs:resultset resultsetId="foo.1.1"
>>>> system="urn:not://sure/what/to/put/here"
>>>> 
PM> xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"
>>>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>>> 
>>>> xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.
>>>> 0.0b
PM> eta1
>>>> ../../src/xsd/resultset.xsd">
>>>> <resultsetMetadata>
>>>> <sendTime>2003-05-02T16:45:50-09:00</sendTime>
>>>> <startRecord>1</startRecord>
>>>> <endRecord>2</endRecord>
>>>> <recordCount>2</recordCount>
>>>> </resultsetMetadata>
>>>> <record number="1"
>>>> 
>>>> system="http://speciesanalyst.net/digir/DiGIR.php?resource=MammalsDwC
PM> 2"
>>>> identifier="mvz1"
>>>> 
PM> namespace="http://digir.net/schema/conceptual/darwin/2003/1.0"
>>>> lastModifiedDate="2003-03-03T10:42:13"
>>>> creationDate="2003-03-03T10:42:13">
>>>> <darwin:ScientificName>PEROMYSCUS LEUCOPUS 
>>>> NOVEBORACENSIS</darwin:ScientificName>
>>>> <darwin:Longitude>121</darwin:Longitude>
>>>> <darwin:Latitude>33</darwin:Latitude>
>>>> </record>
>>>> 
>>>> After:
>>>> <rs:resultset resultsetId="foo.1.1"
>>>> system="urn:not://sure/what/to/put/here"
>>>> 
PM> xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"
>>>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>>> 
>>>> xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.
>>>> 0.0b
PM> eta1
>>>> ../../src/xsd/resultset.xsd">
>>>> 
>>>> <resultsetMetadata>
>>>> <sendTime>2003-05-02T16:45:50-09:00</sendTime>
>>>> <startRecord>1</startRecord>
>>>> <endRecord>2</endRecord>
>>>> <recordCount>2</recordCount>
>>>> 
>>>> <namespace>http://digir.net/schema/conceptual/darwin/2003/1.0</namesp
ace> 
>>>> <namespace 
>>>> prefix="xsi">http://www.w3.org/2001/XMLSchema-instance</namespace>
>>>> </resultsetMetadata>
>>>> 
>>>> <record number="1"
>>>> 
>>>> system="http://speciesanalyst.net/digir/DiGIR.php?resource=MammalsDwC
PM> 2"
>>>> identifier="mvz1"
>>>> lastModifiedDate="2003-03-03T10:42:13"
>>>> creationDate="2003-03-03T10:42:13">
>>>> <returnfield path="ScientificName" 
>>>> type="xsi:string">PEROMYSCUS LEUCOPUS NOVEBORACENSIS</returnfield>
>>>> <returnfield path="Longitude"
PM> type="xsi:int">121</returnfield>
>>>> <returnfield path="Latitude" type="xsi:int">33</returnfield>
>>>> </record>
>>>> 
>>>> Here is the SRB's before and after:
>>>> Before:
>>>> <rs:resultset system="http://knb.ecoinformatics.org"
>>>> resultsetId="SeekSRB_001"  
>>>> xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"
PM> 
>>>>> <resultsetMetadata>
>>>> <sendTime>2004-04-16T11:02:12-0500</sendTime>
>>>> <startRecord>1</startRecord>
>>>> <endRecord>2</endRecord>
>>>> <recordCount>2</recordCount>
>>>> </resultsetMetadata>
>>>> <record number="1"
>>>> system="http://srb.sdsc.edu"
>>>> identifier="/home/testuser.sdsc/SeekTestArea/Lesli Model::0"
>>>> namespace="srb://srb.sdsc.edu"
>>>> lastModifiedDate="2003-11-30T13:04:59-0600"
>>>> creationDate="2003-11-30T13:04:58-0600">
>>>> </record>
>>>> 
>>>> After:
>>>> <rs:resultset system="http://knb.ecoinformatics.org"
>>>> resultsetId="SeekSRB_001"  
>>>> xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"
PM> 
>>>>> <resultsetMetadata>
>>>> <sendTime>2004-04-16T11:02:12-0500</sendTime>
>>>> <startRecord>1</startRecord>
>>>> <endRecord>2</endRecord>
>>>> <recordCount>2</recordCount>
>>>> <namespace>eml://ecoinformatics.org/eml-2.0.0</namespace>
>>>> </resultsetMetadata>
>>>> <record number="1"
>>>> system="http://srb.sdsc.edu"
>>>> identifier="/home/testuser.sdsc/SeekTestArea/Lesli Model::0"
>>>> lastModifiedDate="2003-11-30T13:04:59-0600"
>>>> creationDate="2003-11-30T13:04:58-0600">
>>>> <returnfield name="location" 
>>>> type="xsi:string">/home/testuser.sdsc/SeekTestArea/Lesli
>>>> Model::0</returnfield>
>>>> </record>
>>>> ---------------------------------------------------------------------
>>>> ---
>>>> The Query
>>>> About the only difference between the old query and the new is that
PM> is
>>>> the returnfield value can concept attr values do not have a namespace
>>>> then the prefix should be dropped from the namespace element , or
PM> they
>>>> should have a namespace if there is a prefix in the element. For
PM> example:
>>>> 
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <egq:query queryId="test.1.1" system="http://knb.ecoinformatics.org"
>>>> xmlns:egq="ecogrid://ecoinformatics.org/ecogrid-query-1.0.0beta1"
>>>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>>> 
>>>> xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-query-1.0.0b
>>>> eta1
>>>> ../../src/xsd/query.xsd">
>>>> <namespace>eml://ecoinformatics.org/eml-2.0.0</namespace>
>>>> <returnfield>/eml/dataset/title</returnfield>
>>>> 
>>>> 
PM> <returnfield>/eml/dataset/creator/individualName/surName</returnfield>
>>>> <returnfield>/eml/dataset/pubDate</returnfield>
>>>> <returnfield>/eml/dataset/keywordSet/keyword</returnfield>
>>>> <title>Soils metadata query</title>
>>>> <AND>
>>>> <OR>
>>>> <condition operator="LIKE"
PM> concept="title">%soil%</condition>
>>>> <condition operator="NOT LIKE" 
>>>> concept="title">%dirt%</condition>
>>>> </OR>
>>>> <OR>
>>>> <condition operator="LIKE"
PM> concept="surName">%Jones%</condition>
>>>> <condition operator="LIKE" 
>>>> concept="surName">%Vieglais%</condition>
>>>> </OR>
>>>> </AND>
>>>> </egq:query>
>>>> ---------------------------------------------------------------------
>>>> ---
>>>> 
>>>> We can either discuss this via email, or think about it and discuss 
>>>> it further during our phone meeting.
>>>> 
>>>> Rod
>>>> 
>>>> 
>>>> Chad Berkley wrote:
>>>> 
>>>> 
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Sorry for my late reply...we've been busy with a morpho release. 
>>>>> thanks for getting me in gear, Rod.
>>>>> 
>>>>> In metacat, we only return leaf nodes (i.e. the text node child of a
PM> 
>>>>> CDATA element like in response 4 below).  The returnfield 
>>>>> functionality was originally meant as a convenient way to return 
>>>>> enough information for a meaningful resultset to display, say, on a 
>>>>> web page.  It was not meant to return whole document chunks for 
>>>>> further processing.  I can see how this would be useful, but it 
>>>>> would require returning a namespace defined chunk so that a parser 
>>>>> would know what to do with it.  Metacat currently uses the 
>>>>> returnfields to build the resultset table, then a request must be 
>>>>> made for the whole document in order to do further processing.
>>>>> 
>>>>> Looking at the responses 1-3 below, to me, they are all invalid and 
>>>>> potentially problematic.  without a namespace to parse those xml 
>>>>> chunks off of, the parser is left to just do well-formedness 
>>>>> checking and any query into these document chunks may fail because 
>>>>> we don't know what to expect to get back before doing the processing
PM> 
>>>>> (e.g. an xpath query).
>>>>> 
>>>>> So I guess to make a short answer long, I agree with Peter's 
>>>>> assessment of sticking with response 4 (which is basically what 
>>>>> metacat has done all along).
>>>>> 
>>>>> chad
>>>>> 
>>>>> 
>>>>> Rod Spears wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>>> Is anyone better qualified than me, going to address Peter's 
>>>>>> questions?
>>>>>> 
>>>>>> Please someone respond, thanks.
>>>>>> 
>>>>>> Rod
>>>>>> 
>>>>>> 
>>>>>> Peter McCartney wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> it has to be well formed no matter what. so the question is really
PM> 
>>>>>>> how can we identify a namespace for the result set when the 
>>>>>>> content we stick in there has no hope of being valid? further, how
PM> 
>>>>>>> can we define  a set of rules for how the results are to be 
>>>>>>> evaluated against that namespace yet not be valid? request 1: 
>>>>>>> '*/creator/individualName/surname', '/eml/dataset
>>>>>>> 
>>>>>>> Rule1: "content must appear in minimal xml tree needed to 
>>>>>>> accomodate the informaton"
>>>>>>> 
>>>>>>> Rule2: "content must appear in a potentially valid xml tree that 
>>>>>>> invalidates only due other required elements missing.
>>>>>>> 
>>>>>>> rule 3 "conent must appear in a tree that placed in in correct 
>>>>>>> node ancestry for the declared namespace.
>>>>>>> 
>>>>>>> 
>>>>>>> response 1: meets 1 and 3 and is well formed. Requires just 
>>>>>>> knowledge of parent ancestry to build. <eml>
>>>>>>> <dataset>
>>>>>>> <creator>
>>>>>>> <individualName>
>>>>>>> <surname>mccartney</surname>
>>>>>>> <surname>jones</surname>
>>>>>>> </individualname>
>>>>>>> </creator>
>>>>>>> </dataset>
>>>>>>> <eml>
>>>>>>> 
>>>>>>> response 2: meets 1, 2 and 3 and is well formed. Requires 
>>>>>>> knowledge of ancestry and index (ie jones is in creator[2] of 
>>>>>>> dataset[1] ) <eml>
>>>>>>> <dataset>
>>>>>>> <creator>
>>>>>>> <individualName>
>>>>>>> <surname>mccartney</surname>
>>>>>>> </individualname>
>>>>>>> </creator>
>>>>>>> <creator>
>>>>>>> <individualName>
>>>>>>> <surname>jones</surname>
>>>>>>> </individualname>
>>>>>>> </creator>
>>>>>>> </dataset>
>>>>>>> <eml>
>>>>>>> 
>>>>>>> 
>>>>>>> response 3: meets 3 and is not well formed. rquires knowledge of 
>>>>>>> ancestry.
>>>>>>> 
>>>>>>> <eml>
>>>>>>> <dataset>
>>>>>>> <creator>
>>>>>>> <individualName>
>>>>>>> <surname>mccartney</surname>
>>>>>>> </individualname>
>>>>>>> </creator>
>>>>>>> </dataset>
>>>>>>> <eml>
>>>>>>> <dataset>
>>>>>>> <creator>
>>>>>>> <individualName>
>>>>>>> <surname>jones</surname>
>>>>>>> </individualname>
>>>>>>> </creator>
>>>>>>> </dataset>
>>>>>>> </eml>
>>>>>>> 
>>>>>>> and just a reminder of where we originally started from
>>>>>>> (approximately)
>>>>>>> reponse 4: meets no rule, cannot validated, but conveys all the 
>>>>>>> information to generate format 1 or 3 above using a string 
>>>>>>> tokenizer and a jDOM. but not option 2. <resultset 
>>>>>>> namespace=eml......>
>>>>>>> <returnfield 
>>>>>>> xpath="dataset/creator/individualname/surname">mccartney</returnfi
eld> 
>>>>>>> <returnfield 
>>>>>>> xpath="dataset/creator/individualname/surname">jones</returnfield>
>>>>>>> </resultset>
>>>>>>> 
>>>>>>> I think we should really ask whether we are making ourselves deal 
>>>>>>> with some very complicated rules for really no gain in 
>>>>>>> functionality. None of the results will be valid according to the 
>>>>>>> name space. All of them are valid if i make up my own namespace 
>>>>>>> for the result set.  Unless we can hold our selves to the standard
PM> 
>>>>>>> where any code or xsl written for the schema will successfuly 
>>>>>>> process the result set (#2 is the closest to that, but depending 
>>>>>>> on how loose the code is, all three could work or none could 
>>>>>>> work), why shouldnt we opt for the easiest rule to comply with?
>>>>>>> 
>>>>>>> 
>>>>>>> Peter McCartney (peter.mccartney at asu.edu
>>>>>>> <mailto:peter.mccartney at asu.edu>)
>>>>>>> Center for Environmental-Studies
>>>>>>> Arizona State University
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> *From:* Saritha Bhandarkar
>>>>>>> *Sent:* Friday, April 09, 2004 10:28 AM
>>>>>>> *To:* 'seek-dev'
>>>>>>> *Cc:* Jing Tao; Peter McCartney; Saritha Bhandarkar
>>>>>>> *Subject:* resultset question
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I had a question about the resultset to be returned by 
>>>>>>> Xanthoria.
>>>>>>> 
>>>>>>> The schema of the resultset specifies that a record is of type
>>>>>>> ?AnyRecordType? and optionally it may have some element
PM> content
>>>>>>> from the record. Now, my question here is, if I am to return
PM> the
>>>>>>> elements specified in the <returnfields> of the query, for the
PM> 
>>>>>>> matching records (that is from the matching
>>>>>>> eml file), do I need to send it in eml format,  with only
PM> relevant
>>>>>>> values for requested fields and no values for the fields which
PM> are
>>>>>>> not requested? Or is it enough to return only the requested
PM> fields
>>>>>>> with their values, as well-formed xml? Can someone please
PM> brief me
>>>>>>> on the contents of a record in resultsetType?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Saritha
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Saritha Bhandarkar
>>>>>>> 
>>>>>>> Research Assistant
>>>>>>> 
>>>>>>> Center for Environmental Studies
>>>>>>> 
>>>>>>> ASU-Tempe AZ
>>>>>>> 
>>>>>>> saritha.bhandarkar at asu.edu <mailto:saritha.bhandarkar at asu.edu>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> --
>>>>>> Rod Spears
>>>>>> Biodiversity Research Center
>>>>>> University of Kansas
>>>>>> 1345 Jayhawk Boulevard
>>>>>> Lawrence, KS 66045, USA
>>>>>> Tel: 785 864-4082, Fax: 785 864-5335
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>> 
PM> 
PM> _______________________________________________
PM> seek-dev mailing list
PM> seek-dev at ecoinformatics.org
PM> http://www.ecoinformatics.org/mailman/listinfo/seek-dev



More information about the Seek-dev mailing list