[seek-dev] RE: resultset question

Mon Apr 26 11:11:40 PDT 2004

Im having a hard time making the concept of discovery fit in this
discussion because in every implementation of the query engines we are
talking about, they are used to enable query on a publicly known schema
that the user (or application ) would in fact be familiar with in order
to query it. I can see the type attribute being of use here if there are
cases where one of these services might have a default set of fields
that it returns which could result in a client getting something back
that they hadnt explicitly asked for. 
When i start to think of applications where users are attempting to
discover and query a datasource on the ecogrid in a more dynamic
context, i keep coming back the feeling that this is somehow a different
animal and that what they need to do that discover is metadata that does
indeed have everything the app needs to know BEFORE it queries the data
and starts processing the results. It also seems to me that when
querying something like climate data which could return many records, i
would NOT want the results in XML, expecially in a format that is so
verbose with metadata inlined as attributes (in every record ) in the
result set. We tried passing data back and forth from web services in
xml formats early on in our xylopia project and found very quickly that
that approach just did not scale well when the data tended to be large
or a format for which XML was not very efficient (eg grids). we found
that packages of binary data plus an eml file that describes it (which
could be different from the eml that describes the orignal data that was
queried) was much more efficient.

This isnt really to argue against including the type attribute, just to
point out that its value seems to rest on the fact that these
preliminary resultsets we are talking about are essentially
"schema-less". For each of our systems, when you ask for the full
record, you get data that conforms to the published schema and you thus
know what to do with it by reading the schema. A more robust solution
when you get down to things like importing climate or other data seems
to be to have the query services capable of actually generating eml
schemas for the returned object. This is what we do with our xylopia
query service now. A new eml file that references the source eml file is
generated to describe the data file that was returned. A subsequent
service then recieves that new eml file in its input statement. I think
it makes a lot of sense to start writing the sms services to process
information in accompanying eml files rather than starting to look for
it embedded in the data since, except if very special cases like these
catalog applications, XML is not going to be a very practical means for
passing data between services/components, especially if many of our
compoenents are actually wrappers of existing apps that have thier own
I/O functions like GRASS or MatLAB

Peter McCartney (peter.mccartney at asu.edu)
Center for Environmental-Studies
Arizona State University

	-----Original Message-----
	From: Rod Spears [mailto:rods at ku.edu] 
	Sent: Monday, April 26, 2004 9:34 AM
	To: Peter McCartney
	Cc: Chad Berkley; Saritha Bhandarkar; seek-dev; Jing Tao; Matt
Jones
	Subject: Re: [seek-dev] RE: resultset question

	Certinaly, pipelines can and will be constructed with intimate
knowledge of the data, where it comes from, the type of data begin
requested, etc. But if we provide the "type" attribute for the return
data fields we can open it up more to automation. This enables
non-humans to do "discovery."

	For example, SMS may have a "library" of "converters" that can
be strung together to get one "domain" of data to a different "domain."

	A rather simple example: Let's say a user wants to add a
"climate" dataset to his/her pipeline. From the metadata they discover a
particular dataset holds the temperature data they need. Do they really
need to be concerned whether the data was collected in Celius or
Farhenheit? Probably not. 

	Does the analysis module need to build in any assumptions as to
whether (no pun intended) the temperature data was C or F or whether it
is an interger or decimal (float/double)? Well, it would be best if the
analysis model could "discover" everything it needed to know about the
incoming data, instead of having to hard-code the assumptions thus
limiting its flexibility.

	The more we can describe what things are, the less assumptions
need to be built into the system. It may place a higher burden on the
data providers, but it "frees up" more things further in the pipeline.

	Rod

	Peter McCartney wrote: 

		Ok thought so. 

		Im not sure why we need the type attribute. I understand
that the user needs to know how to interpret the data type of the field
since they are all coming back as string in the xml, but dont they
already know this?  this schema is for defining the return from a
request that presumably some agent (person or software) has constructed
using knowledge of the resources schema and has selected certain fields
to be returned. Wouldnt they thus have to already know the data types of
the return fields in order for them to have requested them in the first
place? Thats certainly been the case with the (limited) apps we've built
using xanthoria so far.

		Peter McCartney (peter.mccartney at asu.edu)
		Center for Environmental-Studies
		Arizona State University

			-----Original Message-----
			From: Rod Spears [mailto:rods at ku.edu] 
			Sent: Sunday, April 25, 2004 6:49 AM
			To: Peter McCartney
			Cc: Chad Berkley; Saritha Bhandarkar; seek-dev;
Jing Tao; Matt Jones
			Subject: Re: [seek-dev] RE: resultset question

			Oops, it should have "name" instead of "path"
(always "name")

			Rod

			Peter McCartney wrote:

				is it a typo or a meaningful difference
that the metacat and srb examples have a "name" attribute whereas the
digir example has a "path" attribute in the <returnField> element?

				Peter McCartney
(peter.mccartney at asu.edu)
				Center for Environmental-Studies
				Arizona State University

				-----Original Message-----
				From: Rod Spears [mailto:rods at ku.edu] 
				Sent: Friday, April 23, 2004 11:00 AM
				To: Chad Berkley
				Cc: Peter McCartney; Saritha Bhandarkar;
seek-dev; Jing Tao; Matt Jones
				Subject: Re: [seek-dev] RE: resultset
question

				Dave and I spent some time thinking
about this and arrived at a similar place as to #4, but took it a little
further and changed how the resultset is defines and made a minor change
to the query.

				The main issue has to do with the
consumers of the resultset coming back from an Ecogrid query. 

				How does a consumer interpret the
results in a meaning way?
				What can be done to help generic
consumers and SMS?

				The issue at the moment is that the
contents of the <record> element is basically a blob and anything goes.
For example:
				1) Metacat return a bunch of param
elements contain the data
				2) DiGIR contents a bunuch of namespace
qualified elements containing the data.
				3) The SRB doesn't even have any data in
the record, the identifier attr is meaningful.

				We need to provide a mechanism for the
contents to be interpreted, to do this we will add four things to the
existing resultset schema:
				1) One or more <namespace> elements the
metadata - this will be the namespace for the new <returnfield> element
				2) Add a new element <returnfield>
				3) A "name" attribute for the
returnfield element (basically the same as Peter 'xpath' att) which is a
unique name within the record and may be meaning for whereever the data
came from.
				4) A "type" attribute for the
returnfield element that describe the type of data contained in the
returnfield 

				The most important and powerful part of
the new additions is the "type" attr. This enables the value to be
interpreted. Most of the time it can be described by a schema defintion
type, for example "xsi:string" etc. Or it could be an url that points to
a schema definition document. This means the value of the returnfield
element could be anything from a string or integer to an entire XML
document.

				(Note that the namespace attr has been
removed from the record element)

				The new namespace attrs in the metadata
provide a way for the value of the name attr and the type attr to be
interpreted.

				Here is an example of the a metacat
resultset that is returned today:
				<rs:resultset
system="http://knb.ecoinformatics.org" <http://knb.ecoinformatics.org>
resultsetId="eml.001"

xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"

xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0
beta1 ../../src/xsd/resultset.xsd">  
				  <resultsetMetadata>

<sendTime>2004-03-10T13:47:26-0600</sendTime>
				    <startRecord>1</startRecord>
				    <endRecord>14</endRecord>
				    <recordCount>14</recordCount>
				  </resultsetMetadata>
				  <record number="1"

system="http://dev.nceas.ucsb.edu" <http://dev.nceas.ucsb.edu> 
				          identifier="obfs2.379.1"

namespace="eml://ecoinformatics.org/eml-2.0.0"

lastModifiedDate="2003-11-02T11:07:43-0600"

creationDate="2003-11-02T11:07:43-0600">
				      <param
name="/eml/dataset/keywordSet/keyword">seasonality</param>
				      <param
name="/eml/dataset/keywordSet/keyword">macroalgal bloom</param>
				      <param
name="/eml/dataset/keywordSet/keyword">green tide</param>
				      <param
name="/eml/dataset/keywordSet/keyword">Ulva</param>
				      <param
name="/eml/dataset/creator/individualName/surName">Nelson</param>
				      <param
name="/eml/dataset/keywordSet/keyword">biomass</param>
				      <param
name="/eml/dataset/keywordSet/keyword">algal blooms</param>
				      <param
name="/eml/dataset/title">Armitage Bay Ulvoid Algal Biomass and Species
Composition</param>
				      <param
name="/eml/dataset/keywordSet/keyword">Enteromorpha</param>
				      <param
name="/eml/dataset/keywordSet/keyword">Ulvaria</param>
				  </record>

				Here is an example of the same resultset
as described by the new approach:
				<rs:resultset
system="http://knb.ecoinformatics.org" <http://knb.ecoinformatics.org>
resultsetId="eml.001"

xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"

xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0
beta1 ../../src/xsd/resultset.xsd">  
				  <resultsetMetadata>

<sendTime>2004-03-10T13:47:26-0600</sendTime>
				    <startRecord>1</startRecord>
				    <endRecord>14</endRecord>
				    <recordCount>14</recordCount>

<namespace>eml://ecoinformatics.org/eml-2.0.0</namespace>
				    <namespace
prefix="xsi">http://www.w3.org/2001/XMLSchema-instance</namespace>
				  </resultsetMetadata>
				  <record number="1"

system="http://dev.nceas.ucsb.edu" <http://dev.nceas.ucsb.edu> 
				          identifier="obfs2.379.1"

lastModifiedDate="2003-11-02T11:07:43-0600"

creationDate="2003-11-02T11:07:43-0600">
				      <returnfield
name="/eml/dataset/keywordSet/keyword"
type="xsi:string">seasonality</returnfield>
				      <returnfield
name="/eml/dataset/keywordSet/keyword" type="xsi:string">macroalgal
bloom</returnfield>
				      <returnfield
name="/eml/dataset/keywordSet/keyword" type="xsi:string">green
tide</returnfield>
				      <returnfield
name="/eml/dataset/keywordSet/keyword"
type="xsi:string">Ulva</returnfield>
				      <returnfield
name="/eml/dataset/creator/individualName/surName"
type="xsi:string">Nelson</returnfield>
				      <returnfield
name="/eml/dataset/keywordSet/keyword"
type="xsi:string">biomass</returnfield>
				      <returnfield
name="/eml/dataset/keywordSet/keyword" type="xsi:string">algal
blooms</returnfield>
				      <returnfield
name="/eml/dataset/title" type="xsi:string">Armitage Bay Ulvoid Algal
Biomass and Species Composition</returnfield>
				      <returnfield
name="/eml/dataset/keywordSet/keyword"
type="xsi:string">Enteromorpha</returnfield>
				      <returnfield
name="/eml/dataset/keywordSet/keyword"
type="xsi:string">Ulvaria</returnfield>
				  </record>

				Note how we now can interpret the
resultset in a much more meaningful way. Also, note that there are two
new namespace elements, one contains a "prefix" attr the other does not.
The one without becaomes the default namespace for unqualified values in
the name and type attrs.

				Here is the before and after for the
DiGIR query:
				Before:
				<rs:resultset resultsetId="foo.1.1" 

system="urn:not://sure/what/to/put/here" 

xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
<http://www.w3.org/2001/XMLSchema-instance>  

xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0
beta1 ../../src/xsd/resultset.xsd">
				    <resultsetMetadata>

<sendTime>2003-05-02T16:45:50-09:00</sendTime>
				        <startRecord>1</startRecord>
				        <endRecord>2</endRecord>
				        <recordCount>2</recordCount>
				    </resultsetMetadata>
				     <record number="1" 

system="http://speciesanalyst.net/digir/DiGIR.php?resource=MammalsDwC2"
<http://speciesanalyst.net/digir/DiGIR.php?resource=MammalsDwC2>  
				             identifier="mvz1" 

namespace="http://digir.net/schema/conceptual/darwin/2003/1.0"
<http://digir.net/schema/conceptual/darwin/2003/1.0> 

lastModifiedDate="2003-03-03T10:42:13" 

creationDate="2003-03-03T10:42:13">

<darwin:ScientificName>PEROMYSCUS LEUCOPUS
NOVEBORACENSIS</darwin:ScientificName>

<darwin:Longitude>121</darwin:Longitude>

<darwin:Latitude>33</darwin:Latitude>
				     </record>

				After:
				<rs:resultset resultsetId="foo.1.1" 

system="urn:not://sure/what/to/put/here" 

xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
<http://www.w3.org/2001/XMLSchema-instance>  

xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0
beta1 ../../src/xsd/resultset.xsd">

				    <resultsetMetadata>

<sendTime>2003-05-02T16:45:50-09:00</sendTime>
				        <startRecord>1</startRecord>
				        <endRecord>2</endRecord>
				        <recordCount>2</recordCount>

<namespace>http://digir.net/schema/conceptual/darwin/2003/1.0</namespace
>
				        <namespace
prefix="xsi">http://www.w3.org/2001/XMLSchema-instance</namespace>
				    </resultsetMetadata>

				    <record number="1" 

system="http://speciesanalyst.net/digir/DiGIR.php?resource=MammalsDwC2"
<http://speciesanalyst.net/digir/DiGIR.php?resource=MammalsDwC2>  
				             identifier="mvz1" 

lastModifiedDate="2003-03-03T10:42:13" 

creationDate="2003-03-03T10:42:13">
				        <returnfield
path="ScientificName" type="xsi:string">PEROMYSCUS LEUCOPUS
NOVEBORACENSIS</returnfield>
				        <returnfield path="Longitude"
type="xsi:int">121</returnfield>
				        <returnfield path="Latitude"
type="xsi:int">33</returnfield>
				    </record>

				Here is the SRB's before and after:
				Before:
				<rs:resultset
system="http://knb.ecoinformatics.org" <http://knb.ecoinformatics.org>
resultsetId="SeekSRB_001" 

xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"  > 
				 <resultsetMetadata> 

<sendTime>2004-04-16T11:02:12-0500</sendTime> 
				   <startRecord>1</startRecord> 
				   <endRecord>2</endRecord> 
				   <recordCount>2</recordCount> 
				 </resultsetMetadata> 
				 <record number="1" 
				         system="http://srb.sdsc.edu"
<http://srb.sdsc.edu>  

identifier="/home/testuser.sdsc/SeekTestArea/Lesli Model::0" 
				         namespace="srb://srb.sdsc.edu" 

lastModifiedDate="2003-11-30T13:04:59-0600" 

creationDate="2003-11-30T13:04:58-0600"> 
				 </record>

				After:
				<rs:resultset
system="http://knb.ecoinformatics.org" <http://knb.ecoinformatics.org>
resultsetId="SeekSRB_001" 

xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"  > 
				 <resultsetMetadata> 

<sendTime>2004-04-16T11:02:12-0500</sendTime> 
				   <startRecord>1</startRecord> 
				   <endRecord>2</endRecord> 
				   <recordCount>2</recordCount>

<namespace>eml://ecoinformatics.org/eml-2.0.0</namespace>
				 </resultsetMetadata> 
				 <record number="1" 
				         system="http://srb.sdsc.edu"
<http://srb.sdsc.edu>  

identifier="/home/testuser.sdsc/SeekTestArea/Lesli Model::0" 

lastModifiedDate="2003-11-30T13:04:59-0600" 

creationDate="2003-11-30T13:04:58-0600">
				  <returnfield name="location"
type="xsi:string">/home/testuser.sdsc/SeekTestArea/Lesli
Model::0</returnfield>
				 </record>

  _____  

				The Query
				About the only difference between the
old query and the new is that is the returnfield value can concept attr
values do not have a namespace then the prefix should be dropped from
the namespace element , or they should have a namespace if there is a
prefix in the element. For example:

				<?xml version="1.0" encoding="UTF-8"?>
				<egq:query queryId="test.1.1"
system="http://knb.ecoinformatics.org" <http://knb.ecoinformatics.org>  

xmlns:egq="ecogrid://ecoinformatics.org/ecogrid-query-1.0.0beta1" 

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
<http://www.w3.org/2001/XMLSchema-instance>  

xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-query-1.0.0beta
1 ../../src/xsd/query.xsd">

<namespace>eml://ecoinformatics.org/eml-2.0.0</namespace>

<returnfield>/eml/dataset/title</returnfield>

<returnfield>/eml/dataset/creator/individualName/surName</returnfield>

<returnfield>/eml/dataset/pubDate</returnfield>

<returnfield>/eml/dataset/keywordSet/keyword</returnfield>
				    <title>Soils metadata query</title>
				    <AND>
				        <OR>
				            <condition operator="LIKE"
concept="title">%soil%</condition>
				            <condition operator="NOT
LIKE" concept="title">%dirt%</condition>
				        </OR>
				        <OR>
				            <condition operator="LIKE"
concept="surName">%Jones%</condition>
				            <condition operator="LIKE"
concept="surName">%Vieglais%</condition>
				        </OR>
				    </AND>
				</egq:query>

  _____  

				We can either discuss this via email, or
think about it and discuss it further during our phone meeting.

				Rod

				Chad Berkley wrote:

				Hi, 

				Sorry for my late reply...we've been
busy with a morpho release.  thanks for getting me in gear, Rod. 

				In metacat, we only return leaf nodes
(i.e. the text node child of a CDATA element like in response 4 below).
The returnfield functionality was originally meant as a convenient way
to return enough information for a meaningful resultset to display, say,
on a web page.  It was not meant to return whole document chunks for
further processing.  I can see how this would be useful, but it would
require returning a namespace defined chunk so that a parser would know
what to do with it.  Metacat currently uses the returnfields to build
the resultset table, then a request must be made for the whole document
in order to do further processing. 

				Looking at the responses 1-3 below, to
me, they are all invalid and potentially problematic.  without a
namespace to parse those xml chunks off of, the parser is left to just
do well-formedness checking and any query into these document chunks may
fail because we don't know what to expect to get back before doing the
processing (e.g. an xpath query). 

				So I guess to make a short answer long,
I agree with Peter's assessment of sticking with response 4 (which is
basically what metacat has done all along). 

				chad 

				Rod Spears wrote: 

				Is anyone better qualified than me,
going to address Peter's questions? 

				Please someone respond, thanks. 

				Rod 

				Peter McCartney wrote: 

				it has to be well formed no matter what.
so the question is really how can we identify a namespace for the result
set when the content we stick in there has no hope of being valid?
further, how can we define  a set of rules for how the results are to be
evaluated against that namespace yet not be valid? 
				request 1:
'*/creator/individualName/surname', '/eml/dataset 

				Rule1: "content must appear in minimal
xml tree needed to accomodate the informaton" 

				Rule2: "content must appear in a
potentially valid xml tree that invalidates only due other required
elements missing. 

				rule 3 "conent must appear in a tree
that placed in in correct node ancestry for the declared namespace. 

				response 1: meets 1 and 3 and is well
formed. Requires just knowledge of parent ancestry to build. 
				<eml> 
				    <dataset> 
				    <creator> 
				        <individualName> 

<surname>mccartney</surname> 
				                <surname>jones</surname>

				        </individualname> 
				    </creator> 
				</dataset> 
				<eml> 

				response 2: meets 1, 2 and 3 and is well
formed. Requires knowledge of ancestry and index (ie jones is in
creator[2] of dataset[1] ) 
				<eml> 
				    <dataset> 
				    <creator> 
				        <individualName> 

<surname>mccartney</surname> 
				        </individualname> 
				    </creator> 
				    <creator> 
				        <individualName> 
				                <surname>jones</surname>

				        </individualname> 
				    </creator> 
				  </dataset> 
				<eml> 

				response 3: meets 3 and is not well
formed. rquires knowledge of ancestry. 

				<eml> 
				    <dataset> 
				    <creator> 
				        <individualName> 

<surname>mccartney</surname> 
				        </individualname> 
				    </creator> 
				</dataset> 
				<eml> 
				    <dataset> 
				    <creator> 
				        <individualName> 
				                <surname>jones</surname>

				        </individualname> 
				    </creator> 
				</dataset> 
				</eml> 

				and just a reminder of where we
originally started from (approximately)   
				reponse 4: meets no rule, cannot
validated, but conveys all the information to generate format 1 or 3
above using a string tokenizer and a jDOM. but not option 2. 
				<resultset namespace=eml......> 
				    <returnfield
xpath="dataset/creator/individualname/surname">mccartney</returnfield> 
				    <returnfield
xpath="dataset/creator/individualname/surname">jones</returnfield> 
				</resultset> 

				I think we should really ask whether we
are making ourselves deal with some very complicated rules for really no
gain in functionality. None of the results will be valid according to
the name space. All of them are valid if i make up my own namespace for
the result set.  Unless we can hold our selves to the standard where any
code or xsl written for the schema will successfuly process the result
set (#2 is the closest to that, but depending on how loose the code is,
all three could work or none could work), why shouldnt we opt for the
easiest rule to comply with? 

				Peter McCartney (peter.mccartney at asu.edu
<mailto:peter.mccartney at asu.edu> <mailto:peter.mccartney at asu.edu> ) 
				Center for Environmental-Studies 
				Arizona State University 

				    -----Original Message----- 
				    *From:* Saritha Bhandarkar 
				    *Sent:* Friday, April 09, 2004 10:28
AM 
				    *To:* 'seek-dev' 
				    *Cc:* Jing Tao; Peter McCartney;
Saritha Bhandarkar 
				    *Subject:* resultset question 

				    Hi, 

				    I had a question about the resultset
to be returned by Xanthoria. 

				    The schema of the resultset
specifies that a record is of type 
				    ?AnyRecordType? and optionally it
may have some element content 
				    from the record. Now, my question
here is, if I am to return the 
				    elements specified in the
<returnfields> of the query, for the matching records (that is from the
matching 
				    eml file), do I need to send it in
eml format,  with only relevant 
				    values for requested fields and no
values for the fields which are 
				    not requested? Or is it enough to
return only the requested fields 
				    with their values, as well-formed
xml? Can someone please brief me 
				    on the contents of a record in
resultsetType? 

				    Thanks, 

				    Saritha 

				    Saritha Bhandarkar 

				    Research Assistant 

				    Center for Environmental Studies 

				    ASU-Tempe AZ 

				    saritha.bhandarkar at asu.edu
<mailto:saritha.bhandarkar at asu.edu> <mailto:saritha.bhandarkar at asu.edu>

				-- 
				Rod Spears 
				Biodiversity Research Center 
				University of Kansas 
				1345 Jayhawk Boulevard 
				Lawrence, KS 66045, USA 
				Tel: 785 864-4082, Fax: 785 864-5335 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/seek-dev/attachments/20040426/808d2314/attachment.htm