[seek-dev] RE: resultset question

Fri Apr 23 11:42:22 PDT 2004

is it a typo or a meaningful difference that the metacat and srb
examples have a "name" attribute whereas the digir example has a "path"
attribute in the <returnField> element?

Peter McCartney (peter.mccartney at asu.edu)
Center for Environmental-Studies
Arizona State University

	-----Original Message-----
	From: Rod Spears [mailto:rods at ku.edu] 
	Sent: Friday, April 23, 2004 11:00 AM
	To: Chad Berkley
	Cc: Peter McCartney; Saritha Bhandarkar; seek-dev; Jing Tao;
Matt Jones
	Subject: Re: [seek-dev] RE: resultset question

	Dave and I spent some time thinking about this and arrived at a
similar place as to #4, but took it a little further and changed how the
resultset is defines and made a minor change to the query.

	The main issue has to do with the consumers of the resultset
coming back from an Ecogrid query. 

	How does a consumer interpret the results in a meaning way?
	What can be done to help generic consumers and SMS?

	The issue at the moment is that the contents of the <record>
element is basically a blob and anything goes. For example:
	1) Metacat return a bunch of param elements contain the data
	2) DiGIR contents a bunuch of namespace qualified elements
containing the data.
	3) The SRB doesn't even have any data in the record, the
identifier attr is meaningful.

	We need to provide a mechanism for the contents to be
interpreted, to do this we will add four things to the existing
resultset schema:
	1) One or more <namespace> elements the metadata - this will be
the namespace for the new <returnfield> element
	2) Add a new element <returnfield>
	3) A "name" attribute for the returnfield element (basically the
same as Peter 'xpath' att) which is a unique name within the record and
may be meaning for whereever the data came from.
	4) A "type" attribute for the returnfield element that describe
the type of data contained in the returnfield 

	The most important and powerful part of the new additions is the
"type" attr. This enables the value to be interpreted. Most of the time
it can be described by a schema defintion type, for example "xsi:string"
etc. Or it could be an url that points to a schema definition document.
This means the value of the returnfield element could be anything from a
string or integer to an entire XML document.

	(Note that the namespace attr has been removed from the record
element)

	The new namespace attrs in the metadata provide a way for the
value of the name attr and the type attr to be interpreted.

	Here is an example of the a metacat resultset that is returned
today:
	<rs:resultset system="http://knb.ecoinformatics.org"
<http://knb.ecoinformatics.org>  resultsetId="eml.001"

xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"

xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0
beta1 ../../src/xsd/resultset.xsd">  
	  <resultsetMetadata>
	    <sendTime>2004-03-10T13:47:26-0600</sendTime>
	    <startRecord>1</startRecord>
	    <endRecord>14</endRecord>
	    <recordCount>14</recordCount>
	  </resultsetMetadata>
	  <record number="1"
	          system="http://dev.nceas.ucsb.edu"
<http://dev.nceas.ucsb.edu> 
	          identifier="obfs2.379.1"
	          namespace="eml://ecoinformatics.org/eml-2.0.0"
	          lastModifiedDate="2003-11-02T11:07:43-0600"
	          creationDate="2003-11-02T11:07:43-0600">
	      <param
name="/eml/dataset/keywordSet/keyword">seasonality</param>
	      <param  name="/eml/dataset/keywordSet/keyword">macroalgal
bloom</param>
	      <param  name="/eml/dataset/keywordSet/keyword">green
tide</param>
	      <param
name="/eml/dataset/keywordSet/keyword">Ulva</param>
	      <param
name="/eml/dataset/creator/individualName/surName">Nelson</param>
	      <param
name="/eml/dataset/keywordSet/keyword">biomass</param>
	      <param  name="/eml/dataset/keywordSet/keyword">algal
blooms</param>
	      <param  name="/eml/dataset/title">Armitage Bay Ulvoid
Algal Biomass and Species Composition</param>
	      <param
name="/eml/dataset/keywordSet/keyword">Enteromorpha</param>
	      <param
name="/eml/dataset/keywordSet/keyword">Ulvaria</param>
	  </record>

	Here is an example of the same resultset as described by the new
approach:
	<rs:resultset system="http://knb.ecoinformatics.org"
<http://knb.ecoinformatics.org>  resultsetId="eml.001"

xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"

xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0
beta1 ../../src/xsd/resultset.xsd">  
	  <resultsetMetadata>
	    <sendTime>2004-03-10T13:47:26-0600</sendTime>
	    <startRecord>1</startRecord>
	    <endRecord>14</endRecord>
	    <recordCount>14</recordCount>
	    <namespace>eml://ecoinformatics.org/eml-2.0.0</namespace>
	    <namespace
prefix="xsi">http://www.w3.org/2001/XMLSchema-instance</namespace>
	  </resultsetMetadata>
	  <record number="1"
	          system="http://dev.nceas.ucsb.edu"
<http://dev.nceas.ucsb.edu> 
	          identifier="obfs2.379.1"
	          lastModifiedDate="2003-11-02T11:07:43-0600"
	          creationDate="2003-11-02T11:07:43-0600">
	      <returnfield name="/eml/dataset/keywordSet/keyword"
type="xsi:string">seasonality</returnfield>
	      <returnfield name="/eml/dataset/keywordSet/keyword"
type="xsi:string">macroalgal bloom</returnfield>
	      <returnfield name="/eml/dataset/keywordSet/keyword"
type="xsi:string">green tide</returnfield>
	      <returnfield name="/eml/dataset/keywordSet/keyword"
type="xsi:string">Ulva</returnfield>
	      <returnfield
name="/eml/dataset/creator/individualName/surName"
type="xsi:string">Nelson</returnfield>
	      <returnfield name="/eml/dataset/keywordSet/keyword"
type="xsi:string">biomass</returnfield>
	      <returnfield name="/eml/dataset/keywordSet/keyword"
type="xsi:string">algal blooms</returnfield>
	      <returnfield name="/eml/dataset/title"
type="xsi:string">Armitage Bay Ulvoid Algal Biomass and Species
Composition</returnfield>
	      <returnfield name="/eml/dataset/keywordSet/keyword"
type="xsi:string">Enteromorpha</returnfield>
	      <returnfield name="/eml/dataset/keywordSet/keyword"
type="xsi:string">Ulvaria</returnfield>
	  </record>

	Note how we now can interpret the resultset in a much more
meaningful way. Also, note that there are two new namespace elements,
one contains a "prefix" attr the other does not. The one without
becaomes the default namespace for unqualified values in the name and
type attrs.

	Here is the before and after for the DiGIR query:
	Before:
	<rs:resultset resultsetId="foo.1.1" 
	    system="urn:not://sure/what/to/put/here" 

xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"
	    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
<http://www.w3.org/2001/XMLSchema-instance>  

xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0
beta1 ../../src/xsd/resultset.xsd">
	    <resultsetMetadata>
	        <sendTime>2003-05-02T16:45:50-09:00</sendTime>
	        <startRecord>1</startRecord>
	        <endRecord>2</endRecord>
	        <recordCount>2</recordCount>
	    </resultsetMetadata>
	     <record number="1" 

system="http://speciesanalyst.net/digir/DiGIR.php?resource=MammalsDwC2"
<http://speciesanalyst.net/digir/DiGIR.php?resource=MammalsDwC2>  
	             identifier="mvz1" 

namespace="http://digir.net/schema/conceptual/darwin/2003/1.0"
<http://digir.net/schema/conceptual/darwin/2003/1.0> 
	             lastModifiedDate="2003-03-03T10:42:13" 
	             creationDate="2003-03-03T10:42:13">
	        <darwin:ScientificName>PEROMYSCUS LEUCOPUS
NOVEBORACENSIS</darwin:ScientificName>
	        <darwin:Longitude>121</darwin:Longitude>
	        <darwin:Latitude>33</darwin:Latitude>
	     </record>

	After:
	<rs:resultset resultsetId="foo.1.1" 
	    system="urn:not://sure/what/to/put/here" 

xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"
	    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
<http://www.w3.org/2001/XMLSchema-instance>  

xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0
beta1 ../../src/xsd/resultset.xsd">

	    <resultsetMetadata>
	        <sendTime>2003-05-02T16:45:50-09:00</sendTime>
	        <startRecord>1</startRecord>
	        <endRecord>2</endRecord>
	        <recordCount>2</recordCount>

<namespace>http://digir.net/schema/conceptual/darwin/2003/1.0</namespace
>
	        <namespace
prefix="xsi">http://www.w3.org/2001/XMLSchema-instance</namespace>
	    </resultsetMetadata>

	    <record number="1" 

system="http://speciesanalyst.net/digir/DiGIR.php?resource=MammalsDwC2"
<http://speciesanalyst.net/digir/DiGIR.php?resource=MammalsDwC2>  
	             identifier="mvz1" 
	             lastModifiedDate="2003-03-03T10:42:13" 
	             creationDate="2003-03-03T10:42:13">
	        <returnfield path="ScientificName"
type="xsi:string">PEROMYSCUS LEUCOPUS NOVEBORACENSIS</returnfield>
	        <returnfield path="Longitude"
type="xsi:int">121</returnfield>
	        <returnfield path="Latitude"
type="xsi:int">33</returnfield>
	    </record>

	Here is the SRB's before and after:
	Before:
	<rs:resultset system="http://knb.ecoinformatics.org"
<http://knb.ecoinformatics.org>  resultsetId="SeekSRB_001" 

xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"  > 
	 <resultsetMetadata> 
	   <sendTime>2004-04-16T11:02:12-0500</sendTime> 
	   <startRecord>1</startRecord> 
	   <endRecord>2</endRecord> 
	   <recordCount>2</recordCount> 
	 </resultsetMetadata> 
	 <record number="1" 
	         system="http://srb.sdsc.edu" <http://srb.sdsc.edu>  
	         identifier="/home/testuser.sdsc/SeekTestArea/Lesli
Model::0" 
	         namespace="srb://srb.sdsc.edu" 
	         lastModifiedDate="2003-11-30T13:04:59-0600" 
	         creationDate="2003-11-30T13:04:58-0600"> 
	 </record>

	After:
	<rs:resultset system="http://knb.ecoinformatics.org"
<http://knb.ecoinformatics.org>  resultsetId="SeekSRB_001" 

xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"  > 
	 <resultsetMetadata> 
	   <sendTime>2004-04-16T11:02:12-0500</sendTime> 
	   <startRecord>1</startRecord> 
	   <endRecord>2</endRecord> 
	   <recordCount>2</recordCount>
	   <namespace>eml://ecoinformatics.org/eml-2.0.0</namespace>
	 </resultsetMetadata> 
	 <record number="1" 
	         system="http://srb.sdsc.edu" <http://srb.sdsc.edu>  
	         identifier="/home/testuser.sdsc/SeekTestArea/Lesli
Model::0" 
	         lastModifiedDate="2003-11-30T13:04:59-0600" 
	         creationDate="2003-11-30T13:04:58-0600">
	  <returnfield name="location"
type="xsi:string">/home/testuser.sdsc/SeekTestArea/Lesli
Model::0</returnfield>
	 </record>

  _____  

	The Query
	About the only difference between the old query and the new is
that is the returnfield value can concept attr values do not have a
namespace then the prefix should be dropped from the namespace element ,
or they should have a namespace if there is a prefix in the element. For
example:

	<?xml version="1.0" encoding="UTF-8"?>
	<egq:query queryId="test.1.1"
system="http://knb.ecoinformatics.org" <http://knb.ecoinformatics.org>  

xmlns:egq="ecogrid://ecoinformatics.org/ecogrid-query-1.0.0beta1" 
	    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
<http://www.w3.org/2001/XMLSchema-instance>  

xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-query-1.0.0beta
1 ../../src/xsd/query.xsd">
	    <namespace>eml://ecoinformatics.org/eml-2.0.0</namespace>
	    <returnfield>/eml/dataset/title</returnfield>

<returnfield>/eml/dataset/creator/individualName/surName</returnfield>
	    <returnfield>/eml/dataset/pubDate</returnfield>
	    <returnfield>/eml/dataset/keywordSet/keyword</returnfield>
	    <title>Soils metadata query</title>
	    <AND>
	        <OR>
	            <condition operator="LIKE"
concept="title">%soil%</condition>
	            <condition operator="NOT LIKE"
concept="title">%dirt%</condition>
	        </OR>
	        <OR>
	            <condition operator="LIKE"
concept="surName">%Jones%</condition>
	            <condition operator="LIKE"
concept="surName">%Vieglais%</condition>
	        </OR>
	    </AND>
	</egq:query>

  _____  

	We can either discuss this via email, or think about it and
discuss it further during our phone meeting.

	Rod

	Chad Berkley wrote:

		Hi, 

		Sorry for my late reply...we've been busy with a morpho
release.  thanks for getting me in gear, Rod. 

		In metacat, we only return leaf nodes (i.e. the text
node child of a CDATA element like in response 4 below).  The
returnfield functionality was originally meant as a convenient way to
return enough information for a meaningful resultset to display, say, on
a web page.  It was not meant to return whole document chunks for
further processing.  I can see how this would be useful, but it would
require returning a namespace defined chunk so that a parser would know
what to do with it.  Metacat currently uses the returnfields to build
the resultset table, then a request must be made for the whole document
in order to do further processing. 

		Looking at the responses 1-3 below, to me, they are all
invalid and potentially problematic.  without a namespace to parse those
xml chunks off of, the parser is left to just do well-formedness
checking and any query into these document chunks may fail because we
don't know what to expect to get back before doing the processing (e.g.
an xpath query). 

		So I guess to make a short answer long, I agree with
Peter's assessment of sticking with response 4 (which is basically what
metacat has done all along). 

		chad 

		Rod Spears wrote: 

			Is anyone better qualified than me, going to
address Peter's questions? 

			Please someone respond, thanks. 

			Rod 

			Peter McCartney wrote: 

				it has to be well formed no matter what.
so the question is really how can we identify a namespace for the result
set when the content we stick in there has no hope of being valid?
further, how can we define  a set of rules for how the results are to be
evaluated against that namespace yet not be valid? 
				request 1:
'*/creator/individualName/surname', '/eml/dataset 

				Rule1: "content must appear in minimal
xml tree needed to accomodate the informaton" 

				Rule2: "content must appear in a
potentially valid xml tree that invalidates only due other required
elements missing. 

				rule 3 "conent must appear in a tree
that placed in in correct node ancestry for the declared namespace. 

				response 1: meets 1 and 3 and is well
formed. Requires just knowledge of parent ancestry to build. 
				<eml> 
				    <dataset> 
				    <creator> 
				        <individualName> 

<surname>mccartney</surname> 
				                <surname>jones</surname>

				        </individualname> 
				    </creator> 
				</dataset> 
				<eml> 

				response 2: meets 1, 2 and 3 and is well
formed. Requires knowledge of ancestry and index (ie jones is in
creator[2] of dataset[1] ) 
				<eml> 
				    <dataset> 
				    <creator> 
				        <individualName> 

<surname>mccartney</surname> 
				        </individualname> 
				    </creator> 
				    <creator> 
				        <individualName> 
				                <surname>jones</surname>

				        </individualname> 
				    </creator> 
				  </dataset> 
				<eml> 

				response 3: meets 3 and is not well
formed. rquires knowledge of ancestry. 

				<eml> 
				    <dataset> 
				    <creator> 
				        <individualName> 

<surname>mccartney</surname> 
				        </individualname> 
				    </creator> 
				</dataset> 
				<eml> 
				    <dataset> 
				    <creator> 
				        <individualName> 
				                <surname>jones</surname>

				        </individualname> 
				    </creator> 
				</dataset> 
				</eml> 

				and just a reminder of where we
originally started from (approximately)   
				reponse 4: meets no rule, cannot
validated, but conveys all the information to generate format 1 or 3
above using a string tokenizer and a jDOM. but not option 2. 
				<resultset namespace=eml......> 
				    <returnfield
xpath="dataset/creator/individualname/surname">mccartney</returnfield> 
				    <returnfield
xpath="dataset/creator/individualname/surname">jones</returnfield> 
				</resultset> 

				I think we should really ask whether we
are making ourselves deal with some very complicated rules for really no
gain in functionality. None of the results will be valid according to
the name space. All of them are valid if i make up my own namespace for
the result set.  Unless we can hold our selves to the standard where any
code or xsl written for the schema will successfuly process the result
set (#2 is the closest to that, but depending on how loose the code is,
all three could work or none could work), why shouldnt we opt for the
easiest rule to comply with? 

				Peter McCartney (peter.mccartney at asu.edu
<mailto:peter.mccartney at asu.edu> <mailto:peter.mccartney at asu.edu> ) 
				Center for Environmental-Studies 
				Arizona State University 

				    -----Original Message----- 
				    *From:* Saritha Bhandarkar 
				    *Sent:* Friday, April 09, 2004 10:28
AM 
				    *To:* 'seek-dev' 
				    *Cc:* Jing Tao; Peter McCartney;
Saritha Bhandarkar 
				    *Subject:* resultset question 

				    Hi, 

				    I had a question about the resultset
to be returned by Xanthoria. 

				    The schema of the resultset
specifies that a record is of type 
				    ?AnyRecordType? and optionally it
may have some element content 
				    from the record. Now, my question
here is, if I am to return the 
				    elements specified in the
<returnfields> of the query, for the matching records (that is from the
matching 
				    eml file), do I need to send it in
eml format,  with only relevant 
				    values for requested fields and no
values for the fields which are 
				    not requested? Or is it enough to
return only the requested fields 
				    with their values, as well-formed
xml? Can someone please brief me 
				    on the contents of a record in
resultsetType? 

				    Thanks, 

				    Saritha 

				    Saritha Bhandarkar 

				    Research Assistant 

				    Center for Environmental Studies 

				    ASU-Tempe AZ 

				    saritha.bhandarkar at asu.edu
<mailto:saritha.bhandarkar at asu.edu> <mailto:saritha.bhandarkar at asu.edu>

			-- 
			Rod Spears 
			Biodiversity Research Center 
			University of Kansas 
			1345 Jayhawk Boulevard 
			Lawrence, KS 66045, USA 
			Tel: 785 864-4082, Fax: 785 864-5335 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/seek-dev/attachments/20040423/80980d3b/attachment.htm