distribution element

Thu Aug 29 17:21:40 PDT 2002

Im just not sure how many more ways i can explain this...i think we're
clearly at an impasse here and should probably drop it. But here's one for
the road....

the distinction im making is between a url that returns a discreet data
object versus connection information to a service which, through some
proprietary language and transport protocol, accepts instructions and then
returns a set of data. for services that have been exposed as web services
and have a WSDL description, there is probably a perfectly good mechanism
for communicating all the information needed, which we would point to in EML
using a url that points to the uddi entry for that web service (i assume?).
But for others, i just need to know the server name, port, or whatever other
things are essential to pass (either in code or by hand) to some software i
have (or have written) that is capable of interacting with that type of
service. If i use a url in the eml-physical for each of my tables in my
database im probably going to create something like :
<dataTable>
<physical>
	<distribution>

<url>schema://some.URL.syntax.that.is.universally.understood?table=mytable1<
/url>
	</distribution>
</physical>
</dataTable>
<dataTable>
<physical>
	<distribution>

<url>schema://some.URL.syntax.that.is.universally.understood?table=mytable2<
/url>
	</distribution>
</physical>
</dataTable>
<dataTable>
<physical>
	<distribution>

<url>schema://some.URL.syntax.that.is.universally.understood?table=mytable3<
/url>
	</distribution>
</physical>
</dataTable>
<dataTable>
<physical>
	<distribution>

<url>schema://some.URL.syntax.that.is.universally.understood?table=mytable4<
/url>
	</distribution>
</physical>
</dataTable>
<dataTable>
<physical>
	<distribution>

<url>schema://some.URL.syntax.that.is.universally.understood?table=mytable5<
/url>
	</distribution>
</physical>
</dataTable>

I now have to parse that string 5 times for the connection information,
compare it accross all the entries to find out that yes, table 4 and 5 are
both at the same connection and that i can probably send an sql statement
that joins the two. I cant understand why it doesnt bother you to repeat
what is clearly the same server information so many times, yet you dont want
to see party descriptions repeated in an instance file. Its especially
frustrating for me because being able to point to a single server connection
that is relevant for multiple objects was the SINGLE reason that i supported
exploring the referencing features in the first place. OpenURL makes a clear
distinction between the Base-URL (what im calling the "connection")  and the
Query portion of a url - all im doing is asking for is the OPTION to record
the base-url portion once when i have multiple query segments that work off
the same one. I cant make it any clearer than that.

I dont think i am confusing data with analytic request at all. I am
attempting to provide a way of passing the necessary information (IN A
FORMAT THEY KNOW HOW TO DEAL WITH IT) to enable someone to make such
requests. I think its clear that product of that request is a "new" dataset
and that the original metadata description does not apply - our proposal
clearly addresses this with our intent to have our web application generate
a new metadata description based on the the query that was executed. The
difference is between analytic requests that are performed by a program
after retrieveing the entire data object versus those that can be performed
by the dataservice where the object(s) are stored. 

I hope this helps, but I dont see the point in further debate - we either
add this capability or we don't. It will faster serve my needs to finalize
EML now so that I have a stable spec to extend with the additional features
that do what I want to do.

Peter McCartney (peter.mccartney at asu.edu)
Center for Environmental Studies
Arizona State University
480-965-6791 

> -----Original Message-----
> From: Matt Jones [mailto:jones at nceas.ucsb.edu]
> Sent: Wednesday, August 28, 2002 9:48 AM
> To: Peter McCartney
> Cc: Eml-Dev (E-mail)
> Subject: distribution element
> 
> 
> Hi Peter,
> 
> I looked over your new schemas wrt distribution.  Your changes seem 
> minimal, but of course I'm still perplexed.  Mainly from past 
> conversations, though.  Maybe you could elaborate further on 
> exactly how 
> you intend to use the "connection" function type?  The 
> example URL you 
> gave didn't have any parameter values, so how can someone use 
> it to make 
> a connection, even if they understood the "sde" scheme protocol 
> semantics?  You left me in the dust here...
> 
> <soapbox>
> I'd like to elaborate onmy own view of the world related to 
> connections 
> and downloading data.  Just to clarify.  Maybe it'll help ring a bell 
> for someone.
> 
> I still don't see the difference between a download and a connection. 
> As an example, here's a couple of urls that one might provide for a 
> dataTable entity in the physical section as "download" urls:
> 
>     [1] http://example.org/data/latest/knb1.2.txt
>     [2] http://example.org/cgi-bin/getdata?id=knb.1.2
>     [3] http://example.org/servlet/getdata/knb.1.2
>     [4]  ftp://example.org/data/latest/knb.1.2
>     [5] 
> meep://example.org/?database=harvey&user=jones&table=knb.1&rev=2
> 
> Question: which of these represents a connection?
> Answer:  all of them.  Each one reqquires a TCP/IP connection over a 
> particular protocol, which is implied by the scheme of the URL.  You 
> have to 'know' the scheme to understand the URL parameters.
> 
> Question: which of these represents a dynamic resource?
> Answer: possibly all, possibly none.  Depends on how the data 
> system is 
> set up on the back end.  In url [2] it is obviously a script, which 
> implies the results may very well come from a database query or 
> whatever.  Slightly less obvious is that [3] could also be a 
> script, and 
> even further [1] could be a script.  In [1], I could define a script 
> called "latest" that takes an id as a path element 
> ("/knb.1.2.txt"), or 
> I could define "data" as a script that takes a path element 
> "/latest/knb.1.2.txt". Even [4], the ftp url, could point to 
> a dynamic 
> object if a cron job replaces the file on the ftp server once 
> a minute 
> with the newest data.  Five [5] shows a url most software 
> won't be able 
> to use because it is a custom scheme, but that's ok, the connection 
> parameters can still be listed for those people that do 
> understand the 
> meep protocol.  There is absolutely nothing about a URL that 
> you can use 
> to determine whether you'll get the same byte stream back on 
> each access 
> of that url.
> 
> Because of these properties of urls, we can only rely upon the EML 
> metadata to determine what we will get back (i.e., whether it 
> is dynamic 
> or static).  If the coverage for an entity says it covers the time 
> period 1999-2001, and the entity metadata says it contains 
> 937 records 
> and 10343 bytes, and the attribute list says it contains 5 attributes 
> (id, site, time, replicate, size), etc., then when I access the url 
> listed in physical, this is exactly what I should get back.  
> If you want 
> to distribute a subset of that data, you could have a metadata record 
> with only 3 attributes and 48 records representing 1999 data, 
> but this 
> would be a distinct metadata record, and a distinct url.  
> That the url 
> might actually point at a script that queries a database for 
> the subset 
> of data described in the metadata is immaterial (ie, data 
> described in 
> two eml records can share the same underlying data storage system).
> 
> So, what about analytic processing?  I agree its critical.  
> Ken seemed 
> to imply the other day on the phone that I was missing this 
> point, but I 
> was not. People won't be able to download huge images, or 
> huge datasets. 
>   Agreed. They should be able to subset and summarize, and get those 
> subsets and summaries instead of the original data.  However, 
> we should 
> not pass the subset or summary data off as being the same 
> thing as the 
> original data described in the orginal metadata -- it is a 
> new dataset 
> with its own metadata, hopefully showing its relationship to the 
> original (as a derived or subsetted portion of the original).  At the 
> current point in time, there is no standardization for how to express 
> this kind of request for analytical processing on the server 
> side -- in 
> my mind it is orthogonal to describing data itself.  We are 
> working on 
> just such a beast in Monarch, but we are trying hard to not 
> conflate the 
> concept of a data stream from the concept of an analytical request. 
> This is the same reason I've been so reluctant in the addition of 
> "StoredProcedure" and "View" as entity types if all they do 
> is return a 
> table of data that could be described using a dataTable entity (where 
> the only difference is backend analytical processing of a procedure).
> </soapbox>
> 
> As you can see, I don't understand the reasons for needing 
> connection. 
> But, that's just because I don't understand your rationale 
> yet. And it 
> is critical that we have agreement so that we can build 
> interoperable, 
> metadata-driven data processing systems.  If you could 
> provide some more 
> concrete examples that show how a connection url differs from 
> a dowload 
> url in the context of this email, and elaborate more on how 
> you intend 
> that urls in general would be used to obtain the data decribed in the 
> metadata, hopefully I'll be able to understand your approach.
> 
> Thanks,
> Matt
> 
> Peter McCartney wrote:
> > 1) my proposed change to distribution to allow one to provide a url 
> > string that represents a connection rather than a discreet resource 
> > object, identify the url as such with the function attribute, and 
> > provide a url that points to online documentation for a 
> custom schema 
> > (eg sde://hostname:port ? instance=&database= )
> 
> 
> 
> -- 
> *******************************************************************
> Matt Jones                                    jones at nceas.ucsb.edu
> http://www.nceas.ucsb.edu/    Fax: 425-920-2439   Ph: 907-789-0496
> National Center for Ecological Analysis and Synthesis (NCEAS)
> 
> Interested in ecological informatics? http://www.ecoinformatics.org
> *******************************************************************
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20020829/c0f5dd07/attachment.htm