distribution element

Matt Jones jones at nceas.ucsb.edu
Wed Aug 28 09:47:51 PDT 2002


Hi Peter,

I looked over your new schemas wrt distribution.  Your changes seem 
minimal, but of course I'm still perplexed.  Mainly from past 
conversations, though.  Maybe you could elaborate further on exactly how 
you intend to use the "connection" function type?  The example URL you 
gave didn't have any parameter values, so how can someone use it to make 
a connection, even if they understood the "sde" scheme protocol 
semantics?  You left me in the dust here...

<soapbox>
I'd like to elaborate onmy own view of the world related to connections 
and downloading data.  Just to clarify.  Maybe it'll help ring a bell 
for someone.

I still don't see the difference between a download and a connection. 
As an example, here's a couple of urls that one might provide for a 
dataTable entity in the physical section as "download" urls:

    [1] http://example.org/data/latest/knb1.2.txt
    [2] http://example.org/cgi-bin/getdata?id=knb.1.2
    [3] http://example.org/servlet/getdata/knb.1.2
    [4]  ftp://example.org/data/latest/knb.1.2
    [5] meep://example.org/?database=harvey&user=jones&table=knb.1&rev=2

Question: which of these represents a connection?
Answer:  all of them.  Each one reqquires a TCP/IP connection over a 
particular protocol, which is implied by the scheme of the URL.  You 
have to 'know' the scheme to understand the URL parameters.

Question: which of these represents a dynamic resource?
Answer: possibly all, possibly none.  Depends on how the data system is 
set up on the back end.  In url [2] it is obviously a script, which 
implies the results may very well come from a database query or 
whatever.  Slightly less obvious is that [3] could also be a script, and 
even further [1] could be a script.  In [1], I could define a script 
called "latest" that takes an id as a path element ("/knb.1.2.txt"), or 
I could define "data" as a script that takes a path element 
"/latest/knb.1.2.txt". Even [4], the ftp url, could point to a dynamic 
object if a cron job replaces the file on the ftp server once a minute 
with the newest data.  Five [5] shows a url most software won't be able 
to use because it is a custom scheme, but that's ok, the connection 
parameters can still be listed for those people that do understand the 
meep protocol.  There is absolutely nothing about a URL that you can use 
to determine whether you'll get the same byte stream back on each access 
of that url.

Because of these properties of urls, we can only rely upon the EML 
metadata to determine what we will get back (i.e., whether it is dynamic 
or static).  If the coverage for an entity says it covers the time 
period 1999-2001, and the entity metadata says it contains 937 records 
and 10343 bytes, and the attribute list says it contains 5 attributes 
(id, site, time, replicate, size), etc., then when I access the url 
listed in physical, this is exactly what I should get back.  If you want 
to distribute a subset of that data, you could have a metadata record 
with only 3 attributes and 48 records representing 1999 data, but this 
would be a distinct metadata record, and a distinct url.  That the url 
might actually point at a script that queries a database for the subset 
of data described in the metadata is immaterial (ie, data described in 
two eml records can share the same underlying data storage system).

So, what about analytic processing?  I agree its critical.  Ken seemed 
to imply the other day on the phone that I was missing this point, but I 
was not. People won't be able to download huge images, or huge datasets. 
  Agreed. They should be able to subset and summarize, and get those 
subsets and summaries instead of the original data.  However, we should 
not pass the subset or summary data off as being the same thing as the 
original data described in the orginal metadata -- it is a new dataset 
with its own metadata, hopefully showing its relationship to the 
original (as a derived or subsetted portion of the original).  At the 
current point in time, there is no standardization for how to express 
this kind of request for analytical processing on the server side -- in 
my mind it is orthogonal to describing data itself.  We are working on 
just such a beast in Monarch, but we are trying hard to not conflate the 
concept of a data stream from the concept of an analytical request. 
This is the same reason I've been so reluctant in the addition of 
"StoredProcedure" and "View" as entity types if all they do is return a 
table of data that could be described using a dataTable entity (where 
the only difference is backend analytical processing of a procedure).
</soapbox>

As you can see, I don't understand the reasons for needing connection. 
But, that's just because I don't understand your rationale yet. And it 
is critical that we have agreement so that we can build interoperable, 
metadata-driven data processing systems.  If you could provide some more 
concrete examples that show how a connection url differs from a dowload 
url in the context of this email, and elaborate more on how you intend 
that urls in general would be used to obtain the data decribed in the 
metadata, hopefully I'll be able to understand your approach.

Thanks,
Matt

Peter McCartney wrote:
> 1) my proposed change to distribution to allow one to provide a url 
> string that represents a connection rather than a discreet resource 
> object, identify the url as such with the function attribute, and 
> provide a url that points to online documentation for a custom schema 
> (eg sde://hostname:port ? instance=&database= )



-- 
*******************************************************************
Matt Jones                                    jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439   Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)

Interested in ecological informatics? http://www.ecoinformatics.org
*******************************************************************




More information about the Eml-dev mailing list