distribution element
Matt Jones
jones at nceas.ucsb.edu
Wed Aug 28 09:47:51 PDT 2002
Hi Peter,
I looked over your new schemas wrt distribution. Your changes seem
minimal, but of course I'm still perplexed. Mainly from past
conversations, though. Maybe you could elaborate further on exactly how
you intend to use the "connection" function type? The example URL you
gave didn't have any parameter values, so how can someone use it to make
a connection, even if they understood the "sde" scheme protocol
semantics? You left me in the dust here...
<soapbox>
I'd like to elaborate onmy own view of the world related to connections
and downloading data. Just to clarify. Maybe it'll help ring a bell
for someone.
I still don't see the difference between a download and a connection.
As an example, here's a couple of urls that one might provide for a
dataTable entity in the physical section as "download" urls:
[1] http://example.org/data/latest/knb1.2.txt
[2] http://example.org/cgi-bin/getdata?id=knb.1.2
[3] http://example.org/servlet/getdata/knb.1.2
[4] ftp://example.org/data/latest/knb.1.2
[5] meep://example.org/?database=harvey&user=jones&table=knb.1&rev=2
Question: which of these represents a connection?
Answer: all of them. Each one reqquires a TCP/IP connection over a
particular protocol, which is implied by the scheme of the URL. You
have to 'know' the scheme to understand the URL parameters.
Question: which of these represents a dynamic resource?
Answer: possibly all, possibly none. Depends on how the data system is
set up on the back end. In url [2] it is obviously a script, which
implies the results may very well come from a database query or
whatever. Slightly less obvious is that [3] could also be a script, and
even further [1] could be a script. In [1], I could define a script
called "latest" that takes an id as a path element ("/knb.1.2.txt"), or
I could define "data" as a script that takes a path element
"/latest/knb.1.2.txt". Even [4], the ftp url, could point to a dynamic
object if a cron job replaces the file on the ftp server once a minute
with the newest data. Five [5] shows a url most software won't be able
to use because it is a custom scheme, but that's ok, the connection
parameters can still be listed for those people that do understand the
meep protocol. There is absolutely nothing about a URL that you can use
to determine whether you'll get the same byte stream back on each access
of that url.
Because of these properties of urls, we can only rely upon the EML
metadata to determine what we will get back (i.e., whether it is dynamic
or static). If the coverage for an entity says it covers the time
period 1999-2001, and the entity metadata says it contains 937 records
and 10343 bytes, and the attribute list says it contains 5 attributes
(id, site, time, replicate, size), etc., then when I access the url
listed in physical, this is exactly what I should get back. If you want
to distribute a subset of that data, you could have a metadata record
with only 3 attributes and 48 records representing 1999 data, but this
would be a distinct metadata record, and a distinct url. That the url
might actually point at a script that queries a database for the subset
of data described in the metadata is immaterial (ie, data described in
two eml records can share the same underlying data storage system).
So, what about analytic processing? I agree its critical. Ken seemed
to imply the other day on the phone that I was missing this point, but I
was not. People won't be able to download huge images, or huge datasets.
Agreed. They should be able to subset and summarize, and get those
subsets and summaries instead of the original data. However, we should
not pass the subset or summary data off as being the same thing as the
original data described in the orginal metadata -- it is a new dataset
with its own metadata, hopefully showing its relationship to the
original (as a derived or subsetted portion of the original). At the
current point in time, there is no standardization for how to express
this kind of request for analytical processing on the server side -- in
my mind it is orthogonal to describing data itself. We are working on
just such a beast in Monarch, but we are trying hard to not conflate the
concept of a data stream from the concept of an analytical request.
This is the same reason I've been so reluctant in the addition of
"StoredProcedure" and "View" as entity types if all they do is return a
table of data that could be described using a dataTable entity (where
the only difference is backend analytical processing of a procedure).
</soapbox>
As you can see, I don't understand the reasons for needing connection.
But, that's just because I don't understand your rationale yet. And it
is critical that we have agreement so that we can build interoperable,
metadata-driven data processing systems. If you could provide some more
concrete examples that show how a connection url differs from a dowload
url in the context of this email, and elaborate more on how you intend
that urls in general would be used to obtain the data decribed in the
metadata, hopefully I'll be able to understand your approach.
Thanks,
Matt
Peter McCartney wrote:
> 1) my proposed change to distribution to allow one to provide a url
> string that represents a connection rather than a discreet resource
> object, identify the url as such with the function attribute, and
> provide a url that points to online documentation for a custom schema
> (eg sde://hostname:port ? instance=&database= )
--
*******************************************************************
Matt Jones jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/ Fax: 425-920-2439 Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
Interested in ecological informatics? http://www.ecoinformatics.org
*******************************************************************
More information about the Eml-dev
mailing list