[eml-dev] EML Question: a <url> that within an EML doc that actually points to the described data table
Matt Jones
jones at nceas.ucsb.edu
Fri Feb 10 11:27:26 PST 2006
Hi Inigo,
Short response:
use online/url, or online/connection, but note the problem is more about
exposing too much data in your eml description rather than how it is
accessed.
Long response:
distribution/online/url was intended for links where it makes sense --
either because the data is in a static file, or because there is a
sensible web url interface for getting to it dynamically. We also
created the distribution/online/connection section to describe other
situations. This should allow you to define arbitrary connections, such
as database connections, but it is not machine-processable because there
are no standards as to how to organize the information. We discussed
this a lot when we were creating the section, and tried several
different structures. Ultimately, we decided it wasn't particularly
tractable. So, although you could use this structure, its probably not
the best solution because there is no was to automate data access with it.
The underlying problem here is that, according to the IM, the data
object described by his EML document is larger than one would reasonably
want returned. This may be because the consumer doesn't want a data
dump that big, or the producer doesn't want people hitting their server
that hard. So, I guess the question is, why is the data described as one
large object if that's not the unit the IM wants to be managed? Could a
series of views or snapshots be described instead that expose a more
reasonable amount of data according to the IM? That's why we included
the 'View' entity type in EML. I'm not sure what the right solution is
with our existing EML.
We've recognized this problem for a while. It comes down to this: large
data sets usually should be accessed through a process that supports
server-side subsetting, rather than being downloaded as a whole. The
problem is, everyone uses a different way of setting up and web-enabling
these processes. In the EcoGrid effort, one of the data access
definitions we defined is a web service that permits data subsetting. If
a web service such as this were standardized, then I think this whole
problem would get easier to deal with. I'd be happy to talk to people
about the EcoGrid web services and their current status (the data
subsetting api has not been implemented, but simpler data access
services have been).
Cheers,
Matt
inigo san gil wrote:
> Hello all,
>
> Here is an EML-related question. The general broad question would be
> like this:
>
> You are describing a large dataset (300,000 rows + 10-15 columns), and
> you naturally want to put a direct link to the actual data table in your
> well
> documented EML. What do you do?
>
> Here is more info, if you feel you need more.
>
> Your data may be in Oracle (or other DB) and a parametrized query (in
> the URL) on ALL data would slow down the server to a crawl.
> Yes, it is a time series of weather stations data, with sampling every
> 15 secs at best.
>
> Do we run a cron job -- a full query every so often --, zip the results
> in a file that is overwritten (updated)?
> Do you just place a link to the query page? A web service?
>
>
> <distribution>
> <online>
> <url> the url comes here.
>
> * LTER Best practices encourages you to point to the data entity.
>
> Cheers,
> Inigo
>
> _______________________________________________
> Eml-dev mailing list
> Eml-dev at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Matt Jones Ph: 907-789-0496
jones at nceas.ucsb.edu SIP #: 1-747-626-7082
National Center for Ecological Analysis and Synthesis (NCEAS)
UC Santa Barbara http://www.nceas.ucsb.edu/ecoinformatics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
More information about the Eml-dev
mailing list