[eml-dev] EML Question: a <url> that within an EML doc that actually points to the described data table

Matt Jones jones at nceas.ucsb.edu
Fri Feb 10 11:27:26 PST 2006


Hi Inigo,

Short response:
use online/url, or online/connection, but note the problem is more about 
exposing too much data in your eml description rather than how it is 
accessed.

Long response:
distribution/online/url was intended for links where it makes sense -- 
either because the data is in a static file, or because there is a 
sensible web url interface for getting to it dynamically.  We also 
created the distribution/online/connection section to describe other 
situations.  This should allow you to define arbitrary connections, such 
as database connections, but it is not machine-processable because there 
are no standards as to how to organize the information.  We discussed 
this a lot when we were creating the section, and tried several 
different structures.  Ultimately, we decided it wasn't particularly 
tractable.  So, although you could use this structure, its probably not 
the best solution because there is no was to automate data access with it.

The underlying problem here is that, according to the IM, the data 
object described by his EML document is larger than one would reasonably 
want returned.  This may be because the consumer doesn't want a data 
dump that big, or the producer doesn't want people hitting their server 
that hard. So, I guess the question is, why is the data described as one 
large object if that's not the unit the IM wants to be managed?  Could a 
series of views or snapshots be described instead that expose a more 
reasonable amount of data according to the IM?  That's why we included 
the 'View' entity type in EML.  I'm not sure what the right solution is 
with our existing EML.

We've recognized this problem for a while.  It comes down to this: large 
data sets usually should be accessed through a process that supports 
server-side subsetting, rather than being downloaded as a whole.  The 
problem is, everyone uses a different way of setting up and web-enabling 
these processes.  In the EcoGrid effort, one of the data access 
definitions we defined is a web service that permits data subsetting. If 
a web service such as this were standardized, then I think this whole 
problem would get easier to deal with.  I'd be happy to talk to people 
about the EcoGrid web services and their current status (the data 
subsetting api has not been implemented, but simpler data access 
services have been).

Cheers,
Matt

inigo san gil wrote:
> Hello all,
> 
> Here is an EML-related question.  The general broad question would be 
> like this:
>  
> You are describing a large dataset (300,000 rows + 10-15 columns), and 
> you naturally want to put a direct link to the actual data table in your 
> well
> documented EML.   What do you do?
> 
> Here is more info, if you feel you need more.
> 
> Your data may be in Oracle (or other DB) and a parametrized query (in 
> the URL) on ALL data would slow down the server to a crawl.
> Yes, it is a time series of weather stations data, with sampling every 
> 15 secs at best.
> 
> Do we run a cron job -- a full query every so often --, zip the results 
> in a file that is overwritten (updated)?
> Do you just place a link to the query page? A web service?
> 
> 
> <distribution>
>        <online>
>             <url>  the url comes here.
> 
> * LTER Best practices encourages you to point to the data entity.
>                                        
> Cheers,
> Inigo
> 
> _______________________________________________
> Eml-dev mailing list
> Eml-dev at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Matt Jones                                   Ph: 907-789-0496
jones at nceas.ucsb.edu                    SIP #: 1-747-626-7082
National Center for Ecological Analysis and Synthesis (NCEAS)
UC Santa Barbara     http://www.nceas.ucsb.edu/ecoinformatics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


More information about the Eml-dev mailing list