[eml-dev] Can one EML document describe multiple datasets?

inigo san gil isangil at lternet.edu
Tue Oct 27 11:34:40 PDT 2009


You are welcome, Steve.

Your team's interpretation of what a dataset seems good to me,
then, of course, there is the word "project", which is also subject
to interpretation. sometimes a project is defined by the funding
that receives, other times by the overarching goals of a group or
individual, and so on.

Your vision of what an EML would contain for you seems good to.

As for the next question, answer is no - you cannot have more
than one "dataset" instance per EML document - EML as defined
in versions 2.* can only accept one instance of either "dataset"
"citation", "software" or "protocol".  It would seem you would like
to preserve in an EML document some of the native relationships
between datasets -while it may be done with a clever interpretation
of EML, you'd still have to translate that relational vision into a
Kepler module to take advantage of this.

Producing as many EMLs as "datasets" are in your database
are a typical implementation within LTER. One eml containing
it all is not very typical - it may seem awkward, given the current
EML-friendly metadata clearinghouses out there [metacat, NBII]
(--each search returns is broken into links to 'relevant' EML docs--). 
In the case you produce one EML for all, it would be problematic
for the users to find the direct pointer to the particular information
bit queried.  I'd stay away from too much lumping - some degree
of lumping (as you describe for your team) is fine, but too much may
have the collateral problem of diluting information specifics in a
large block (per organization within clearinghouses - so more of
a problem of these repositories than EML, really)

Scripts to produce  dynamically EML are perfectly fine.
When you bump into some more trouble, email us -

Inigo
 
Steve Rentmeester wrote:
> All,
>
> Thanks for the feedback.
>
> Typically, our data management team thinks of a dataset in terms of
> the data collection effort and defines a dataset as all data collected
> by a given project or field crew under a given protocol during a
> specified time frame. Typically the time frame is one field season or
> one year. Datasets typically include multiple observation tables.
>
> Reading the EML spec, I can see how to document a single dataset using
> eml. The dataset would have a title, contact, and creator. We would
> include a protocol, multiple method steps, equipment for methods, and
> dataTable to describe multiple entities and their attributes.
>
> So, the way we think about datasets works just fine in eml.
>
> The next question is: Can a single eml document include multiple datasets?
> Can the top-level eml-module include multiple datasets and then each
> dataset be referenced later in the document when describing individual
> protocols, method steps, and dataTables?
>
> My ultimate goal is to make the data in my database available as a
> data source in Kepler. Assuming that my database stores multiple
> "datasets", should we produce multiple eml documents, one for each
> "dataset". Or should we produce a single eml document and use the
> top-level eml-module as a wrapper to describe the collection of all
> datasets in our database and then describe each dataset individually
> using the lower-level module eml-dataset?
>
> It should be noted that we are writing scripts that will dynamically
> generate the eml document. So, the document can be updated whenever
> the db is updated or a subset of the database is distributed.
>
> Thanks again for any advice or feedback,
>
> steve
>
>
> On Tue, Oct 27, 2009 at 7:24 AM, inigo san gil <isangil at lternet.edu> wrote:
>   
>> Steve,
>>
>> EML is flexible enough -- it accepts any implementation of your vision of
>> what a dataset is.
>> Use this flexibility to your advantage -- see how you manage sets locally,
>> and try to reflect
>> it in the EML packages (or documents).
>> Would it be nice to standardize what a 'dataset' is? i think it would have
>> help, but we would
>> not reach consensus.
>>
>> There are all sort of interpretations out there - i'd point out to a paper
>> that we submitted a
>> year ago, but paper is still held captive by the reviewers. At LTER, most of
>> the sites (26)
>> use the element "dataTable" to describe either spreadsheet types of data, or
>> views from a
>> database.  Some use the "spatialVector" or "spatialRaster", to detail GIS
>> type of data, but
>> most defer those documentations to ESRI-based metadata, tights better with
>> the GIS data
>> management systems.
>>
>> Most LTER EML documents ("datasets") contain more than one table.
>> ("dataTable" - think
>> a meteo station describing several measurements, in different spreadsheets).
>>  A few sites
>> describe *a lot* of data within one data set (lump data), and a few others
>> split data to
>> a level close to the most atomic of parts -perhaps an example of a
>> quintessential atomized
>> EML would be certain "EcoTrends" project generated records, where an EML
>> "dataset"
>> may just describe one time series (two variables - time and something
>> else.).
>>
>> What I would take home about EML is that is a vehicle to transport
>> information in a
>> common specification - you should define and manage your datasets according
>> to your
>> group understanding ( see your database's collection events as a possible
>> working
>> understanding, or split it a bit from there if those become too massive. )
>>
>> cheers, inigo
>>
>> Steve Rentmeester wrote:
>>     
>>> Hello,
>>>
>>> My programmer and I are working to export EML documents from a
>>> relational database that stores data and metadata from many data
>>> collection events, protocols, sites, and projects. We are attempting
>>> to gain a better understanding of how a dataset is defined within EML.
>>>
>>> Can one EML document describe multiple datasets?
>>>
>>> How is a dataset defined?
>>>
>>> Currently, I'm assuming datasets are defined based on data collection
>>> characteristics (agency, project, protocol, temporal range) and not
>>> defined based on data analysis or synthesis requirements (all data
>>> used to evaluate question x).
>>>
>>> thank you for any advice,
>>>
>>> steve
>>>
>>> Steve Rentmeester
>>> Environmental Data Services
>>> Contractor to Bonneville Power Administration
>>> Portland, OR 97203
>>> office: 503-247-8431
>>> cell: 503-348-5839
>>> _______________________________________________
>>> Eml-dev mailing list
>>> Eml-dev at ecoinformatics.org
>>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>>>
>>>       
>>     
>
>
>
>   



More information about the Eml-dev mailing list