[eml-dev] Can one EML document describe multiple datasets?

Mon Oct 26 18:11:41 PDT 2009

Hi Steve,

EML remains somewhat agnostic regarding just what defines a 'data set'.  A
definition we've commonly used is that a 'data set' is any collection of
data records that are usefully assembled for a particular scientific
purpose.  Often, the purpose is to organize the data associated with a given
data collection campaign, often times segmented temporally (e.g., all height
measurements from the 1999 permanent plot survey).  Other times, a data set
might be all data associated with a particular analysis or synthesis
activity (e.g., data from a particular paper).  It is useful if these types
of synthetic data sets that contain derived data also have pointers in the
metadata back to the original raw data from which they are derived, although
this is not required.  Ultimately, its up to the contributing investigator
to determine what a useful data set represents.

In practice, the KNB contains a lot of different types of 'data sets'.  Some
people are lumpers (all data from 20 years is one data set), and others are
splitters (every month of every region produces a new data set).  EML allows
for both of these, and everything in between.  Within a single data set, you
can also decide whether all of the data are contained in one or many data
entities (e.g., you might have one dataset with many dataTables in a single
EML document). You can even have multiple entities that share the same
schema but represent different coverages (divided along space, time, or
other axes).

For me, the crux is that data sets are the wrong level at which to be
managing scientific data for many purposes (especially synthesis).  Much
more fundamental is the observation, which can be aggregated and
re-aggregated into collections of data much more readily, and is closer to
the fundamental unit of data that scientists actually collect.  We're
working on how to extend EML with the semantics of observations through our
work with OBOE.

Helpful?

Matt

On Mon, Oct 26, 2009 at 4:21 PM, Steve Rentmeester <
environmentaldataservices at gmail.com> wrote:

> Hello,
>
> My programmer and I are working to export EML documents from a
> relational database that stores data and metadata from many data
> collection events, protocols, sites, and projects. We are attempting
> to gain a better understanding of how a dataset is defined within EML.
>
> Can one EML document describe multiple datasets?
>
> How is a dataset defined?
>
> Currently, I'm assuming datasets are defined based on data collection
> characteristics (agency, project, protocol, temporal range) and not
> defined based on data analysis or synthesis requirements (all data
> used to evaluate question x).
>
> thank you for any advice,
>
> steve
>
> Steve Rentmeester
> Environmental Data Services
> Contractor to Bonneville Power Administration
> Portland, OR 97203
> office: 503-247-8431
> cell: 503-348-5839
> _______________________________________________
> Eml-dev mailing list
> Eml-dev at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20091026/0137fde8/attachment.html>