EML questions at HFR

Fri Apr 4 14:04:17 PST 2003

what we were striving for in eml was to ensure that there was a place to
describe the research activities that were directly responsible for the
production of a dataset. in my opionoin a core area program or an LTER site
project is probably a related project to the specific project. But from the
data management workshop, it became obvious that many sites dont have this
concept of specific research projects, so we removed the methods discussion
from the project descpription to ensure that the specific activites related
to this dataset could be expressed even though they were not part of the
project description.So the bottom line is describe the smallest resolution
research activity as the project, and then link related project descriptions
for any "parent" projects.

EML requires that the original content for any block of information that is
referenced with <references> must appear once in the document. That means
that if you want to have a unique list of content items that you only have
to enter once, you must manage that list separately and then draw from it
when you creat your eml files. one obvious way is to have a database of
these things (for example - you already have the LTER persBIB database from
which you could draw personnel descriptions) and then write some tool that
pastes that content into your eml files as you build them up. another way
would be manage an xml file with party fragments in it as you describe, and
copy and paste from that into your eml files as you go. there is really no
effective way to maintain the syncronization between your eml files and the
external source. the system and scope attributes can be used to indicate
that this content is identified in an external source, but you'd have to
write your own custom software to manage the relationship. We're discovering
that this can lead to key violations since the xml parsers dont consider the
scope attribute when enforcing the unique constraint on ID's, so we're now
looking at writing our own metadata into additionalMetadata to manage the
relationship between content blocks and our relational database.

you can enter discontinuous polygons in geographic  coverage if you want to
represent the study ares of two lter sites as the study area of a project.
Bear in mind that geographic coverage at the resource level is primarily
discovery information. there is a separate section under methods where you
can define the extent of your study area with respect to its spatial
sampling universe. In principle these are the same, but entries in
/dataset/coverage tend to be imprecise bounding boxes  

Peter McCartney (peter.mccartney at asu.edu <mailto:peter.mccartney at asu.edu> )
Center for Environmental-Studies
Arizona State University

-----Original Message-----
From: David Blankman [mailto:dblankman at lternet.edu] 
Sent: Friday, April 04, 2003 1:51 PM
To: Emery R. Boose
Cc: Iml; Eml-Dev (E-mail); Jeanine
Subject: Re: EML questions at HFR

Hi Emery,

I am planning to be out in Boston sometime in the 30 - 45 days.

In the mean time, I'll answer your questions as best as I can. (Answers are
after questions.)

Emery R. Boose wrote:

Hi David, James & Peter, 

I'm writing for guidance on a few (very simple, I suspect) EML questions. 

We're currently revising the Harvard Forest online Data Archive in
preparation for our site review next summer.  I'd like to incorporate
project-level EML as part of this revision.  At our annual LTER meeting in
mid-February I circulated a metadata survey form to all of our researchers,
and now have most of the necessary content in hand.  I'd like to create a
project-level EML template into which I can cut & paste using an XML editor
(see attached hf001.xml for a first attempt).  Though not very elegant, I
think this approach will work fine for now and give us more time to think
about long-term plans for managing our metadata and EML. 

The technical documents on ecoinformatics.org are quite helpful but I'm
still puzzled on several basic points: 

(1) Dataset vs. project.  Our data & metadata are organized according to
(what we call) "project," which appears to correspond most closely to
"dataset" in the world of EML.  Is the "project" category in EML intended to
provide broader information for a specific dataset, or is the project
category really a broader entity that might encompass more than one dataset?

An EML <dataset> is comprised of one or more data entities. While there are
no specific standards for determining the contants of a dataset, the basic
guideline is that a dataset is composed one data entities that are clearly
related. For example, if the title is "Effects of Hurricanes on Primary
Productivity in New England and Puerto Rico", then the dataset might include
data entities like, "NEWeather","NEProductivity", "PRWeather",
PRPrductivity".

Project is a somewhat  fuzzier area. The primary intent of project is to
allow for the documentation of something broader research context than just
a dataset. Continuing the previous example, supposing that you have the
following datasets with the following titles, "Effects of Hurricanes on
Primary Productivity in New England and Puerto Rico", "Effects of Hurricanes
on Biodiversity in New England and Puerto Rico","Effects of Hurricanes on
Water Quality in New England and Puerto Rico", then a project might be:
"Ecological Effects of Hurricanes in New England and Puerto Rico". 

Continuing on, there might be other similar dataset like: "Effects of
Hurricanes on Primary Productivity in Andrews Experimental Forest and
Florida Coastal Everglades.",
"Effects of Hurricanes on Biodiversity in Andrews Experimental Forest and
Florida Coastal Everglades," with a corresponding project of: "Ecological
Effects of Hurricanes in Andrews Experimental Forest and Florida Coastal
Everglades.". 

This might produce something like:
<eml>
    <dataset>
       <title>Effects of Hurricanes on Primary Productivity in New England
and Puerto Rico</title>
       <project>Ecological Effects of Hurricanes in New England and Puerto
Rico</project>
                <relatedProject>Ecological Effects of Hurricanes in Andrews
Experimental Forest and Florida Coastal Everglades<relatedProject>
       </project>
    </dataset>
<eml>

or alternatively

<eml>
    <dataset>
       <title>Effects of Hurricanes on Primary Productivity in New England
and Puerto Rico</title>
       <project>Ecological Effects of Hurricanes </project>
                <relatedProject>Ecological Effects of Hurricanes in Andrews
Experimental Forest and Florida Coastal Everglades</relatedProject>
<relatedProject>Ecological Effects of Hurricanes in New England and Puerto
Rico</relatedProject>

       </project>
    </dataset>
<eml>

a project could be something like an LTER Core Area or even the whole LTER
site research project. The project module is intented to place a given
dataset into a broader research context. Project is optional, but if you
have the information it makes the metadata richer.

(2) Scope of identifiers.  I'd like to place personnel and publications
information into separate EML files that are referenced from the dataset
files (see attached hfpers.xml & hfpubs.xml).  Is it necessary to wrap the
personnel, publications, and dataset files into a single EML file (where
scope = document)? YES Or can I implement these as separate EML files on the
same directory (where scope = system and system = URL)? 

References in EML 2 are internal to a document.  An EML 2.0 document knows
only what is inside it. The only time that you can reference something
outside the eml document is when there is a <citation> element or something
like <dataset>/<distribution>/<online><url>. 

(3) Multiple study sites.  Many of our projects are comparative studies
(e.g., hurricane impacts in New England and Puerto Rico).  Is it possible to
include spatial coverage information for two distinct sites at the project
(dataset) level?  Or is it necessary to move the spatial coverage
information to the data entity level and repeat it (as appropriate) for each
data entity? 

You can do <geographicCoverage> at the dataset level. This approach would be
best if your data is integrated, that is, a single data files includes data
from both New England and Puerto Rico. Let's say a data file looked like:
SITE         DATE            PRECIPITATION
NE            2002-12-12     .75
PR            2002-12-12     1.1

The geographic coverage might be something like:
<eml>
    <dataset>
        <geographicCoverage>
            <geographicDescription>New England</geographicDescription>
            <boundingCoordinates></boundingCoordinates>
          </geographicCoverage>
       <geographicCoverage>
            <geographicDescription>Puerto Rico</geographicDescription>
            <boundingCoordinates></boundingCoordinates>
          </geographicCoverage>
    </dataset>
</eml>

On the other other hand if the New England data was recorded in a separate
table from the Puerto Rico file, for example:

TABLE 1 NEweather
SITE         DATE            PRECIPITATION
NE1            2002-12-12     .75
NE2            2002-12-12     1.1

TABLE 2 PuertoRicoWeather
SITE         DATE            PRECIPITATION
NE1            2002-12-12     .75
NE2            2002-12-12     1.1

it might be better to do the following:

<eml>
    <dataset>

       <dataTable>
          <entityName>NEweather>
          <geographicCoverage>
                <geographicDescription>New England</geographicDescription>
                <boundingCoordinates></boundingCoordinates>
          </geographicCoverage>
       </dataTable>

        <dataTable>
          <entityName>PuertoRicoWeather>
           <geographicCoverage>
                <geographicDescription>Puerto Rico</geographicDescription>
                <boundingCoordinates></boundingCoordinates>
           </geographicCoverage>
        </dataTable>

    </dataset>
</eml>

A third alternative and arguable the ideal solution (assuming that the NE &
PR data are in separate tables) would be to combine the two approaches as
follows:

<eml>
    <dataset>
       <geographicCoverage>
            <geographicDescription>New England</geographicDescription>
            <boundingCoordinates></boundingCoordinates>
          </geographicCoverage>
       <geographicCoverage>
            <geographicDescription>Puerto Rico</geographicDescription>
            <boundingCoordinates></boundingCoordinates>
          </geographicCoverage>
       <dataTable>
          <entityName>NEweather>
          <geographicCoverage>
                <geographicDescription>New England</geographicDescription>
                <boundingCoordinates></boundingCoordinates>
          </geographicCoverage>
       </dataTable>

        <dataTable>
          <entityName>PuertoRicoWeather>
           <geographicCoverage>
                <geographicDescription>Puerto Rico</geographicDescription>
                <boundingCoordinates></boundingCoordinates>
           </geographicCoverage>
        </dataTable>
    </dataset>
</eml>

(4) Examples.  Most or all of my questions could be answered by looking at a
few well-chosen examples (many in fact have been answered by looking at the
NTL web page).  Are there other examples available for study?  Perhaps a
preliminary draft of the core EML specification? 

I'm working on the core EML specification?

Also we are working on an EML for Mere Mortals that will provide examples
and guidance.

David Blankman

Many thanks, 

Emery

-- 

David E. Blankman

Database Integration Developer

Long Term Ecological Research Network Office

University of New Mexico

801 University, SE #104

Albuquerque, NM 87106

(505) 272-7346 / (505) 272-7080 FAX

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20030404/59a13052/attachment.htm