eml2rc2 feedback and questions

Matt Jones jones at nceas.ucsb.edu
Thu Oct 17 15:48:27 PDT 2002


Hi Barbara,

Thanks for taking the time to evaluate EML. I'mm cc'ing eml-dev on this 
reply so the other EML developers can benefit from your feedback too. 
The sample document you provided was quite comprehensive, probably one 
of the best EML 2 instance documents around (even though it isn't quite 
valid).  A couple of notes, then I'll address your questions inline...

1) You don't have any "physical" elements in your dataTable 
descriptions, and so there is no information about where to download the 
data.  Is there a reason you specifically don't describe the 
distribution information for the data, given that you have an automated 
system for accessing the data?  And if you were to provide distribution 
information, would the EML physical descriptors for describing the data 
format suffice?

2)  You use several custom units.  To be valid EML, you should be 
providing an STMML formal definition for each custom unit in the 
"additionalMetadata" element, so that people can understand how your 
custom unit relates to more standard SI units.  The EML Parser 
(http://knb.ecoinformatics.org/software/emlparser/) can check the 
validity of these things.  You can see examples of STMML by looking in 
the current unit dictionary shipped with the EML release 
(eml-unitDictionary.xml).

3) You use several custom units that seem common to me, that I'll try to 
get added to the standard unit list.  Date and Time units come first to 
mind, although I need to figure out how these formally relate to the 
STMML data types for date and time. You also use a custom unit of 
"percent", but that isn't really a unit.  You need to say what is a 
percentage of what.  For relative humidity, its probably a ratio of 
partial pressure of water to saturation pressure of water expressed as a 
percentage.  Most dimensionless units such as percent and ppm/ppb should 
be expressed in terms of their true numerator and denominator. E.g., 
percent vegetation cover and percent humidity are very different things.

4) For your numeric attributes, you have systematically omitted the 
attributeDomain description.  This is a serious flaw.  It was a required 
element in earlier versions of EML, and somehow became optional for the 
RC2 release, but I am going to try to make it required again for the 
final release (see 
http://bugzilla.ecoinformatics.org/show_bug.cgi?id=637 ).  Is there a 
specific reason you don't provide this crucial information?

5) For numeric attributes, you also systematically omit the precision 
field.  This is also crucial for interpreting the data for synthetic 
analyses.  Is there a specific reason you don't include precision 
estimates?  It is optional in EML only because it isn't sensible for 
nominal values.

Barbara Benson wrote:
> Hi Matt,
> 
> The information management group at NTL-LTER is in the process of 
> revising our dynamic database access program to produce EML2 compliant 
> metadata.  A number of issues have arisen that we would appreciate 
> getting comments on from you.
> 
> 1)      We are describing the geographic coverage of meteorological data 
> that are collected at a particular airport. It doesn't seem that EML2 
> gives an option of describing a geographic point by latitude and 
> longitude. To deal with this situation, we have made the east and west 
> bounding coordinates the same and the north and south bounding 
> coordinates the same.

Perfect.  The EML documentation for boundingCoordinates says to do 
exactly what you did in this situation.  Here's the text from the docs: 
"If your bounding area is a single point, use the same values for 
northBoundingCoordinate and southBoundingCoordinate, as well as the same 
value for westBoundingCoordinate and eastBoundingCoordinate. This will 
define the same lat/lon pairs since all four are required."

> 2)      We currently have a one-to-one relationship between data tables 
> and data sets. I'm wondering if some time in the future we are going to 
> regret not having the option of allowing more than one data table per 
> data set. Our Oracle table that describes a data set is linked via a 
> pivot table to attributes. Do you have any advice on this issue?

Well, this is a tough issue.  As a dataset in EML is only loosely 
defined to include a series of tables that are related for some 
scientific purpose, there are no firm guidelines as to what should be 
"in" a dataset.  However, if you only provide one table per dataset, you 
can not describe any of the relational constraints among tables that 
might be very useful for analysis.  Unfortunately, determining which 
tables should be described together as a dataset is really a question of 
  what the analyses will be, which obviously can't be determined ahead 
of time.  I guess if I were in your shoes, I would be trying to describe 
functional clusters of data tables.  For example, if a fictional 
biodiversity data table can't be properly used without knoweldge of the 
sampling sites in the "sites" table, then I would try to cluster those 
together in a dataset.  Unfortunately, there are no hard rules that I 
can come up with for this.

> 3)      Here is a suggested addition to EML2. When describing an 
> enumerated domain, the same code/definition pairs are used repeatedly in 
> an eml document. Could an id and reference be allowed in the attribute 
> domains?

Excellent point.  I have marked it as a feature request (see 
http://bugzilla.ecoinformatics.org/show_bug.cgi?id=637 ) and I will try 
to get it incorporated for the 2.0.0 release.

> 4)      One piece of information that is currently part of our local 
> metadata is the sampling frequency (hourly, biweekly, etc.). We didn't 
> see a place other than as part of the overall methods for this 
> information yet it seems important for both data discovery and integration.

See dataset/methods/sampling/studyExtent

The documentation says to detail sampling frequency there.  It is not 
intended to be machine parseable for frequency, although we would like 
to eventually have a fully parseable description of sampling.  Having 
thought about it a lot, we decided it was beyond our ability for this 
version of EML.  So just a text-based description is accomodated. Does 
that work for you?

> Here is the URL for an example of the EML2 documents we are producing 
> with our program.   It uses the reference for the enumerated domains 
> even though this feature is not currently part of EML2.  Any feedback 
> you have would be greatly appreciated.
> 
> http://lterquery.limnology.wisc.edu/eml2.jsp?dataset_id=NTLME01

Looks great!  I'd be interested in knowing when it passes the EML 
validation test in the parser I mentioned above.

Thanks,
Matt

> Hope all goes well with you,
> Barbara
> 
> Barbara Benson
> Center for Limnology, University of Wisconsin-Madison
> 680 N. Park St.
> Madison, WI 53706
> 
> (608)262-2573
> fax: (608)265-2340
> 



-- 
*******************************************************************
Matt Jones                                    jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439   Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)

Interested in ecological informatics? http://www.ecoinformatics.org
*******************************************************************




More information about the Eml-dev mailing list