eml2rc2 feedback and questions
Matt Jones
jones at nceas.ucsb.edu
Thu Oct 17 15:48:27 PDT 2002
Hi Barbara,
Thanks for taking the time to evaluate EML. I'mm cc'ing eml-dev on this
reply so the other EML developers can benefit from your feedback too.
The sample document you provided was quite comprehensive, probably one
of the best EML 2 instance documents around (even though it isn't quite
valid). A couple of notes, then I'll address your questions inline...
1) You don't have any "physical" elements in your dataTable
descriptions, and so there is no information about where to download the
data. Is there a reason you specifically don't describe the
distribution information for the data, given that you have an automated
system for accessing the data? And if you were to provide distribution
information, would the EML physical descriptors for describing the data
format suffice?
2) You use several custom units. To be valid EML, you should be
providing an STMML formal definition for each custom unit in the
"additionalMetadata" element, so that people can understand how your
custom unit relates to more standard SI units. The EML Parser
(http://knb.ecoinformatics.org/software/emlparser/) can check the
validity of these things. You can see examples of STMML by looking in
the current unit dictionary shipped with the EML release
(eml-unitDictionary.xml).
3) You use several custom units that seem common to me, that I'll try to
get added to the standard unit list. Date and Time units come first to
mind, although I need to figure out how these formally relate to the
STMML data types for date and time. You also use a custom unit of
"percent", but that isn't really a unit. You need to say what is a
percentage of what. For relative humidity, its probably a ratio of
partial pressure of water to saturation pressure of water expressed as a
percentage. Most dimensionless units such as percent and ppm/ppb should
be expressed in terms of their true numerator and denominator. E.g.,
percent vegetation cover and percent humidity are very different things.
4) For your numeric attributes, you have systematically omitted the
attributeDomain description. This is a serious flaw. It was a required
element in earlier versions of EML, and somehow became optional for the
RC2 release, but I am going to try to make it required again for the
final release (see
http://bugzilla.ecoinformatics.org/show_bug.cgi?id=637 ). Is there a
specific reason you don't provide this crucial information?
5) For numeric attributes, you also systematically omit the precision
field. This is also crucial for interpreting the data for synthetic
analyses. Is there a specific reason you don't include precision
estimates? It is optional in EML only because it isn't sensible for
nominal values.
Barbara Benson wrote:
> Hi Matt,
>
> The information management group at NTL-LTER is in the process of
> revising our dynamic database access program to produce EML2 compliant
> metadata. A number of issues have arisen that we would appreciate
> getting comments on from you.
>
> 1) We are describing the geographic coverage of meteorological data
> that are collected at a particular airport. It doesn't seem that EML2
> gives an option of describing a geographic point by latitude and
> longitude. To deal with this situation, we have made the east and west
> bounding coordinates the same and the north and south bounding
> coordinates the same.
Perfect. The EML documentation for boundingCoordinates says to do
exactly what you did in this situation. Here's the text from the docs:
"If your bounding area is a single point, use the same values for
northBoundingCoordinate and southBoundingCoordinate, as well as the same
value for westBoundingCoordinate and eastBoundingCoordinate. This will
define the same lat/lon pairs since all four are required."
> 2) We currently have a one-to-one relationship between data tables
> and data sets. I'm wondering if some time in the future we are going to
> regret not having the option of allowing more than one data table per
> data set. Our Oracle table that describes a data set is linked via a
> pivot table to attributes. Do you have any advice on this issue?
Well, this is a tough issue. As a dataset in EML is only loosely
defined to include a series of tables that are related for some
scientific purpose, there are no firm guidelines as to what should be
"in" a dataset. However, if you only provide one table per dataset, you
can not describe any of the relational constraints among tables that
might be very useful for analysis. Unfortunately, determining which
tables should be described together as a dataset is really a question of
what the analyses will be, which obviously can't be determined ahead
of time. I guess if I were in your shoes, I would be trying to describe
functional clusters of data tables. For example, if a fictional
biodiversity data table can't be properly used without knoweldge of the
sampling sites in the "sites" table, then I would try to cluster those
together in a dataset. Unfortunately, there are no hard rules that I
can come up with for this.
> 3) Here is a suggested addition to EML2. When describing an
> enumerated domain, the same code/definition pairs are used repeatedly in
> an eml document. Could an id and reference be allowed in the attribute
> domains?
Excellent point. I have marked it as a feature request (see
http://bugzilla.ecoinformatics.org/show_bug.cgi?id=637 ) and I will try
to get it incorporated for the 2.0.0 release.
> 4) One piece of information that is currently part of our local
> metadata is the sampling frequency (hourly, biweekly, etc.). We didn't
> see a place other than as part of the overall methods for this
> information yet it seems important for both data discovery and integration.
See dataset/methods/sampling/studyExtent
The documentation says to detail sampling frequency there. It is not
intended to be machine parseable for frequency, although we would like
to eventually have a fully parseable description of sampling. Having
thought about it a lot, we decided it was beyond our ability for this
version of EML. So just a text-based description is accomodated. Does
that work for you?
> Here is the URL for an example of the EML2 documents we are producing
> with our program. It uses the reference for the enumerated domains
> even though this feature is not currently part of EML2. Any feedback
> you have would be greatly appreciated.
>
> http://lterquery.limnology.wisc.edu/eml2.jsp?dataset_id=NTLME01
Looks great! I'd be interested in knowing when it passes the EML
validation test in the parser I mentioned above.
Thanks,
Matt
> Hope all goes well with you,
> Barbara
>
> Barbara Benson
> Center for Limnology, University of Wisconsin-Madison
> 680 N. Park St.
> Madison, WI 53706
>
> (608)262-2573
> fax: (608)265-2340
>
--
*******************************************************************
Matt Jones jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/ Fax: 425-920-2439 Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
Interested in ecological informatics? http://www.ecoinformatics.org
*******************************************************************
More information about the Eml-dev
mailing list