status of EML 2.0

Thu Aug 15 12:52:43 PDT 2002

Hi everyone. 

Its been a busy summer what with travel, vacation, and major scores on the
funding front. While we all probably needed a break, we do need to resolve
where we are with EML 2.0. Ive noticed a trickle of traffic in bugzilla on
minor points, but its so small that I suspect I'm not the only one that is
muddling over the path we should be taking with respect to some of the
feedback we've been getting. Heres my take on the issues drawn both from our
experience working with beta 9 this summer and from the workshop. I dont
find bugzilla well suited to this level of comment, so i will make them here
first. Im willing to put in the effort to turn some of these comments into
bugs once we have some general sense of how to respond to them, or at least
agreement that they are bugs. I've cc'd this to the lter IM list so that
they can confirm whether or not my interpretations of the workshop response
are fair.

1) There were a number of issues that Chris and i both felt were simple
errors in beta 9 when we did our walk-through. The following are the most
glaring that i recall:
	a) there is no recursive link within project to a related project
description
 	b) the dataSourceUsed element which links protocol methodSteps to
existing eml-datasets from which this dataset was derived is missing
	c) there is no recursive link within protocol to reference an
existing protocol (see separate comments on protocol below)
	d)the ascii fixed section of physical doesnt work, nor does it
support records with multiple physical lines. We've already defined a
structure that does this. 

2) there are some technical problems with the identifier and keyref
statements which prevent any instance file from validating. I dont
understand this aspect of XML very well so i cant really suggest how to fix
it or where the problem lies but I assume it is just a technical matter and
not a fundamental problem with what we are trying to do with references

3) Literature needs fixing - it doesnt work intuitively with the way most of
us cite bibliographic information, even after we get some robust name
parsing tools written. Ive already enumerated the problems in bugzilla, so
ill won't belabor it here. Ive had a student writing XSLs for various
journal formats as well as endnote conversion, but they are held up waiting
for a final version. The fact that the network office is investing so much
effort into endnote export format as a means for harvesting bibliographic
information is in my opinion not a good letter of recommendation for
eml-literature, so we should fix it or drop it in favor of something
simpler. 

4) The decision to record online distribution only using URLs and only as
stateless pointers to a single opaque object will, I fear, force us to
seriously limit the role of EML in the future development of a web service
based network. The fact that URLs are at best awkward and at worse not
useable for expressing some types of connections is one thing, but it is the
lack of support for describing a stateful connection that bothers me most.
Many LTER sites, not just CAP, are attempting to build internet applications
that are metadata driven and provide an interface (either direct or web
service based) to data stored in many different systems including SDE, SQL,
ascii files and various GIS and hyperspectral formats. While few of us
intend to give out the stateful connection information to end users
directly, many of us would like to see the development of server-side tools
follow some standards so that we might all better share software components.
Without a standard in EML for describing connection information in a usable
format, the result is adminstrators are force to still develop local
solutions and then figure out how to relate them to EML. I'd hate to see EML
perceived as useful for enabling outside institutions to build applications
around site data but not very useful for sites in building their own
applications.

5) the recent traffic on reusable content partly underscores, I think, our
failure to adequately separate storage and management of metadata from its
presentation during this design process. The former benefits from a high
degree of granularity and normalization, the latter benefits from just the
opposite (assuming size of the eml document is not an issue). The references
element is a device to introduce some normalization capability within EML to
better serve management of information at the expense of some convenience in
reading it. Its not likely to satisfy everyone since it doesnt allow
addressing between documents and this, as well as granularity, will be a
perennial problem when trying to use EML to serve as both a metadata
management format as well as a metadata presentation format. For those of us
that are dynamically building an EML document from a normalized source such
as a relational database or collections of independent xml fragments, this
is far less an issue: we can choose our own level of granularity within our
storage systems and frankly find it easier to write the same information out
twice rather than going through the hassle of creating identifiers and
remembering what they are during the entire output process. Id hate to see
the issue of references and granularity hold up the design process given
that (in my opinion) they aren't really necessary at all in order to define
EML content (with the one excepton of key definitions in eml-constraint
which i dont like).

6) Finally, and most significantly, the response from the workshop indicated
that how we have organized project and protocol are at odds with most
participants. the problem seems to stem from the fact that most sites view
projects as something that exists at a different level from a dataset. while
most agreed philosopically that there has to be a discreet intellectual
activity to produce a dataset, few make any formal recognition of this
activity. Instead, most see certain components (data collection methods,
sampling, qaqc, etc) as direct properties of the dataset. what is recognized
as a project seems to be defined more by administrative or research criteria
that often are on a higher plane than an individual dataset. A few
acknowledged that they could live with using the immediate project element
to record these more dataset-specific items and include a link to a
higher-order project description, but this was hard to visualize at the time
because that link was missing in beta 9.

 There were also similar problems with protocol. As Tim Bergsma put it (in
better words that I did), we are trying to use one module to carry both
prescriptive and descriptive information. In deciding to make protocol a
resource-level element, we have really made the choice to use it as a
prescriptive information tool - that is, a way of describing standardized
protocols independent of any particular data collection instance. i even
recall at least one person saying that their personal interpretation of
"protocol" was  As such, the informatin is only peripherally useful for
describing the actual methods used to produce a specific dataset. 

There was also some dissatisfaction with the organization of protocol. many
objected to the idea of binding QAQC descriptions to specific methodStep
descriptions. Again, there was no philosophical argument that quality
control measures by definition impose control over actions, nevertheless it
does not agree with how most organize this information. Instead, QAQC
descriptions are typically stored indeptendent of the the descriptions of
the methodology and cannot be easily linked in this way.  Finally, as we've
seen in recent email traffic, there are frustrations with the perenial gray
area of blending pure content markup (XML) with formating markup (for
predominantly textual content). 

If I were to suggest changes to Beta9 to best address these responses, they
might go something like this. I would change eml-project to be predominantly
a research project description including stafffing, funding, publications,
and links to higher level projects.. I would also leave eml-protocol as a
resource module, but make it predominately text based and prescriptive, used
only when a prodedure has been formally worked out and used by many
datasets. I would make a new module called methods, which i would use in
every place that we now use protocol. methods would contain a repeatable
methodStep element, which in turn would include references to source
datasets (type eml-dataset), software (type eml-software), instrumentation,
and any QAQC procedures that can be logically related to those steps.
Methods would also include optional links to eml-literature and eml-protocol
as references to formally published or cataloged prodedures.  I would create
a new module researchContext in which i would include the methodological
descriptors that directly qualify this dataset like site description,
sampling, and the above methods module. Finally, for QAQC information that
arent described under methodology but are directly related to specific
attributes in the data, I would suggest using the data-quality module (in
its current incarnations as attributeAccuracy, horizonalAccuracy and
verticalAccuracy) should be used as the mechanism for describing both data
quality and the various control/assurance procedures used to arrive at that
quality. 

These of course are pretty extreme changes for a beta 9. I think that they
probably describe something much more inline with metadata that the current
LTER network is producing, but we have to weigh that against time-honored
software development procedures which are designed to prevent knee-jerk
changes like this so late in the game! With both momentum from the LTER
network workshops and a desire to get going on the new ITR(s), we need EML
2.0 out the door soon or we will begin to lose our focus but we also need it
to work. 

So what's the best course of action? an irc meeting? conference call? wait
for a few days to see who responds to this email? 

Peter McCartney (peter.mccartney at asu.edu)
Center for Environmental Studies
Arizona State University
480-965-6791 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20020815/25a53658/attachment.htm