status of EML 2.0

Thu Aug 15 15:02:45 PDT 2002

Hello Peter et. al.,

I agree that we need to get EML moving again.  We had originally said
that we would have EML 2.0 out by the beginning of this Summer.  We
have, however, been able to collect some very good feedback from Scott,
Tim and others as to the usability of EML2b9 that can only improve our
final product.  It is only through people trying to use the beta
releases that we will actually find the modeling errors and other bugs
that have popped up in the last couple months.  I would encourage every
one who has an interest in EML's success to try to mark up at least on
of their datasets in EML and not just wait for the final release.  Your
time is not wasted in this process!  That having been said, please see
my comments inline below.

On Thu, 2002-08-15 at 12:52, Peter McCartney wrote:
> Hi everyone. 
> 
> Its been a busy summer what with travel, vacation, and major scores on the
> funding front. While we all probably needed a break, we do need to resolve
> where we are with EML 2.0. Ive noticed a trickle of traffic in bugzilla on
> minor points, but its so small that I suspect I'm not the only one that is
> muddling over the path we should be taking with respect to some of the
> feedback we've been getting. Heres my take on the issues drawn both from our
> experience working with beta 9 this summer and from the workshop. I dont
> find bugzilla well suited to this level of comment, so i will make them here
> first. Im willing to put in the effort to turn some of these comments into
> bugs once we have some general sense of how to respond to them, or at least
> agreement that they are bugs. I've cc'd this to the lter IM list so that
> they can confirm whether or not my interpretations of the workshop response
> are fair.
> 
> 1) There were a number of issues that Chris and i both felt were simple
> errors in beta 9 when we did our walk-through. The following are the most
> glaring that i recall:
> 	a) there is no recursive link within project to a related project
> description
>  	b) the dataSourceUsed element which links protocol methodSteps to
> existing eml-datasets from which this dataset was derived is missing
> 	c) there is no recursive link within protocol to reference an
> existing protocol (see separate comments on protocol below)
> 	d)the ascii fixed section of physical doesnt work, nor does it
> support records with multiple physical lines. We've already defined a
> structure that does this. 
> 
I think these are just bugs that are easily fixed. I could not find any
of these bugs in bugzilla.  Chris or Peter, do you want to enter them
since you found them and are familiar with them? 

> 2) there are some technical problems with the identifier and keyref
> statements which prevent any instance file from validating. I dont
> understand this aspect of XML very well so i cant really suggest how to fix
> it or where the problem lies but I assume it is just a technical matter and
> not a fundamental problem with what we are trying to do with references
>  
I have not tried to validate an instance document so I have not seen
this problem.  Will the person who encountered this error please write a
detailed description and either put it in bugzilla or email it to
eml-dev.  

> 3) Literature needs fixing - it doesnt work intuitively with the way most of
> us cite bibliographic information, even after we get some robust name
> parsing tools written. Ive already enumerated the problems in bugzilla, so
> ill won't belabor it here. Ive had a student writing XSLs for various
> journal formats as well as endnote conversion, but they are held up waiting
> for a final version. The fact that the network office is investing so much
> effort into endnote export format as a means for harvesting bibliographic
> information is in my opinion not a good letter of recommendation for
> eml-literature, so we should fix it or drop it in favor of something
> simpler. 
I would like to see a model of the proposed structure of a new
literature module.  I think that some of the fields there are not needed
and there are others that might be needed.  I would propose that someone
who is actively working on marking up citations using literature propose
a model for us to look at. The EndNote export format may be a good
starting point.  The important part is that we have input from people
who are actually trying to use eml-literature.

> 4) The decision to record online distribution only using URLs and only as
> stateless pointers to a single opaque object will, I fear, force us to
> seriously limit the role of EML in the future development of a web service
> based network. The fact that URLs are at best awkward and at worse not
> useable for expressing some types of connections is one thing, but it is the
> lack of support for describing a stateful connection that bothers me most.
> Many LTER sites, not just CAP, are attempting to build internet applications
> that are metadata driven and provide an interface (either direct or web
> service based) to data stored in many different systems including SDE, SQL,
> ascii files and various GIS and hyperspectral formats. While few of us
> intend to give out the stateful connection information to end users
> directly, many of us would like to see the development of server-side tools
> follow some standards so that we might all better share software components.
> Without a standard in EML for describing connection information in a usable
> format, the result is adminstrators are force to still develop local
> solutions and then figure out how to relate them to EML. I'd hate to see EML
> perceived as useful for enabling outside institutions to build applications
> around site data but not very useful for sites in building their own
> applications.
I see many of your points, however, in my mind there are 4 requirements
for the connection model.  1) It must be machine parsable and/or
directly machine usable.  2) It must not require that we add to or
change the standard every time a new connection protocol is introduced
or an existing one is revised. 3)It must not be based on proprietary
connection protocols that limit the scope of other types of
connections.  4)  It must be comprehensive, allowing the description of
any type of connection that one may want to list.

The current method (using URLs to define connection points) follows
1,2,3 and mostly 4.  I'm sure you can find some connection somewhere
that doesn't have a standard URL, but they are far and few between.  The
only other method that I can see working is to develop a name/value pair
connection parameter model, where a connection is defined by the set of
name/value pairs of needed connection parameters.  The problem with this
is that to enable cross connections, we may have to have some sort of
dictionary or map that shows what types of connections need what kinds
of parameters.  Maybe we need a hybrid of the two.  What do you propose
should take the place of the URL?

> 
> 5) the recent traffic on reusable content partly underscores, I think, our
> failure to adequately separate storage and management of metadata from its
> presentation during this design process. 
I would say, that as EML is an XML metadata standard, presentation has
no place in EML.  We should be focusing on creating EML as a metadata
storage container.  Presentation can be done later with stylesheets if
the structure of the metadata storage mechanism is accurate enough to
hold all of the facets of the data which it is modeling.  I don't think
that eml has been built for presentation at all.  We have attempted to
organize certain sub-categories of information, but that is not
presentation, that is organization.  The problem with this is that
everyone tends to think of the data model for EML slightly differently,
so there have been some disagreements as to what types of information
needs to be repeatable (normalized) and some of the organization is
sometimes an issue.

> The former benefits from a high
> degree of granularity and normalization, the latter benefits from just the
> opposite (assuming size of the eml document is not an issue). The references
> element is a device to introduce some normalization capability within EML to
> better serve management of information at the expense of some convenience in
> reading it. Its not likely to satisfy everyone since it doesnt allow
> addressing between documents and this, as well as granularity, will be a
> perennial problem when trying to use EML to serve as both a metadata
> management format as well as a metadata presentation format. For those of us
> that are dynamically building an EML document from a normalized source such
> as a relational database or collections of independent xml fragments, this
> is far less an issue: we can choose our own level of granularity within our
> storage systems and frankly find it easier to write the same information out
> twice rather than going through the hassle of creating identifiers and
> remembering what they are during the entire output process. 
The problem with this approach is that there is no way to know whether
two sub-trees that have the same content, are, in fact, the same
object.  For instance, if you have entity alpha(A,B,C,D) and entity
beta(A,B,C,D) are they the same entity?  If you use references, you know
that alpha(A,B,C,D) with id=1 is the same object as beta(refid=1).  This
is very important for machine processing of this metadata.  I would say
that one should view EML as a metadata propegation unit that no one will
ever look at.  It is, in essence, a machine language.  You would never
look at the binary format of an excel file and try to follow the
pointers around would you?  I don't think a human will try to do that
with EML.  The presentation should be completely seperate from the
storage and we must follow the concrete rules that govern when to use a
relationship and when not to.  See
http://knb.ecoinformatics.org/software/eml/eml20docs/eml-docbook.html#reusableContent

When you talk about the hassle of creating identifiers, I'm not sure
what you are referencing.  It is really no trouble at all to add
identifiers programmatically.

> Id hate to see
> the issue of references and granularity hold up the design process given
> that (in my opinion) they aren't really necessary at all in order to define
> EML content (with the one excepton of key definitions in eml-constraint
> which i dont like).
I don't see references as holding up the design process.   They are part
of the design and hence need to be part of the design process.  Like I
said above, I think they are very necessary for advance machine
processing with EML.

> 
> 6) Finally, and most significantly, the response from the workshop indicated
> that how we have organized project and protocol are at odds with most
> participants. the problem seems to stem from the fact that most sites view
> projects as something that exists at a different level from a dataset. while
> most agreed philosopically that there has to be a discreet intellectual
> activity to produce a dataset, few make any formal recognition of this
> activity. Instead, most see certain components (data collection methods,
> sampling, qaqc, etc) as direct properties of the dataset. what is recognized
> as a project seems to be defined more by administrative or research criteria
> that often are on a higher plane than an individual dataset. A few
> acknowledged that they could live with using the immediate project element
> to record these more dataset-specific items and include a link to a
> higher-order project description, but this was hard to visualize at the time
> because that link was missing in beta 9.
I don't necessarily think that project needs to be at the root level of
EML.  In fact, I think it used to buried farther down.  If people think
of it as being farther down in the tree, I have no problem with that
change.  Where should it go though?  Any ideas?

> 
>  There were also similar problems with protocol. As Tim Bergsma put it (in
> better words that I did), we are trying to use one module to carry both
> prescriptive and descriptive information. In deciding to make protocol a
> resource-level element, we have really made the choice to use it as a
> prescriptive information tool - that is, a way of describing standardized
> protocols independent of any particular data collection instance. i even
> recall at least one person saying that their personal interpretation of
> "protocol" was  As such, the informatin is only peripherally useful for
> describing the actual methods used to produce a specific dataset. 
> 
> There was also some dissatisfaction with the organization of protocol. many
> objected to the idea of binding QAQC descriptions to specific methodStep
> descriptions. Again, there was no philosophical argument that quality
> control measures by definition impose control over actions, nevertheless it
> does not agree with how most organize this information. Instead, QAQC
> descriptions are typically stored indeptendent of the the descriptions of
> the methodology and cannot be easily linked in this way.  Finally, as we've
> seen in recent email traffic, there are frustrations with the perenial gray
> area of blending pure content markup (XML) with formating markup (for
> predominantly textual content). 
I would agree with this.  I'm not sure how to handle it though. I'll
need to think more about it.

> 
> If I were to suggest changes to Beta9 to best address these responses, they
> might go something like this. I would change eml-project to be predominantly
> a research project description including stafffing, funding, publications,
> and links to higher level projects.. I would also leave eml-protocol as a
> resource module, but make it predominately text based and prescriptive, used
> only when a prodedure has been formally worked out and used by many
> datasets. 
Sounds logical.

> I would make a new module called methods, which i would use in
> every place that we now use protocol. methods would contain a repeatable
> methodStep element, which in turn would include references to source
> datasets (type eml-dataset), software (type eml-software), instrumentation,
> and any QAQC procedures that can be logically related to those steps.
> Methods would also include optional links to eml-literature and eml-protocol
> as references to formally published or cataloged prodedures.  I would create
> a new module researchContext in which i would include the methodological
> descriptors that directly qualify this dataset like site description,
> sampling, and the above methods module. Finally, for QAQC information that
> arent described under methodology but are directly related to specific
> attributes in the data, I would suggest using the data-quality module (in
> its current incarnations as attributeAccuracy, horizonalAccuracy and
> verticalAccuracy) should be used as the mechanism for describing both data
> quality and the various control/assurance procedures used to arrive at that
> quality. 
> 
> These of course are pretty extreme changes for a beta 9. I think that they
> probably describe something much more inline with metadata that the current
> LTER network is producing, but we have to weigh that against time-honored
> software development procedures which are designed to prevent knee-jerk
> changes like this so late in the game! With both momentum from the LTER
> network workshops and a desire to get going on the new ITR(s), we need EML
> 2.0 out the door soon or we will begin to lose our focus but we also need it
> to work. 
It seems like this would work fine.  I need to think it over a bit more
though.

> 
> So what's the best course of action? an irc meeting? conference call? wait
> for a few days to see who responds to this email? 
I think we should give people a few days to digest this then have a
conference call with all interested parties some time next week.  I
propose Wednesday at 9:00 PDT for the call. Who would want to be in on
the call?  I think it would be good to have Scott and/or Tim and/or
other interested outside parties in on it to get an outside perspective,
lest we fall back into our original arguments that we've been having for
2 years.  Anyone interrested in the call respond to eml-dev.  I can set
up an Verizon conference call if we don't have a situation where we can
chain enough 2 line phones together.  If you have a problem with the
date/time of the call, propose a different one.  I'm available all next
week.

chad

-- 
-----------------------
Chad Berkley
National Center for 
Ecological Analysis 
and Synthesis (NCEAS)
berkley at nceas.ucsb.edu
-----------------------