status of EML 2.0

Tue Aug 20 11:45:04 PDT 2002

Thanks for the comments Scott. A couple clarifications about my comments for
the discussion:

1) Ive attached a file generated by one of our tools for reverse engineering
metadata from an RDBMS. The file is not very complete, but should be valid
as near as i can tell by manually inspecting it. However, i am unable to
validated either with Excelon Stylus Studio, XML Spy, or Forte (with
different errors reported in each). While i could easily believe that one or
another of these has less than perfect support for schema, the fact that i
cant validate with all three (two of which are using the Xerces parser) is
significant. By comparison, i had no problem validating instance files
against the various nceas and asu drafts prior to beta9. If anyone has a
separate tool for validating, please either send it or try this file and let
me know whats wrong. Alternately, send me an instance file that you have
been able to validate, and ill try it here. To be honest, validation isnt
all that important to us - we'd prefer to have our applications attempt to
use the metadata and try to trap for errors rather than give up just because
it didnt validate - but I'd like to know why im having such aproblem with
beta 9 when no one else is....

2) I should not have used the word "presentation" in my comments about the
granularity discussion - its clear that both chad and scott interpreted me
as contrasting storage with display. What i meant was the form in which
metadata is delivered to an application as opposed to the form in which it
is managed and maintained. my point was that if you attempt to build
normalization into the design of EML, then you must decide on a level of
granularity that works for everyone. If you leave matters of how metadata
are managed and stored out of the EML spec, then data managers are free to
design whatever storage system works best for them with however much
granularity they want. We could then remove the granularity debate to a
different thread from that of the EML content spec. 

I concur with Owen's first two expectations of EML (a text specification and
a corresponding implementation in XML Schema for exchange) I'm far less in
agreement that EML should include specifications for storage and management
of metadata (rdbms or otherwise) at this time. For many LTER sites, EML is,
and will remain for some time, a metadata "report format" that is
dynamically generated from a separate system for managing metadata. Those
systems are evolved to the local research environment, often far more
sophisticated than anything we are likely to offer based on EML, and not
likely to be discarded anytime soon in favor of an entirely EML-based
management system. Both EML and XML management systems are simply way too
immature at this time to attempt anything but a standard for exchanging and
searching metadata.  Given our progress, i think we need to think baby steps
with the first one being to get sites to use a common, comprehensive,
machine-parsable metadata standard. We should get that done, and then worry
about what standards can be promoted for how metadata are managed, stored,
and maintained.

Peter McCartney (peter.mccartney at asu.edu)
Center for Environmental Studies
Arizona State University
480-965-6791 

-----Original Message-----
From: Scott Chapal [mailto:scott.chapal at jonesctr.org]
Sent: Monday, August 19, 2002 10:36 AM
To: Chad Berkley
Cc: Peter McCartney; Eml-Dev (E-mail); 'im at lternet.edu'; 'Scott Chapal'
Subject: Re: status of EML 2.0

Chad, Peter, et. al.:

I have been somewhat disappointed at the lack of traffic on eml-dev
for the past month or so, but now understand that to be summer
meetings and vacations.  I have been waiting on the side-lines
anticipating the release of EML [2] for 18 months - 2 years.  It is
only recently that I have gotten involved more than vicariously, and
even then, just through testing beta9 instance documents.

Let me just begin by stating that I believe that EML is *EXTREMELY*
important.  From my perspective, EML is more important than Morpho, or
Metacat, or any site specific requirements.  EML needs to be completed
soon, but more importantly, to echo what has been said before, it
needs to be done right.

I have nothing but the utmost respect for NCEAS' (and in particular,
Matt's) leadership on this front and I think that the quality and
potential usefulness of the standard are a tribute to the dedication
to the long-term general utility of EML.  What appears to be somewhat
problematic, is defining the purpose which EML should be designed to
fulfill.

What is the scope of intent that EML addresses?

  Minimalist design prescribes a solution which is just complex enough
  to achieve the result, but no more so.  That might be apropos but it
  begs the difficult question of what is being solved with EML, in
  precise terms.

  To an extent, the scope has been determined by the state of EML
  itself, it's documentation, and by the historical legacy of the
  eml-dev archive - short as it is.  But, a concise specification,
  including scope-of-intent, doesn't occur in any obvious public place
  that I can find.  A useful adjunct to that definition, would be a
  subsection regarding what EML is *NOT* designed to solve.

  Defining the specification for EML, as Owen Eddins so aptly
  outlines below, is blatantly necessary.

"Owen Eddins" <oeddins at lternet.edu> surmised Mon, 1 Apr 2002 11:22:18 :

> And finally is EML an XML Schema/DTD or is it a specification?  EML
> is an XML Schema/DTD.  The specification, a guideline for metadata
> management, is implicit and needs to be made explicit. If we had a
> well defined specification for a metadata standard then we could
> have

> 1) a metadata specification in English, as a guideline for ecologists and
> data managers and for the purpose of soliciting community review.

> 2) implementation of the metadata specification in XML Schema for purposes
> of sharing and storage.
> 
> 3) implementation of metadata specification in relational database systems
> for purposes of storage.
> 
> With a metadata specification we think we could meet the needs of all
> categories of potential users of EML.

Here, here!

The lack of a detailed specification definition leads to a confused
discussion and analysis of EML's architecture, module structural
design and relationships, as well as it's purpose.

My comments to individual points in this thread are interspersed
below, and reflect the idea that a more complete specification
definition for EML would lend clarity to the process.

Chad Berkley <berkley at nceas.ucsb.edu> writes:

> On Thu, 2002-08-15 at 12:52, Peter McCartney wrote:

> > 2) there are some technical problems with the identifier and keyref
> > statements which prevent any instance file from validating. I dont
> > understand this aspect of XML very well so i cant really suggest how to
fix
> > it or where the problem lies but I assume it is just a technical matter
and
> > not a fundamental problem with what we are trying to do with references
> >  
> I have not tried to validate an instance document so I have not seen
> this problem.  Will the person who encountered this error please write a
> detailed description and either put it in bugzilla or email it to
> eml-dev.  

This is not our experience at all.  We have several projects
consisting of multiple data sets [tables] which have EML instance
documents that validate fine with ID and keyref statements used
liberally.  (However we have not yet tested everything, notably
eml-literature, or the gis/spatial modules).

> > 3) Literature needs fixing - it doesnt work intuitively with the way
most of
> > us cite bibliographic information, even after we get some robust name
> > parsing tools written. Ive already enumerated the problems in bugzilla,
so
> > ill won't belabor it here. Ive had a student writing XSLs for various
> > journal formats as well as endnote conversion, but they are held up
waiting
> > for a final version. The fact that the network office is investing so
much
> > effort into endnote export format as a means for harvesting
bibliographic
> > information is in my opinion not a good letter of recommendation for
> > eml-literature, so we should fix it or drop it in favor of something
> > simpler. 

> I would like to see a model of the proposed structure of a new
> literature module.  I think that some of the fields there are not needed
> and there are others that might be needed.  I would propose that someone
> who is actively working on marking up citations using literature propose
> a model for us to look at. The EndNote export format may be a good
> starting point.  The important part is that we have input from people
> who are actually trying to use eml-literature.

What is 'EndNote export format'?  EndNote 5 Documentation lists
Refer/BibIX, BibTex and RIS as default 'styles' included with EndNote
for export.  This has little to do with the design intent of
eml-literature.  I would think that what eml-literature should be
judged on is it's syntactic capacity to represent bibliographic
metadata correctly.  Period.  Everything else is simply a matter of
conversion/translation of bibliographic 'format'.

Along these lines, why re-invent the wheel?  This problem has been
solved before and better than we could hope to reinvent in the Library
Information Technology discipline.  Why not defer to them.  Eg. MARC, or
http://www.niso.org/, etc...

> > 4) The decision to record online distribution only using URLs and
> > only as stateless pointers to a single opaque object will, I fear,
> > force us to seriously limit the role of EML in the future
> > development of a web service based network. The fact that URLs are
> > at best awkward and at worse not useable for expressing some types
> > of connections is one thing, but it is the lack of support for
> > describing a stateful connection that bothers me most.  Many LTER
> > sites, not just CAP, are attempting to build internet applications
> > that are metadata driven and provide an interface (either direct
> > or web service based) to data stored in many different systems
> > including SDE, SQL, ascii files and various GIS and hyperspectral
> > formats. While few of us intend to give out the stateful
> > connection information to end users directly, many of us would
> > like to see the development of server-side tools follow some
> > standards so that we might all better share software components.
> > Without a standard in EML for describing connection information in
> > a usable format, the result is adminstrators are force to still
> > develop local solutions and then figure out how to relate them to
> > EML. I'd hate to see EML perceived as useful for enabling outside
> > institutions to build applications around site data but not very
> > useful for sites in building their own applications.

> I see many of your points, however, in my mind there are 4
> requirements for the connection model.  1) It must be machine
> parsable and/or directly machine usable.  2) It must not require
> that we add to or change the standard every time a new connection
> protocol is introduced or an existing one is revised. 3)It must not
> be based on proprietary connection protocols that limit the scope of
> other types of connections.  4) It must be comprehensive, allowing
> the description of any type of connection that one may want to list.

> The current method (using URLs to define connection points) follows
> 1,2,3 and mostly 4.  I'm sure you can find some connection somewhere
> that doesn't have a standard URL, but they are far and few between.
> The only other method that I can see working is to develop a
> name/value pair connection parameter model, where a connection is
> defined by the set of name/value pairs of needed connection
> parameters.  The problem with this is that to enable cross
> connections, we may have to have some sort of dictionary or map that
> shows what types of connections need what kinds of parameters.
> Maybe we need a hybrid of the two.  What do you propose should take
> the place of the URL?

This is where the lack of a detailed specification for EML really
begins to show, in my opinion.

Why should connection detail be necessary in EML, at all? 

What is the 'role of EML in the future development of a web service
based network'?

I'm an unabashed supporter of the concept of 'Web-Services', but I'm
not convinced that EML needs to be architected specially to support
them.  It seems to me that support for connection definitions would be
better defined elsewhere, perhaps in the WSDL or SOAP-RPC component(s)
of the web-service in question.  In other words, EML should be
'usable' by a web-service, but it souldn't define one.  Connection
details, and other service info, should be defined somehow in the
web-service, itself.  A link to this 'definition' could be served by
the simple URL pointer, couldn't it?

Regarding the 'server-side tools' that are necessary, the connection
details need to be known between the data source and the application
server, and those might be site/application specific. Should EML
maintain these details?  Expecting EML to provide the basis and
momentum for standardized tools is unrealistic, IMHO.

But, this point of view reflects my incomplete understanding of EML's
scope of intent.

> > 5) the recent traffic on reusable content partly underscores, I
> > think, our failure to adequately separate storage and management
> > of metadata from its presentation during this design process.

> I would say, that as EML is an XML metadata standard, presentation
> has no place in EML.  We should be focusing on creating EML as a
> metadata storage container.  Presentation can be done later with
> stylesheets if the structure of the metadata storage mechanism is
> accurate enough to hold all of the facets of the data which it is
> modeling.  I don't think that eml has been built for presentation at
> all.  We have attempted to organize certain sub-categories of
> information, but that is not presentation, that is organization.
> The problem with this is that everyone tends to think of the data
> model for EML slightly differently, so there have been some
> disagreements as to what types of information needs to be repeatable
> (normalized) and some of the organization is sometimes an issue.

My comments regarding 'Granularity of Reusable Content' had nothing
to do with presentation, nor does the issue as far as I'm concerned.
I agree with Chad's assessment above, and would only add that the
basic issue at stake here is 'Metadata maintenance'.  The real problem
with metadata is keeping it up-to-date.  Creating it is definitely a
challenge, but the promise of automated processing will help there.
'Updating' metadata is the true long-term burden, however, and that
was the genesis of my thread: because the ability to identify
repeatable discrete 'objects' in EML means updating those objects is
simplified.

> > The former benefits from a high degree of granularity and
> > normalization, the latter benefits from just the opposite
> > (assuming size of the eml document is not an issue). The
> > references element is a device to introduce some normalization
> > capability within EML to better serve management of information at
> > the expense of some convenience in reading it. Its not likely to
> > satisfy everyone since it doesnt allow addressing between
> > documents and this, as well as granularity, will be a perennial
> > problem when trying to use EML to serve as both a metadata
> > management format as well as a metadata presentation format. For
> > those of us that are dynamically building an EML document from a
> > normalized source such as a relational database or collections of
> > independent xml fragments, this is far less an issue: we can
> > choose our own level of granularity within our storage systems and
> > frankly find it easier to write the same information out twice
> > rather than going through the hassle of creating identifiers and
> > remembering what they are during the entire output process.

I am building EML from normalized sources, and the problem of
granularity still persists.  Chad summarizes this well...

> The problem with this approach is that there is no way to know
> whether two sub-trees that have the same content, are, in fact, the
> same object.  For instance, if you have entity alpha(A,B,C,D) and
> entity beta(A,B,C,D) are they the same entity?  If you use
> references, you know that alpha(A,B,C,D) with id=1 is the same
> object as beta(refid=1).  This is very important for machine
> processing of this metadata.  I would say that one should view EML
> as a metadata propegation unit that no one will ever look at.  It
> is, in essence, a machine language.  You would never look at the
> binary format of an excel file and try to follow the pointers around
> would you?  I don't think a human will try to do that with EML.  The
> presentation should be completely seperate from the storage and we
> must follow the concrete rules that govern when to use a
> relationship and when not to.  See
>
http://knb.ecoinformatics.org/software/eml/eml20docs/eml-docbook.html#reusab
leContent

> When you talk about the hassle of creating identifiers, I'm not sure
> what you are referencing.  It is really no trouble at all to add
> identifiers programmatically.

> > Id hate to see the issue of references and granularity hold up the
> > design process given that (in my opinion) they aren't really
> > necessary at all in order to define EML content (with the one
> > excepton of key definitions in eml-constraint which i dont like).

> I don't see references as holding up the design process.   They are part
> of the design and hence need to be part of the design process.  Like I
> said above, I think they are very necessary for advance machine
> processing with EML.

> > 6) Finally, and most significantly, the response from the workshop
> > indicated that how we have organized project and protocol are at
> > odds with most participants. the problem seems to stem from the
> > fact that most sites view projects as something that exists at a
> > different level from a dataset. while most agreed philosopically
> > that there has to be a discreet intellectual activity to produce a
> > dataset, few make any formal recognition of this
> > activity. Instead, most see certain components (data collection
> > methods, sampling, qaqc, etc) as direct properties of the
> > dataset. what is recognized as a project seems to be defined more
> > by administrative or research criteria that often are on a higher
> > plane than an individual dataset.

I don't understand the objection. Could you please elaborate?

We have 'projects' which have one to 20 or more data sets and the
basic structure of the modules in this regard doesn't seem to be an
impediment.

> > A few acknowledged that they could live with using the immediate
> > project element to record these more dataset-specific items and
> > include a link to a higher-order project description, but this was
> > hard to visualize at the time because that link was missing in
> > beta 9.

> I don't necessarily think that project needs to be at the root level
> of EML.  In fact, I think it used to buried farther down.  If people
> think of it as being farther down in the tree, I have no problem
> with that change.  Where should it go though?  Any ideas?

If project isn't at the root level, what would be?

Here, a diagram of 'Module Dependencies in EML' would be quite
useful.  I have taken the graphical module representations from the
documentation and assembled them onto a huge poster.  But it is still
really hard to visualize.

> > There were also similar problems with protocol. As Tim Bergsma put
> > it (in better words that I did), we are trying to use one module
> > to carry both prescriptive and descriptive information. In
> > deciding to make protocol a resource-level element, we have really
> > made the choice to use it as a prescriptive information tool -
> > that is, a way of describing standardized protocols independent of
> > any particular data collection instance. i even recall at least
> > one person saying that their personal interpretation of "protocol"
> > was As such, the informatin is only peripherally useful for
> > describing the actual methods used to produce a specific dataset.

> > There was also some dissatisfaction with the organization of
> > protocol. many objected to the idea of binding QAQC descriptions
> > to specific methodStep descriptions. Again, there was no
> > philosophical argument that quality control measures by definition
> > impose control over actions, nevertheless it does not agree with
> > how most organize this information. Instead, QAQC descriptions are
> > typically stored indeptendent of the the descriptions of the
> > methodology and cannot be easily linked in this way.

This problem may be larger than EML can hope to solve.  I would say
this is an inherent problem in the way much research is conducted and
audited.  The demands of structured metadata are just bringing the
problem out in the light.

> > Finally, as we've seen in recent email traffic, there are
> > frustrations with the perenial gray area of blending pure content
> > markup (XML) with formating markup (for predominantly textual
> > content).

> I would agree with this.  I'm not sure how to handle it though. I'll
> need to think more about it.

DocBook?

> > If I were to suggest changes to Beta9 to best address these
> > responses, they might go something like this. I would change
> > eml-project to be predominantly a research project description
> > including stafffing, funding, publications, and links to higher
> > level projects.. I would also leave eml-protocol as a resource
> > module, but make it predominately text based and prescriptive,
> > used only when a prodedure has been formally worked out and used
> > by many datasets.

> Sounds logical.

> > I would make a new module called methods, which i would use in
> > every place that we now use protocol. methods would contain a
> > repeatable methodStep element, which in turn would include
> > references to source datasets (type eml-dataset), software (type
> > eml-software), instrumentation, and any QAQC procedures that can
> > be logically related to those steps.  Methods would also include
> > optional links to eml-literature and eml-protocol as references to
> > formally published or cataloged prodedures.  I would create a new
> > module researchContext in which i would include the methodological
> > descriptors that directly qualify this dataset like site
> > description, sampling, and the above methods module. Finally, for
> > QAQC information that arent described under methodology but are
> > directly related to specific attributes in the data, I would
> > suggest using the data-quality module (in its current incarnations
> > as attributeAccuracy, horizonalAccuracy and verticalAccuracy)
> > should be used as the mechanism for describing both data quality
> > and the various control/assurance procedures used to arrive at
> > that quality.

> > These of course are pretty extreme changes for a beta 9. I think
> > that they probably describe something much more inline with
> > metadata that the current LTER network is producing, but we have
> > to weigh that against time-honored software development procedures
> > which are designed to prevent knee-jerk changes like this so late
> > in the game! With both momentum from the LTER network workshops
> > and a desire to get going on the new ITR(s), we need EML 2.0 out
> > the door soon or we will begin to lose our focus but we also need
> > it to work.

> It seems like this would work fine.  I need to think it over a bit more
> though.

The distinction of method and protocol seems useful.  Although I think
I followed the arguments for the other changes more-or-less I don't
yet see how they would create a dramatic improvement.  They don't
appear to simplify things.

> > So what's the best course of action? an irc meeting? conference
> > call? wait for a few days to see who responds to this email?

> I think we should give people a few days to digest this then have a
> conference call with all interested parties some time next week.  I
> propose Wednesday at 9:00 PDT for the call. Who would want to be in
> on the call?  I think it would be good to have Scott and/or Tim
> and/or other interested outside parties in on it to get an outside
> perspective, lest we fall back into our original arguments that
> we've been having for 2 years.  Anyone interrested in the call
> respond to eml-dev.  I can set up an Verizon conference call if we
> don't have a situation where we can chain enough 2 line phones
> together.  If you have a problem with the date/time of the call,
> propose a different one.  I'm available all next week.

I would be happy to participate, if you think that would benefit the
process.  Thanks for asking.

I really appreciate all the work everyone has contributed to the EML
project.

> chad

-- 
Scott E. Chapal_________________________________________________
Database & Network Manager             scott.chapal at jonesctr.org
J.W. Jones Ecological Research Center          229.734.4706 x227
Rt. 2. Box. 2324. Newton, GA 31770-9651        229.734.6650 :FAX
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20020820/63bd54f6/attachment.htm