status of EML 2.0

Mon Aug 19 10:46:39 PDT 2002

Scott,

Thanks for your insightful comments.  I think your presence on the
conference call would be a benefit to everyone.  I'll add you to the
participant list.  I won't repsond to your comments here, rather, I'll
add them to the agenda so everyone can discuss them on the call.

thanks,
chad

On Mon, 2002-08-19 at 10:36, Scott Chapal wrote:
> 
> Chad, Peter, et. al.:
> 
> I have been somewhat disappointed at the lack of traffic on eml-dev
> for the past month or so, but now understand that to be summer
> meetings and vacations.  I have been waiting on the side-lines
> anticipating the release of EML [2] for 18 months - 2 years.  It is
> only recently that I have gotten involved more than vicariously, and
> even then, just through testing beta9 instance documents.
> 
> Let me just begin by stating that I believe that EML is *EXTREMELY*
> important.  From my perspective, EML is more important than Morpho, or
> Metacat, or any site specific requirements.  EML needs to be completed
> soon, but more importantly, to echo what has been said before, it
> needs to be done right.
> 
> I have nothing but the utmost respect for NCEAS' (and in particular,
> Matt's) leadership on this front and I think that the quality and
> potential usefulness of the standard are a tribute to the dedication
> to the long-term general utility of EML.  What appears to be somewhat
> problematic, is defining the purpose which EML should be designed to
> fulfill.
> 
> What is the scope of intent that EML addresses?
> 
>   Minimalist design prescribes a solution which is just complex enough
>   to achieve the result, but no more so.  That might be apropos but it
>   begs the difficult question of what is being solved with EML, in
>   precise terms.
> 
>   To an extent, the scope has been determined by the state of EML
>   itself, it's documentation, and by the historical legacy of the
>   eml-dev archive - short as it is.  But, a concise specification,
>   including scope-of-intent, doesn't occur in any obvious public place
>   that I can find.  A useful adjunct to that definition, would be a
>   subsection regarding what EML is *NOT* designed to solve.
> 
>   Defining the specification for EML, as Owen Eddins so aptly
>   outlines below, is blatantly necessary.
> 
> "Owen Eddins" <oeddins at lternet.edu> surmised Mon, 1 Apr 2002 11:22:18 :
> 
> > And finally is EML an XML Schema/DTD or is it a specification?  EML
> > is an XML Schema/DTD.  The specification, a guideline for metadata
> > management, is implicit and needs to be made explicit. If we had a
> > well defined specification for a metadata standard then we could
> > have
> 
> > 1) a metadata specification in English, as a guideline for ecologists and
> > data managers and for the purpose of soliciting community review.
> 
> > 2) implementation of the metadata specification in XML Schema for purposes
> > of sharing and storage.
> > 
> > 3) implementation of metadata specification in relational database systems
> > for purposes of storage.
> > 
> > With a metadata specification we think we could meet the needs of all
> > categories of potential users of EML.
> 
> Here, here!
> 
> The lack of a detailed specification definition leads to a confused
> discussion and analysis of EML's architecture, module structural
> design and relationships, as well as it's purpose.
> 
> My comments to individual points in this thread are interspersed
> below, and reflect the idea that a more complete specification
> definition for EML would lend clarity to the process.
> 
> Chad Berkley <berkley at nceas.ucsb.edu> writes:
> 
> > On Thu, 2002-08-15 at 12:52, Peter McCartney wrote:
> 
> > > 2) there are some technical problems with the identifier and keyref
> > > statements which prevent any instance file from validating. I dont
> > > understand this aspect of XML very well so i cant really suggest how to fix
> > > it or where the problem lies but I assume it is just a technical matter and
> > > not a fundamental problem with what we are trying to do with references
> > >  
> > I have not tried to validate an instance document so I have not seen
> > this problem.  Will the person who encountered this error please write a
> > detailed description and either put it in bugzilla or email it to
> > eml-dev.  
> 
> This is not our experience at all.  We have several projects
> consisting of multiple data sets [tables] which have EML instance
> documents that validate fine with ID and keyref statements used
> liberally.  (However we have not yet tested everything, notably
> eml-literature, or the gis/spatial modules).
> 
> > > 3) Literature needs fixing - it doesnt work intuitively with the way most of
> > > us cite bibliographic information, even after we get some robust name
> > > parsing tools written. Ive already enumerated the problems in bugzilla, so
> > > ill won't belabor it here. Ive had a student writing XSLs for various
> > > journal formats as well as endnote conversion, but they are held up waiting
> > > for a final version. The fact that the network office is investing so much
> > > effort into endnote export format as a means for harvesting bibliographic
> > > information is in my opinion not a good letter of recommendation for
> > > eml-literature, so we should fix it or drop it in favor of something
> > > simpler. 
> 
> > I would like to see a model of the proposed structure of a new
> > literature module.  I think that some of the fields there are not needed
> > and there are others that might be needed.  I would propose that someone
> > who is actively working on marking up citations using literature propose
> > a model for us to look at. The EndNote export format may be a good
> > starting point.  The important part is that we have input from people
> > who are actually trying to use eml-literature.
> 
> What is 'EndNote export format'?  EndNote 5 Documentation lists
> Refer/BibIX, BibTex and RIS as default 'styles' included with EndNote
> for export.  This has little to do with the design intent of
> eml-literature.  I would think that what eml-literature should be
> judged on is it's syntactic capacity to represent bibliographic
> metadata correctly.  Period.  Everything else is simply a matter of
> conversion/translation of bibliographic 'format'.
> 
> Along these lines, why re-invent the wheel?  This problem has been
> solved before and better than we could hope to reinvent in the Library
> Information Technology discipline.  Why not defer to them.  Eg. MARC, or
> http://www.niso.org/, etc...
> 
> > > 4) The decision to record online distribution only using URLs and
> > > only as stateless pointers to a single opaque object will, I fear,
> > > force us to seriously limit the role of EML in the future
> > > development of a web service based network. The fact that URLs are
> > > at best awkward and at worse not useable for expressing some types
> > > of connections is one thing, but it is the lack of support for
> > > describing a stateful connection that bothers me most.  Many LTER
> > > sites, not just CAP, are attempting to build internet applications
> > > that are metadata driven and provide an interface (either direct
> > > or web service based) to data stored in many different systems
> > > including SDE, SQL, ascii files and various GIS and hyperspectral
> > > formats. While few of us intend to give out the stateful
> > > connection information to end users directly, many of us would
> > > like to see the development of server-side tools follow some
> > > standards so that we might all better share software components.
> > > Without a standard in EML for describing connection information in
> > > a usable format, the result is adminstrators are force to still
> > > develop local solutions and then figure out how to relate them to
> > > EML. I'd hate to see EML perceived as useful for enabling outside
> > > institutions to build applications around site data but not very
> > > useful for sites in building their own applications.
> 
> > I see many of your points, however, in my mind there are 4
> > requirements for the connection model.  1) It must be machine
> > parsable and/or directly machine usable.  2) It must not require
> > that we add to or change the standard every time a new connection
> > protocol is introduced or an existing one is revised. 3)It must not
> > be based on proprietary connection protocols that limit the scope of
> > other types of connections.  4) It must be comprehensive, allowing
> > the description of any type of connection that one may want to list.
> 
> > The current method (using URLs to define connection points) follows
> > 1,2,3 and mostly 4.  I'm sure you can find some connection somewhere
> > that doesn't have a standard URL, but they are far and few between.
> > The only other method that I can see working is to develop a
> > name/value pair connection parameter model, where a connection is
> > defined by the set of name/value pairs of needed connection
> > parameters.  The problem with this is that to enable cross
> > connections, we may have to have some sort of dictionary or map that
> > shows what types of connections need what kinds of parameters.
> > Maybe we need a hybrid of the two.  What do you propose should take
> > the place of the URL?
> 
> This is where the lack of a detailed specification for EML really
> begins to show, in my opinion.
> 
> Why should connection detail be necessary in EML, at all? 
> 
> What is the 'role of EML in the future development of a web service
> based network'?
> 
> I'm an unabashed supporter of the concept of 'Web-Services', but I'm
> not convinced that EML needs to be architected specially to support
> them.  It seems to me that support for connection definitions would be
> better defined elsewhere, perhaps in the WSDL or SOAP-RPC component(s)
> of the web-service in question.  In other words, EML should be
> 'usable' by a web-service, but it souldn't define one.  Connection
> details, and other service info, should be defined somehow in the
> web-service, itself.  A link to this 'definition' could be served by
> the simple URL pointer, couldn't it?
> 
> Regarding the 'server-side tools' that are necessary, the connection
> details need to be known between the data source and the application
> server, and those might be site/application specific. Should EML
> maintain these details?  Expecting EML to provide the basis and
> momentum for standardized tools is unrealistic, IMHO.
> 
> But, this point of view reflects my incomplete understanding of EML's
> scope of intent.
> 
> > > 5) the recent traffic on reusable content partly underscores, I
> > > think, our failure to adequately separate storage and management
> > > of metadata from its presentation during this design process.
> 
> > I would say, that as EML is an XML metadata standard, presentation
> > has no place in EML.  We should be focusing on creating EML as a
> > metadata storage container.  Presentation can be done later with
> > stylesheets if the structure of the metadata storage mechanism is
> > accurate enough to hold all of the facets of the data which it is
> > modeling.  I don't think that eml has been built for presentation at
> > all.  We have attempted to organize certain sub-categories of
> > information, but that is not presentation, that is organization.
> > The problem with this is that everyone tends to think of the data
> > model for EML slightly differently, so there have been some
> > disagreements as to what types of information needs to be repeatable
> > (normalized) and some of the organization is sometimes an issue.
> 
> My comments regarding 'Granularity of Reusable Content' had nothing
> to do with presentation, nor does the issue as far as I'm concerned.
> I agree with Chad's assessment above, and would only add that the
> basic issue at stake here is 'Metadata maintenance'.  The real problem
> with metadata is keeping it up-to-date.  Creating it is definitely a
> challenge, but the promise of automated processing will help there.
> 'Updating' metadata is the true long-term burden, however, and that
> was the genesis of my thread: because the ability to identify
> repeatable discrete 'objects' in EML means updating those objects is
> simplified.
> 
> > > The former benefits from a high degree of granularity and
> > > normalization, the latter benefits from just the opposite
> > > (assuming size of the eml document is not an issue). The
> > > references element is a device to introduce some normalization
> > > capability within EML to better serve management of information at
> > > the expense of some convenience in reading it. Its not likely to
> > > satisfy everyone since it doesnt allow addressing between
> > > documents and this, as well as granularity, will be a perennial
> > > problem when trying to use EML to serve as both a metadata
> > > management format as well as a metadata presentation format. For
> > > those of us that are dynamically building an EML document from a
> > > normalized source such as a relational database or collections of
> > > independent xml fragments, this is far less an issue: we can
> > > choose our own level of granularity within our storage systems and
> > > frankly find it easier to write the same information out twice
> > > rather than going through the hassle of creating identifiers and
> > > remembering what they are during the entire output process.
> 
> I am building EML from normalized sources, and the problem of
> granularity still persists.  Chad summarizes this well...
> 
> > The problem with this approach is that there is no way to know
> > whether two sub-trees that have the same content, are, in fact, the
> > same object.  For instance, if you have entity alpha(A,B,C,D) and
> > entity beta(A,B,C,D) are they the same entity?  If you use
> > references, you know that alpha(A,B,C,D) with id=1 is the same
> > object as beta(refid=1).  This is very important for machine
> > processing of this metadata.  I would say that one should view EML
> > as a metadata propegation unit that no one will ever look at.  It
> > is, in essence, a machine language.  You would never look at the
> > binary format of an excel file and try to follow the pointers around
> > would you?  I don't think a human will try to do that with EML.  The
> > presentation should be completely seperate from the storage and we
> > must follow the concrete rules that govern when to use a
> > relationship and when not to.  See
> > http://knb.ecoinformatics.org/software/eml/eml20docs/eml-docbook.html#reusableContent
> 
> > When you talk about the hassle of creating identifiers, I'm not sure
> > what you are referencing.  It is really no trouble at all to add
> > identifiers programmatically.
> 
> > > Id hate to see the issue of references and granularity hold up the
> > > design process given that (in my opinion) they aren't really
> > > necessary at all in order to define EML content (with the one
> > > excepton of key definitions in eml-constraint which i dont like).
> 
> > I don't see references as holding up the design process.   They are part
> > of the design and hence need to be part of the design process.  Like I
> > said above, I think they are very necessary for advance machine
> > processing with EML.
> 
> > > 6) Finally, and most significantly, the response from the workshop
> > > indicated that how we have organized project and protocol are at
> > > odds with most participants. the problem seems to stem from the
> > > fact that most sites view projects as something that exists at a
> > > different level from a dataset. while most agreed philosopically
> > > that there has to be a discreet intellectual activity to produce a
> > > dataset, few make any formal recognition of this
> > > activity. Instead, most see certain components (data collection
> > > methods, sampling, qaqc, etc) as direct properties of the
> > > dataset. what is recognized as a project seems to be defined more
> > > by administrative or research criteria that often are on a higher
> > > plane than an individual dataset.
> 
> I don't understand the objection. Could you please elaborate?
> 
> We have 'projects' which have one to 20 or more data sets and the
> basic structure of the modules in this regard doesn't seem to be an
> impediment.
> 
> > > A few acknowledged that they could live with using the immediate
> > > project element to record these more dataset-specific items and
> > > include a link to a higher-order project description, but this was
> > > hard to visualize at the time because that link was missing in
> > > beta 9.
> 
> > I don't necessarily think that project needs to be at the root level
> > of EML.  In fact, I think it used to buried farther down.  If people
> > think of it as being farther down in the tree, I have no problem
> > with that change.  Where should it go though?  Any ideas?
> 
> If project isn't at the root level, what would be?
> 
> Here, a diagram of 'Module Dependencies in EML' would be quite
> useful.  I have taken the graphical module representations from the
> documentation and assembled them onto a huge poster.  But it is still
> really hard to visualize.
> 
> > > There were also similar problems with protocol. As Tim Bergsma put
> > > it (in better words that I did), we are trying to use one module
> > > to carry both prescriptive and descriptive information. In
> > > deciding to make protocol a resource-level element, we have really
> > > made the choice to use it as a prescriptive information tool -
> > > that is, a way of describing standardized protocols independent of
> > > any particular data collection instance. i even recall at least
> > > one person saying that their personal interpretation of "protocol"
> > > was As such, the informatin is only peripherally useful for
> > > describing the actual methods used to produce a specific dataset.
> 
> > > There was also some dissatisfaction with the organization of
> > > protocol. many objected to the idea of binding QAQC descriptions
> > > to specific methodStep descriptions. Again, there was no
> > > philosophical argument that quality control measures by definition
> > > impose control over actions, nevertheless it does not agree with
> > > how most organize this information. Instead, QAQC descriptions are
> > > typically stored indeptendent of the the descriptions of the
> > > methodology and cannot be easily linked in this way.
> 
> This problem may be larger than EML can hope to solve.  I would say
> this is an inherent problem in the way much research is conducted and
> audited.  The demands of structured metadata are just bringing the
> problem out in the light.
> 
> > > Finally, as we've seen in recent email traffic, there are
> > > frustrations with the perenial gray area of blending pure content
> > > markup (XML) with formating markup (for predominantly textual
> > > content).
> 
> > I would agree with this.  I'm not sure how to handle it though. I'll
> > need to think more about it.
> 
> DocBook?
> 
> > > If I were to suggest changes to Beta9 to best address these
> > > responses, they might go something like this. I would change
> > > eml-project to be predominantly a research project description
> > > including stafffing, funding, publications, and links to higher
> > > level projects.. I would also leave eml-protocol as a resource
> > > module, but make it predominately text based and prescriptive,
> > > used only when a prodedure has been formally worked out and used
> > > by many datasets.
> 
> > Sounds logical.
> 
> > > I would make a new module called methods, which i would use in
> > > every place that we now use protocol. methods would contain a
> > > repeatable methodStep element, which in turn would include
> > > references to source datasets (type eml-dataset), software (type
> > > eml-software), instrumentation, and any QAQC procedures that can
> > > be logically related to those steps.  Methods would also include
> > > optional links to eml-literature and eml-protocol as references to
> > > formally published or cataloged prodedures.  I would create a new
> > > module researchContext in which i would include the methodological
> > > descriptors that directly qualify this dataset like site
> > > description, sampling, and the above methods module. Finally, for
> > > QAQC information that arent described under methodology but are
> > > directly related to specific attributes in the data, I would
> > > suggest using the data-quality module (in its current incarnations
> > > as attributeAccuracy, horizonalAccuracy and verticalAccuracy)
> > > should be used as the mechanism for describing both data quality
> > > and the various control/assurance procedures used to arrive at
> > > that quality.
> 
> > > These of course are pretty extreme changes for a beta 9. I think
> > > that they probably describe something much more inline with
> > > metadata that the current LTER network is producing, but we have
> > > to weigh that against time-honored software development procedures
> > > which are designed to prevent knee-jerk changes like this so late
> > > in the game! With both momentum from the LTER network workshops
> > > and a desire to get going on the new ITR(s), we need EML 2.0 out
> > > the door soon or we will begin to lose our focus but we also need
> > > it to work.
> 
> > It seems like this would work fine.  I need to think it over a bit more
> > though.
> 
> The distinction of method and protocol seems useful.  Although I think
> I followed the arguments for the other changes more-or-less I don't
> yet see how they would create a dramatic improvement.  They don't
> appear to simplify things.
> 
> > > So what's the best course of action? an irc meeting? conference
> > > call? wait for a few days to see who responds to this email?
> 
> > I think we should give people a few days to digest this then have a
> > conference call with all interested parties some time next week.  I
> > propose Wednesday at 9:00 PDT for the call. Who would want to be in
> > on the call?  I think it would be good to have Scott and/or Tim
> > and/or other interested outside parties in on it to get an outside
> > perspective, lest we fall back into our original arguments that
> > we've been having for 2 years.  Anyone interrested in the call
> > respond to eml-dev.  I can set up an Verizon conference call if we
> > don't have a situation where we can chain enough 2 line phones
> > together.  If you have a problem with the date/time of the call,
> > propose a different one.  I'm available all next week.
> 
> I would be happy to participate, if you think that would benefit the
> process.  Thanks for asking.
> 
> I really appreciate all the work everyone has contributed to the EML
> project.
> 
> > chad
> 
> -- 
> Scott E. Chapal_________________________________________________
> Database & Network Manager             scott.chapal at jonesctr.org
> J.W. Jones Ecological Research Center          229.734.4706 x227
> Rt. 2. Box. 2324. Newton, GA 31770-9651        229.734.6650 :FAX
> 
> _______________________________________________
> eml-dev mailing list
> eml-dev at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/eml-dev
-- 
-----------------------
Chad Berkley
National Center for 
Ecological Analysis 
and Synthesis (NCEAS)
berkley at nceas.ucsb.edu
-----------------------