[LTER-im] [Fwd: [Fwd: Re: FW: Report from Metacat Harvester: Wed Aug 25 11:00:36 MDT 2004]]

Peter McCartney peter.mccartney at asu.edu
Tue Aug 31 09:57:36 PDT 2004


Thanks james and mark for following up on this. The more attention this
gets, the more we will arrive at a solution that works for the
community. 

I ammended my bug to basically say the same as what you are
saying...that the "bug" is in the discrepancy between the eml spec and
what has more recently evolved as a best practices regarding ids.
Originally this was to have been enforced by the schema but the key
definition was removed. So the only enforcement is by independed
software like EML parser that lies (in my opinion) outside the spec.

Obviously im not suggesting that we allow documents to contain
references tags pointing to non existing ids. What my proposal meant is
that if a user chooses to duplicate the same content within eml and give
them the same ID, that should not be flagged as invalid. Nor should
duplicate id's that have different content but also different system
tags. If the eml parser has code to check for broken references, then it
should be left in. but those are not the kind of violations that are
being generated by our dynamically produced EML. 

That said let me comment on Marks proposed fixes:

(1) Deprecate SYSTEM/SCOPE attributes in this context, update the
specification to reflect such change, and do not allow duplicate IDs.

I think the dialog between Matt and Duane indicates that this is not the
ultimate direction we want to go, so I don't recommend it as an interim
fix.


(2) Modify the specification to allow SYSTEM/SCOPE to narrow the ID
scope, thereby allowing duplicate IDs when qualified by either
SYSTEM/SCOPE -- and, modify the specification for REFERENCES to make use
of such change.

This is slightly more than the recomendation of my bug entry. Adding a
system attribute to <references> would require changing the schemas and
if it is made required, would not be a backward compatible change as it
would cause previously valid documents to be invalid. Again, this does
not address ALL problems with IDs but it does deal with some of them.
Either way, this would entail a 2.02 release, though if it is a serious
enough issue and the solution is kept backward compatible, that could
probably be fast-tracked.


(3) Deprecate REFERENCES completely and force repeated content.

This would invalidate existing documents, so it is not an option. A
backward compatible fix would be to simply change the spec to read that
a references tag MUST point to an existing ID that is unique (within
system if Marks #2 is followed). Howver, references are optional and the
user may elect to place the same content in multiple places, signifying
that they are indentical by giving them the same id and system
attribute. Parsing software would be change to 1) check that every
references tag resolves to one and only one ID, and 2) every id and
sytem combination that repeats is checked for identical match. These
changes could be done without requiring a new release of EML (exept for
the bit about checking references within restricted system scopes)


It would really help me justify the extra work involed in managing ids
and references if someone could give me a concrete example of why it
would be bad to have a document contain two elements with identical ids
and identical content. If my xpath returns one or several nodes and they
are all identical, why is it so bad to just assume that the rule is:
"identical id (and system) means identical content" and just use the
first one in the list? I think it is no more work to write parsers to
check for differences between nodes ith similar ids than it is to check
for duplicate ids in the first place, but it makes generating valid eml
a LOT simpler.

Peter McCartney (peter.mccartney at asu.edu)
Center for Environmental-Studies
Arizona State University
 


> -----Original Message-----
> From: owner-im at lternet.edu [mailto:owner-im at lternet.edu] On
> Behalf Of James W Brunt
> Sent: Monday, August 30, 2004 2:57 PM
> To: eml-dev at ecoinformatics.org; emlbestpractices at lternet.edu; 
> im at lternet.edu
> Subject: [LTER-im] [Fwd: [Fwd: Re: FW: Report from Metacat 
> Harvester: Wed Aug 25 11:00:36 MDT 2004]]
> 
> 
> Peter,  et. al,
> 
> Mark's email to me (below) has reinforced my own conclusion about the
> id, system, references question. There at least 2 possibly 3 issues 
> (bugs if you will) here to be dealt with:
> 
> 1. The eml normative documentation needs to reflect the real
> intent and 
> use of the system attribute. Read (Can O Worms). Options as I 
> see them:
> 	a. deprecate the system attribute until it can be 
> better defined - 
> ignore 2 and 3 below (Mark goes even further on this one below).
> 	b. clearly define the system attribute and make the 
> changes in 2 and 3 
> below.
> 
> 2. <references> tag needs to be made system/scope aware
> 
> 3. EMLparser needs to enforce the final outcome of 1 and 2.
> 
> Currently, the documentation introduces system but it's
> definition does 
> not supercede the unique ID requirement within a document, references 
> is not system aware, EMLparser is enforcing exactly what the 
> documentation says.
> 
> Turning off the ID checking as Peter has suggested (different thread)
> would  result in uninterpretable EML documents were the 
> references tag 
> to be used (Although, in all but one case in the example below there 
> were no references to the IDs). I don't see this as an intermediate 
> solution.
> 
> The intent as I remember all that long discussion ago was to create a
> way to get around having to completely duplicate content in a 
> document. 
>   Thus creating a more compact document and one that would be more 
> easily maintained for someone not generating the documents from a 
> database. I'm sure I can be clarified some here by others that were 
> present. I realize the difficulty in tracking a document ID map for 
> every document you automatically generate however I really don't 
> understand why you wouldn't completely duplicate the content. 
> However, 
> the inclusion of a second qualifying attribute that has to be checked 
> for every id tag is doable but before we begin something like this it 
> must be clearly spelled-out and agreeable to the group(s). 
> We'd like to 
> hear from eml-dev, eml-bestpractices, and im as well as individual 
> stakeholders.
> 
> Thanks,
> 
> James
> 
> --
> James W. Brunt
> Associate Director for Information Management
> Long Term Ecological Research Network Office
> Department of Biology
> University of New Mexico
> Albuquerque, NM 87131-1091
> 505 272 7085
> jbrunt at lternet.edu
> 
> 
> -------- Original Message --------
> From: Mark Servilla <servilla at lternet.edu>
> To: James Brunt <jbrunt at lternet.edu>
> Subject: [Fwd: Re: FW: Report from Metacat Harvester: Wed Aug 25
> 11:00:36 MDT 2004]
> 
> James,
> 
> After reviewing the EML specification documents, it appears
> to me that duplicate IDs within a single instance document is 
> not valid EML, and therefore (IMHO), the EML Parser is 
> behaving correctly.  I cannot see how setting either the 
> SYSTEM or SCOPE attribute can be used by the REFERENCES 
> element to distinguish duplicate IDs within a single document 
> (perhaps someone in eml-dev can help answer how SYSTEM/SCOPE 
> are used in this context).
> 
> Some possible solutions are:
> (1) Deprecate SYSTEM/SCOPE attributes in this context, update
> the specification to reflect such change, and do not allow 
> duplicate IDs.
> (2) Modify the specification to allow SYSTEM/SCOPE to narrow 
> the ID scope, thereby allowing duplicate IDs when qualified 
> by either SYSTEM/SCOPE -- and, modify the specification for 
> REFERENCES to make use of such change.
> (3) Deprecate REFERENCES completely and force repeated content.
> 
> Just my thoughts - thanks!
> 
> Mark
> 
> -------- Original Message --------
> Subject: Re: FW: Report from Metacat Harvester: Wed Aug 25
> 11:00:36 MDT 
> 2004
> Date: Mon, 30 Aug 2004 09:26:13 -0600
> From: Mark Servilla <servilla at lternet.edu>
> To: 'Corinna Gries' <corinna at asu.edu>
> CC: James Brunt <jbrunt at lternet.edu>, Duane Costa <dcosta at lternet.edu>
> References: <E1C0TNQ-00066I-00 at lternet.lternet.edu>
> 
> Hi Corinna,
> 
> I have been discussing this issue of ID attributes with James
> and Duane here at LNO.  Please correct me if I am wrong, but 
> the section on Reusable Content (below or
> http://knb.ecoinformatics.org/software/eml/eml-2.0.1/index.htm
> l#reusableContent)
> states that "two identical ids cannot exist in a single 
> document".  It appears that the "SYSTEM" attribute only 
> allows identical ids in multiple documents within the system 
> (that is, only if the repeated ids reference the exact same 
> object) - something like globalizing the id'ed object to the 
> system for repeated reference in one or more documents, but 
> not necessarily allowing identical ids within a single 
> document by changing the SYSTEM attribute value.  I am not 
> really sure how one would take advantage of the SYSTEM 
> attribute for reusable content.  And, I don't know the 
> provenance of this particular issue (the documentation could 
> certainly be more clear), but if we were to follow the 
> documentation as we interpret, would this still be a bug in 
> the Harvester/Metacat software?
> 
> Sincerely,
> Mark
> 
> 3.3. Reusable Content
> EML allows the reuse of previously defined structured content (DOM
> sub-trees) through the use of key/keyRef type references. In
> order for an EML package to remain cohesive and to allow for 
> the cross platform compatability of packages, the following 
> rules with respect to packaging must be followed. 1. An ID is 
> required on the eml root element. 2. IDs are optional on all 
> other elements. 3. If an ID is not provided, that content 
> must be interpreted as representing a distinct object. 4. If 
> an ID is provided for content then that content is distinct 
> from all other content except for that content that 
> references its ID. 5. If a user wants to reuse content to 
> indicate the repetition of an object, a reference must be 
> used. Two identical ids cannot exist in a single document. 6. 
> "Document" scope is defined as identifiers unique only to a 
> single instance document (if a document does not have a 
> system attribute or if scope is set to 'document' then all 
> IDs are defined as distinct content). 7. "System" scope is 
> defined as identifiers unique to an entire data management 
> system (if two documents share a system string, then any IDs 
> in those two documents that are identical refer to the same 
> object). 8. If an element references another element, it must 
> not have an ID itself. 9. All EML packages must have the 
> 'eml' module as the root. 10. The system and scope attribute 
> are always optional except for at the 'eml' module where the 
> scope attribute is fixed as 'system'. The scope attribute 
> defaults to 'document' for all other modules.
> 
> Duane Costa wrote:
> > Could anyone comment as to whether the EML error reported
> by Metacat
> > below is a genuine EML error versus a bug in Metacat or the EML
> > validator program? The issue is whether the id value for <dataset> 
> > must be unique from the id value for <creator>.
> > 
> > Thanks,
> > Duane
> > 
> > -----Original Message-----
> > From: Corinna Gries [mailto:corinna at asu.edu]
> > Sent: Thursday, August 26, 2004 3:48 PM
> > To: dcosta at lternet.edu
> > Subject: RE: Report from Metacat Harvester: Wed Aug 25
> 11:00:36 MDT 2004
> > 
> > Hi Duane,
> > 
> > I am trying to fix these problems with our eml files. Some are easy
> > because they are actual errors in our files, but there is 
> one where I
> > wonder if the ID checking is right. I understood IDs should
> be unique
> > within the system, that is for example:
> > 
> > <dataset id="30" system="ces_dataset"> ... Is different
> from <creator
> > id="30" system="ces_party"> ....
> > 
> > However, your harvester complains that they are the same:
> > 
> > 
> **********************************************************************
> > **
> > *****
> > *
> > * METACAT HARVESTER REPORT: Wed Aug 25 11:00:36 MDT 2004
> > *
> > * A TOTAL OF 22 ERRORS WERE DETECTED.
> > * Please see the log entries below for additonal details.
> > *
> > 
> **************************************************************
> **********
> > *****
> > 
> **************************************************************
> **********
> > *****
> > *
> > * harvestLogID:         5549
> > * harvestDate:          Wed Aug 25 11:00:36 MDT 2004
> > * status:               1
> > * message:              
> > * harvestOperationCode: InsertDocError
> > * description:          Error inserting EML document to Metacat
> > * detailLogID:          383
> > * errorMessage:         MetacatException: <?xml version="1.0"?>
> > <error>
> > Error running xpath expression:
> > 
> //dateTimeDomain|//nonNumericDomain|//numericDomain|//access|/
> /attribute
> > 
> List|//constraint|//coverage|//temporalCoverage|//geographicCoverage|/
> > List|/t
> > 
> axonomicCoverage|/dataset|/eml/dataset|//dataSource|//dataTable|//othe
> > axonomicCoverage|rE
> > 
> ntity|//citation|//address|//conferenceLocation|//party|//originator|/
> > ntity|/c
> > 
> reator|//contact|//publisher|//editor|//recipient|//performer|//instit
> > reator|ut
> > 
> ion|//metadataProvider|//associatedParty|//personnel|//physical|//conn
> > ion|ec
> > 
> tionDefinition|//distribution|//researchProject|//project|//relatedPro
> > tionDefinition|je
> > 
> ct|//software|//spatialRaster|//spatialReference|//spatialVector|//sto
> > ct|re
> > dProcedure|//view|//protocol|//additionalMetadata : Error in xml
> > document.  This EML document is not valid because the id 30 occurs
> > more than once.  IDs must be unique. </error>
> > 
> > * scope:                ces_dataset
> > * identifier:           30
> > * revision:             1
> > * documentType:         eml://ecoinformatics.org/eml-2.0.0
> > * documentURL:
> > 
> http://seinet.asu.edu/DataCatalog/getXanthoriaRecord.jsp?source=ces_da
> > ta
> > set_mohave&id=30
> > *
> > 
> **************************************************************
> **********
> > *****
> > 
> > What do you think?
> > 
> > Corinna
> > 
> > _______________________________________________
> > eml-dev mailing list
> > eml-dev at ecoinformatics.org
> > http://www.ecoinformatics.org/mailman/listinfo/eml-dev
> 
> --
> Mark Servilla, Ph.D.
> 
> LTER Network Office
> Department of Biology
> MSC 03 2020
> 1 University of New Mexico
> Albuquerque, NM 87131-0001
> 
> servilla at lternet.edu
> Office (505) 277-2619
> Cell   (505) 453-8593
> 
> 
> 
> --
> Mark Servilla, Ph.D.
> 
> LTER Network Office
> Department of Biology
> MSC 03 2020
> 1 University of New Mexico
> Albuquerque, NM 87131-0001
> 
> servilla at lternet.edu
> Office (505) 277-2619
> Cell   (505) 453-8593
> 
> --
> James W. Brunt
> Associate Director for Information Management
> Long Term Ecological Research Network Office
> Department of Biology
> University of New Mexico
> Albuquerque, NM 87131-1091
> 505 272 7085
> jbrunt at lternet.edu
> 
> -------------------------------------------------
> Long-Term Ecological Research Network Mailing List
> im at LTERnet.edu http://sql.lternet.edu/cgi/mailgroups_view.pl?im
> 



More information about the Eml-dev mailing list