[LTER-im] [Fwd: [Fwd: Re: FW: Report from Metacat Harvester: Wed Aug 25 11:00:36 MDT 2004]]

Matt Jones jones at nceas.ucsb.edu
Tue Aug 31 11:35:09 PDT 2004


Hi all,

Glad we're having this discussion....

Peter McCartney wrote:
> Thanks james and mark for following up on this. The more attention this
> gets, the more we will arrive at a solution that works for the
> community. 
> 
> I ammended my bug to basically say the same as what you are
> saying...that the "bug" is in the discrepancy between the eml spec and
> what has more recently evolved as a best practices regarding ids.
> Originally this was to have been enforced by the schema but the key
> definition was removed. So the only enforcement is by independed
> software like EML parser that lies (in my opinion) outside the spec.

Well, I removed the key defs from the schema because XML Schema could 
not express the keys as we wanted to define them in EML.  So, way back 
when we agreed that we would need to define a separate parser to enforce 
those constraints -- which is what we did with the EML parser.  We fully 
intended that the EML Parser was fully part of the spec.  Here's the 
relevant discussion in a bug:

http://bugzilla.ecoinformatics.org/show_bug.cgi?id=586

And, refer to section 3.3, and 3.3.1.1 of the spec for a discussion of 
the use of IDs and the parser in the spec.

> 
> Obviously im not suggesting that we allow documents to contain
> references tags pointing to non existing ids. What my proposal meant is
> that if a user chooses to duplicate the same content within eml and give
> them the same ID, that should not be flagged as invalid. Nor should
> duplicate id's that have different content but also different system
> tags. If the eml parser has code to check for broken references, then it
> should be left in. but those are not the kind of violations that are
> being generated by our dynamically produced EML. 
> 
> That said let me comment on Marks proposed fixes:
> 
> (1) Deprecate SYSTEM/SCOPE attributes in this context, update the
> specification to reflect such change, and do not allow duplicate IDs.
> 
> I think the dialog between Matt and Duane indicates that this is not the
> ultimate direction we want to go, so I don't recommend it as an interim
> fix.

I agree.

> 
> 
> (2) Modify the specification to allow SYSTEM/SCOPE to narrow the ID
> scope, thereby allowing duplicate IDs when qualified by either
> SYSTEM/SCOPE -- and, modify the specification for REFERENCES to make use
> of such change.
> 
> This is slightly more than the recomendation of my bug entry. Adding a
> system attribute to <references> would require changing the schemas and
> if it is made required, would not be a backward compatible change as it
> would cause previously valid documents to be invalid. Again, this does
> not address ALL problems with IDs but it does deal with some of them.
> Either way, this would entail a 2.02 release, though if it is a serious
> enough issue and the solution is kept backward compatible, that could
> probably be fast-tracked.

I think this is the right approach.  It lets us still validate refs, but 
allows more flexibility in making the ids conform to existing ids used 
in various external systems.  It may not be backward compatible, 
depending on what format we concoct for the references tag.  But we 
might be able to do it in a backwards compatible manner.

> 
> 
> (3) Deprecate REFERENCES completely and force repeated content.
> 
> This would invalidate existing documents, so it is not an option. A
> backward compatible fix would be to simply change the spec to read that
> a references tag MUST point to an existing ID that is unique (within
> system if Marks #2 is followed). Howver, references are optional and the
> user may elect to place the same content in multiple places, signifying
> that they are indentical by giving them the same id and system
> attribute. 
I don't like this.  IN particular, it makes the keys non-unique, which I 
think is a real problem -- its like having a relational db with a table 
that allows duplicate keys -- they aren't really keys then, are they?

Parsing software would be change to 1) check that every
> references tag resolves to one and only one ID, and 2) every id and
> sytem combination that repeats is checked for identical match. These
> changes could be done without requiring a new release of EML (exept for
> the bit about checking references within restricted system scopes)
> 
> 
> It would really help me justify the extra work involed in managing ids
> and references if someone could give me a concrete example of why it
> would be bad to have a document contain two elements with identical ids
> and identical content. 
Like in other relational systems, The key (id) acts as a surrogate for 
the content.  So, references should resolve to one (and only one) id. 
It is far harder to validate that the content is the same between two 
nodes with identical keys than it is to validate that no key is 
duplicated.  I think they got this right in the relational model, and we 
should follow that lead.  If you allow duplicate ids, then I am sure 
this situation will arise:

<a id="1">foo</a>
<a id="1">bar></a>
<b><references>1</references></b>

What is the value of <b>?  foo, or bar?  It is indeterminate.  And this 
is precisely why this is a problem.

If my xpath returns one or several nodes and they
> are all identical, why is it so bad to just assume that the rule is:
> "identical id (and system) means identical content" and just use the
> first one in the list?
Because relational models have shown that this never works.  I think 
that such an assumption will result in lots of broken docs.

  I think it is no more work to write parsers to
> check for differences between nodes ith similar ids than it is to check
> for duplicate ids in the first place, but it makes generating valid eml
> a LOT simpler.
Generating valid eml with only one copy of a subtree is easy -- just 
track whether you've already inserted it, and reference it thereafter. 
I don't understand at all why this is hard.  However, I do understand 
the problem with system not being included in the assessment of the 
uniqueness of the ID.  So I like the idea of pursuing Mark's suggestion (2).

Matt

> 
> Peter McCartney (peter.mccartney at asu.edu)
> Center for Environmental-Studies
> Arizona State University
>  
> 
> 
> 
>>-----Original Message-----
>>From: owner-im at lternet.edu [mailto:owner-im at lternet.edu] On
>>Behalf Of James W Brunt
>>Sent: Monday, August 30, 2004 2:57 PM
>>To: eml-dev at ecoinformatics.org; emlbestpractices at lternet.edu; 
>>im at lternet.edu
>>Subject: [LTER-im] [Fwd: [Fwd: Re: FW: Report from Metacat 
>>Harvester: Wed Aug 25 11:00:36 MDT 2004]]
>>
>>
>>Peter,  et. al,
>>
>>Mark's email to me (below) has reinforced my own conclusion about the
>>id, system, references question. There at least 2 possibly 3 issues 
>>(bugs if you will) here to be dealt with:
>>
>>1. The eml normative documentation needs to reflect the real
>>intent and 
>>use of the system attribute. Read (Can O Worms). Options as I 
>>see them:
>>	a. deprecate the system attribute until it can be 
>>better defined - 
>>ignore 2 and 3 below (Mark goes even further on this one below).
>>	b. clearly define the system attribute and make the 
>>changes in 2 and 3 
>>below.
>>
>>2. <references> tag needs to be made system/scope aware
>>
>>3. EMLparser needs to enforce the final outcome of 1 and 2.
>>
>>Currently, the documentation introduces system but it's
>>definition does 
>>not supercede the unique ID requirement within a document, references 
>>is not system aware, EMLparser is enforcing exactly what the 
>>documentation says.
>>
>>Turning off the ID checking as Peter has suggested (different thread)
>>would  result in uninterpretable EML documents were the 
>>references tag 
>>to be used (Although, in all but one case in the example below there 
>>were no references to the IDs). I don't see this as an intermediate 
>>solution.
>>
>>The intent as I remember all that long discussion ago was to create a
>>way to get around having to completely duplicate content in a 
>>document. 
>>  Thus creating a more compact document and one that would be more 
>>easily maintained for someone not generating the documents from a 
>>database. I'm sure I can be clarified some here by others that were 
>>present. I realize the difficulty in tracking a document ID map for 
>>every document you automatically generate however I really don't 
>>understand why you wouldn't completely duplicate the content. 
>>However, 
>>the inclusion of a second qualifying attribute that has to be checked 
>>for every id tag is doable but before we begin something like this it 
>>must be clearly spelled-out and agreeable to the group(s). 
>>We'd like to 
>>hear from eml-dev, eml-bestpractices, and im as well as individual 
>>stakeholders.
>>
>>Thanks,
>>
>>James
>>
>>--
>>James W. Brunt
>>Associate Director for Information Management
>>Long Term Ecological Research Network Office
>>Department of Biology
>>University of New Mexico
>>Albuquerque, NM 87131-1091
>>505 272 7085
>>jbrunt at lternet.edu
>>
>>
>>-------- Original Message --------
>>From: Mark Servilla <servilla at lternet.edu>
>>To: James Brunt <jbrunt at lternet.edu>
>>Subject: [Fwd: Re: FW: Report from Metacat Harvester: Wed Aug 25
>>11:00:36 MDT 2004]
>>
>>James,
>>
>>After reviewing the EML specification documents, it appears
>>to me that duplicate IDs within a single instance document is 
>>not valid EML, and therefore (IMHO), the EML Parser is 
>>behaving correctly.  I cannot see how setting either the 
>>SYSTEM or SCOPE attribute can be used by the REFERENCES 
>>element to distinguish duplicate IDs within a single document 
>>(perhaps someone in eml-dev can help answer how SYSTEM/SCOPE 
>>are used in this context).
>>
>>Some possible solutions are:
>>(1) Deprecate SYSTEM/SCOPE attributes in this context, update
>>the specification to reflect such change, and do not allow 
>>duplicate IDs.
>>(2) Modify the specification to allow SYSTEM/SCOPE to narrow 
>>the ID scope, thereby allowing duplicate IDs when qualified 
>>by either SYSTEM/SCOPE -- and, modify the specification for 
>>REFERENCES to make use of such change.
>>(3) Deprecate REFERENCES completely and force repeated content.
>>
>>Just my thoughts - thanks!
>>
>>Mark
>>
>>-------- Original Message --------
>>Subject: Re: FW: Report from Metacat Harvester: Wed Aug 25
>>11:00:36 MDT 
>>2004
>>Date: Mon, 30 Aug 2004 09:26:13 -0600
>>From: Mark Servilla <servilla at lternet.edu>
>>To: 'Corinna Gries' <corinna at asu.edu>
>>CC: James Brunt <jbrunt at lternet.edu>, Duane Costa <dcosta at lternet.edu>
>>References: <E1C0TNQ-00066I-00 at lternet.lternet.edu>
>>
>>Hi Corinna,
>>
>>I have been discussing this issue of ID attributes with James
>>and Duane here at LNO.  Please correct me if I am wrong, but 
>>the section on Reusable Content (below or
>>http://knb.ecoinformatics.org/software/eml/eml-2.0.1/index.htm
>>l#reusableContent)
>>states that "two identical ids cannot exist in a single 
>>document".  It appears that the "SYSTEM" attribute only 
>>allows identical ids in multiple documents within the system 
>>(that is, only if the repeated ids reference the exact same 
>>object) - something like globalizing the id'ed object to the 
>>system for repeated reference in one or more documents, but 
>>not necessarily allowing identical ids within a single 
>>document by changing the SYSTEM attribute value.  I am not 
>>really sure how one would take advantage of the SYSTEM 
>>attribute for reusable content.  And, I don't know the 
>>provenance of this particular issue (the documentation could 
>>certainly be more clear), but if we were to follow the 
>>documentation as we interpret, would this still be a bug in 
>>the Harvester/Metacat software?
>>
>>Sincerely,
>>Mark
>>
>>3.3. Reusable Content
>>EML allows the reuse of previously defined structured content (DOM
>>sub-trees) through the use of key/keyRef type references. In
>>order for an EML package to remain cohesive and to allow for 
>>the cross platform compatability of packages, the following 
>>rules with respect to packaging must be followed. 1. An ID is 
>>required on the eml root element. 2. IDs are optional on all 
>>other elements. 3. If an ID is not provided, that content 
>>must be interpreted as representing a distinct object. 4. If 
>>an ID is provided for content then that content is distinct 
>>from all other content except for that content that 
>>references its ID. 5. If a user wants to reuse content to 
>>indicate the repetition of an object, a reference must be 
>>used. Two identical ids cannot exist in a single document. 6. 
>>"Document" scope is defined as identifiers unique only to a 
>>single instance document (if a document does not have a 
>>system attribute or if scope is set to 'document' then all 
>>IDs are defined as distinct content). 7. "System" scope is 
>>defined as identifiers unique to an entire data management 
>>system (if two documents share a system string, then any IDs 
>>in those two documents that are identical refer to the same 
>>object). 8. If an element references another element, it must 
>>not have an ID itself. 9. All EML packages must have the 
>>'eml' module as the root. 10. The system and scope attribute 
>>are always optional except for at the 'eml' module where the 
>>scope attribute is fixed as 'system'. The scope attribute 
>>defaults to 'document' for all other modules.
>>
>>Duane Costa wrote:
>>
>>>Could anyone comment as to whether the EML error reported
>>
>>by Metacat
>>
>>>below is a genuine EML error versus a bug in Metacat or the EML
>>>validator program? The issue is whether the id value for <dataset> 
>>>must be unique from the id value for <creator>.
>>>
>>>Thanks,
>>>Duane
>>>
>>>-----Original Message-----
>>>From: Corinna Gries [mailto:corinna at asu.edu]
>>>Sent: Thursday, August 26, 2004 3:48 PM
>>>To: dcosta at lternet.edu
>>>Subject: RE: Report from Metacat Harvester: Wed Aug 25
>>
>>11:00:36 MDT 2004
>>
>>>Hi Duane,
>>>
>>>I am trying to fix these problems with our eml files. Some are easy
>>>because they are actual errors in our files, but there is 
>>
>>one where I
>>
>>>wonder if the ID checking is right. I understood IDs should
>>
>>be unique
>>
>>>within the system, that is for example:
>>>
>>><dataset id="30" system="ces_dataset"> ... Is different
>>
>>from <creator
>>
>>>id="30" system="ces_party"> ....
>>>
>>>However, your harvester complains that they are the same:
>>>
>>>
>>
>>**********************************************************************
>>
>>>**
>>>*****
>>>*
>>>* METACAT HARVESTER REPORT: Wed Aug 25 11:00:36 MDT 2004
>>>*
>>>* A TOTAL OF 22 ERRORS WERE DETECTED.
>>>* Please see the log entries below for additonal details.
>>>*
>>>
>>
>>**************************************************************
>>**********
>>
>>>*****
>>>
>>
>>**************************************************************
>>**********
>>
>>>*****
>>>*
>>>* harvestLogID:         5549
>>>* harvestDate:          Wed Aug 25 11:00:36 MDT 2004
>>>* status:               1
>>>* message:              
>>>* harvestOperationCode: InsertDocError
>>>* description:          Error inserting EML document to Metacat
>>>* detailLogID:          383
>>>* errorMessage:         MetacatException: <?xml version="1.0"?>
>>><error>
>>>Error running xpath expression:
>>>
>>
>>//dateTimeDomain|//nonNumericDomain|//numericDomain|//access|/
>>/attribute
>>
>>List|//constraint|//coverage|//temporalCoverage|//geographicCoverage|/
>>
>>>List|/t
>>>
>>
>>axonomicCoverage|/dataset|/eml/dataset|//dataSource|//dataTable|//othe
>>
>>>axonomicCoverage|rE
>>>
>>
>>ntity|//citation|//address|//conferenceLocation|//party|//originator|/
>>
>>>ntity|/c
>>>
>>
>>reator|//contact|//publisher|//editor|//recipient|//performer|//instit
>>
>>>reator|ut
>>>
>>
>>ion|//metadataProvider|//associatedParty|//personnel|//physical|//conn
>>
>>>ion|ec
>>>
>>
>>tionDefinition|//distribution|//researchProject|//project|//relatedPro
>>
>>>tionDefinition|je
>>>
>>
>>ct|//software|//spatialRaster|//spatialReference|//spatialVector|//sto
>>
>>>ct|re
>>>dProcedure|//view|//protocol|//additionalMetadata : Error in xml
>>>document.  This EML document is not valid because the id 30 occurs
>>>more than once.  IDs must be unique. </error>
>>>
>>>* scope:                ces_dataset
>>>* identifier:           30
>>>* revision:             1
>>>* documentType:         eml://ecoinformatics.org/eml-2.0.0
>>>* documentURL:
>>>
>>
>>http://seinet.asu.edu/DataCatalog/getXanthoriaRecord.jsp?source=ces_da
>>
>>>ta
>>>set_mohave&id=30
>>>*
>>>
>>
>>**************************************************************
>>**********
>>
>>>*****
>>>
>>>What do you think?
>>>
>>>Corinna
>>>
>>>_______________________________________________
>>>eml-dev mailing list
>>>eml-dev at ecoinformatics.org
>>>http://www.ecoinformatics.org/mailman/listinfo/eml-dev
>>
>>--
>>Mark Servilla, Ph.D.
>>
>>LTER Network Office
>>Department of Biology
>>MSC 03 2020
>>1 University of New Mexico
>>Albuquerque, NM 87131-0001
>>
>>servilla at lternet.edu
>>Office (505) 277-2619
>>Cell   (505) 453-8593
>>
>>
>>
>>--
>>Mark Servilla, Ph.D.
>>
>>LTER Network Office
>>Department of Biology
>>MSC 03 2020
>>1 University of New Mexico
>>Albuquerque, NM 87131-0001
>>
>>servilla at lternet.edu
>>Office (505) 277-2619
>>Cell   (505) 453-8593
>>
>>--
>>James W. Brunt
>>Associate Director for Information Management
>>Long Term Ecological Research Network Office
>>Department of Biology
>>University of New Mexico
>>Albuquerque, NM 87131-1091
>>505 272 7085
>>jbrunt at lternet.edu
>>
>>-------------------------------------------------
>>Long-Term Ecological Research Network Mailing List
>>im at LTERnet.edu http://sql.lternet.edu/cgi/mailgroups_view.pl?im
>>
> 
> _______________________________________________
> eml-dev mailing list
> eml-dev at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/eml-dev

-- 
-------------------------------------------------------------------
Matt Jones                                     jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439    Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org
-------------------------------------------------------------------



More information about the Eml-dev mailing list