[LTER-im] [Fwd: [Fwd: Re: FW: Report from Metacat Harvester: Wed Aug 25 11:00:36 MDT 2004]]

Tue Aug 31 13:31:38 PDT 2004

Just a clarification...The specific error example we have been 
discussing is concerning two identical ids with different content...

<dataset id="30" system="ces_dataset"> ... Is different from
<creator id="30" system="ces_party"> ....

Admittedly, were the content the same we would still get the error (if 
the parser is written to the spec). However, if there were (in this case 
there wasn't) a

<references>30</references>

it would be ambiguous. Correct?

James

Peter McCartney wrote:
> On Tue, 2004-08-31 at 11:35, Matt Jones wrote:
> 
>>>It would really help me justify the extra work involed in managing ids
>>>and references if someone could give me a concrete example of why it
>>>would be bad to have a document contain two elements with identical ids
>>>and identical content. 
>>
>>Like in other relational systems, The key (id) acts as a surrogate for 
>>the content.  So, references should resolve to one (and only one) id. 
>>It is far harder to validate that the content is the same between two 
>>nodes with identical keys than it is to validate that no key is 
>>duplicated.  I think they got this right in the relational model, and we 
>>should follow that lead.  If you allow duplicate ids, then I am sure 
>>this situation will arise:
>>
>><a id="1">foo</a>
>><a id="1">bar></a>
>><b><references>1</references></b>
>>
>>What is the value of <b>?  foo, or bar?  It is indeterminate.  And this 
>>is precisely why this is a problem.
> 
> 
> I agree this would be bad, but this is not what is happening. The
> documents that are being rejected have:
> <a id="1">foo</a>
> <a id="1">foo></a>
> Typically, when this happens, the code is obviously not bothering with
> references tags, so we aren't likely to create broken or ambiguous
> reference tags. Even if we did throw in a
> <b><references>1</references></b>, it really wouldn't be a problem. In
> some of our files where attributes are repeated in view entities, we are
> also getting this:
> 
> <a id="1">foo</a>
> <a id="2">foo></a>
> 
> but your parser hasn't spotted that one yet :) and again, even though it
> violates the spec,  i would contend that this causes no problem.
> 
> 
>>If my xpath returns one or several nodes and they
>>
>>>are all identical, why is it so bad to just assume that the rule is:
>>>"identical id (and system) means identical content" and just use the
>>>first one in the list?
>>
>>Because relational models have shown that this never works.  I think 
>>that such an assumption will result in lots of broken docs.
>>
>>  I think it is no more work to write parsers to
>>
>>>check for differences between nodes ith similar ids than it is to check
>>>for duplicate ids in the first place, but it makes generating valid eml
>>>a LOT simpler.
>>
>>Generating valid eml with only one copy of a subtree is easy -- just 
>>track whether you've already inserted it, and reference it thereafter. 
>>I don't understand at all why this is hard.  However, I do understand 
>>the problem with system not being included in the assessment of the 
>>uniqueness of the ID.  So I like the idea of pursuing Mark's suggestion (2).
>>
> 
> 
> I also like pursuing 2 regardless of how we debate over 3 and would
> support a hasty 2.02 to revise the spec documentation and add an
> optional system attribute as such:
> 
>  <references system="ces_dataset">201</references>. 
> 
> Keep in mind that when people hear you (or me, or anyone...) say "its
> not that hard" they are thinking "sure, if you have a team of Java
> programmers!"). So perhaps it would help to provide some code samples
> that can be adapted to the kind of approaches people are taking with
> more off-the-shelf tools so that people don't feel like the only way to
> work with valid eml is to use one set of tools from one shop. For
> example, the approach we take in Xanthoria for converting from RDBMS to
> xml is actually a fairly common one that appears in Cocoon, XML spy's
> RDPMS mapping tool, and many other vendor-specific DB->xml modules.
> Specifically, the rdbms content is exported to a generic, denormalized
> xml and then transformed with xsl to map to the desired schema. So for
> most cases, the place where this tracking needs to be done is likely to
> be in XSL. While we have found it relatively easy when parsing EML in
> XSL to follow references to find the content, we have also found that
> tracking things within xsls when writing out eml to be a cumbersome
> process, let alone making sure that each time we do it it is going to
> come out consistent. 
> 
> So if there is some xsl sample that we can easily add to xanthoria style
> sheets to solve this problem, then thats cool. Otherwise, I really think
> it would be folly to hang too long on this when we (LTER that is) have
> bigger fish to fry. Namely, building a better search interface for
> searching LTER data via eml. The query interface is what the CC spent
> hours talking about in Fairbanks, so if we come back in Miami with the
> ID problem solved but no improved query system, I'd prefer not be the
> one to give that powerpoint.
> 
> 
> 
>>Matt
>>
>>
>>>Peter McCartney (peter.mccartney at asu.edu)
>>>Center for Environmental-Studies
>>>Arizona State University
>>> 
>>>
>>>
>>>
>>>
>>>>-----Original Message-----
>>>>From: owner-im at lternet.edu [mailto:owner-im at lternet.edu] On
>>>>Behalf Of James W Brunt
>>>>Sent: Monday, August 30, 2004 2:57 PM
>>>>To: eml-dev at ecoinformatics.org; emlbestpractices at lternet.edu; 
>>>>im at lternet.edu
>>>>Subject: [LTER-im] [Fwd: [Fwd: Re: FW: Report from Metacat 
>>>>Harvester: Wed Aug 25 11:00:36 MDT 2004]]
>>>>
>>>>
>>>>Peter,  et. al,
>>>>
>>>>Mark's email to me (below) has reinforced my own conclusion about the
>>>>id, system, references question. There at least 2 possibly 3 issues 
>>>>(bugs if you will) here to be dealt with:
>>>>
>>>>1. The eml normative documentation needs to reflect the real
>>>>intent and 
>>>>use of the system attribute. Read (Can O Worms). Options as I 
>>>>see them:
>>>>	a. deprecate the system attribute until it can be 
>>>>better defined - 
>>>>ignore 2 and 3 below (Mark goes even further on this one below).
>>>>	b. clearly define the system attribute and make the 
>>>>changes in 2 and 3 
>>>>below.
>>>>
>>>>2. <references> tag needs to be made system/scope aware
>>>>
>>>>3. EMLparser needs to enforce the final outcome of 1 and 2.
>>>>
>>>>Currently, the documentation introduces system but it's
>>>>definition does 
>>>>not supercede the unique ID requirement within a document, references 
>>>>is not system aware, EMLparser is enforcing exactly what the 
>>>>documentation says.
>>>>
>>>>Turning off the ID checking as Peter has suggested (different thread)
>>>>would  result in uninterpretable EML documents were the 
>>>>references tag 
>>>>to be used (Although, in all but one case in the example below there 
>>>>were no references to the IDs). I don't see this as an intermediate 
>>>>solution.
>>>>
>>>>The intent as I remember all that long discussion ago was to create a
>>>>way to get around having to completely duplicate content in a 
>>>>document. 
>>>> Thus creating a more compact document and one that would be more 
>>>>easily maintained for someone not generating the documents from a 
>>>>database. I'm sure I can be clarified some here by others that were 
>>>>present. I realize the difficulty in tracking a document ID map for 
>>>>every document you automatically generate however I really don't 
>>>>understand why you wouldn't completely duplicate the content. 
>>>>However, 
>>>>the inclusion of a second qualifying attribute that has to be checked 
>>>>for every id tag is doable but before we begin something like this it 
>>>>must be clearly spelled-out and agreeable to the group(s). 
>>>>We'd like to 
>>>>hear from eml-dev, eml-bestpractices, and im as well as individual 
>>>>stakeholders.
>>>>
>>>>Thanks,
>>>>
>>>>James
>>>>
>>>>--
>>>>James W. Brunt
>>>>Associate Director for Information Management
>>>>Long Term Ecological Research Network Office
>>>>Department of Biology
>>>>University of New Mexico
>>>>Albuquerque, NM 87131-1091
>>>>505 272 7085
>>>>jbrunt at lternet.edu
>>>>
>>>>
>>>>-------- Original Message --------
>>>>From: Mark Servilla <servilla at lternet.edu>
>>>>To: James Brunt <jbrunt at lternet.edu>
>>>>Subject: [Fwd: Re: FW: Report from Metacat Harvester: Wed Aug 25
>>>>11:00:36 MDT 2004]
>>>>
>>>>James,
>>>>
>>>>After reviewing the EML specification documents, it appears
>>>>to me that duplicate IDs within a single instance document is 
>>>>not valid EML, and therefore (IMHO), the EML Parser is 
>>>>behaving correctly.  I cannot see how setting either the 
>>>>SYSTEM or SCOPE attribute can be used by the REFERENCES 
>>>>element to distinguish duplicate IDs within a single document 
>>>>(perhaps someone in eml-dev can help answer how SYSTEM/SCOPE 
>>>>are used in this context).
>>>>
>>>>Some possible solutions are:
>>>>(1) Deprecate SYSTEM/SCOPE attributes in this context, update
>>>>the specification to reflect such change, and do not allow 
>>>>duplicate IDs.
>>>>(2) Modify the specification to allow SYSTEM/SCOPE to narrow 
>>>>the ID scope, thereby allowing duplicate IDs when qualified 
>>>>by either SYSTEM/SCOPE -- and, modify the specification for 
>>>>REFERENCES to make use of such change.
>>>>(3) Deprecate REFERENCES completely and force repeated content.
>>>>
>>>>Just my thoughts - thanks!
>>>>
>>>>Mark
>>>>
>>>>-------- Original Message --------
>>>>Subject: Re: FW: Report from Metacat Harvester: Wed Aug 25
>>>>11:00:36 MDT 
>>>>2004
>>>>Date: Mon, 30 Aug 2004 09:26:13 -0600
>>>>From: Mark Servilla <servilla at lternet.edu>
>>>>To: 'Corinna Gries' <corinna at asu.edu>
>>>>CC: James Brunt <jbrunt at lternet.edu>, Duane Costa <dcosta at lternet.edu>
>>>>References: <E1C0TNQ-00066I-00 at lternet.lternet.edu>
>>>>
>>>>Hi Corinna,
>>>>
>>>>I have been discussing this issue of ID attributes with James
>>>>and Duane here at LNO.  Please correct me if I am wrong, but 
>>>>the section on Reusable Content (below or
>>>>http://knb.ecoinformatics.org/software/eml/eml-2.0.1/index.htm
>>>>l#reusableContent)
>>>>states that "two identical ids cannot exist in a single 
>>>>document".  It appears that the "SYSTEM" attribute only 
>>>>allows identical ids in multiple documents within the system 
>>>>(that is, only if the repeated ids reference the exact same 
>>>>object) - something like globalizing the id'ed object to the 
>>>>system for repeated reference in one or more documents, but 
>>>>not necessarily allowing identical ids within a single 
>>>>document by changing the SYSTEM attribute value.  I am not 
>>>>really sure how one would take advantage of the SYSTEM 
>>>>attribute for reusable content.  And, I don't know the 
>>>>provenance of this particular issue (the documentation could 
>>>>certainly be more clear), but if we were to follow the 
>>>>documentation as we interpret, would this still be a bug in 
>>>>the Harvester/Metacat software?
>>>>
>>>>Sincerely,
>>>>Mark
>>>>
>>>>3.3. Reusable Content
>>>>EML allows the reuse of previously defined structured content (DOM
>>>>sub-trees) through the use of key/keyRef type references. In
>>>>order for an EML package to remain cohesive and to allow for 
>>>>the cross platform compatability of packages, the following 
>>>>rules with respect to packaging must be followed. 1. An ID is 
>>>>required on the eml root element. 2. IDs are optional on all 
>>>>other elements. 3. If an ID is not provided, that content 
>>>>must be interpreted as representing a distinct object. 4. If 
>>>>an ID is provided for content then that content is distinct 
>>>
>>>>from all other content except for that content that 
>>>
>>>>references its ID. 5. If a user wants to reuse content to 
>>>>indicate the repetition of an object, a reference must be 
>>>>used. Two identical ids cannot exist in a single document. 6. 
>>>>"Document" scope is defined as identifiers unique only to a 
>>>>single instance document (if a document does not have a 
>>>>system attribute or if scope is set to 'document' then all 
>>>>IDs are defined as distinct content). 7. "System" scope is 
>>>>defined as identifiers unique to an entire data management 
>>>>system (if two documents share a system string, then any IDs 
>>>>in those two documents that are identical refer to the same 
>>>>object). 8. If an element references another element, it must 
>>>>not have an ID itself. 9. All EML packages must have the 
>>>>'eml' module as the root. 10. The system and scope attribute 
>>>>are always optional except for at the 'eml' module where the 
>>>>scope attribute is fixed as 'system'. The scope attribute 
>>>>defaults to 'document' for all other modules.
>>>>
>>>>Duane Costa wrote:
>>>>
>>>>
>>>>>Could anyone comment as to whether the EML error reported
>>>>
>>>>by Metacat
>>>>
>>>>
>>>>>below is a genuine EML error versus a bug in Metacat or the EML
>>>>>validator program? The issue is whether the id value for <dataset> 
>>>>>must be unique from the id value for <creator>.
>>>>>
>>>>>Thanks,
>>>>>Duane
>>>>>
>>>>>-----Original Message-----
>>>>>From: Corinna Gries [mailto:corinna at asu.edu]
>>>>>Sent: Thursday, August 26, 2004 3:48 PM
>>>>>To: dcosta at lternet.edu
>>>>>Subject: RE: Report from Metacat Harvester: Wed Aug 25
>>>>
>>>>11:00:36 MDT 2004
>>>>
>>>>
>>>>>Hi Duane,
>>>>>
>>>>>I am trying to fix these problems with our eml files. Some are easy
>>>>>because they are actual errors in our files, but there is 
>>>>
>>>>one where I
>>>>
>>>>
>>>>>wonder if the ID checking is right. I understood IDs should
>>>>
>>>>be unique
>>>>
>>>>
>>>>>within the system, that is for example:
>>>>>
>>>>><dataset id="30" system="ces_dataset"> ... Is different
>>>>
>>>>from <creator
>>>
>>>>>id="30" system="ces_party"> ....
>>>>>
>>>>>However, your harvester complains that they are the same:
>>>>>
>>>>>
>>>>
>>>>**********************************************************************
>>>>
>>>>
>>>>>**
>>>>>*****
>>>>>*
>>>>>* METACAT HARVESTER REPORT: Wed Aug 25 11:00:36 MDT 2004
>>>>>*
>>>>>* A TOTAL OF 22 ERRORS WERE DETECTED.
>>>>>* Please see the log entries below for additonal details.
>>>>>*
>>>>>
>>>>
>>>>**************************************************************
>>>>**********
>>>>
>>>>
>>>>>*****
>>>>>
>>>>
>>>>**************************************************************
>>>>**********
>>>>
>>>>
>>>>>*****
>>>>>*
>>>>>* harvestLogID:         5549
>>>>>* harvestDate:          Wed Aug 25 11:00:36 MDT 2004
>>>>>* status:               1
>>>>>* message:              
>>>>>* harvestOperationCode: InsertDocError
>>>>>* description:          Error inserting EML document to Metacat
>>>>>* detailLogID:          383
>>>>>* errorMessage:         MetacatException: <?xml version="1.0"?>
>>>>><error>
>>>>>Error running xpath expression:
>>>>>
>>>>
>>>>//dateTimeDomain|//nonNumericDomain|//numericDomain|//access|/
>>>>/attribute
>>>>
>>>>List|//constraint|//coverage|//temporalCoverage|//geographicCoverage|/
>>>>
>>>>
>>>>>List|/t
>>>>>
>>>>
>>>>axonomicCoverage|/dataset|/eml/dataset|//dataSource|//dataTable|//othe
>>>>
>>>>
>>>>>axonomicCoverage|rE
>>>>>
>>>>
>>>>ntity|//citation|//address|//conferenceLocation|//party|//originator|/
>>>>
>>>>
>>>>>ntity|/c
>>>>>
>>>>
>>>>reator|//contact|//publisher|//editor|//recipient|//performer|//instit
>>>>
>>>>
>>>>>reator|ut
>>>>>
>>>>
>>>>ion|//metadataProvider|//associatedParty|//personnel|//physical|//conn
>>>>
>>>>
>>>>>ion|ec
>>>>>
>>>>
>>>>tionDefinition|//distribution|//researchProject|//project|//relatedPro
>>>>
>>>>
>>>>>tionDefinition|je
>>>>>
>>>>
>>>>ct|//software|//spatialRaster|//spatialReference|//spatialVector|//sto
>>>>
>>>>
>>>>>ct|re
>>>>>dProcedure|//view|//protocol|//additionalMetadata : Error in xml
>>>>>document.  This EML document is not valid because the id 30 occurs
>>>>>more than once.  IDs must be unique. </error>
>>>>>
>>>>>* scope:                ces_dataset
>>>>>* identifier:           30
>>>>>* revision:             1
>>>>>* documentType:         eml://ecoinformatics.org/eml-2.0.0
>>>>>* documentURL:
>>>>>
>>>>
>>>>http://seinet.asu.edu/DataCatalog/getXanthoriaRecord.jsp?source=ces_da
>>>>
>>>>
>>>>>ta
>>>>>set_mohave&id=30
>>>>>*
>>>>>
>>>>
>>>>**************************************************************
>>>>**********
>>>>
>>>>
>>>>>*****
>>>>>
>>>>>What do you think?
>>>>>
>>>>>Corinna
>>>>>
>>>>>_______________________________________________
>>>>>eml-dev mailing list
>>>>>eml-dev at ecoinformatics.org
>>>>>http://www.ecoinformatics.org/mailman/listinfo/eml-dev
>>>>
>>>>--
>>>>Mark Servilla, Ph.D.
>>>>
>>>>LTER Network Office
>>>>Department of Biology
>>>>MSC 03 2020
>>>>1 University of New Mexico
>>>>Albuquerque, NM 87131-0001
>>>>
>>>>servilla at lternet.edu
>>>>Office (505) 277-2619
>>>>Cell   (505) 453-8593
>>>>
>>>>
>>>>
>>>>--
>>>>Mark Servilla, Ph.D.
>>>>
>>>>LTER Network Office
>>>>Department of Biology
>>>>MSC 03 2020
>>>>1 University of New Mexico
>>>>Albuquerque, NM 87131-0001
>>>>
>>>>servilla at lternet.edu
>>>>Office (505) 277-2619
>>>>Cell   (505) 453-8593
>>>>
>>>>--
>>>>James W. Brunt
>>>>Associate Director for Information Management
>>>>Long Term Ecological Research Network Office
>>>>Department of Biology
>>>>University of New Mexico
>>>>Albuquerque, NM 87131-1091
>>>>505 272 7085
>>>>jbrunt at lternet.edu
>>>>
>>>>-------------------------------------------------
>>>>Long-Term Ecological Research Network Mailing List
>>>>im at LTERnet.edu http://sql.lternet.edu/cgi/mailgroups_view.pl?im
>>>>
>>>
>>>_______________________________________________
>>>eml-dev mailing list
>>>eml-dev at ecoinformatics.org
>>>http://www.ecoinformatics.org/mailman/listinfo/eml-dev