[LTER-im] [Fwd: [Fwd: Re: FW: Report from Metacat Harvester: Wed Aug 25 11:00:36 MDT 2004]]

Tue Aug 31 13:43:32 PDT 2004

Hi James,

Yes, that's exactly the problem.  Peter is proposing to solve it by 
taking the *first* of the redundant trees. But, which is first depends 
on whether you traverse the document in breadth-first order or 
depth-first order.  That, to me, is just asking for trouble -- we'd be 
asking people to remember to put the subtree they want referenced in the 
"depth-first" first node, which can change as the structure of the tree 
changes.  Hard to do and harder to maintain.

Also, if we do it this way, we should probably check to be sure that two 
subtrees that have identical id's also have identical content, which is 
not a trivial programming task (assuming they are identical could easily 
lead to conflicting information).

I would far prefer to keep the links unambiguous (ie, references always 
can be resolved to one and only one id).  If someone doesn't want to 
deal with that stuff, they can always omit the ids and just duplicate 
the content, which is why we made the ids optional originally.

Matt

James W Brunt wrote:
> Just a clarification...The specific error example we have been 
> discussing is concerning two identical ids with different content...
> 
> <dataset id="30" system="ces_dataset"> ... Is different from
> <creator id="30" system="ces_party"> ....
> 
> Admittedly, were the content the same we would still get the error (if 
> the parser is written to the spec). However, if there were (in this case 
> there wasn't) a
> 
> <references>30</references>
> 
> it would be ambiguous. Correct?
> 
> James
> 
> Peter McCartney wrote:
> 
>> On Tue, 2004-08-31 at 11:35, Matt Jones wrote:
>>
>>>> It would really help me justify the extra work involed in managing ids
>>>> and references if someone could give me a concrete example of why it
>>>> would be bad to have a document contain two elements with identical ids
>>>> and identical content. 
>>>
>>>
>>> Like in other relational systems, The key (id) acts as a surrogate 
>>> for the content.  So, references should resolve to one (and only one) 
>>> id. It is far harder to validate that the content is the same between 
>>> two nodes with identical keys than it is to validate that no key is 
>>> duplicated.  I think they got this right in the relational model, and 
>>> we should follow that lead.  If you allow duplicate ids, then I am 
>>> sure this situation will arise:
>>>
>>> <a id="1">foo</a>
>>> <a id="1">bar></a>
>>> <b><references>1</references></b>
>>>
>>> What is the value of <b>?  foo, or bar?  It is indeterminate.  And 
>>> this is precisely why this is a problem.
>>
>>
>>
>> I agree this would be bad, but this is not what is happening. The
>> documents that are being rejected have:
>> <a id="1">foo</a>
>> <a id="1">foo></a>
>> Typically, when this happens, the code is obviously not bothering with
>> references tags, so we aren't likely to create broken or ambiguous
>> reference tags. Even if we did throw in a
>> <b><references>1</references></b>, it really wouldn't be a problem. In
>> some of our files where attributes are repeated in view entities, we are
>> also getting this:
>>
>> <a id="1">foo</a>
>> <a id="2">foo></a>
>>
>> but your parser hasn't spotted that one yet :) and again, even though it
>> violates the spec,  i would contend that this causes no problem.
>>
>>
>>> If my xpath returns one or several nodes and they
>>>
>>>> are all identical, why is it so bad to just assume that the rule is:
>>>> "identical id (and system) means identical content" and just use the
>>>> first one in the list?
>>>
>>>
>>> Because relational models have shown that this never works.  I think 
>>> that such an assumption will result in lots of broken docs.
>>>
>>>  I think it is no more work to write parsers to
>>>
>>>> check for differences between nodes ith similar ids than it is to check
>>>> for duplicate ids in the first place, but it makes generating valid eml
>>>> a LOT simpler.
>>>
>>>
>>> Generating valid eml with only one copy of a subtree is easy -- just 
>>> track whether you've already inserted it, and reference it 
>>> thereafter. I don't understand at all why this is hard.  However, I 
>>> do understand the problem with system not being included in the 
>>> assessment of the uniqueness of the ID.  So I like the idea of 
>>> pursuing Mark's suggestion (2).
>>>
>>
>>
>> I also like pursuing 2 regardless of how we debate over 3 and would
>> support a hasty 2.02 to revise the spec documentation and add an
>> optional system attribute as such:
>>
>>  <references system="ces_dataset">201</references>.
>> Keep in mind that when people hear you (or me, or anyone...) say "its
>> not that hard" they are thinking "sure, if you have a team of Java
>> programmers!"). So perhaps it would help to provide some code samples
>> that can be adapted to the kind of approaches people are taking with
>> more off-the-shelf tools so that people don't feel like the only way to
>> work with valid eml is to use one set of tools from one shop. For
>> example, the approach we take in Xanthoria for converting from RDBMS to
>> xml is actually a fairly common one that appears in Cocoon, XML spy's
>> RDPMS mapping tool, and many other vendor-specific DB->xml modules.
>> Specifically, the rdbms content is exported to a generic, denormalized
>> xml and then transformed with xsl to map to the desired schema. So for
>> most cases, the place where this tracking needs to be done is likely to
>> be in XSL. While we have found it relatively easy when parsing EML in
>> XSL to follow references to find the content, we have also found that
>> tracking things within xsls when writing out eml to be a cumbersome
>> process, let alone making sure that each time we do it it is going to
>> come out consistent.
>> So if there is some xsl sample that we can easily add to xanthoria style
>> sheets to solve this problem, then thats cool. Otherwise, I really think
>> it would be folly to hang too long on this when we (LTER that is) have
>> bigger fish to fry. Namely, building a better search interface for
>> searching LTER data via eml. The query interface is what the CC spent
>> hours talking about in Fairbanks, so if we come back in Miami with the
>> ID problem solved but no improved query system, I'd prefer not be the
>> one to give that powerpoint.
>>
>>
>>
>>> Matt
>>>
>>>
>>>> Peter McCartney (peter.mccartney at asu.edu)
>>>> Center for Environmental-Studies
>>>> Arizona State University
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: owner-im at lternet.edu [mailto:owner-im at lternet.edu] On
>>>>> Behalf Of James W Brunt
>>>>> Sent: Monday, August 30, 2004 2:57 PM
>>>>> To: eml-dev at ecoinformatics.org; emlbestpractices at lternet.edu; 
>>>>> im at lternet.edu
>>>>> Subject: [LTER-im] [Fwd: [Fwd: Re: FW: Report from Metacat 
>>>>> Harvester: Wed Aug 25 11:00:36 MDT 2004]]
>>>>>
>>>>>
>>>>> Peter,  et. al,
>>>>>
>>>>> Mark's email to me (below) has reinforced my own conclusion about the
>>>>> id, system, references question. There at least 2 possibly 3 issues 
>>>>> (bugs if you will) here to be dealt with:
>>>>>
>>>>> 1. The eml normative documentation needs to reflect the real
>>>>> intent and use of the system attribute. Read (Can O Worms). Options 
>>>>> as I see them:
>>>>>     a. deprecate the system attribute until it can be better 
>>>>> defined - ignore 2 and 3 below (Mark goes even further on this one 
>>>>> below).
>>>>>     b. clearly define the system attribute and make the changes in 
>>>>> 2 and 3 below.
>>>>>
>>>>> 2. <references> tag needs to be made system/scope aware
>>>>>
>>>>> 3. EMLparser needs to enforce the final outcome of 1 and 2.
>>>>>
>>>>> Currently, the documentation introduces system but it's
>>>>> definition does not supercede the unique ID requirement within a 
>>>>> document, references is not system aware, EMLparser is enforcing 
>>>>> exactly what the documentation says.
>>>>>
>>>>> Turning off the ID checking as Peter has suggested (different thread)
>>>>> would  result in uninterpretable EML documents were the references 
>>>>> tag to be used (Although, in all but one case in the example below 
>>>>> there were no references to the IDs). I don't see this as an 
>>>>> intermediate solution.
>>>>>
>>>>> The intent as I remember all that long discussion ago was to create a
>>>>> way to get around having to completely duplicate content in a 
>>>>> document. Thus creating a more compact document and one that would 
>>>>> be more easily maintained for someone not generating the documents 
>>>>> from a database. I'm sure I can be clarified some here by others 
>>>>> that were present. I realize the difficulty in tracking a document 
>>>>> ID map for every document you automatically generate however I 
>>>>> really don't understand why you wouldn't completely duplicate the 
>>>>> content. However, the inclusion of a second qualifying attribute 
>>>>> that has to be checked for every id tag is doable but before we 
>>>>> begin something like this it must be clearly spelled-out and 
>>>>> agreeable to the group(s). We'd like to hear from eml-dev, 
>>>>> eml-bestpractices, and im as well as individual stakeholders.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> James
>>>>>
>>>>> -- 
>>>>> James W. Brunt
>>>>> Associate Director for Information Management
>>>>> Long Term Ecological Research Network Office
>>>>> Department of Biology
>>>>> University of New Mexico
>>>>> Albuquerque, NM 87131-1091
>>>>> 505 272 7085
>>>>> jbrunt at lternet.edu
>>>>>
>>>>>
>>>>> -------- Original Message --------
>>>>> From: Mark Servilla <servilla at lternet.edu>
>>>>> To: James Brunt <jbrunt at lternet.edu>
>>>>> Subject: [Fwd: Re: FW: Report from Metacat Harvester: Wed Aug 25
>>>>> 11:00:36 MDT 2004]
>>>>>
>>>>> James,
>>>>>
>>>>> After reviewing the EML specification documents, it appears
>>>>> to me that duplicate IDs within a single instance document is not 
>>>>> valid EML, and therefore (IMHO), the EML Parser is behaving 
>>>>> correctly.  I cannot see how setting either the SYSTEM or SCOPE 
>>>>> attribute can be used by the REFERENCES element to distinguish 
>>>>> duplicate IDs within a single document (perhaps someone in eml-dev 
>>>>> can help answer how SYSTEM/SCOPE are used in this context).
>>>>>
>>>>> Some possible solutions are:
>>>>> (1) Deprecate SYSTEM/SCOPE attributes in this context, update
>>>>> the specification to reflect such change, and do not allow 
>>>>> duplicate IDs.
>>>>> (2) Modify the specification to allow SYSTEM/SCOPE to narrow the ID 
>>>>> scope, thereby allowing duplicate IDs when qualified by either 
>>>>> SYSTEM/SCOPE -- and, modify the specification for REFERENCES to 
>>>>> make use of such change.
>>>>> (3) Deprecate REFERENCES completely and force repeated content.
>>>>>
>>>>> Just my thoughts - thanks!
>>>>>
>>>>> Mark
>>>>>
>>>>> -------- Original Message --------
>>>>> Subject: Re: FW: Report from Metacat Harvester: Wed Aug 25
>>>>> 11:00:36 MDT 2004
>>>>> Date: Mon, 30 Aug 2004 09:26:13 -0600
>>>>> From: Mark Servilla <servilla at lternet.edu>
>>>>> To: 'Corinna Gries' <corinna at asu.edu>
>>>>> CC: James Brunt <jbrunt at lternet.edu>, Duane Costa <dcosta at lternet.edu>
>>>>> References: <E1C0TNQ-00066I-00 at lternet.lternet.edu>
>>>>>
>>>>> Hi Corinna,
>>>>>
>>>>> I have been discussing this issue of ID attributes with James
>>>>> and Duane here at LNO.  Please correct me if I am wrong, but the 
>>>>> section on Reusable Content (below or
>>>>> http://knb.ecoinformatics.org/software/eml/eml-2.0.1/index.htm
>>>>> l#reusableContent)
>>>>> states that "two identical ids cannot exist in a single document".  
>>>>> It appears that the "SYSTEM" attribute only allows identical ids in 
>>>>> multiple documents within the system (that is, only if the repeated 
>>>>> ids reference the exact same object) - something like globalizing 
>>>>> the id'ed object to the system for repeated reference in one or 
>>>>> more documents, but not necessarily allowing identical ids within a 
>>>>> single document by changing the SYSTEM attribute value.  I am not 
>>>>> really sure how one would take advantage of the SYSTEM attribute 
>>>>> for reusable content.  And, I don't know the provenance of this 
>>>>> particular issue (the documentation could certainly be more clear), 
>>>>> but if we were to follow the documentation as we interpret, would 
>>>>> this still be a bug in the Harvester/Metacat software?
>>>>>
>>>>> Sincerely,
>>>>> Mark
>>>>>
>>>>> 3.3. Reusable Content
>>>>> EML allows the reuse of previously defined structured content (DOM
>>>>> sub-trees) through the use of key/keyRef type references. In
>>>>> order for an EML package to remain cohesive and to allow for the 
>>>>> cross platform compatability of packages, the following rules with 
>>>>> respect to packaging must be followed. 1. An ID is required on the 
>>>>> eml root element. 2. IDs are optional on all other elements. 3. If 
>>>>> an ID is not provided, that content must be interpreted as 
>>>>> representing a distinct object. 4. If an ID is provided for content 
>>>>> then that content is distinct 
>>>>
>>>>
>>>>> from all other content except for that content that 
>>>>
>>>>
>>>>> references its ID. 5. If a user wants to reuse content to indicate 
>>>>> the repetition of an object, a reference must be used. Two 
>>>>> identical ids cannot exist in a single document. 6. "Document" 
>>>>> scope is defined as identifiers unique only to a single instance 
>>>>> document (if a document does not have a system attribute or if 
>>>>> scope is set to 'document' then all IDs are defined as distinct 
>>>>> content). 7. "System" scope is defined as identifiers unique to an 
>>>>> entire data management system (if two documents share a system 
>>>>> string, then any IDs in those two documents that are identical 
>>>>> refer to the same object). 8. If an element references another 
>>>>> element, it must not have an ID itself. 9. All EML packages must 
>>>>> have the 'eml' module as the root. 10. The system and scope 
>>>>> attribute are always optional except for at the 'eml' module where 
>>>>> the scope attribute is fixed as 'system'. The scope attribute 
>>>>> defaults to 'document' for all other modules.
>>>>>
>>>>> Duane Costa wrote:
>>>>>
>>>>>
>>>>>> Could anyone comment as to whether the EML error reported
>>>>>
>>>>>
>>>>> by Metacat
>>>>>
>>>>>
>>>>>> below is a genuine EML error versus a bug in Metacat or the EML
>>>>>> validator program? The issue is whether the id value for <dataset> 
>>>>>> must be unique from the id value for <creator>.
>>>>>>
>>>>>> Thanks,
>>>>>> Duane
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Corinna Gries [mailto:corinna at asu.edu]
>>>>>> Sent: Thursday, August 26, 2004 3:48 PM
>>>>>> To: dcosta at lternet.edu
>>>>>> Subject: RE: Report from Metacat Harvester: Wed Aug 25
>>>>>
>>>>>
>>>>> 11:00:36 MDT 2004
>>>>>
>>>>>
>>>>>> Hi Duane,
>>>>>>
>>>>>> I am trying to fix these problems with our eml files. Some are easy
>>>>>> because they are actual errors in our files, but there is 
>>>>>
>>>>>
>>>>> one where I
>>>>>
>>>>>
>>>>>> wonder if the ID checking is right. I understood IDs should
>>>>>
>>>>>
>>>>> be unique
>>>>>
>>>>>
>>>>>> within the system, that is for example:
>>>>>>
>>>>>> <dataset id="30" system="ces_dataset"> ... Is different
>>>>>
>>>>>
>>>>> from <creator
>>>>
>>>>
>>>>>> id="30" system="ces_party"> ....
>>>>>>
>>>>>> However, your harvester complains that they are the same:
>>>>>>
>>>>>>
>>>>>
>>>>> **********************************************************************
>>>>>
>>>>>
>>>>>> **
>>>>>> *****
>>>>>> *
>>>>>> * METACAT HARVESTER REPORT: Wed Aug 25 11:00:36 MDT 2004
>>>>>> *
>>>>>> * A TOTAL OF 22 ERRORS WERE DETECTED.
>>>>>> * Please see the log entries below for additonal details.
>>>>>> *
>>>>>>
>>>>>
>>>>> **************************************************************
>>>>> **********
>>>>>
>>>>>
>>>>>> *****
>>>>>>
>>>>>
>>>>> **************************************************************
>>>>> **********
>>>>>
>>>>>
>>>>>> *****
>>>>>> *
>>>>>> * harvestLogID:         5549
>>>>>> * harvestDate:          Wed Aug 25 11:00:36 MDT 2004
>>>>>> * status:               1
>>>>>> * message:              * harvestOperationCode: InsertDocError
>>>>>> * description:          Error inserting EML document to Metacat
>>>>>> * detailLogID:          383
>>>>>> * errorMessage:         MetacatException: <?xml version="1.0"?>
>>>>>> <error>
>>>>>> Error running xpath expression:
>>>>>>
>>>>>
>>>>> //dateTimeDomain|//nonNumericDomain|//numericDomain|//access|/
>>>>> /attribute
>>>>>
>>>>> List|//constraint|//coverage|//temporalCoverage|//geographicCoverage|/
>>>>>
>>>>>
>>>>>> List|/t
>>>>>>
>>>>>
>>>>> axonomicCoverage|/dataset|/eml/dataset|//dataSource|//dataTable|//othe
>>>>>
>>>>>
>>>>>> axonomicCoverage|rE
>>>>>>
>>>>>
>>>>> ntity|//citation|//address|//conferenceLocation|//party|//originator|/
>>>>>
>>>>>
>>>>>> ntity|/c
>>>>>>
>>>>>
>>>>> reator|//contact|//publisher|//editor|//recipient|//performer|//instit
>>>>>
>>>>>
>>>>>> reator|ut
>>>>>>
>>>>>
>>>>> ion|//metadataProvider|//associatedParty|//personnel|//physical|//conn
>>>>>
>>>>>
>>>>>> ion|ec
>>>>>>
>>>>>
>>>>> tionDefinition|//distribution|//researchProject|//project|//relatedPro
>>>>>
>>>>>
>>>>>> tionDefinition|je
>>>>>>
>>>>>
>>>>> ct|//software|//spatialRaster|//spatialReference|//spatialVector|//sto
>>>>>
>>>>>
>>>>>> ct|re
>>>>>> dProcedure|//view|//protocol|//additionalMetadata : Error in xml
>>>>>> document.  This EML document is not valid because the id 30 occurs
>>>>>> more than once.  IDs must be unique. </error>
>>>>>>
>>>>>> * scope:                ces_dataset
>>>>>> * identifier:           30
>>>>>> * revision:             1
>>>>>> * documentType:         eml://ecoinformatics.org/eml-2.0.0
>>>>>> * documentURL:
>>>>>>
>>>>>
>>>>> http://seinet.asu.edu/DataCatalog/getXanthoriaRecord.jsp?source=ces_da
>>>>>
>>>>>
>>>>>> ta
>>>>>> set_mohave&id=30
>>>>>> *
>>>>>>
>>>>>
>>>>> **************************************************************
>>>>> **********
>>>>>
>>>>>
>>>>>> *****
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Corinna
>>>>>>
>>>>>> _______________________________________________
>>>>>> eml-dev mailing list
>>>>>> eml-dev at ecoinformatics.org
>>>>>> http://www.ecoinformatics.org/mailman/listinfo/eml-dev
>>>>>
>>>>>
>>>>> -- 
>>>>> Mark Servilla, Ph.D.
>>>>>
>>>>> LTER Network Office
>>>>> Department of Biology
>>>>> MSC 03 2020
>>>>> 1 University of New Mexico
>>>>> Albuquerque, NM 87131-0001
>>>>>
>>>>> servilla at lternet.edu
>>>>> Office (505) 277-2619
>>>>> Cell   (505) 453-8593
>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> Mark Servilla, Ph.D.
>>>>>
>>>>> LTER Network Office
>>>>> Department of Biology
>>>>> MSC 03 2020
>>>>> 1 University of New Mexico
>>>>> Albuquerque, NM 87131-0001
>>>>>
>>>>> servilla at lternet.edu
>>>>> Office (505) 277-2619
>>>>> Cell   (505) 453-8593
>>>>>
>>>>> -- 
>>>>> James W. Brunt
>>>>> Associate Director for Information Management
>>>>> Long Term Ecological Research Network Office
>>>>> Department of Biology
>>>>> University of New Mexico
>>>>> Albuquerque, NM 87131-1091
>>>>> 505 272 7085
>>>>> jbrunt at lternet.edu
>>>>>
>>>>> -------------------------------------------------
>>>>> Long-Term Ecological Research Network Mailing List
>>>>> im at LTERnet.edu http://sql.lternet.edu/cgi/mailgroups_view.pl?im
>>>>>
>>>>
>>>> _______________________________________________
>>>> eml-dev mailing list
>>>> eml-dev at ecoinformatics.org
>>>> http://www.ecoinformatics.org/mailman/listinfo/eml-dev

-- 
-------------------------------------------------------------------
Matt Jones                                     jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439    Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org
-------------------------------------------------------------------