[LTER-im] [Fwd: [Fwd: Re: FW: Report from Metacat Harvester: Wed Aug 25 11:00:36 MDT 2004]]
James W Brunt
jbrunt at lternet.edu
Tue Aug 31 13:31:38 PDT 2004
Just a clarification...The specific error example we have been
discussing is concerning two identical ids with different content...
<dataset id="30" system="ces_dataset"> ... Is different from
<creator id="30" system="ces_party"> ....
Admittedly, were the content the same we would still get the error (if
the parser is written to the spec). However, if there were (in this case
there wasn't) a
<references>30</references>
it would be ambiguous. Correct?
James
Peter McCartney wrote:
> On Tue, 2004-08-31 at 11:35, Matt Jones wrote:
>
>>>It would really help me justify the extra work involed in managing ids
>>>and references if someone could give me a concrete example of why it
>>>would be bad to have a document contain two elements with identical ids
>>>and identical content.
>>
>>Like in other relational systems, The key (id) acts as a surrogate for
>>the content. So, references should resolve to one (and only one) id.
>>It is far harder to validate that the content is the same between two
>>nodes with identical keys than it is to validate that no key is
>>duplicated. I think they got this right in the relational model, and we
>>should follow that lead. If you allow duplicate ids, then I am sure
>>this situation will arise:
>>
>><a id="1">foo</a>
>><a id="1">bar></a>
>><b><references>1</references></b>
>>
>>What is the value of <b>? foo, or bar? It is indeterminate. And this
>>is precisely why this is a problem.
>
>
> I agree this would be bad, but this is not what is happening. The
> documents that are being rejected have:
> <a id="1">foo</a>
> <a id="1">foo></a>
> Typically, when this happens, the code is obviously not bothering with
> references tags, so we aren't likely to create broken or ambiguous
> reference tags. Even if we did throw in a
> <b><references>1</references></b>, it really wouldn't be a problem. In
> some of our files where attributes are repeated in view entities, we are
> also getting this:
>
> <a id="1">foo</a>
> <a id="2">foo></a>
>
> but your parser hasn't spotted that one yet :) and again, even though it
> violates the spec, i would contend that this causes no problem.
>
>
>>If my xpath returns one or several nodes and they
>>
>>>are all identical, why is it so bad to just assume that the rule is:
>>>"identical id (and system) means identical content" and just use the
>>>first one in the list?
>>
>>Because relational models have shown that this never works. I think
>>that such an assumption will result in lots of broken docs.
>>
>> I think it is no more work to write parsers to
>>
>>>check for differences between nodes ith similar ids than it is to check
>>>for duplicate ids in the first place, but it makes generating valid eml
>>>a LOT simpler.
>>
>>Generating valid eml with only one copy of a subtree is easy -- just
>>track whether you've already inserted it, and reference it thereafter.
>>I don't understand at all why this is hard. However, I do understand
>>the problem with system not being included in the assessment of the
>>uniqueness of the ID. So I like the idea of pursuing Mark's suggestion (2).
>>
>
>
> I also like pursuing 2 regardless of how we debate over 3 and would
> support a hasty 2.02 to revise the spec documentation and add an
> optional system attribute as such:
>
> <references system="ces_dataset">201</references>.
>
> Keep in mind that when people hear you (or me, or anyone...) say "its
> not that hard" they are thinking "sure, if you have a team of Java
> programmers!"). So perhaps it would help to provide some code samples
> that can be adapted to the kind of approaches people are taking with
> more off-the-shelf tools so that people don't feel like the only way to
> work with valid eml is to use one set of tools from one shop. For
> example, the approach we take in Xanthoria for converting from RDBMS to
> xml is actually a fairly common one that appears in Cocoon, XML spy's
> RDPMS mapping tool, and many other vendor-specific DB->xml modules.
> Specifically, the rdbms content is exported to a generic, denormalized
> xml and then transformed with xsl to map to the desired schema. So for
> most cases, the place where this tracking needs to be done is likely to
> be in XSL. While we have found it relatively easy when parsing EML in
> XSL to follow references to find the content, we have also found that
> tracking things within xsls when writing out eml to be a cumbersome
> process, let alone making sure that each time we do it it is going to
> come out consistent.
>
> So if there is some xsl sample that we can easily add to xanthoria style
> sheets to solve this problem, then thats cool. Otherwise, I really think
> it would be folly to hang too long on this when we (LTER that is) have
> bigger fish to fry. Namely, building a better search interface for
> searching LTER data via eml. The query interface is what the CC spent
> hours talking about in Fairbanks, so if we come back in Miami with the
> ID problem solved but no improved query system, I'd prefer not be the
> one to give that powerpoint.
>
>
>
>>Matt
>>
>>
>>>Peter McCartney (peter.mccartney at asu.edu)
>>>Center for Environmental-Studies
>>>Arizona State University
>>>
>>>
>>>
>>>
>>>
>>>>-----Original Message-----
>>>>From: owner-im at lternet.edu [mailto:owner-im at lternet.edu] On
>>>>Behalf Of James W Brunt
>>>>Sent: Monday, August 30, 2004 2:57 PM
>>>>To: eml-dev at ecoinformatics.org; emlbestpractices at lternet.edu;
>>>>im at lternet.edu
>>>>Subject: [LTER-im] [Fwd: [Fwd: Re: FW: Report from Metacat
>>>>Harvester: Wed Aug 25 11:00:36 MDT 2004]]
>>>>
>>>>
>>>>Peter, et. al,
>>>>
>>>>Mark's email to me (below) has reinforced my own conclusion about the
>>>>id, system, references question. There at least 2 possibly 3 issues
>>>>(bugs if you will) here to be dealt with:
>>>>
>>>>1. The eml normative documentation needs to reflect the real
>>>>intent and
>>>>use of the system attribute. Read (Can O Worms). Options as I
>>>>see them:
>>>> a. deprecate the system attribute until it can be
>>>>better defined -
>>>>ignore 2 and 3 below (Mark goes even further on this one below).
>>>> b. clearly define the system attribute and make the
>>>>changes in 2 and 3
>>>>below.
>>>>
>>>>2. <references> tag needs to be made system/scope aware
>>>>
>>>>3. EMLparser needs to enforce the final outcome of 1 and 2.
>>>>
>>>>Currently, the documentation introduces system but it's
>>>>definition does
>>>>not supercede the unique ID requirement within a document, references
>>>>is not system aware, EMLparser is enforcing exactly what the
>>>>documentation says.
>>>>
>>>>Turning off the ID checking as Peter has suggested (different thread)
>>>>would result in uninterpretable EML documents were the
>>>>references tag
>>>>to be used (Although, in all but one case in the example below there
>>>>were no references to the IDs). I don't see this as an intermediate
>>>>solution.
>>>>
>>>>The intent as I remember all that long discussion ago was to create a
>>>>way to get around having to completely duplicate content in a
>>>>document.
>>>> Thus creating a more compact document and one that would be more
>>>>easily maintained for someone not generating the documents from a
>>>>database. I'm sure I can be clarified some here by others that were
>>>>present. I realize the difficulty in tracking a document ID map for
>>>>every document you automatically generate however I really don't
>>>>understand why you wouldn't completely duplicate the content.
>>>>However,
>>>>the inclusion of a second qualifying attribute that has to be checked
>>>>for every id tag is doable but before we begin something like this it
>>>>must be clearly spelled-out and agreeable to the group(s).
>>>>We'd like to
>>>>hear from eml-dev, eml-bestpractices, and im as well as individual
>>>>stakeholders.
>>>>
>>>>Thanks,
>>>>
>>>>James
>>>>
>>>>--
>>>>James W. Brunt
>>>>Associate Director for Information Management
>>>>Long Term Ecological Research Network Office
>>>>Department of Biology
>>>>University of New Mexico
>>>>Albuquerque, NM 87131-1091
>>>>505 272 7085
>>>>jbrunt at lternet.edu
>>>>
>>>>
>>>>-------- Original Message --------
>>>>From: Mark Servilla <servilla at lternet.edu>
>>>>To: James Brunt <jbrunt at lternet.edu>
>>>>Subject: [Fwd: Re: FW: Report from Metacat Harvester: Wed Aug 25
>>>>11:00:36 MDT 2004]
>>>>
>>>>James,
>>>>
>>>>After reviewing the EML specification documents, it appears
>>>>to me that duplicate IDs within a single instance document is
>>>>not valid EML, and therefore (IMHO), the EML Parser is
>>>>behaving correctly. I cannot see how setting either the
>>>>SYSTEM or SCOPE attribute can be used by the REFERENCES
>>>>element to distinguish duplicate IDs within a single document
>>>>(perhaps someone in eml-dev can help answer how SYSTEM/SCOPE
>>>>are used in this context).
>>>>
>>>>Some possible solutions are:
>>>>(1) Deprecate SYSTEM/SCOPE attributes in this context, update
>>>>the specification to reflect such change, and do not allow
>>>>duplicate IDs.
>>>>(2) Modify the specification to allow SYSTEM/SCOPE to narrow
>>>>the ID scope, thereby allowing duplicate IDs when qualified
>>>>by either SYSTEM/SCOPE -- and, modify the specification for
>>>>REFERENCES to make use of such change.
>>>>(3) Deprecate REFERENCES completely and force repeated content.
>>>>
>>>>Just my thoughts - thanks!
>>>>
>>>>Mark
>>>>
>>>>-------- Original Message --------
>>>>Subject: Re: FW: Report from Metacat Harvester: Wed Aug 25
>>>>11:00:36 MDT
>>>>2004
>>>>Date: Mon, 30 Aug 2004 09:26:13 -0600
>>>>From: Mark Servilla <servilla at lternet.edu>
>>>>To: 'Corinna Gries' <corinna at asu.edu>
>>>>CC: James Brunt <jbrunt at lternet.edu>, Duane Costa <dcosta at lternet.edu>
>>>>References: <E1C0TNQ-00066I-00 at lternet.lternet.edu>
>>>>
>>>>Hi Corinna,
>>>>
>>>>I have been discussing this issue of ID attributes with James
>>>>and Duane here at LNO. Please correct me if I am wrong, but
>>>>the section on Reusable Content (below or
>>>>http://knb.ecoinformatics.org/software/eml/eml-2.0.1/index.htm
>>>>l#reusableContent)
>>>>states that "two identical ids cannot exist in a single
>>>>document". It appears that the "SYSTEM" attribute only
>>>>allows identical ids in multiple documents within the system
>>>>(that is, only if the repeated ids reference the exact same
>>>>object) - something like globalizing the id'ed object to the
>>>>system for repeated reference in one or more documents, but
>>>>not necessarily allowing identical ids within a single
>>>>document by changing the SYSTEM attribute value. I am not
>>>>really sure how one would take advantage of the SYSTEM
>>>>attribute for reusable content. And, I don't know the
>>>>provenance of this particular issue (the documentation could
>>>>certainly be more clear), but if we were to follow the
>>>>documentation as we interpret, would this still be a bug in
>>>>the Harvester/Metacat software?
>>>>
>>>>Sincerely,
>>>>Mark
>>>>
>>>>3.3. Reusable Content
>>>>EML allows the reuse of previously defined structured content (DOM
>>>>sub-trees) through the use of key/keyRef type references. In
>>>>order for an EML package to remain cohesive and to allow for
>>>>the cross platform compatability of packages, the following
>>>>rules with respect to packaging must be followed. 1. An ID is
>>>>required on the eml root element. 2. IDs are optional on all
>>>>other elements. 3. If an ID is not provided, that content
>>>>must be interpreted as representing a distinct object. 4. If
>>>>an ID is provided for content then that content is distinct
>>>
>>>>from all other content except for that content that
>>>
>>>>references its ID. 5. If a user wants to reuse content to
>>>>indicate the repetition of an object, a reference must be
>>>>used. Two identical ids cannot exist in a single document. 6.
>>>>"Document" scope is defined as identifiers unique only to a
>>>>single instance document (if a document does not have a
>>>>system attribute or if scope is set to 'document' then all
>>>>IDs are defined as distinct content). 7. "System" scope is
>>>>defined as identifiers unique to an entire data management
>>>>system (if two documents share a system string, then any IDs
>>>>in those two documents that are identical refer to the same
>>>>object). 8. If an element references another element, it must
>>>>not have an ID itself. 9. All EML packages must have the
>>>>'eml' module as the root. 10. The system and scope attribute
>>>>are always optional except for at the 'eml' module where the
>>>>scope attribute is fixed as 'system'. The scope attribute
>>>>defaults to 'document' for all other modules.
>>>>
>>>>Duane Costa wrote:
>>>>
>>>>
>>>>>Could anyone comment as to whether the EML error reported
>>>>
>>>>by Metacat
>>>>
>>>>
>>>>>below is a genuine EML error versus a bug in Metacat or the EML
>>>>>validator program? The issue is whether the id value for <dataset>
>>>>>must be unique from the id value for <creator>.
>>>>>
>>>>>Thanks,
>>>>>Duane
>>>>>
>>>>>-----Original Message-----
>>>>>From: Corinna Gries [mailto:corinna at asu.edu]
>>>>>Sent: Thursday, August 26, 2004 3:48 PM
>>>>>To: dcosta at lternet.edu
>>>>>Subject: RE: Report from Metacat Harvester: Wed Aug 25
>>>>
>>>>11:00:36 MDT 2004
>>>>
>>>>
>>>>>Hi Duane,
>>>>>
>>>>>I am trying to fix these problems with our eml files. Some are easy
>>>>>because they are actual errors in our files, but there is
>>>>
>>>>one where I
>>>>
>>>>
>>>>>wonder if the ID checking is right. I understood IDs should
>>>>
>>>>be unique
>>>>
>>>>
>>>>>within the system, that is for example:
>>>>>
>>>>><dataset id="30" system="ces_dataset"> ... Is different
>>>>
>>>>from <creator
>>>
>>>>>id="30" system="ces_party"> ....
>>>>>
>>>>>However, your harvester complains that they are the same:
>>>>>
>>>>>
>>>>
>>>>**********************************************************************
>>>>
>>>>
>>>>>**
>>>>>*****
>>>>>*
>>>>>* METACAT HARVESTER REPORT: Wed Aug 25 11:00:36 MDT 2004
>>>>>*
>>>>>* A TOTAL OF 22 ERRORS WERE DETECTED.
>>>>>* Please see the log entries below for additonal details.
>>>>>*
>>>>>
>>>>
>>>>**************************************************************
>>>>**********
>>>>
>>>>
>>>>>*****
>>>>>
>>>>
>>>>**************************************************************
>>>>**********
>>>>
>>>>
>>>>>*****
>>>>>*
>>>>>* harvestLogID: 5549
>>>>>* harvestDate: Wed Aug 25 11:00:36 MDT 2004
>>>>>* status: 1
>>>>>* message:
>>>>>* harvestOperationCode: InsertDocError
>>>>>* description: Error inserting EML document to Metacat
>>>>>* detailLogID: 383
>>>>>* errorMessage: MetacatException: <?xml version="1.0"?>
>>>>><error>
>>>>>Error running xpath expression:
>>>>>
>>>>
>>>>//dateTimeDomain|//nonNumericDomain|//numericDomain|//access|/
>>>>/attribute
>>>>
>>>>List|//constraint|//coverage|//temporalCoverage|//geographicCoverage|/
>>>>
>>>>
>>>>>List|/t
>>>>>
>>>>
>>>>axonomicCoverage|/dataset|/eml/dataset|//dataSource|//dataTable|//othe
>>>>
>>>>
>>>>>axonomicCoverage|rE
>>>>>
>>>>
>>>>ntity|//citation|//address|//conferenceLocation|//party|//originator|/
>>>>
>>>>
>>>>>ntity|/c
>>>>>
>>>>
>>>>reator|//contact|//publisher|//editor|//recipient|//performer|//instit
>>>>
>>>>
>>>>>reator|ut
>>>>>
>>>>
>>>>ion|//metadataProvider|//associatedParty|//personnel|//physical|//conn
>>>>
>>>>
>>>>>ion|ec
>>>>>
>>>>
>>>>tionDefinition|//distribution|//researchProject|//project|//relatedPro
>>>>
>>>>
>>>>>tionDefinition|je
>>>>>
>>>>
>>>>ct|//software|//spatialRaster|//spatialReference|//spatialVector|//sto
>>>>
>>>>
>>>>>ct|re
>>>>>dProcedure|//view|//protocol|//additionalMetadata : Error in xml
>>>>>document. This EML document is not valid because the id 30 occurs
>>>>>more than once. IDs must be unique. </error>
>>>>>
>>>>>* scope: ces_dataset
>>>>>* identifier: 30
>>>>>* revision: 1
>>>>>* documentType: eml://ecoinformatics.org/eml-2.0.0
>>>>>* documentURL:
>>>>>
>>>>
>>>>http://seinet.asu.edu/DataCatalog/getXanthoriaRecord.jsp?source=ces_da
>>>>
>>>>
>>>>>ta
>>>>>set_mohave&id=30
>>>>>*
>>>>>
>>>>
>>>>**************************************************************
>>>>**********
>>>>
>>>>
>>>>>*****
>>>>>
>>>>>What do you think?
>>>>>
>>>>>Corinna
>>>>>
>>>>>_______________________________________________
>>>>>eml-dev mailing list
>>>>>eml-dev at ecoinformatics.org
>>>>>http://www.ecoinformatics.org/mailman/listinfo/eml-dev
>>>>
>>>>--
>>>>Mark Servilla, Ph.D.
>>>>
>>>>LTER Network Office
>>>>Department of Biology
>>>>MSC 03 2020
>>>>1 University of New Mexico
>>>>Albuquerque, NM 87131-0001
>>>>
>>>>servilla at lternet.edu
>>>>Office (505) 277-2619
>>>>Cell (505) 453-8593
>>>>
>>>>
>>>>
>>>>--
>>>>Mark Servilla, Ph.D.
>>>>
>>>>LTER Network Office
>>>>Department of Biology
>>>>MSC 03 2020
>>>>1 University of New Mexico
>>>>Albuquerque, NM 87131-0001
>>>>
>>>>servilla at lternet.edu
>>>>Office (505) 277-2619
>>>>Cell (505) 453-8593
>>>>
>>>>--
>>>>James W. Brunt
>>>>Associate Director for Information Management
>>>>Long Term Ecological Research Network Office
>>>>Department of Biology
>>>>University of New Mexico
>>>>Albuquerque, NM 87131-1091
>>>>505 272 7085
>>>>jbrunt at lternet.edu
>>>>
>>>>-------------------------------------------------
>>>>Long-Term Ecological Research Network Mailing List
>>>>im at LTERnet.edu http://sql.lternet.edu/cgi/mailgroups_view.pl?im
>>>>
>>>
>>>_______________________________________________
>>>eml-dev mailing list
>>>eml-dev at ecoinformatics.org
>>>http://www.ecoinformatics.org/mailman/listinfo/eml-dev
More information about the Eml-dev
mailing list