eml packaging -- proposed changes

Thu May 23 02:27:20 PDT 2002

Sorry -- I attached the wrong image in my last email. Here's the right 
image...

Matt

Matt Jones wrote:
> I've been doing a lot of thinking about the changes in packaging that we 
> discussed at Sevilleta.  I've also talked it over at some length with 
> Chad, Dan, and Chris, and then continued my rumination.  We have a 
> number of possibilities, and so I have been weighing the pros and cons 
> of each.  In this email, I am hoping to outline my thoughts in terms of 
> 1) what we're trying to achieve with packaging, 2) what technologies are 
> available and their possible problems, and 3) a proposed approach. 
> Needless to say, this is going to be a long email (just think how long 
> we could talk about it :^).
> 
> I have checked in some changes to a subset of the eml modules (eml, 
> resource, dataset, party, coverage) that demonstrate how this new 
> solution would work.  You might want to grab those to look at.  I'm 
> looking forward to your feedback.
> 
> 1) our goals for linking are:
> -----------------------------
> a) reduce repeated information by allowing references internally to 
> existing subtrees (this requires an ability to substitute a whole 
> subtree for a reference when processing);
> b) provide extensibility by allowing new metadata types to be added to a 
> package and associated without knowing ahead of time what they are (this
> requires an ability to state an association between two metadata 
> components).
> 
> 2) Technologies available
> -------------------------
>  At least 5 linking technologies are available to use: eml triples,
> ID/IDREF, XLink, RDF, XML Schema key/keyref.  ID/IDREF are part of XML 
> 1.0 and require only an XML parser for validation.  The others require 
> an additional parser for validation.  All of them require an additional 
> parser in order to use the links beyond validation (ie, to resolve 
> them).  ID/IDREF and key/keyref do not have a concept of a "role" for 
> the link, the others do. Our triples are a non-standard
> equivalent to rdf.  We could also use XPath/XPointer addresses, but
> because these can be relative, they can break by changing the document,
> and so are not particularly robust for our purposes.  XLink requires 
> that the links be attributes, and that they are in the xlink namespace, 
> as well as requiring an additional processor.  ID/IDREF must be attributes.
> 
> The RDF Statement is generally used at a finer granularity than we
> are using it, in that the predicate/role/property is usually atomic
> (e.g., "creator").  We, however, use it to point two complex structures
> at one another, and so the role provides no additional information that
> is not already implicit in the document types of the subject and object.
>   Consequently, I propose that we do not actually need a "role" for our
> linking purposes.  Thus, eml triples and RDF are probably overkill.  We 
> could talk this over for a long time.
> 
> Xlink allows a role and other link metadata, but requires the use of
> the particular xlink attributes on your linking elements (e.g.,
> <mylinkelement xlink:href="some-link-uri"/>). ID/IDREF allows you to
> create a link id or idref attribute with any name on any element (e.g.,
> <mylinkelement ref="someid"/>.  The Xlink:show attribute allows one to
> specify what to do with a link, including values like "replace" for
> substitution and "new" to indicate a link. These loosely correspond to
> our desire for both replacement and pointer links.  Xlink processors
> seem to provide more convenient access to a compiled link database after
> processing, but this is probably a fairly easy library to provide for
> ID/IDREF too.  ID/IDREF links MUST be internal to the doc, whereas Xlink
> can point at external resources.  Overall, we thought the simplicity of
> ID/IDREF was good, and that it had the features we needed, but that
> Xlink would be almost equivalent. XLink may allow some growth that
> ID/IDREF would not.
> 
> Our first goal, to reduce redundancy by using references to other 
> identified subtrees, introduces some issues with validation.  In 
> particular, if we use IDREF, then we would need to write a content model 
> where the element content depends on the presence of an id or idref 
> attribute, which is technically not possible.  For example, the element 
> should be considered valid if it either 1) has an idref attribute and no 
> content, or 2) has an id attribute and valid content.  This can not be 
> represented in XML.  So, overall, this whole scheme introduces a huge 
> complication for validation, but the proposed solution gets around this 
> problem.
> 
> 3) Proposed approach
> --------------------
> Our general approach in EML has been to create ComplexTypes (CT) when we 
> wanted a particular block to be reusable.  I propose that this concept 
> be extended by adding an optional attribute named "id" of type "xs:ID" 
> for each ComplexType.  This allows us to uniquely address each block 
> defined by a CT, and any XML 1.0 parser will validate that all of the 
> "id" values are in fact locally unique.  For the "ResourceBase" CT, this 
> new id element would replace the current "identifier" element and would 
> also act as the overall identifier for the package.  ResourceBase would 
> also  have the "system" attribute (from identifier) for globally scoping 
> the id.
> 
> Next, we would change the content model for each CT to be a choice 
> between the existing content model and a new element named "references" 
> of type "xs:string".  This element will be used to hold a reference to 
> an existing subtree identified by its id.   We use this element instead 
> of an IDREF to surmount the validation issues mentioned above. This 
> relationship between the "references" element and the "id" identifiers 
> will be enforced by defining an XML Schema "key" for the "id" elements 
> and a "keyref" for the "references" elements.  Thus, any XML parser that 
> supports XML Schema validation will be able to validate the 
> correspondence between each "id" and "references" field (e.g., Xerces 
> 2.0 supports this).  I've attached a picture of ResponsibleParty as 
> modified using this approach for illustration purposes (but note it 
> doesn't show the "id" attribute on "ResponsibleParty").
> 
> Here's a fragemnt of an example xml doc to illustrate:
>     ...
>     <creator id="p1">
>       <individualName><surName>Jones</surName></individualName>
>     </creator>
>     <associatedParty>
>       <references>p1</references>
>       <role>lackey</role>
>     </associatedParty>
>     <contact>
>       <references>p1</references>
>     </contact>
>     ...
> 
> Note that this even works for types that extend other types as long as 
> the subclass is the one that does the referencing (e.g., associatedParty 
> can reference creator, but not vice versa).  This rule will actually be 
> enforced by validating parsers.
> 
> Existing modules that are currently associated via triples will instead 
> be directly included in the content models (e.g., entity will contain 
> attributeList), but the "references" element allows us to define each 
> attributeList only once and reference it in the other entities that 
> share it.
> 
> So, that lets us reuse portions of documents, satisfying goal 1 while 
> still minimizing our processor needs.  In a worse case scenario, if a 
> schema validator is not available, we can still validate the ids as 
> unique because they are defined as type xs:ID.
> 
> The key and keyref are defined in the eml.xsd module.  In this scenario, 
> a package is defined by all of the content included in the <eml> tag, 
> including the nested modules like attribute in entity.  The only thing 
> we lose with this approach is the ability to use alternative specs 
> (e.g., use something other than eml-attribute for attribute 
> descriptions) for a given module because they will be included directly 
> in the content models, but that's not a very big deal.  The content 
> model of the eml element requires one of the types that extend resource 
> (dataset, lit, software, ...), and then has an optional, repeatable 
> element "additionalMetadata" with content model ANY in which arbitrary 
> other metadata docs can be placed.  The additionalMetadata element has 
> an id attribute and another attribute named "describes" that is a 
> reference to an id with which this subtree should be associated.  The 
> nature of the association is implied by the types of the document (ie, 
> role/predicate/property/relationship is not specified directly).  The 
> reference/id linkage is enforced by defining another "keyref" 
> constraint.  So, this lets us add arbitrary metadata documents and point 
> them at existing ids in the tree. Thus, the id serves as both ends of 
> the link (subject and object in RDF terms) depending on whether it is 
> referred to in a "references" element or in a "describes" attribute.
> 
> This satisfies our second goal of being able to include arbitrary 
> metadata types.
> 
> I've attached a sample xml document illustrating these concepts that 
> validates using the Xerces schema processor. You'll need to check out 
> the updated schema files (obviously) for it to work. I've also modified 
> the SAX validator script to optionally include schema validation if you 
> have xerces2 on your classpath.
> 
> Lots of stuff to ponder, for sure.  I didn't go into detail about the 
> several other approaches that I've considered and rejected, because I'm 
> trying to keep this email somewhat manageable in terms of length.
> 
> Thanks for your feedback.
> 
> Matt
> 
> 
> ------------------------------------------------------------------------
> 
> 
> ------------------------------------------------------------------------
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <eml:eml xmlns:eml="eml:eml-2.0.0beta8" xmlns:cit="eml:literature-2.0.0beta8" xmlns:doc="eml:documentation-2.0.0beta8" xmlns:ds="eml:dataset-2.0.0beta8" xmlns:sw="eml:software-2.0.0beta8" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="eml:eml-2.0.0beta8
> eml.xsd">
>   <ds:dataset id="knb.2.1" system="knb">
>     <title>Bison data</title>
>     <creator id="knb.3.1">
>       <individualName>
>         <givenName>Matthew</givenName>
>         <surName>Jones</surName>
>       </individualName>
>       <organizationName>NCEAS</organizationName>
>       <address>
>         <city>Santa Barbara</city>
>         <country>CA</country>
>       </address>
>       <electronicMailAddress>jones at nceas.ucsb.edu</electronicMailAddress>
>       <onlineUrl>http://www.nceas.ucsb.edu</onlineUrl>
>     </creator>
>     <metadataProvider>
>       <references>knb.3.1</references>
>     </metadataProvider>
>     <associatedParty>
>       <references>knb.3.1</references>
>       <role>lackey</role>
>     </associatedParty>
>     <contact>
>       <references>knb.3.1</references>
>     </contact>
>   </ds:dataset>
>   <additionalMetadata id="knb.4.1" describes="knb.3.1">
>     <personalInfo>
>       <nickname>Matt</nickname>
>     </personalInfo>
>   </additionalMetadata>
> </eml:eml>

-- 
*******************************************************************
Matt Jones                                    jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439   Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)

Interested in ecological informatics? http://www.ecoinformatics.org
*******************************************************************
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eml-party.png
Type: image/png
Size: 2149 bytes
Desc: not available
Url : http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20020523/1e4d7fb2/eml-party.png