eml packaging -- proposed changes

Thu May 23 02:23:50 PDT 2002

I've been doing a lot of thinking about the changes in packaging that we 
discussed at Sevilleta.  I've also talked it over at some length with 
Chad, Dan, and Chris, and then continued my rumination.  We have a 
number of possibilities, and so I have been weighing the pros and cons 
of each.  In this email, I am hoping to outline my thoughts in terms of 
1) what we're trying to achieve with packaging, 2) what technologies are 
available and their possible problems, and 3) a proposed approach. 
Needless to say, this is going to be a long email (just think how long 
we could talk about it :^).

I have checked in some changes to a subset of the eml modules (eml, 
resource, dataset, party, coverage) that demonstrate how this new 
solution would work.  You might want to grab those to look at.  I'm 
looking forward to your feedback.

1) our goals for linking are:
-----------------------------
a) reduce repeated information by allowing references internally to 
existing subtrees (this requires an ability to substitute a whole 
subtree for a reference when processing);
b) provide extensibility by allowing new metadata types to be added to a 
package and associated without knowing ahead of time what they are (this
requires an ability to state an association between two metadata 
components).

2) Technologies available
-------------------------
  At least 5 linking technologies are available to use: eml triples,
ID/IDREF, XLink, RDF, XML Schema key/keyref.  ID/IDREF are part of XML 
1.0 and require only an XML parser for validation.  The others require 
an additional parser for validation.  All of them require an additional 
parser in order to use the links beyond validation (ie, to resolve 
them).  ID/IDREF and key/keyref do not have a concept of a "role" for 
the link, the others do. Our triples are a non-standard
equivalent to rdf.  We could also use XPath/XPointer addresses, but
because these can be relative, they can break by changing the document,
and so are not particularly robust for our purposes.  XLink requires 
that the links be attributes, and that they are in the xlink namespace, 
as well as requiring an additional processor.  ID/IDREF must be attributes.

The RDF Statement is generally used at a finer granularity than we
are using it, in that the predicate/role/property is usually atomic
(e.g., "creator").  We, however, use it to point two complex structures
at one another, and so the role provides no additional information that
is not already implicit in the document types of the subject and object.
   Consequently, I propose that we do not actually need a "role" for our
linking purposes.  Thus, eml triples and RDF are probably overkill.  We 
could talk this over for a long time.

Xlink allows a role and other link metadata, but requires the use of
the particular xlink attributes on your linking elements (e.g.,
<mylinkelement xlink:href="some-link-uri"/>). ID/IDREF allows you to
create a link id or idref attribute with any name on any element (e.g.,
<mylinkelement ref="someid"/>.  The Xlink:show attribute allows one to
specify what to do with a link, including values like "replace" for
substitution and "new" to indicate a link. These loosely correspond to
our desire for both replacement and pointer links.  Xlink processors
seem to provide more convenient access to a compiled link database after
processing, but this is probably a fairly easy library to provide for
ID/IDREF too.  ID/IDREF links MUST be internal to the doc, whereas Xlink
can point at external resources.  Overall, we thought the simplicity of
ID/IDREF was good, and that it had the features we needed, but that
Xlink would be almost equivalent. XLink may allow some growth that
ID/IDREF would not.

Our first goal, to reduce redundancy by using references to other 
identified subtrees, introduces some issues with validation.  In 
particular, if we use IDREF, then we would need to write a content model 
where the element content depends on the presence of an id or idref 
attribute, which is technically not possible.  For example, the element 
should be considered valid if it either 1) has an idref attribute and no 
content, or 2) has an id attribute and valid content.  This can not be 
represented in XML.  So, overall, this whole scheme introduces a huge 
complication for validation, but the proposed solution gets around this 
problem.

3) Proposed approach
--------------------
Our general approach in EML has been to create ComplexTypes (CT) when we 
wanted a particular block to be reusable.  I propose that this concept 
be extended by adding an optional attribute named "id" of type "xs:ID" 
for each ComplexType.  This allows us to uniquely address each block 
defined by a CT, and any XML 1.0 parser will validate that all of the 
"id" values are in fact locally unique.  For the "ResourceBase" CT, this 
new id element would replace the current "identifier" element and would 
also act as the overall identifier for the package.  ResourceBase would 
also  have the "system" attribute (from identifier) for globally scoping 
the id.

Next, we would change the content model for each CT to be a choice 
between the existing content model and a new element named "references" 
of type "xs:string".  This element will be used to hold a reference to 
an existing subtree identified by its id.   We use this element instead 
of an IDREF to surmount the validation issues mentioned above. This 
relationship between the "references" element and the "id" identifiers 
will be enforced by defining an XML Schema "key" for the "id" elements 
and a "keyref" for the "references" elements.  Thus, any XML parser that 
supports XML Schema validation will be able to validate the 
correspondence between each "id" and "references" field (e.g., Xerces 
2.0 supports this).  I've attached a picture of ResponsibleParty as 
modified using this approach for illustration purposes (but note it 
doesn't show the "id" attribute on "ResponsibleParty").

Here's a fragemnt of an example xml doc to illustrate:
     ...
     <creator id="p1">
       <individualName><surName>Jones</surName></individualName>
     </creator>
     <associatedParty>
       <references>p1</references>
       <role>lackey</role>
     </associatedParty>
     <contact>
       <references>p1</references>
     </contact>
     ...

Note that this even works for types that extend other types as long as 
the subclass is the one that does the referencing (e.g., associatedParty 
can reference creator, but not vice versa).  This rule will actually be 
enforced by validating parsers.

Existing modules that are currently associated via triples will instead 
be directly included in the content models (e.g., entity will contain 
attributeList), but the "references" element allows us to define each 
attributeList only once and reference it in the other entities that 
share it.

So, that lets us reuse portions of documents, satisfying goal 1 while 
still minimizing our processor needs.  In a worse case scenario, if a 
schema validator is not available, we can still validate the ids as 
unique because they are defined as type xs:ID.

The key and keyref are defined in the eml.xsd module.  In this scenario, 
a package is defined by all of the content included in the <eml> tag, 
including the nested modules like attribute in entity.  The only thing 
we lose with this approach is the ability to use alternative specs 
(e.g., use something other than eml-attribute for attribute 
descriptions) for a given module because they will be included directly 
in the content models, but that's not a very big deal.  The content 
model of the eml element requires one of the types that extend resource 
(dataset, lit, software, ...), and then has an optional, repeatable 
element "additionalMetadata" with content model ANY in which arbitrary 
other metadata docs can be placed.  The additionalMetadata element has 
an id attribute and another attribute named "describes" that is a 
reference to an id with which this subtree should be associated.  The 
nature of the association is implied by the types of the document (ie, 
role/predicate/property/relationship is not specified directly).  The 
reference/id linkage is enforced by defining another "keyref" 
constraint.  So, this lets us add arbitrary metadata documents and point 
them at existing ids in the tree. Thus, the id serves as both ends of 
the link (subject and object in RDF terms) depending on whether it is 
referred to in a "references" element or in a "describes" attribute.

This satisfies our second goal of being able to include arbitrary 
metadata types.

I've attached a sample xml document illustrating these concepts that 
validates using the Xerces schema processor. You'll need to check out 
the updated schema files (obviously) for it to work. I've also modified 
the SAX validator script to optionally include schema validation if you 
have xerces2 on your classpath.

Lots of stuff to ponder, for sure.  I didn't go into detail about the 
several other approaches that I've considered and rejected, because I'm 
trying to keep this email somewhat manageable in terms of length.

Thanks for your feedback.

Matt

-- 
*******************************************************************
Matt Jones                                    jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439   Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)

Interested in ecological informatics? http://www.ecoinformatics.org
*******************************************************************
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eml-party.png
Type: image/png
Size: 5591 bytes
Desc: not available
Url : http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20020523/805dc82b/eml-party.png
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eml-test.xml
Type: text/xml
Size: 1280 bytes
Desc: not available
Url : http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20020523/805dc82b/eml-test.xml