eml packaging -- proposed changes
Matt Jones
jones at nceas.ucsb.edu
Thu May 23 02:23:50 PDT 2002
I've been doing a lot of thinking about the changes in packaging that we
discussed at Sevilleta. I've also talked it over at some length with
Chad, Dan, and Chris, and then continued my rumination. We have a
number of possibilities, and so I have been weighing the pros and cons
of each. In this email, I am hoping to outline my thoughts in terms of
1) what we're trying to achieve with packaging, 2) what technologies are
available and their possible problems, and 3) a proposed approach.
Needless to say, this is going to be a long email (just think how long
we could talk about it :^).
I have checked in some changes to a subset of the eml modules (eml,
resource, dataset, party, coverage) that demonstrate how this new
solution would work. You might want to grab those to look at. I'm
looking forward to your feedback.
1) our goals for linking are:
-----------------------------
a) reduce repeated information by allowing references internally to
existing subtrees (this requires an ability to substitute a whole
subtree for a reference when processing);
b) provide extensibility by allowing new metadata types to be added to a
package and associated without knowing ahead of time what they are (this
requires an ability to state an association between two metadata
components).
2) Technologies available
-------------------------
At least 5 linking technologies are available to use: eml triples,
ID/IDREF, XLink, RDF, XML Schema key/keyref. ID/IDREF are part of XML
1.0 and require only an XML parser for validation. The others require
an additional parser for validation. All of them require an additional
parser in order to use the links beyond validation (ie, to resolve
them). ID/IDREF and key/keyref do not have a concept of a "role" for
the link, the others do. Our triples are a non-standard
equivalent to rdf. We could also use XPath/XPointer addresses, but
because these can be relative, they can break by changing the document,
and so are not particularly robust for our purposes. XLink requires
that the links be attributes, and that they are in the xlink namespace,
as well as requiring an additional processor. ID/IDREF must be attributes.
The RDF Statement is generally used at a finer granularity than we
are using it, in that the predicate/role/property is usually atomic
(e.g., "creator"). We, however, use it to point two complex structures
at one another, and so the role provides no additional information that
is not already implicit in the document types of the subject and object.
Consequently, I propose that we do not actually need a "role" for our
linking purposes. Thus, eml triples and RDF are probably overkill. We
could talk this over for a long time.
Xlink allows a role and other link metadata, but requires the use of
the particular xlink attributes on your linking elements (e.g.,
<mylinkelement xlink:href="some-link-uri"/>). ID/IDREF allows you to
create a link id or idref attribute with any name on any element (e.g.,
<mylinkelement ref="someid"/>. The Xlink:show attribute allows one to
specify what to do with a link, including values like "replace" for
substitution and "new" to indicate a link. These loosely correspond to
our desire for both replacement and pointer links. Xlink processors
seem to provide more convenient access to a compiled link database after
processing, but this is probably a fairly easy library to provide for
ID/IDREF too. ID/IDREF links MUST be internal to the doc, whereas Xlink
can point at external resources. Overall, we thought the simplicity of
ID/IDREF was good, and that it had the features we needed, but that
Xlink would be almost equivalent. XLink may allow some growth that
ID/IDREF would not.
Our first goal, to reduce redundancy by using references to other
identified subtrees, introduces some issues with validation. In
particular, if we use IDREF, then we would need to write a content model
where the element content depends on the presence of an id or idref
attribute, which is technically not possible. For example, the element
should be considered valid if it either 1) has an idref attribute and no
content, or 2) has an id attribute and valid content. This can not be
represented in XML. So, overall, this whole scheme introduces a huge
complication for validation, but the proposed solution gets around this
problem.
3) Proposed approach
--------------------
Our general approach in EML has been to create ComplexTypes (CT) when we
wanted a particular block to be reusable. I propose that this concept
be extended by adding an optional attribute named "id" of type "xs:ID"
for each ComplexType. This allows us to uniquely address each block
defined by a CT, and any XML 1.0 parser will validate that all of the
"id" values are in fact locally unique. For the "ResourceBase" CT, this
new id element would replace the current "identifier" element and would
also act as the overall identifier for the package. ResourceBase would
also have the "system" attribute (from identifier) for globally scoping
the id.
Next, we would change the content model for each CT to be a choice
between the existing content model and a new element named "references"
of type "xs:string". This element will be used to hold a reference to
an existing subtree identified by its id. We use this element instead
of an IDREF to surmount the validation issues mentioned above. This
relationship between the "references" element and the "id" identifiers
will be enforced by defining an XML Schema "key" for the "id" elements
and a "keyref" for the "references" elements. Thus, any XML parser that
supports XML Schema validation will be able to validate the
correspondence between each "id" and "references" field (e.g., Xerces
2.0 supports this). I've attached a picture of ResponsibleParty as
modified using this approach for illustration purposes (but note it
doesn't show the "id" attribute on "ResponsibleParty").
Here's a fragemnt of an example xml doc to illustrate:
...
<creator id="p1">
<individualName><surName>Jones</surName></individualName>
</creator>
<associatedParty>
<references>p1</references>
<role>lackey</role>
</associatedParty>
<contact>
<references>p1</references>
</contact>
...
Note that this even works for types that extend other types as long as
the subclass is the one that does the referencing (e.g., associatedParty
can reference creator, but not vice versa). This rule will actually be
enforced by validating parsers.
Existing modules that are currently associated via triples will instead
be directly included in the content models (e.g., entity will contain
attributeList), but the "references" element allows us to define each
attributeList only once and reference it in the other entities that
share it.
So, that lets us reuse portions of documents, satisfying goal 1 while
still minimizing our processor needs. In a worse case scenario, if a
schema validator is not available, we can still validate the ids as
unique because they are defined as type xs:ID.
The key and keyref are defined in the eml.xsd module. In this scenario,
a package is defined by all of the content included in the <eml> tag,
including the nested modules like attribute in entity. The only thing
we lose with this approach is the ability to use alternative specs
(e.g., use something other than eml-attribute for attribute
descriptions) for a given module because they will be included directly
in the content models, but that's not a very big deal. The content
model of the eml element requires one of the types that extend resource
(dataset, lit, software, ...), and then has an optional, repeatable
element "additionalMetadata" with content model ANY in which arbitrary
other metadata docs can be placed. The additionalMetadata element has
an id attribute and another attribute named "describes" that is a
reference to an id with which this subtree should be associated. The
nature of the association is implied by the types of the document (ie,
role/predicate/property/relationship is not specified directly). The
reference/id linkage is enforced by defining another "keyref"
constraint. So, this lets us add arbitrary metadata documents and point
them at existing ids in the tree. Thus, the id serves as both ends of
the link (subject and object in RDF terms) depending on whether it is
referred to in a "references" element or in a "describes" attribute.
This satisfies our second goal of being able to include arbitrary
metadata types.
I've attached a sample xml document illustrating these concepts that
validates using the Xerces schema processor. You'll need to check out
the updated schema files (obviously) for it to work. I've also modified
the SAX validator script to optionally include schema validation if you
have xerces2 on your classpath.
Lots of stuff to ponder, for sure. I didn't go into detail about the
several other approaches that I've considered and rejected, because I'm
trying to keep this email somewhat manageable in terms of length.
Thanks for your feedback.
Matt
--
*******************************************************************
Matt Jones jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/ Fax: 425-920-2439 Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
Interested in ecological informatics? http://www.ecoinformatics.org
*******************************************************************
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eml-party.png
Type: image/png
Size: 5591 bytes
Desc: not available
Url : http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20020523/805dc82b/eml-party.png
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eml-test.xml
Type: text/xml
Size: 1280 bytes
Desc: not available
Url : http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20020523/805dc82b/eml-test.xml
More information about the Eml-dev
mailing list