eml packaging -- proposed changes
Matt Jones
jones at nceas.ucsb.edu
Thu May 23 02:27:20 PDT 2002
Sorry -- I attached the wrong image in my last email. Here's the right
image...
Matt
Matt Jones wrote:
> I've been doing a lot of thinking about the changes in packaging that we
> discussed at Sevilleta. I've also talked it over at some length with
> Chad, Dan, and Chris, and then continued my rumination. We have a
> number of possibilities, and so I have been weighing the pros and cons
> of each. In this email, I am hoping to outline my thoughts in terms of
> 1) what we're trying to achieve with packaging, 2) what technologies are
> available and their possible problems, and 3) a proposed approach.
> Needless to say, this is going to be a long email (just think how long
> we could talk about it :^).
>
> I have checked in some changes to a subset of the eml modules (eml,
> resource, dataset, party, coverage) that demonstrate how this new
> solution would work. You might want to grab those to look at. I'm
> looking forward to your feedback.
>
> 1) our goals for linking are:
> -----------------------------
> a) reduce repeated information by allowing references internally to
> existing subtrees (this requires an ability to substitute a whole
> subtree for a reference when processing);
> b) provide extensibility by allowing new metadata types to be added to a
> package and associated without knowing ahead of time what they are (this
> requires an ability to state an association between two metadata
> components).
>
> 2) Technologies available
> -------------------------
> At least 5 linking technologies are available to use: eml triples,
> ID/IDREF, XLink, RDF, XML Schema key/keyref. ID/IDREF are part of XML
> 1.0 and require only an XML parser for validation. The others require
> an additional parser for validation. All of them require an additional
> parser in order to use the links beyond validation (ie, to resolve
> them). ID/IDREF and key/keyref do not have a concept of a "role" for
> the link, the others do. Our triples are a non-standard
> equivalent to rdf. We could also use XPath/XPointer addresses, but
> because these can be relative, they can break by changing the document,
> and so are not particularly robust for our purposes. XLink requires
> that the links be attributes, and that they are in the xlink namespace,
> as well as requiring an additional processor. ID/IDREF must be attributes.
>
> The RDF Statement is generally used at a finer granularity than we
> are using it, in that the predicate/role/property is usually atomic
> (e.g., "creator"). We, however, use it to point two complex structures
> at one another, and so the role provides no additional information that
> is not already implicit in the document types of the subject and object.
> Consequently, I propose that we do not actually need a "role" for our
> linking purposes. Thus, eml triples and RDF are probably overkill. We
> could talk this over for a long time.
>
> Xlink allows a role and other link metadata, but requires the use of
> the particular xlink attributes on your linking elements (e.g.,
> <mylinkelement xlink:href="some-link-uri"/>). ID/IDREF allows you to
> create a link id or idref attribute with any name on any element (e.g.,
> <mylinkelement ref="someid"/>. The Xlink:show attribute allows one to
> specify what to do with a link, including values like "replace" for
> substitution and "new" to indicate a link. These loosely correspond to
> our desire for both replacement and pointer links. Xlink processors
> seem to provide more convenient access to a compiled link database after
> processing, but this is probably a fairly easy library to provide for
> ID/IDREF too. ID/IDREF links MUST be internal to the doc, whereas Xlink
> can point at external resources. Overall, we thought the simplicity of
> ID/IDREF was good, and that it had the features we needed, but that
> Xlink would be almost equivalent. XLink may allow some growth that
> ID/IDREF would not.
>
> Our first goal, to reduce redundancy by using references to other
> identified subtrees, introduces some issues with validation. In
> particular, if we use IDREF, then we would need to write a content model
> where the element content depends on the presence of an id or idref
> attribute, which is technically not possible. For example, the element
> should be considered valid if it either 1) has an idref attribute and no
> content, or 2) has an id attribute and valid content. This can not be
> represented in XML. So, overall, this whole scheme introduces a huge
> complication for validation, but the proposed solution gets around this
> problem.
>
> 3) Proposed approach
> --------------------
> Our general approach in EML has been to create ComplexTypes (CT) when we
> wanted a particular block to be reusable. I propose that this concept
> be extended by adding an optional attribute named "id" of type "xs:ID"
> for each ComplexType. This allows us to uniquely address each block
> defined by a CT, and any XML 1.0 parser will validate that all of the
> "id" values are in fact locally unique. For the "ResourceBase" CT, this
> new id element would replace the current "identifier" element and would
> also act as the overall identifier for the package. ResourceBase would
> also have the "system" attribute (from identifier) for globally scoping
> the id.
>
> Next, we would change the content model for each CT to be a choice
> between the existing content model and a new element named "references"
> of type "xs:string". This element will be used to hold a reference to
> an existing subtree identified by its id. We use this element instead
> of an IDREF to surmount the validation issues mentioned above. This
> relationship between the "references" element and the "id" identifiers
> will be enforced by defining an XML Schema "key" for the "id" elements
> and a "keyref" for the "references" elements. Thus, any XML parser that
> supports XML Schema validation will be able to validate the
> correspondence between each "id" and "references" field (e.g., Xerces
> 2.0 supports this). I've attached a picture of ResponsibleParty as
> modified using this approach for illustration purposes (but note it
> doesn't show the "id" attribute on "ResponsibleParty").
>
> Here's a fragemnt of an example xml doc to illustrate:
> ...
> <creator id="p1">
> <individualName><surName>Jones</surName></individualName>
> </creator>
> <associatedParty>
> <references>p1</references>
> <role>lackey</role>
> </associatedParty>
> <contact>
> <references>p1</references>
> </contact>
> ...
>
> Note that this even works for types that extend other types as long as
> the subclass is the one that does the referencing (e.g., associatedParty
> can reference creator, but not vice versa). This rule will actually be
> enforced by validating parsers.
>
> Existing modules that are currently associated via triples will instead
> be directly included in the content models (e.g., entity will contain
> attributeList), but the "references" element allows us to define each
> attributeList only once and reference it in the other entities that
> share it.
>
> So, that lets us reuse portions of documents, satisfying goal 1 while
> still minimizing our processor needs. In a worse case scenario, if a
> schema validator is not available, we can still validate the ids as
> unique because they are defined as type xs:ID.
>
> The key and keyref are defined in the eml.xsd module. In this scenario,
> a package is defined by all of the content included in the <eml> tag,
> including the nested modules like attribute in entity. The only thing
> we lose with this approach is the ability to use alternative specs
> (e.g., use something other than eml-attribute for attribute
> descriptions) for a given module because they will be included directly
> in the content models, but that's not a very big deal. The content
> model of the eml element requires one of the types that extend resource
> (dataset, lit, software, ...), and then has an optional, repeatable
> element "additionalMetadata" with content model ANY in which arbitrary
> other metadata docs can be placed. The additionalMetadata element has
> an id attribute and another attribute named "describes" that is a
> reference to an id with which this subtree should be associated. The
> nature of the association is implied by the types of the document (ie,
> role/predicate/property/relationship is not specified directly). The
> reference/id linkage is enforced by defining another "keyref"
> constraint. So, this lets us add arbitrary metadata documents and point
> them at existing ids in the tree. Thus, the id serves as both ends of
> the link (subject and object in RDF terms) depending on whether it is
> referred to in a "references" element or in a "describes" attribute.
>
> This satisfies our second goal of being able to include arbitrary
> metadata types.
>
> I've attached a sample xml document illustrating these concepts that
> validates using the Xerces schema processor. You'll need to check out
> the updated schema files (obviously) for it to work. I've also modified
> the SAX validator script to optionally include schema validation if you
> have xerces2 on your classpath.
>
> Lots of stuff to ponder, for sure. I didn't go into detail about the
> several other approaches that I've considered and rejected, because I'm
> trying to keep this email somewhat manageable in terms of length.
>
> Thanks for your feedback.
>
> Matt
>
>
> ------------------------------------------------------------------------
>
>
> ------------------------------------------------------------------------
>
> <?xml version="1.0" encoding="UTF-8"?>
> <eml:eml xmlns:eml="eml:eml-2.0.0beta8" xmlns:cit="eml:literature-2.0.0beta8" xmlns:doc="eml:documentation-2.0.0beta8" xmlns:ds="eml:dataset-2.0.0beta8" xmlns:sw="eml:software-2.0.0beta8" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="eml:eml-2.0.0beta8
> eml.xsd">
> <ds:dataset id="knb.2.1" system="knb">
> <title>Bison data</title>
> <creator id="knb.3.1">
> <individualName>
> <givenName>Matthew</givenName>
> <surName>Jones</surName>
> </individualName>
> <organizationName>NCEAS</organizationName>
> <address>
> <city>Santa Barbara</city>
> <country>CA</country>
> </address>
> <electronicMailAddress>jones at nceas.ucsb.edu</electronicMailAddress>
> <onlineUrl>http://www.nceas.ucsb.edu</onlineUrl>
> </creator>
> <metadataProvider>
> <references>knb.3.1</references>
> </metadataProvider>
> <associatedParty>
> <references>knb.3.1</references>
> <role>lackey</role>
> </associatedParty>
> <contact>
> <references>knb.3.1</references>
> </contact>
> </ds:dataset>
> <additionalMetadata id="knb.4.1" describes="knb.3.1">
> <personalInfo>
> <nickname>Matt</nickname>
> </personalInfo>
> </additionalMetadata>
> </eml:eml>
--
*******************************************************************
Matt Jones jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/ Fax: 425-920-2439 Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
Interested in ecological informatics? http://www.ecoinformatics.org
*******************************************************************
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eml-party.png
Type: image/png
Size: 2149 bytes
Desc: not available
Url : http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20020523/1e4d7fb2/eml-party.png
More information about the Eml-dev
mailing list