eml PARAGRAPH

Tue Jul 16 09:20:16 PDT 2002

Dear eml-dev,

I am very impressed with the way the eml standard is shaping up.  Many
thanks are due to those who have taken leadership in authoring and
revising it:  it is likely (and regrettable) that the extent of their
effort will tend not to be fully appreciated by most end-users.

As an implementer of eml, I still need guidance on the use of
<paragraph>.  It is the only vehicle in eml for representing prose.  I
count ten occurences of <paragraph> in eml.  Some may be trivial, such
as <dataset><maintenance><paragraph>; others are not, such as
<ResourceGroup><abstract><paragraph>.

The problem, as I have mentioned previously, is that prose metadata
(text) is often highly structured.  <paragraph> gives us no way of
representing the structure of text, which is itself information.  In
many instances, of course, <paragraph> is repeatable, which allows us
some leeway to represent sequential structure.  But there is still no
way to represent hierarchical structure.  This has significant
consequences.  For example, a project-level abstract may include a short
outline of purposes or hypotheses.  A research protocol may include
finely-grained outlines of contingencies and responses.

Three alternative solutions have emerged from previous discussion.
1.  Decompose structured text into a series of <paragraph>.
2.  Inject structured text, with its native markup, as a CDATA block in
<paragraph>.
3.  Make <paragraph> nestable.

The decomposition approach has the advantage that it works directly with
eml as currently written.  However, converting hierarchically-structured
text to serially-structured text will require innovations by the data
manager that raise him/her to the status of author, a status not
necessarily sanctioned by those who contributed the original material. 
Furthermore, the result will probably not look good on the web, so the
data manager is forced to keep two versions of each (for example)
protocol:  one that looks good on the web, and one that has been
thoroughly decomposed to comply with eml.

The CDATA approach is certainly legal, requires virtually no effort on
the part of the information manager, and works with eml in its current
form.  But it opens the door to pass lots of custom markup to the eml
consumer.  Will the consumer know what to do with all those custom
tags?  It seems contrary to the whole purpose of generating a standard. 
If this is the preferred approach, perhaps we should establish the
expectation that the content of the CDATA block will be XHTML, as
specified by some recommendation.

The nestable paragraph approach shuts the door on custom markup while
still allowing the information manager to provide unsupervised,
completely hierarchical transformations of richly structured text.  The
major disadvantage is that it requires editing the draft standard.  It
may not, however, require the creation of any new elements, as long as
the paragraph element can have parsed character content as well as other
paragraphs.  This is a technical issue beyond my competence.

I hope that the leadership of the eml development community will offer
me some guidance on this issue.  I really don't think number 1 is a
viable option, but could make peace with either 2 or 3.  

Best regards,

Tim
-- 
Tim Bergsma
LTER Information Manager
W.K. Kellogg Biological Station
Michigan State University
Hickory Corners, MI   49060
616/671-2337
tbergsma at kbs.msu.edu
http://lter.kbs.msu.edu