[eml-dev] EML 2.0.2 changes to text leaf nodes

Sat Mar 22 08:03:50 PDT 2008

Mark Servilla wrote:

> Hi Everyone,
>
> This is a great discussion, and certainly presses the issue of meaning 
> versus presentation in XML.  In my humble opinion, I disagree with the 
> movement toward allowing more presentation-like tagging within EML 
> specifically, and XML in general.  I realize that it simplifies the 
> decisions to be made within the rendering process, but it does not add 
> any meaning to the textual components of the content in question 
> except when inferred by the human who is viewing the rendered 
> content.  This is because the "meaning" is context sensitive.

I like the startings of "in my humble opinion" kind, they
remind me of the days of usenet before AOL  (~1994)
when flame wars were not the [usenet] norm. We would
all be oh so careful to insert those "markups" to avoid
creating harsh feelings. They made me smile :) .

As you know very well, Mark, the XML schema provides
with namespaces to avoid contextual ambiguities, and
tag collisions that seem to concern you. More specifically,
two identical tags with two different meanings, like in the
case you mention, the formula superscript would have the
mathML namespace as a cue to readers and machines.
So if we need to use the same identical tag names for a
footnote (superscript) and a math formula power (superscript),
we would be using the parent  schema namespaces. In
EML, we borrowed Docbook 4.* tags, so we wouldn't
even need to use the docbook namespace. (Those are
still unique tagnames)

We argue that those format-directed tags do not
interfere with the content as any machine parsing
may ignore format oriented tags if the parsing purpose
is to just extract specific content.

However, if we do not have the ability to indicate
to the machine, reader, or parser where is a newline,
we are losing a VERY important cue to the reader.
and the last recipient of the EML content is always
a person ( a scientist, a lawmaker, a K-12 student ),
not a machine (at least for now).

You can argue that we would instruct a parser
(for example) to read the "\n" (unix) or the "\n\r" 
and to consider all the OS dependent ways of
encoding a newline. True. But we chose XML
precisely to have a machine independent
portable, unambiguos way to communicate content
AND format: why rely on magic rules to reproduce
(a best guess) the original formatting?

I realize nothing is perfect. We rely heavily in the
internet browsers (iexplorer, firefox and friends..)
to interpret valuable encoding not expressed
in HTML and render the content in a humane
fashion.  The vast majority of web pages are
rendered correctly not because of the perfect
use of HTML,  but by the heavy lifting done by
browsers. In that vein, in the example above,
the newlines are frequently rescued by other means.
But there are plenty of examples, including the
one above, where the format is not quite recovered,
for those instances, it would be good to provide
those formatting choices.

Providing tools (tags) within EML for formatting
does not impact dramatically the content. Not
providing those tags may cripple the content and
give a good headache to the reader.

Finally, if you do not want to use those format oriented
markup tags, you have the choice to provide content
as flat as you please, as those tags are not mandatory.
But by not providing those tags (as choice) in EML 
we are making mandatory the lack of formatting. 

cheers, inigo

>   As an example, I can infer that "Ephedra trifurca" is a specie name 
> in the title "Sex in Ephedra trifurca (Ephedraceae) with Relation to 
> Chihuahuan Desert Habitats" Brunt, J.W. et al (1987) because it is 
> italicized and I am familiar with plant ecology.  In this example, the 
> title would be written "<title>Sex in <emphasis>Ephedra 
> trifurca</emphasis> (Ephedraceae) with Relation to Chihuahuan Desert 
> Habitats</title>" based on the suggested changes to the EML 2.0.1 
> schema.  Would it not be more powerful to provide semantic tagging to 
> textual components, thereby giving the content specific and concise 
> meaning?  As an alternative - "<title>Sex in <specie>Ephedra 
> trifurca</sepcie> (Ephedraceae) with Relation to Chihuahuan Desert 
> Habitats</title>."  In the later example, "Ephedra trifurca" is 
> clearly defined as a specie and the rendering process can decide how 
> to publish  the text based on its meaning.  This approach may open a 
> can of worms because of the unlimited number of possible tags, but it 
> is certainly more informative in systems where context cannot be 
> inferred, such as machine-to-machine interactions.  I would make a 
> similar argument against the use of superscript and subscript for use 
> in both chemical and mathematical formula; the former can easily 
> result in mistaking an exponent for a footnote, while the later can 
> result in mistaking a chemical formula for a variable index in a 
> mathematical expression.  I believe I understand the motivation for 
> the suggested changes, but I don't believe they will serve as a 
> benefit in the long run.  Please bang on me if I am really missing 
> something here.  And with the economy tanking, it is only my 0.0002 
> cents.
>
> Sincerely,
> Mark
>
>
>
> inigo wrote:
>
>>
>>
>> ...And how do you envision, in practice,  XSL interpreting
>> a bare "string" into formatted text?  If you don't give any cues
>> to XSLT in the form of markup tags (as for when to emphasize,
>> make a newline, or a new section, an underline, or boldface)
>> it is a guessing game.
>>   Those markup tags do not get on the way of content.
>> Whenever is chosen, XSL can flatten out all the content
>> of a branch (leaf) and pipe it as desired.  On the contrary,
>> without markup for formatting, you lose all the richness
>> associated with text.  Did you ever wonder why the vast
>> majority of people choose <i>MS word</i> or <i> OpenOffice</i>
>>  as opposed to 'vi', 'ed', or DOS 'edit'. We are not just
>> programming here, we are passing content with certain
>> syntactical and formal cues to the reader.  Do you ever
>> wonder why a scientist in Grenoble decided to come
>> up with HTML? may be adding some tags (title,underline,
>> strikeout, italics, boldface, and a suite of fonts, etc) was
>> not such a bad idea to replace the good ol' gophers.
>> Imagine e-commerce in flat text.
>>
>> In the extreme, is the case of people who pass ASCII
>> based "maps" of plot division (I.e:Cedar Creek LTER)
>> completely destroyed by the  Metacat  Stylesheets that
>> are unable to observe the minimum markup (such
>> as "literalLayout".) But how about methodogies that
>> are not well described by the tandem <substep>-<description>?
>> A little format goes a long way in helping the reader.
>> And it does not get that much in the way of
>> the "content".  But
>> Christopher Jones wrote:
>>
>>> Hi all,
>>>
>>> I strongly agree that content and presentation, ideally, should be  
>>> kept separate by allowing stylesheets to handle the latter.  I'm  
>>> struggling a bit with what constitutes 'content'.  A structural tag  
>>> such as <title> lends 'meaning' to the contained text, at least in  
>>> english.  A <b> tag in HTML seems much more presentational - it  
>>> doesn't add meaning, merely emphasis.  However, when formatting  
>>> conventions in scientific domains lend 'meaning' to text, like  
>>> italicizing species binomials, it seems that we need to provide the  
>>> facility for this, lest we lose semantic information.
>>>
>>> I agree with Wade that we walk a fine line here between expressing  
>>> semantics and presenting.  Cluttered EML docs could abound.  Is the  
>>> preservation of 'meaning' worth the trade-off?
>>>
>>> On Mar 20, 2008, at Mar20---3:06:43 PM, Wade Sheldon wrote:
>>>  
>>>
>>>> I think your casual example makes this point very well - what real  
>>>> use is preserving <emphasis> markup in a data set title? That's 
>>>> what  XSL is for. If this is a legacy issue for some metadata 
>>>> providers,  then I think they should be encouraged (or helped) to 
>>>> offload  embedded display markup when porting to EML.
>>>>     
>>>
>>>
>>> True, my example was a bit simple.  A better example would be the  
>>> species binomial case:
>>>
>>> <title>
>>>    Acetylene reduction and 15N2 uptake rates for
>>>    <emphasis>Alnus tenuifolia</emphasis> and
>>>    <emphasis>Alnus crispa</emphasis>
>>>    in six different successional habitats
>>> </title>
>>>
>>> where the stylesheet treats title tags followed by emphasis tags 
>>> with  italics.  This certainly is a presentation issue, but one that 
>>> imparts  meaning based on known conventions.  Notice how the 15N2 
>>> also seems to  lose meaning in this title without appropriate 
>>> formatting.
>>>
>>> Perhaps there is another way to deal with this, though?  It seems 
>>> too  big of a job to try to infer meaning from straight xs:string 
>>> word  combinations (such as Alnus tenuifolia) and then present it 
>>> correctly  with the right markup for presentation.
>>>
>>> On Mar 20, 2008, at Mar20---3:22:06 PM, inigo wrote:
>>>  
>>>
>>>> Margaret O'Brien and myself with help of Mark Servilla, and  to some
>>>> extent J. Brunt and Corinna Gries worked on this minor fix. In it,
>>>> we addressed the bug that Chris is talking about, yet the workaround
>>>> that Chris is proposing does not fix the fact that there are   
>>>> DocBook 4.*
>>>> Schema tags present in the documentation module of EML not declared
>>>> in the text-module of EML. Examples are <url> and <citetitle>. By
>>>> redefining the types, we address these errors partially, yet some
>>>> stringent XML editors (the XML Spy 2007, 2008) will call on the
>>>> existence of these undeclared tag, critical errors. This makes the  
>>>> schema
>>>> rather unprofessional.
>>>>     
>>>
>>>
>>> On Mar 20, 2008, at Mar20---3:39:10 PM, James Brunt wrote:
>>>  
>>>
>>>> Also, I'm in agreement with Inigo that making the schema "clean"  
>>>> should be a priority in this bug-fix release.
>>>>     
>>>
>>>
>>> Fair enough.  Consistent and complete support for either DocBook 
>>> 4.x  or DocBook 5.x throughout the EML schemas (in the eml-text 
>>> module and  the documentation tags in every module) seems like a 
>>> good goal, and  one that isn't particularly onerous.  Likewise, an 
>>> audit of the  documentation tags is in order to ensure completeness.
>>>
>>> Questions -
>>>
>>> Have the EML-2.0.2 proposed fixes stated in the "Community opinion 
>>> on  minor revision of EML" post been implemented in a branch in the  
>>> Ecoinformatics EML repository? If so, are they tagged?
>>>
>>> Besides bug #s 2054 and 2073, have the other 11 bullets in this 
>>> email  post been entered into the ecoinfo bugzilla?
>>>
>>> Cheers,
>>> Chris
>>> _________________________________________________________________
>>> christopher jones       cjones at msi.ucsb.edu      (805) 680-5946
>>> marine science institute  university of california, santa barbara
>>> _________________________________________________________________
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Eml-dev mailing list
>>> Eml-dev at ecoinformatics.org
>>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>>>   
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Eml-dev mailing list
>> Eml-dev at ecoinformatics.org
>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>
>