[eml-dev] EML 2.0.2 changes to text leaf nodes

Mon Mar 24 08:13:35 PDT 2008

A Monday morning discussion on this subject with Duane Costa has
provided me with some excellent clarity on this issue.

My previous, and perhaps overzealous, musings over separating meaning
from presentation is best restated such that: there are elements in EML
that are well structured and should be reserved for the pure semantic
meaning of the content (e.g., taxonomicCoverage) so that machine-based
parsing can be effective, while there are other elements that best
convey meaning through presentation-based formatting (e.g. title).

Duane said it quite eloquently, "the former can be rigorously
machine-parsed, while the latter is a blob of text that is inherently
more presentation oriented"

If this is a better appreciation of the issue and basically a
restatement of Margaret's email (I should have read more carefully),
then I agree with her note on the judicious replacement of xs:strings
with txt:TypeText.

My apologies for any distractions that I may have caused.

Sincerely,
Mark

Mark Servilla wrote:
> Namespaces only disambiguate the domain use of the specific tag.
> Presentation-based tags, even when contextually constrained by a
> namespace, still do not add concise semantic meaning to unique terms.
> 
> To be clear, I am not advocating the removal of presentation tags
> (docbook like or other) when used to identify the layout and
> presentation of text, but rather the removal of these tags as a
> substitute for assigning meaning to text - e.g., emphasis equates to
> species.  I will concede that the docbook tags make rendering of content
> quite nice, especially when displaying the complex information that
> often is part of EML.  And yes, end-consumers of such information rely
> on such cues to infer meaning.  I believe, however, that allowing
> presentation markup for textual based content creates the possibility of
> misinterpretation by both the producer (e.g., assumes emphasis clearly
> defines species) and the consumer (e.g., that italics implies species).
>   And certainly applications cannot make such context-based inference.
> It is this slippery slope that I wish to avoid.
> 
> And by the way, I use "vi" often in LaTex document preparation ;-).
> 
> Mark
> 
> 
> inigo wrote:
>> Mark Servilla wrote:
>>
>>> Hi Everyone,
>>>
>>> This is a great discussion, and certainly presses the issue of meaning 
>>> versus presentation in XML.  In my humble opinion, I disagree with the 
>>> movement toward allowing more presentation-like tagging within EML 
>>> specifically, and XML in general.  I realize that it simplifies the 
>>> decisions to be made within the rendering process, but it does not add 
>>> any meaning to the textual components of the content in question 
>>> except when inferred by the human who is viewing the rendered 
>>> content.  This is because the "meaning" is context sensitive.
>> I like the startings of "in my humble opinion" kind, they
>> remind me of the days of usenet before AOL  (~1994)
>> when flame wars were not the [usenet] norm. We would
>> all be oh so careful to insert those "markups" to avoid
>> creating harsh feelings. They made me smile :) .
>>
>> As you know very well, Mark, the XML schema provides
>> with namespaces to avoid contextual ambiguities, and
>> tag collisions that seem to concern you. More specifically,
>> two identical tags with two different meanings, like in the
>> case you mention, the formula superscript would have the
>> mathML namespace as a cue to readers and machines.
>> So if we need to use the same identical tag names for a
>> footnote (superscript) and a math formula power (superscript),
>> we would be using the parent  schema namespaces. In
>> EML, we borrowed Docbook 4.* tags, so we wouldn't
>> even need to use the docbook namespace. (Those are
>> still unique tagnames)
>>
>> We argue that those format-directed tags do not
>> interfere with the content as any machine parsing
>> may ignore format oriented tags if the parsing purpose
>> is to just extract specific content.
>>
>> However, if we do not have the ability to indicate
>> to the machine, reader, or parser where is a newline,
>> we are losing a VERY important cue to the reader.
>> and the last recipient of the EML content is always
>> a person ( a scientist, a lawmaker, a K-12 student ),
>> not a machine (at least for now).
>>
>> You can argue that we would instruct a parser
>> (for example) to read the "\n" (unix) or the "\n\r" 
>> and to consider all the OS dependent ways of
>> encoding a newline. True. But we chose XML
>> precisely to have a machine independent
>> portable, unambiguos way to communicate content
>> AND format: why rely on magic rules to reproduce
>> (a best guess) the original formatting?
>>
>> I realize nothing is perfect. We rely heavily in the
>> internet browsers (iexplorer, firefox and friends..)
>> to interpret valuable encoding not expressed
>> in HTML and render the content in a humane
>> fashion.  The vast majority of web pages are
>> rendered correctly not because of the perfect
>> use of HTML,  but by the heavy lifting done by
>> browsers. In that vein, in the example above,
>> the newlines are frequently rescued by other means.
>> But there are plenty of examples, including the
>> one above, where the format is not quite recovered,
>> for those instances, it would be good to provide
>> those formatting choices.
>>
>> Providing tools (tags) within EML for formatting
>> does not impact dramatically the content. Not
>> providing those tags may cripple the content and
>> give a good headache to the reader.
>>
>> Finally, if you do not want to use those format oriented
>> markup tags, you have the choice to provide content
>> as flat as you please, as those tags are not mandatory.
>> But by not providing those tags (as choice) in EML 
>> we are making mandatory the lack of formatting. 
>>
>> cheers, inigo
>>
>>>   As an example, I can infer that "Ephedra trifurca" is a specie name 
>>> in the title "Sex in Ephedra trifurca (Ephedraceae) with Relation to 
>>> Chihuahuan Desert Habitats" Brunt, J.W. et al (1987) because it is 
>>> italicized and I am familiar with plant ecology.  In this example, the 
>>> title would be written "<title>Sex in <emphasis>Ephedra 
>>> trifurca</emphasis> (Ephedraceae) with Relation to Chihuahuan Desert 
>>> Habitats</title>" based on the suggested changes to the EML 2.0.1 
>>> schema.  Would it not be more powerful to provide semantic tagging to 
>>> textual components, thereby giving the content specific and concise 
>>> meaning?  As an alternative - "<title>Sex in <specie>Ephedra 
>>> trifurca</sepcie> (Ephedraceae) with Relation to Chihuahuan Desert 
>>> Habitats</title>."  In the later example, "Ephedra trifurca" is 
>>> clearly defined as a specie and the rendering process can decide how 
>>> to publish  the text based on its meaning.  This approach may open a 
>>> can of worms because of the unlimited number of possible tags, but it 
>>> is certainly more informative in systems where context cannot be 
>>> inferred, such as machine-to-machine interactions.  I would make a 
>>> similar argument against the use of superscript and subscript for use 
>>> in both chemical and mathematical formula; the former can easily 
>>> result in mistaking an exponent for a footnote, while the later can 
>>> result in mistaking a chemical formula for a variable index in a 
>>> mathematical expression.  I believe I understand the motivation for 
>>> the suggested changes, but I don't believe they will serve as a 
>>> benefit in the long run.  Please bang on me if I am really missing 
>>> something here.  And with the economy tanking, it is only my 0.0002 
>>> cents.
>>>
>>> Sincerely,
>>> Mark
>>>
>>>
>>>
>>> inigo wrote:
>>>
>>>> ...And how do you envision, in practice,  XSL interpreting
>>>> a bare "string" into formatted text?  If you don't give any cues
>>>> to XSLT in the form of markup tags (as for when to emphasize,
>>>> make a newline, or a new section, an underline, or boldface)
>>>> it is a guessing game.
>>>>   Those markup tags do not get on the way of content.
>>>> Whenever is chosen, XSL can flatten out all the content
>>>> of a branch (leaf) and pipe it as desired.  On the contrary,
>>>> without markup for formatting, you lose all the richness
>>>> associated with text.  Did you ever wonder why the vast
>>>> majority of people choose <i>MS word</i> or <i> OpenOffice</i>
>>>>  as opposed to 'vi', 'ed', or DOS 'edit'. We are not just
>>>> programming here, we are passing content with certain
>>>> syntactical and formal cues to the reader.  Do you ever
>>>> wonder why a scientist in Grenoble decided to come
>>>> up with HTML? may be adding some tags (title,underline,
>>>> strikeout, italics, boldface, and a suite of fonts, etc) was
>>>> not such a bad idea to replace the good ol' gophers.
>>>> Imagine e-commerce in flat text.
>>>>
>>>> In the extreme, is the case of people who pass ASCII
>>>> based "maps" of plot division (I.e:Cedar Creek LTER)
>>>> completely destroyed by the  Metacat  Stylesheets that
>>>> are unable to observe the minimum markup (such
>>>> as "literalLayout".) But how about methodogies that
>>>> are not well described by the tandem <substep>-<description>?
>>>> A little format goes a long way in helping the reader.
>>>> And it does not get that much in the way of
>>>> the "content".  But
>>>> Christopher Jones wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I strongly agree that content and presentation, ideally, should be  
>>>>> kept separate by allowing stylesheets to handle the latter.  I'm  
>>>>> struggling a bit with what constitutes 'content'.  A structural tag  
>>>>> such as <title> lends 'meaning' to the contained text, at least in  
>>>>> english.  A <b> tag in HTML seems much more presentational - it  
>>>>> doesn't add meaning, merely emphasis.  However, when formatting  
>>>>> conventions in scientific domains lend 'meaning' to text, like  
>>>>> italicizing species binomials, it seems that we need to provide the  
>>>>> facility for this, lest we lose semantic information.
>>>>>
>>>>> I agree with Wade that we walk a fine line here between expressing  
>>>>> semantics and presenting.  Cluttered EML docs could abound.  Is the  
>>>>> preservation of 'meaning' worth the trade-off?
>>>>>
>>>>> On Mar 20, 2008, at Mar20---3:06:43 PM, Wade Sheldon wrote:
>>>>>  
>>>>>
>>>>>> I think your casual example makes this point very well - what real  
>>>>>> use is preserving <emphasis> markup in a data set title? That's 
>>>>>> what  XSL is for. If this is a legacy issue for some metadata 
>>>>>> providers,  then I think they should be encouraged (or helped) to 
>>>>>> offload  embedded display markup when porting to EML.
>>>>>>     
>>>>> True, my example was a bit simple.  A better example would be the  
>>>>> species binomial case:
>>>>>
>>>>> <title>
>>>>>    Acetylene reduction and 15N2 uptake rates for
>>>>>    <emphasis>Alnus tenuifolia</emphasis> and
>>>>>    <emphasis>Alnus crispa</emphasis>
>>>>>    in six different successional habitats
>>>>> </title>
>>>>>
>>>>> where the stylesheet treats title tags followed by emphasis tags 
>>>>> with  italics.  This certainly is a presentation issue, but one that 
>>>>> imparts  meaning based on known conventions.  Notice how the 15N2 
>>>>> also seems to  lose meaning in this title without appropriate 
>>>>> formatting.
>>>>>
>>>>> Perhaps there is another way to deal with this, though?  It seems 
>>>>> too  big of a job to try to infer meaning from straight xs:string 
>>>>> word  combinations (such as Alnus tenuifolia) and then present it 
>>>>> correctly  with the right markup for presentation.
>>>>>
>>>>> On Mar 20, 2008, at Mar20---3:22:06 PM, inigo wrote:
>>>>>  
>>>>>
>>>>>> Margaret O'Brien and myself with help of Mark Servilla, and  to some
>>>>>> extent J. Brunt and Corinna Gries worked on this minor fix. In it,
>>>>>> we addressed the bug that Chris is talking about, yet the workaround
>>>>>> that Chris is proposing does not fix the fact that there are   
>>>>>> DocBook 4.*
>>>>>> Schema tags present in the documentation module of EML not declared
>>>>>> in the text-module of EML. Examples are <url> and <citetitle>. By
>>>>>> redefining the types, we address these errors partially, yet some
>>>>>> stringent XML editors (the XML Spy 2007, 2008) will call on the
>>>>>> existence of these undeclared tag, critical errors. This makes the  
>>>>>> schema
>>>>>> rather unprofessional.
>>>>>>     
>>>>> On Mar 20, 2008, at Mar20---3:39:10 PM, James Brunt wrote:
>>>>>  
>>>>>
>>>>>> Also, I'm in agreement with Inigo that making the schema "clean"  
>>>>>> should be a priority in this bug-fix release.
>>>>>>     
>>>>> Fair enough.  Consistent and complete support for either DocBook 
>>>>> 4.x  or DocBook 5.x throughout the EML schemas (in the eml-text 
>>>>> module and  the documentation tags in every module) seems like a 
>>>>> good goal, and  one that isn't particularly onerous.  Likewise, an 
>>>>> audit of the  documentation tags is in order to ensure completeness.
>>>>>
>>>>> Questions -
>>>>>
>>>>> Have the EML-2.0.2 proposed fixes stated in the "Community opinion 
>>>>> on  minor revision of EML" post been implemented in a branch in the  
>>>>> Ecoinformatics EML repository? If so, are they tagged?
>>>>>
>>>>> Besides bug #s 2054 and 2073, have the other 11 bullets in this 
>>>>> email  post been entered into the ecoinfo bugzilla?
>>>>>
>>>>> Cheers,
>>>>> Chris
>>>>> _________________________________________________________________
>>>>> christopher jones       cjones at msi.ucsb.edu      (805) 680-5946
>>>>> marine science institute  university of california, santa barbara
>>>>> _________________________________________________________________
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Eml-dev mailing list
>>>>> Eml-dev at ecoinformatics.org
>>>>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>>>>>   
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Eml-dev mailing list
>>>> Eml-dev at ecoinformatics.org
>>>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>> _______________________________________________
>> Eml-dev mailing list
>> Eml-dev at ecoinformatics.org
>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Eml-dev mailing list
> Eml-dev at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev

-- 
Mark Servilla, Ph.D.

LTER Network Office
Department of Biology
MSC 03 2020
1 University of New Mexico
Albuquerque, NM 87131-0001

servilla at lternet.edu
Office (505) 277-2619
Cell   (505) 453-8593
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3249 bytes
Desc: S/MIME Cryptographic Signature
Url : http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20080324/9adeb915/smime.bin