[eml-dev] xml:lang attribute for title in EML 2.1.0

Mon Sep 20 14:59:19 PDT 2010

Folks interested in i18n progress - 
I've added a section to the EML documentation that describes the internationalization features implemented so far for EML 2.1.1. Please look over what I've got - hopefully your comments will help make the documentation clearer, or discussion will ensue over best-practices for generating multi-lingual EML documents.
Thanks,
-ben

--------------------------------------------------------------------------------------------------
2.8. Internationalization - Metadata in multiple languages

EML supports internationalization using the i18nNonEmptyStringType. Fields defined as this type include:
	• Title
	• Keyword
	• Contact information (e.g. names, organizations, addresses)

TextType fields also support language translations. These fields include:
	• Abstract
	• Methods
	• Protocol

Example 2.1. Internationalization techniques
Core metadata should be provided in English. The core elements can be augmented with translations in a native language. Detailed metadata can be provided in the native language as declared using the xml:lang attribute. Authors can opt to include English translations of this detailed metadata as they see fit.
The following example metadata document is provided primarily in Portuguese but includes English translations of core metadata fields. The xml:lang="pt_BR" attribute at the root of the EML document indicates that, unless otherwise specified, the content of the document is supplied in Portuguese (Brazil). The xml:lang="en_US" attributes on child elements denote that the content of that element is provided in English. Core metadata (i.e. title) is provided in English, supplemented with a Portuguese translation using the value tag with an xml:lang attribute. Note that child elements can override the root language declaration of the document as well as the language declaration of their containing elements. The abstract element is primarily given in Portuguese (as inherited from the root language declaration), with an English translation.
Many EML fields are repeatable (i.e. keyword) so that multiple values can be provided for the same concept. Translations for these fields should be included as nested value tags to indicate that they are equivalent concepts expressed in different languages rather than entirely different concepts.

<?xml version="1.0"?>
<eml:eml
    packageId="eml.1.1" system="knb" 
    xml:lang="pt_BR"
    xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
  <dataset id="ds.1">
    <!-- English title with Portuguese translation -->    
    <title xml:lang=""en_US" >
    	Sample Dataset Description
    	<value xml:lang="pt_BR" >Exemplo Descrição Dataset</value>
    </title>
    ...
    <!-- Portuguese abstract with English translation -->    
    <abstract>
    	<para>
	    	Neste exemplo, a tradução em Inglês é secundário
	    	<value xml:lang="en_US" >In this example, the English translation is secondary</value>
    	<para>
    </abstract>
    ...
    <!-- two keywords, each with an equivalent translation -->    
    <keywordSet>
    	<keyword keywordType="theme">
	    	árvore
	    	<value xml:lang="en_US" >tree</value>
    	<keyword>
    	<keyword keywordType="theme">
	    	água
	    	<value xml:lang="en_US" >water</value>
    	<keyword>
    </keywordSet>
    ...
  </dataset>
</eml:eml>

-------------------------------------------------

On Sep 17, 2010, at 12:14 AM, Markus Döring (GBIF) wrote:

> Impressive.
> just woke up and all seems to be settled already. Im happy to see all those changes and the new namespace version. 
> I will adapt the gbif recommended subset of eml to work with this 2.1.1 solution right now.
> 
> Thanks so much,
> Markus
> 
> 
> 
> On Sep 17, 2010, at 8:12, ben leinfelder wrote:
> 
>> Matt,
>> Looking over some TFRI business cards I have, I can certainly see a utility in allowing translations for contact information...
>> I've augmented the eml-party schema to allow for internationalized values.
>> Method steps are TextType (a la DocBook) and now support internationalization because of the changes I made to TextType for the 'abstract' element.
>> 
>> The next step will be to update the EML namespace in the schema modules to reflect this minor update. I am targeting "eml://ecoinformatics.org/eml-2.1.1" unless anyone objects.
>> Thanks,
>> -ben
>> 
>> 
>> On Sep 16, 2010, at 8:55 PM, Matt Jones wrote:
>> 
>>> Ben, 
>>> 
>>> I agree with you on the identifier and creator, contact, etc fields as not strictly needing translation.  However, even some of those fields may benefit, such as the use of a name in Mandarin and its Romanized translation.  Also, I think all of the fields that might allow for general text should be included, such as methods, etc.  Certainly anything that accepts TextType might need to be translated in addition to the fields you listed.  Does that sound reasonable?
>>> 
>>> Matt
>>> 
>>> On Thu, Sep 16, 2010 at 6:18 PM, Mark Servilla <servilla at lternet.edu> wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>> 
>>> I agree with the changes proposed by Ben in his 7/30/10 Jul 30, 2010 email -
>>> that is, the "mixed content" approach.  It is not entirely clear what the
>>> long-term implications of the changes will be for parsing and validation, but
>>> the short-term approach seems manageable (albeit, with effort from some warm
>>> body).  Internationalization of EML is a necessary step as we move research and
>>> its associated data/metadata to a global level.
>>> 
>>> I personally do not see the alternatives as being viable long-term solutions.
>>> Multiple documents of differing languages will ultimately be too cumbersome and,
>>> likely, not kept synchronized.  Introducing another inline content attribute
>>> (i.e., <title>[Language:En]Snow cover...) only adds yet more syntactical parsing
>>> issues.  The use of the xml:lang attribute is, at least, a recognized and
>>> standard approach in many systems.
>>> 
>>> I appreciate and thank Ben and others for their efforts in this matter.
>>> 
>>> Sincerely,
>>> Mark
>>> 
>>> On 9/16/10 Sep 16, 2010 1:56 PM, Matt Jones wrote:
>>>> The solution that Ben proposed is meant to address the requirements that arose
>>>> from the iLTER Lake Taihu meeting for providing core metadata in multiple
>>>> languages.  These recommendations then were also at the core of the
>>>> recommendations made to GBIF about which fields should contain English
>>>> translations, but the set of fields differs slightly in the two recommendations.
>>>> Because many of these fields are not currently repeatable according to the EML
>>>> 2.1 schema, we would need to, at a minumum, change cardinality rules to allow
>>>> for each field to be included multiple times if the xml:lang tag were used to
>>>> differentiate them (or for the approach Inigo points to).  As Ben points out, it
>>>> would still be ambiguous as to whether the repeating fields represent different
>>>> information, or the same information translated.  So his proposal is meant to
>>>> explicitly flag translations as such within mixed content string fields, with
>>>> the goal of doing so without breaking existing EML 2.1 compatibility and without
>>>> having to change existing cardinality rules.
>>>> 
>>>> Ben's prior discussion on this highlighted the conflict with the NonEmptyString
>>>> type that was introduced in EML 2.1, in that mixed content elements would not be
>>>> validated and so the rules for NonEmptyString would not be enforced.  I think
>>>> this would only be a small issue, and that the advantages in compatibility
>>>> provided by using a mixed content model for language translations outweigh the
>>>> loss of validation within our string types.  Either way, we would need to add
>>>> the xml:lang attribute so that it can be used throughout EML, including in the
>>>> translation elements that Ben proposed.
>>>> 
>>>> Are there any objections to moving forward with the schema changes to use a
>>>> mixed content models for translations that Ben proposed in his earlier emails?
>>>> 
>>>> Matt
>>>> 
>>>> On Thu, Sep 16, 2010 at 11:35 AM, Inigo San Gil <isangil at canyon.lternet.edu
>>>> <mailto:isangil at canyon.lternet.edu>> wrote:
>>>> 
>>>> 
>>>>   We'll keep our eyes on the ball, then.
>>>> 
>>>>   Meanwhile others have adopted their own solution.
>>>>   Here are two examples:
>>>>   1) a site from Spain reports this implementation
>>>> 
>>>>   <title>[Language:En]Snow cover data provided by MODIS satellite imagery</title>
>>>>   <title>[Language:Sp]Datos de innivaci&#243;n seg&#250;n im&#225;genes
>>>>   MODIS</title>
>>>> 
>>>>   We thought that the use of the XML attribute "lang=en | sp"
>>>>   was interesting -but, among other problems,  we would  have
>>>>   gotten screwed by eml-dev eventual internationalization
>>>>   implementation.  Call it luck, but you can bet the "eventual
>>>>   eml-dev decision" would force us to re-code the EML
>>>>   generation.
>>>> 
>>>>   2) From Taiwan, it is also a mix and match.  I had the
>>>>   internationalization conversation years ago, when we set
>>>>   harvesting into the NBII clearinghouse.  at the TFRI, we
>>>>   found EML documents that have a hybrid of english and
>>>>   chinese, with no sign or whatsoever of the language used.
>>>>   We had to devise a mechanism to detect language.  We
>>>>   simply did not harvest those docs whose critical content
>>>>   was not translated in English.
>>>> 
>>>>   ILTER discussed (two years ago?) some guidelines on
>>>>   how the different countries were going to deal with the
>>>>   tower of Babel problem.  May be you can look into those
>>>>   if you feel curious, but if I recall correctly, it went along
>>>>   the lines of encoding the metadata in the native language,
>>>>   and produce some discovery-level EML in English. This
>>>>   strategy would create two EMLs per EML..
>>>> 
>>>>   Sparks or not, I still have to recommend the EML users
>>>>   to implement some solution. Im inclined to suggest  that
>>>>   such solution 1) does not break the current EML rules.
>>>>   2) The solution should allow for easy language detection.
>>>>   Spain's case fits here, for example.
>>>> 
>>>>   Cheers, inigo
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>   On 9/16/2010 10:56 AM, ben leinfelder wrote:
>>>> 
>>>>       Hi Markus,
>>>>       I'm afraid your findings are accurate with respect to the xml:lang
>>>>       attribute in the<title>  element (or any "NonEmptyStringType" element).
>>>>       In the course of my experimentation with allowing backwards-compatible
>>>>       internationalization with a new EML version (2.1.1) I did have to
>>>>       include the "http://www.w3.org/XML/1998/namespace" namespace just as you
>>>>       did and also declare the xml:lang attribute in elements where I wanted
>>>>       to employ it.
>>>>       While certain EML elements are repeatable, it's not always clear what
>>>>       the presence of multiple elements represent (are they translations in
>>>>       different languages or are they alternate titles?). In order to clarify
>>>>       this confusion and also allow multiple translations for non-repeatable
>>>>       elements I proposed a solution for allowing mixed element content for
>>>>       fields that should be internationalized. There's a fairly comprehensive
>>>>       discussion of this approach in our eml-dev archives:
>>>>       http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/2010-July/001828.html
>>>>       I didn't get a lot of decisive feedback and so have not moved forward
>>>>       with releasing an updated EML version. Hopefully this thread will again
>>>>       set the ball rolling.
>>>>       -ben
>>>>       .nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>>>>       <http://nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev>
>>>> 
>>>> 
>>>>   _______________________________________________
>>>>   Eml-dev mailing list
>>>>   Eml-dev at ecoinformatics.org <mailto:Eml-dev at ecoinformatics.org>
>>>>   http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Eml-dev mailing list
>>>> Eml-dev at ecoinformatics.org
>>>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>>> 
>>> - --
>>> Mark Servilla, Ph.D.
>>> 
>>> LTER Network Office
>>> Department of Biology
>>> MSC 03 2020
>>> 1 University of New Mexico
>>> Albuquerque, NM 87131-0001
>>> 
>>> servilla at LTERnet.edu
>>> Office (505) 277-2619
>>> Cell   (505) 453-8593
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG v1.4.8 (Darwin)
>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>>> 
>>> iEYEARECAAYFAkySz/4ACgkQqFW3+12RyXOEggCeLtSSf8r3pJty+lv06lk9uSVH
>>> z0YAn1HQNykMFDCt8zIm02bwMv5iecng
>>> =z21i
>>> -----END PGP SIGNATURE-----
>>> 
>>> _______________________________________________
>>> Eml-dev mailing list
>>> Eml-dev at ecoinformatics.org
>>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>> 
>> _______________________________________________
>> Eml-dev mailing list
>> Eml-dev at ecoinformatics.org
>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>> 
>