[eml-dev] [Bug 585] - internationalization needed in EML

Tue Dec 9 15:15:31 PST 2008

http://bugzilla.ecoinformatics.org/show_bug.cgi?id=585

------- Comment #4 from jones at nceas.ucsb.edu  2008-12-09 15:15 -------
After more conversation on the email list, it seems that the approach to
localization used in ISO 19139 may be applicable here.  A fragment of a 19139
document might include:

<abstract xsi:type="PT_FreeText_PropertyType">
  <gco:CharacterString>Brief narrative summary of the content of the
resource</gco:CharacterString>
  <!--== Alternative value ==-->
  <PT_FreeText>
    <textGroup>
      <LocalisedCharacterString locale="#locale-fr">RÃ©sumÃ© succinct
du contenu de la ressource</LocalisedCharacterString>
    </textGroup>
  </PT_FreeText>
</abstract>

So, the PT_FreeText_PropertyType is very similar in concept to the EML
TextType.  We could indeed define a new LocalizedTextType that use this same
trick, basically allowing textGroup subelements with alternate language
strings.  Or we could simply use the definition of PT_FreeText in EML via an
import (except that there may be restrictions on free reuse of ISO standards,
which would prevent us from incorporating such a thing directly in EML, as
redistribution is critical to an open standard). Although the naming
conventions they've used are not particularly readable.

Note that this approach depends on a previously defined locale
(locale="#locale-fr"), which is provided by a different set of elements earlier
in the metadata for defining locales.  The local defines both a language code
and a character encoding for strings in that locale:

  <locale> 
    <PT_Locale id="locale-fr"> 
      <languageCode> 
        <LanguageCode  
           codeList="resources/Codelist/gmxcodelists.xml# 
LanguageCode" 
           codeListValue="fra"> French </LanguageCode> 
      </languageCode> 
      <characterEncoding> 
        <MD_CharacterSetCode 
         codeList="resources/Codelist/gmxcodelists.xml# 
MD_CharacterSetCode" 
         codeListValue="utf8">UTF 8</MD_CharacterSetCode> 
      </characterEncoding> 
   </PT_Locale> 
  </locale> 

This is powerful, but it seems to me that the characterEncoding could get one
in trouble with XML parsers if the locale character encoding differs from the
encoding defined in the XML Prolog.  As far as I know, an XML document can have
one and only one character encoding, so mixing different elements with
different encodings will probably mess up standard XML processors.  This would
have to be explored to see if it is a significant issue.