[eml-dev] [Bug 585] - internationalization needed in EML

Tue Dec 9 11:07:25 PST 2008

Thanks, Eammon, for the information.  Very useful.

The frustrating thing about ISO standards is how impossible they are to
obtain.  I have an old draft copy of ISO 19115, but neither I nor the UC
library has a copy of the current standard or of the 19139.  I have a
fundamental philosophical problem with standards that are not free and
open.  Nevertheless, I will continue to try to find a copy of these so that
I can look into it.

In the meantime, a couple of comments below in your note...

On Tue, Dec 9, 2008 at 12:23 AM, Eamonn O Tuama (GBIF) <eotuama at gbif.org>wrote:

>  Hi All,
>
> I presume any extensions to EML will involve changes to the schemas and
> therefore versioning. I do not know how complicated that will be – someone
> familiar with the EML schemas construction is best suited to answering.
> However, I think we might be able to learn something from the ISO
> 19115/19139 standard regarding multilingual metadata.
>
> First of all, it provides a distinct set of attributes for the metadata
> document itself (rather than the data the metadata document is describing).
> These include:
>
> 1. fileIdentifier
>
same as EML packageID

> 2. language
>
see my earlier discussion on this issue

> 3. characterSet
>
same as 'encoding' attribute in the XML prolog

> 4. contact
>
 why would the metadata contact be different from the data contact?  We have
trouble enough keeping one up to date

> 5. dateStamp
>
this would be useful, we should consider adding it to EML.  Presumably, this
is the date on which the metadata document was last updated, which would
probably belong in the 'maintenance' section of EML

> 6. metadataStandardname
>
provided in the namesapce of EML

> 7. metadataStandardVersion
>
provided in the namespace of EML

> 8. locale
>
this could be useful, although it seems like providing the language code
would be just as effective and essentially redundant.

> ISO 19139 (the implementation standard for the conceptual model in ISO
> 19115) also provides a means for encoding multilingual metadata. This is
> achieved through use of an optional, repeatable "locale" attribute
> consisting of language, country and characterset encodings.
>
This sounds interesting.  So, how does it repeat?  XML attributes are not
repeatable, nor do they have substructure.  Is it an element?  And if so, is
it a child element of every other element in the model?

> Multiple instances of locale may be defined for a metadata document and
> translations representing those locales provided for each metadata element.
> So, repeatability in multiple languages is built in.
>
I don't quite see how this would work.  Could you show a brief snippet as an
example.  For example, for the title of the dataset, how would you encode
two titles, each in both english and spanish, and be able to tell which of
the elements were semantically linked?  Here's one way I could see doing it,
but its a bit clunky:
 <title>
    <translation xml:lang="en">Forests of New Mexico</translation>
    <translation xml:lang="es">Bosques del Nuevo México</translation>
 </title>
 <title>
    <translation xml:lang="en">Survey of Plants and Animals</translation>
    <translation xml:lang="es">Estudio de Plantas y Animales</translation>
 </title>

How would the ISO 19139 propose representing this content?

> The ability to work with multiple languages is seen as a strong advantage
> in moving from the FGDC metadata standard to the North American Profile
> (NAP) of ISO 19115. The problem, at the moment, is that a biological profile
> in ISO 19115 does not exist but it seems that work is underway to express
> the FDGC Biological Profile in ISO. (I understand that EML based their
> taxonomic module directly on the FGDC biological profile component.)
>
Actually, the BDP standard first got these fields from EML 1.3.x and 1.4.x,
and then EML 2.x reincorporated the changes from the BDP. Either way, we've
been looking at replacing the EML taxonomic module with something more in
line with TDWG standards, in particular with TCS.  I have worked out a new
set of schemas for eml-taxon with Jessie Kennedy and Bob Peet that directly
incorporate TCS, but I haven't had time to introduce these changes to the
rest of the EML community.  On the TODO list.  Nevertheless, as you said,
there's a lot of compatibility between EML and the BDP.

>
>
> The European Union, because of its composition, has always faced the
> challenge of dealing with multiple languages. A document by the European
> Committee for Standardisation (CEN) on "Geographic information — Standards,
> specifications, technical reports and guidelines, required to implement
> Spatial Data Infrastructure" (can't find URL where I downloaded originally
> but have PDF if anyone wants it) provides some insights on "Cultural and
> Linguistic Adaptibility" where it places the emphasis on use of multilingual
> thesauri rather than efforts to translate element contents.
>
Interesting.  I'd like to see that.  So, given a metadata document in
Chinese, they are arguiing that scientists that speak other languages can
get by with multilingual thesausrus entries in place of the natural language
metadata?  I find this somewhat unconvincing if you really want to re-use
the data.

Thanks for your comments, Eammon.

Matt

> See also Nowak et al paper "Issues of multilinguality in creating a
> European SDI – the perspective for spatial data interoperability"
>
> http://www.ec-gis.org/Workshops/11ec-gis/papers/309nowak.pdf
>
>
>
> Regards,
>
>
>
> Éamonn
>
>
>
>
>
> *From:* David Blankman [mailto:dblankman1 at gmail.com]
> *Sent:* 08 December 2008 20:59
> *To:* Matt Jones
> *Cc:* inigo san gil; eml-dev at ecoinformatics.org;
> bugzilla-daemon at ecoinformatics.org; Vivian B Hutchison;
> burkeker at gate.sinica.edu.tw; chin at tfri.gov.tw; guoxb at igsnrr.ac.cn;
> hehl at igsnrr.ac.cn; lijh at sdb.cnic.cn; Aikiko Ogawa; Eamonn O Tuama; Kristin
> Vanderbilt; Schentz Herbert; Shang; Su Wen; Werf, Bert van der
> *Subject:* Re: [eml-dev] [Bug 585] - internationalization needed in EML
>
>
>
> As I think back upon the discussions in China and my discussions with Matt
> at ISEI, it seems to me that my initial thought that multiple language
> versions of EML documents are probably better handled by creating separate
> EML documents for each language used. EML is already complex, I see no
> reason to make it more complex.
>
>
> In the ILTER situation  we are asking ILTER member networks to provide a
> core of EML in English, on the understanding that more complete metadata may
> be in another language. In this case should there be an EML module,
> eml-ilter or eml-language analogous to eml-access that specifies the
> identifier of the "main" eml-document and the language of that document.
> This module might also include an element to record a brief statement about
> the amount of data in that foreign language. I am not sure what else might
> be appropriate for this module. I know that Matt was thinking that there
> might be some modifications to metacat replication that might be needed.
>
> David
>
>
>
>
>  On Mon, Dec 8, 2008 at 1:34 PM, Matt Jones <jones at nceas.ucsb.edu> wrote:
>
> David and I discussed (briefly) some of these issues at ISEI.  And we also
> discussed them at the ILTER meeting in China.  The 'language' tag in
> eml-resource defines the language of the resource, which in the case of
> eml-dataset resources means the language of the data.  Interestingly, we
> don't really have a language tag per se for the EML document content itself,
> except that all XML documents can use the built-in "xml:lang" attribute,
> which is optional for all XML elements (
> http://www.w3.org/TR/REC-xml/#sec-lang-tag).  This allows one to set the
> language for each and every element in an XML document, such as:
>
> <title xml:lang="en">North American Forests</title>
> <title xml:lang="es">Bosques de Norte Americano</title>
>
> Two problems we would need to address with this approach come immedately to
> mind:
>
> 1) Many elements in EML are not repeatable, and therefore it is not
> possible to have one copy of the element in English and another in a
> different language. So cardinality would have to be updated throughout the
> EML schemas, which would make some aspects of validation more confusing.
> 2) For those elements that are already repeatable or are made repatable
> through a revision, there is no mechanism to indicate that the two element
> nodes are meant to be have the same semantic meaning in different languages,
> as opposed to two semantically different elements that happen to also differ
> in their language.
>
> This second issue is the one that would require more structural changes to
> EML.  For example, one might sometimes want to have more than one title
> (which is why title is currently repeatable), but other times want to have
> one title in two different languages.  Either way, EML's current structures
> don't allow these subtleties to be specified.
>
> Matt
>
>
>
> On Fri, Dec 5, 2008 at 12:54 PM, inigo san gil <isangil at lternet.edu>
> wrote:
>
>
> Metadata folks:
>
> I think this opens (perhaps re-opens) and interesting discussion.
>
> EML's resource (main module) offers us a <language> element that,
> as I understand it, serves to specify the language used for the document.
> The cardinality is set to <= 1, so it is optional, and if used, only one
> language.
>
> However, we understood from Kristin Valnderbilt and David Blankman
> that at a recent ILTER meeting, there was an agreement to provide
> referencial-level EML for all metadata in English (and perhaps more
> rich EML in their native languages).
> The option David proposes, providing content in two languages
> one being english, does not play well with the EML schema as is.
> There are options in the interim, while we think whether 'we' tweak
> the EML schema.  Some solutions go in the direction of "duplicating" the
> original EML record: Take what it is in the native language, and either
> have it translate at some minimal-compliance level EML (ouch) or
> run it by a translation web service and laugh (or rather cry) at the
> results.
>
> There are of course many other approaches to this problem, Mark
> Servilla mentioned some in the hallways of the LTER Network Office.
>
> The thing is that part of the international community in ecology has
> expressed formal interest/commitment in using EML to document their
> metadata. The ILTER group quickly realized of the Babelian challenge
> ahead, (see Blankman's ISEI-6 presentation & future paper) and
> David, Akiko Ocgawa and others took in helping the ILTER providing
> basic EML in english (remember ILTER committed to use English
> -chinglish and spanglish- as the lingua franca for referential level EML,
> EML level 1, title, creator, abstract, contact at least
>
> Cheers,
> Inigo
>
>
>
>
> bugzilla-daemon at ecoinformatics.org wrote:
>
> http://bugzilla.ecoinformatics.org/show_bug.cgi?id=585
>
>
>
>
>
> ------- Comment #2 from mob at icess.ucsb.edu  2008-12-05 09:31 -------
> This comment from an email from David Blankman:
> As EML is becoming an international standard, we need to start thinking
> about
> ways to make EML more intelligent about multiple languages. While EML
> allows
> multiple titles, there is currently no way to indicated that multiple
> titles
> are equivalent. For example,if I have:
> <title> North American Forests </title>  AND
> <title> Bosques de Norte Americano</title>
>
> EML currently has no way to indicate that these are the same title, just in
> a
> different language.
>
> Matt and I were talking about this at the ISEI-Cancun meeting, but I
> thought
> that it would be a good idea to get this discussion started within eml-dev
> and
> the ILTER group as well.
> _______________________________________________
> Eml-dev mailing list
> Eml-dev at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>
>
>
> _______________________________________________
> Eml-dev mailing list
> Eml-dev at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>
>
>
>   --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Matthew B. Jones
> Director of Informatics Research and Development
> National Center for Ecological Analysis and Synthesis (NCEAS)
> UC Santa Barbara
> jones at nceas.ucsb.edu                       Ph: 1-907-523-1960
> http://www.nceas.ucsb.edu/ecoinfo
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> _______________________________________________
> Eml-dev mailing list
> Eml-dev at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>
>
>
>
> --
> Nature is trying very hard to make us succeed, but nature does not depend
> on us. We are not the only experiment.
>  - R. Buckminster Fuller
>
> If I am not for myself, then who will be for me? If I am for myself alone,
> then who am I? If not now, when?
> - Rabbi Hillel
>

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Matthew B. Jones
Director of Informatics Research and Development
National Center for Ecological Analysis and Synthesis (NCEAS)
UC Santa Barbara
jones at nceas.ucsb.edu                       Ph: 1-907-523-1960
http://www.nceas.ucsb.edu/ecoinfo
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20081209/1806b7cb/attachment-0001.html>