[eml-dev] eml globalization

Thu Jun 24 13:28:34 PDT 2010

Hi --

This is an important issue, and one that I think we should tackle very soon
for EML as we have a lot of new international groups producing EML in many
languages.  I was in Brazil 2 weeks ago setting up a Metacat for PELD and
the issue of supporting multiple languages came up immediately.  We've
discussed this in the past, and the approach I was thinking of is summarized
here:

http://bugzilla.ecoinformatics.org/show_bug.cgi?id=585#c4

The alternate solution of producing multiple metadata documents each in a
different language has the problem of not knowing how to locate a particular
translation -- I guess it would be done by file naming convention, but this
is problematic as it is difficult to standardize without a specification.

The three ways I can see doing this are:
1) At the element level, allow repeating content in multiple languages
   -- matches how ISO19115 does it
   -- this is the proposal in bug
http://bugzilla.ecoinformatics.org/show_bug.cgi?id=585#c4
2) At the document level, allow two or more sections, each in their own
language
3) Multiple documents

Personally, I think the 1st is the most approachable, and allows groups to
add translated content for a few fields easily.  Marking these with the
appropriate locale using xml:lang and related attributes would be
straightforward.  The hard part would be changing the content model of EML
to allow the repeating fields -- it would be best if we could do this in a
way that does not invalidate existing EML 2.1 documents, but I'm not sure if
that is possible.  Also, as attributes in XML can not repeat, we'd need to
determine how to best provide translations for attribute content -- we use
few attributes in EML (mostly for things like packageId), so maybe they
don't need to be translated at all.

Thanks to our contributions from our collaborators in Taiwan, we now have a
localizable version of Morpho, with the UI translated into Chinese,
Japanese, Spanish, French, and Portuguese.  So the next version of Morpho
would support the UI in multiple languages when we release it -- we need to
get native speakers from those languages to help validate and fix the
translations.  It would be great if we could also add in multi-language
support for metadata content in that same release. If you're interesting in
seeing this development version of Morpho, contact Ben Leinfelder and he can
point you in the right direction.

We'd be willing to put some time into i18n for Morpho and EML over the next
6 months if others want to help out too.  New releases need not take a long
time, assuming that people are willing to contribute to making sure the
changes are broadly acceptable and won't break a lot for existing EML users.

Matt

On Thu, Jun 24, 2010 at 12:03 PM, David Blankman <dblankman1 at gmail.com>wrote:

> Inigo,
>
> We talked about the possibility of using one document with repeating
> elements  with a language tag, but I think that it creates a document that
> is confusing. EML is sufficiently complex even in one language. Personally I
> do think that mixing languages is a good idea.
>
> I am copying Matt and Eamonn O Tuama (GBIF) on this since both were a part
> of the meeting in China. They may have different ideas. GBIF, I know, deals
> with multiple languages on a regular basis.
>
> It seems to me that mixing languages creates two problems. For the human
> reader, it makes the document harder to read. You have more experience with
> the machine parsing approach than I do, but intuitively it seems to me that
> it is easier to parse two single language documents than one mixed document,
> although clearly one can use the language tag to separate the two languages.
> ILTER is a resource poor organization relying on volunteers. ILTER doesn't
> have the resources to develop the parsing of a mixed document.
>
> Most ILTER users have minimal information management people. There are
> exceptions: China and Taiwan are the most obvious. But their technical
> expertise cannot be counted upon by ILTER in general.
>
> It also seems to me that generating a mixed document is more difficult.
> Morpho can be used easily to create two documents. Creating a mixed
> document, as far as I know, requires either hand editing or the development
> of a tool specifically for this purpose. Since ILTER does not have the
> resources to create such a tool, I think the recommendation has to be two
> separate documents.
>
> Kristin, Matt or Eamonn, feel free to to comment.
>
> David
>
>
> ———————————————————
> Everything is possible with a chocolate cookie!
>  - Rabbi Herbie of Jerusalem
>
> If I am not for myself, then who will be for me? If I am for myself alone,
> then who am I? If not now, when?
> - Rabbi Hillel
>
>
> 2010/6/24 Inigo San Gil <isangil at canyon.lternet.edu>
>
>>
>> Thanks David,
>>
>> Cool.. as for the actual implementation:
>>
>> For example, the title tag can be duplicated, so i can see having this
>> sort of logic
>>
>> <title>[Language:En]Snow cover data provided by MODIS satellite imagery
>> </title>
>> <title>[Language:Sp]Datos de innivaci&#243;n seg&#250;n im&#225;genes
>> MODIS</title>
>> <creator>(this translation would only apply for non latin
>> codesets)</creator>
>> <abstract>
>>   <para>[Language:En] These data shows all the information obtained
>> through the MODIS atellite imagery about the Snow cover at Sierra
>> Nevada</para>
>>         <para>[Language:Sp]Incluye toda la informaci&#243;n obtenida de
>> las im&#225;genes de sat&#233;lite de MODIS sobre nieve en Sierra
>> Nevada</para>
>>   </abstract>
>> etc...
>>
>> An alternative would be to tweak EML to allow for an attribute "lang"
>> within the EML tags
>> (this could be painful as it would need to be sanctioned by eml-dev -- a 2
>> year wait or more)
>>
>> <title lang='en'>Snow cover data provided by MODIS satellite imagery
>> </title>
>> <title lang='sp'>Datos de innivaci&#243;n seg&#250;n im&#225;genes
>> MODIS</title>
>>
>> But if I understand it correctly, ILTER suggests two documents,
>> (optionally).
>> One must be at least be "discovery level" in english, and other "full
>> document" in the
>> native tongue.  Is this what we should do?
>> like "snowcover.xml" (packageId='knb-spainlster-snv-en.0100.1230493704'
>> and "innivacion.xml" (packageId='knb-spainlster-snv-sp.0100.1230493704'
>>
>> (note the different scope in the packageId)
>>
>> i dont know of any specific implementations, all i encountered in dealing
>> with this monster issue is the Taiwan EML, which does not follow a unique
>> strategy.  may be i should take a look at the Brazilian or Chilean EML and
>> such (if they have any..)
>>
>> cheers,
>> Inigo
>>
>>
>>
>> David Blankman wrote:
>>
>>> Hi Inigo,
>>>
>>> We discussed this issue in an ILTER workshop in China. This workshop
>>> produced a recommendation which the ILTER coordinating committee agreed
>>> at
>>> the ILTER meeting in Slovakia in 2008. The strategy is to provide,
>>> at minimum a basic discovery level document in English to include: title,
>>> creator, contact, abstract, and keywords.  A site could then produce a
>>> full
>>> document in the native language. In both bases the language tag should
>>> probably be used.
>>>
>>> Let me know if you need more information.
>>>
>>> David
>>> ———————————————————
>>> Everything is possible with a chocolate cookie!
>>>  - Rabbi Herbie of Jerusalem
>>>
>>> If I am not for myself, then who will be for me? If I am for myself
>>> alone,
>>> then who am I? If not now, when?
>>> - Rabbi Hillel
>>>
>>>
>>> On Thu, Jun 24, 2010 at 8:59 PM, Inigo San Gil
>>> <isangil at canyon.lternet.edu>wrote:
>>>
>>>
>>>
>>>> remind me, David
>>>>
>>>> how are we tackling the Babelian problem in EML? are we duplicating
>>>> titles,
>>>> and descriptive tags in the natural language and english? do we use some
>>>> sort of XML attribute to denote the language? separate EML docs? what
>>>> was
>>>> the strategy outlined at ISEI6 (cancun)?
>>>>
>>>> it is urgent cause the spaniards are producing EML, and we are wondering
>>>> what would be the best way.  I know the Taiwan TFRI have a mix-and-match
>>>> of
>>>> instances (all chinese, a mix of chinese and english, all english).
>>>> cheers, inigo
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20100624/6b5aae7c/attachment.html>