[eml-dev] eml globalization

Mon Jun 28 05:51:12 PDT 2010

Hi Matt,

It’s good to hear that you are considering i18n for Morpho and EML. GBIF, having adopted a profile of EML for use with its Integrated Publishing Toolkit, is very keen to see this development - following on from the recommendations of the GBIF metadata task group (and of the Lake Taihu workshop). I’m not sure how much we could commit in terms of actual development though as we are very under-resourced as regards developers. One approach might be to call on the wider GBIF community – we already have some common partners active in the area such as GBIF Taiwan.

At GBIF, we had to extend EML to meet our specific requirements using the “additionalMetadata” element. One of the extensions was to be able to state the metadata language - we needed that to be INSPIRE compliant and enable cross-walking to ISO19115.

<additionalMetadata>

               <metadata>

               <!-- language of the metadata document; use ISO language code  -->

                              <metadataLanguage></metadataLanguage>

               ….

I think Matt’s first choice below for enabling multiple language expression in EML is certainly the most complete and builds on the ISO19115 model. However, I can’t guage the amount of work required here (esp with need to maintain backwards compatibility). Method 2 offers a faster solution. At the very minimum, and while not so elegant, it could be restricted to the minimal element set recommendation of the Lake Taihu workshop: title, creator, contact, abstract, and keywords. It seems to me that the only meaningful elements to translate here are title and abstract, and keywords, if the latter do not originate from multilingual thesauri/glossary source. So a section identified as containing content in a particular language/encoding and restricted to whatever subset of elements is considered necessary to enable cross language discovery might be the minimum to aim for.

Éamonn

From: David Blankman [mailto:dblankman1 at gmail.com] 
Sent: 25 June 2010 21:45
To: Matt Jones
Cc: isangil at lternet.edu; Kristin Vanderbilt; Eamonn O Tuama; eml-dev; Ben Leinfelder; Terry Parr; chin at tfri.gov.tw
Subject: Re: eml globalization

Matt,

Thank you for clarifying the issue.

 From an ILTER perspective I am excited to hear that you are willing to commit resources to modifying Morpho to create multi-lingual documents. 

As part of the ILTER information management committee, I am willing to coordinate efforts to get native-speaker translations. My Java programming skills are not good enough to make a substantive contribution to Morpho.

As I read your comments, it looks like EML will require modification in order to adequately accomodate internationalization and that the changes are not trivial. Because this is so important to ILTER, I am will take have a significant involvement. On the other hand, I know that i am not currently up to speed on the parser issues.

There will be an ILTER meeting in Israel in August. Chin Chau-Lin from Taiwan will be there. Chin had also proposed a follow-up meeting to the Lake Taihu meeting which he said could be hosted in Taiwan.

I would appreciate your suggestions as far as the process for moving forward.

David 

———————————————————
Everything is possible with a chocolate cookie!
 - Rabbi Herbie of Jerusalem

If I am not for myself, then who will be for me? If I am for myself alone, then who am I? If not now, when?
- Rabbi Hillel

On Thu, Jun 24, 2010 at 11:28 PM, Matt Jones <jones at nceas.ucsb.edu> wrote:

Hi --

This is an important issue, and one that I think we should tackle very soon for EML as we have a lot of new international groups producing EML in many languages.  I was in Brazil 2 weeks ago setting up a Metacat for PELD and the issue of supporting multiple languages came up immediately.  We've discussed this in the past, and the approach I was thinking of is summarized here:

http://bugzilla.ecoinformatics.org/show_bug.cgi?id=585#c4

The alternate solution of producing multiple metadata documents each in a different language has the problem of not knowing how to locate a particular translation -- I guess it would be done by file naming convention, but this is problematic as it is difficult to standardize without a specification.

The three ways I can see doing this are:

1) At the element level, allow repeating content in multiple languages

   -- matches how ISO19115 does it

   -- this is the proposal in bug http://bugzilla.ecoinformatics.org/show_bug.cgi?id=585#c4

2) At the document level, allow two or more sections, each in their own language

3) Multiple documents

Personally, I think the 1st is the most approachable, and allows groups to add translated content for a few fields easily.  Marking these with the appropriate locale using xml:lang and related attributes would be straightforward.  The hard part would be changing the content model of EML to allow the repeating fields -- it would be best if we could do this in a way that does not invalidate existing EML 2.1 documents, but I'm not sure if that is possible.  Also, as attributes in XML can not repeat, we'd need to determine how to best provide translations for attribute content -- we use few attributes in EML (mostly for things like packageId), so maybe they don't need to be translated at all.

Thanks to our contributions from our collaborators in Taiwan, we now have a localizable version of Morpho, with the UI translated into Chinese, Japanese, Spanish, French, and Portuguese.  So the next version of Morpho would support the UI in multiple languages when we release it -- we need to get native speakers from those languages to help validate and fix the translations.  It would be great if we could also add in multi-language support for metadata content in that same release. If you're interesting in seeing this development version of Morpho, contact Ben Leinfelder and he can point you in the right direction.

We'd be willing to put some time into i18n for Morpho and EML over the next 6 months if others want to help out too.  New releases need not take a long time, assuming that people are willing to contribute to making sure the changes are broadly acceptable and won't break a lot for existing EML users.

Matt

On Thu, Jun 24, 2010 at 12:03 PM, David Blankman <dblankman1 at gmail.com> wrote:

Inigo,

We talked about the possibility of using one document with repeating elements  with a language tag, but I think that it creates a document that is confusing. EML is sufficiently complex even in one language. Personally I do think that mixing languages is a good idea.

I am copying Matt and Eamonn O Tuama (GBIF) on this since both were a part of the meeting in China. They may have different ideas. GBIF, I know, deals with multiple languages on a regular basis.

It seems to me that mixing languages creates two problems. For the human reader, it makes the document harder to read. You have more experience with the machine parsing approach than I do, but intuitively it seems to me that it is easier to parse two single language documents than one mixed document, although clearly one can use the language tag to separate the two languages. ILTER is a resource poor organization relying on volunteers. ILTER doesn't have the resources to develop the parsing of a mixed document. 

Most ILTER users have minimal information management people. There are exceptions: China and Taiwan are the most obvious. But their technical expertise cannot be counted upon by ILTER in general.

It also seems to me that generating a mixed document is more difficult. Morpho can be used easily to create two documents. Creating a mixed document, as far as I know, requires either hand editing or the development of a tool specifically for this purpose. Since ILTER does not have the resources to create such a tool, I think the recommendation has to be two separate documents.

Kristin, Matt or Eamonn, feel free to to comment.

David 

———————————————————
Everything is possible with a chocolate cookie!
 - Rabbi Herbie of Jerusalem

If I am not for myself, then who will be for me? If I am for myself alone, then who am I? If not now, when?
- Rabbi Hillel

2010/6/24 Inigo San Gil <isangil at canyon.lternet.edu>

Thanks David,

Cool.. as for the actual implementation:

For example, the title tag can be duplicated, so i can see having this sort of logic

<title>[Language:En]Snow cover data provided by MODIS satellite imagery </title>
<title>[Language:Sp]Datos de innivaci&#243;n seg&#250;n im&#225;genes MODIS</title>
<creator>(this translation would only apply for non latin codesets)</creator>
<abstract>
  <para>[Language:En] These data shows all the information obtained through the MODIS atellite imagery about the Snow cover at Sierra Nevada</para>
        <para>[Language:Sp]Incluye toda la informaci&#243;n obtenida de las im&#225;genes de sat&#233;lite de MODIS sobre nieve en Sierra Nevada</para>
  </abstract>
etc...

An alternative would be to tweak EML to allow for an attribute "lang" within the EML tags
(this could be painful as it would need to be sanctioned by eml-dev -- a 2 year wait or more)

<title lang='en'>Snow cover data provided by MODIS satellite imagery </title>
<title lang='sp'>Datos de innivaci&#243;n seg&#250;n im&#225;genes MODIS</title>

But if I understand it correctly, ILTER suggests two documents, (optionally).
One must be at least be "discovery level" in english, and other "full document" in the
native tongue.  Is this what we should do? 
like "snowcover.xml" (packageId='knb-spainlster-snv-en.0100.1230493704'
and "innivacion.xml" (packageId='knb-spainlster-snv-sp.0100.1230493704'

(note the different scope in the packageId)

i dont know of any specific implementations, all i encountered in dealing with this monster issue is the Taiwan EML, which does not follow a unique strategy.  may be i should take a look at the Brazilian or Chilean EML and such (if they have any..)

cheers,
Inigo

David Blankman wrote:

Hi Inigo,

We discussed this issue in an ILTER workshop in China. This workshop
produced a recommendation which the ILTER coordinating committee agreed at
the ILTER meeting in Slovakia in 2008. The strategy is to provide,
at minimum a basic discovery level document in English to include: title,
creator, contact, abstract, and keywords.  A site could then produce a full
document in the native language. In both bases the language tag should
probably be used.

Let me know if you need more information.

David
———————————————————
Everything is possible with a chocolate cookie!
 - Rabbi Herbie of Jerusalem

If I am not for myself, then who will be for me? If I am for myself alone,
then who am I? If not now, when?
- Rabbi Hillel

On Thu, Jun 24, 2010 at 8:59 PM, Inigo San Gil
<isangil at canyon.lternet.edu>wrote:

remind me, David

how are we tackling the Babelian problem in EML? are we duplicating titles,
and descriptive tags in the natural language and english? do we use some
sort of XML attribute to denote the language? separate EML docs? what was
the strategy outlined at ISEI6 (cancun)?

it is urgent cause the spaniards are producing EML, and we are wondering
what would be the best way.  I know the Taiwan TFRI have a mix-and-match of
instances (all chinese, a mix of chinese and english, all english).
cheers, inigo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20100628/d4ee60a3/attachment-0001.html>