[eml-dev] eml globalization

Fri Jul 30 12:35:39 PDT 2010

Hello all - 
We've discovered another drawback to using mixed content elements to support localized strings. I've summarized the issue below and provided notes about our options and their implications.

-You cannot perform any schema-based validation on the text content within a mixed content element.
-Currently the NonEmptyString elements (newly introduced in EML 2.1.0) not only require that the element be present in the document, but also that the content has some non-whitespace content.
-If we extended the NonEmptyString elements to allow mixed content (additional language translation subelements) then we would no longer be able to enforce the non-whitespace restriction. The element would still be required, but it could be left blank and not be caught by schema-based validation.
-The content of the language translation subelements could, however, be subject to this non-whitespace restriction.
-TextType elements are not affected by this because their content is currently optional

Options:
1) Continue with the mixed context approach (backward compatible with EML 2.1.0)
	-EML 2.1.1 would be a more relaxed schema than the current EML 2.1.0
	-we could augment existing EML-specific parsers to perform additional checks on the mixed content after schema-based validation was performed. 
		-Metacat already includes quite a bit of EML parsing when documents are submitted to the server
		-The EML project has a utility parser that could also include a check for blank fields
2) Pursue a more structured localization approach (not backward compatible)
	-allows us to continue to enforce the NonEmptyString restrictions for all language translations
	-simplifies xpath-based searching within documents
	-immediately requires modifications be made to:
		-Morpho
			-write XSL for upgrading the EML version
			-refactor all paths that are used for newly-localized EML elements
		-Metacat
			-search paths
			-stylesheets used by the skins
		-Ecogrid
			-search paths
		-Kepler
			-EML datasource actor
			-the query interface would need to include

You're comments on how to proceed are greatly appreciated (full disclosure: I am still in favor of using the mixed content approach).
Thanks,
-ben

On Jul 23, 2010, at 3:22 PM, Matt Jones wrote:

> Hi Ben,
> 
> This does look good.  I'd considered using a mixed content model, but previously we had decided to avoid mixed content in EML so that searches could work on atomic elements.  However, since we introduced the TextType, that has no longer been possible, so we really need a search solution for mixed content elements anyways.  So, if we added a similar structure to TextType that you did for NonEmptyString, would we cover all of the fields that are intended to contain natural language?
> 
> I think that the one issue is that software that assumes that a field contains just a string -- such as attributeName -- will need to be changed to be sure to handle the new mixed content model.  This would include the DataManager library, Morpho, Metacat, and Kepler at a minumum, and probably others.  But its a small price to pay for the added language compatibility.  
> 
> One other issue is that we should sure to indicate that people should use the xml:lang attribute on their container elements so that we can tell what the primary language of the field is.
> 
> Can you go ahead and change TextType properly as well, which I think would cover the majority of the parts of EML that accept strings?  We also should go through and look at all of the fields that currently either use or extend xs:string and evaluate whether these should be changed to either the non empty string type or the text type as well.
> 
> As this would be backwards compatible, we could probably release this as EML 2.1.1 and then people would be able to use it quickly.  Thoughts on that?
> 
> Matt
> 
> On Fri, Jul 23, 2010 at 9:37 AM, Eamonn O Tuama (GBIF) <eotuama at gbif.org> wrote:
> Hello Ben,
> 
> Thank you for looking into this.
> 
> It looks like a relatively simple and non-disruptive approach. Being able
> to use this mixed content model in the title and abstract elements alone
> would be a positive step forward to dealing with multi-lingual metadata.
> Obviously, parsers would need to understand the convention of repeating
> information in different languages but then that is the case for any kind
> of processing for presentation. If multi-lingual glossaries/ontologies are
> also adopted by metadata providers, I think we will have covered the basic
> requirements.
> 
> Regards,
> 
> Éamonn
> 
> > Hello everyone,
> > I've taken a few moments to experiment with the approach of adding
> > localized translations to existing EML elements.
> > The bugzilla comment outlines my prototype:
> > http://bugzilla.ecoinformatics.org/show_bug.cgi?id=585#c5
> > It seems relatively simple and successfully preserves backwards
> > compatibility, but I'd like the group to take a look and provide feedback.
> > Thanks,
> > -ben
> >
> >
> > For the impatient, here's an excerpt from the bugzilla entry:
> > <title>
> >             Original title
> >             <!-- language translations -->
> >             <value xml:lang="en">Title in English</value>
> >             <value xml:lang="es">Titulo en Español</value>
> > </title>
> >
> > On Jun 28, 2010, at 5:51 AM, Éamonn Ó Tuama (GBIF) wrote:
> >
> >> Hi Matt,
> >>
> >> It’s good to hear that you are considering i18n for Morpho and EML.
> >> GBIF, having adopted a profile of EML for use with its Integrated
> >> Publishing Toolkit, is very keen to see this development - following on
> >> from the recommendations of the GBIF metadata task group (and of the
> >> Lake Taihu workshop). I’m not sure how much we could commit in terms of
> >> actual development though as we are very under-resourced as regards
> >> developers. One approach might be to call on the wider GBIF community –
> >> we already have some common partners active in the area such as GBIF
> >> Taiwan.
> >>
> >> At GBIF, we had to extend EML to meet our specific requirements using
> >> the “additionalMetadata” element. One of the extensions was to be able
> >> to state the metadata language - we needed that to be INSPIRE compliant
> >> and enable cross-walking to ISO19115.
> >>
> >> <additionalMetadata>
> >>                <metadata>
> >>                <!-- language of the metadata document; use ISO language
> >> code  -->
> >>                               <metadataLanguage></metadataLanguage>
> >>                ….
> >>
> >> I think Matt’s first choice below for enabling multiple language
> >> expression in EML is certainly the most complete and builds on the
> >> ISO19115 model. However, I can’t guage the amount of work required here
> >> (esp with need to maintain backwards compatibility). Method 2 offers a
> >> faster solution. At the very minimum, and while not so elegant, it could
> >> be restricted to the minimal element set recommendation of the Lake
> >> Taihu workshop: title, creator, contact, abstract, and keywords. It
> >> seems to me that the only meaningful elements to translate here are
> >> title and abstract, and keywords, if the latter do not originate from
> >> multilingual thesauri/glossary source. So a section identified as
> >> containing content in a particular language/encoding and restricted to
> >> whatever subset of elements is considered necessary to enable cross
> >> language discovery might be the minimum to aim for.
> >>
> >> Éamonn
> >>
> >> From: David Blankman [mailto:dblankman1 at gmail.com]
> >> Sent: 25 June 2010 21:45
> >> To: Matt Jones
> >> Cc: isangil at lternet.edu; Kristin Vanderbilt; Eamonn O Tuama; eml-dev;
> >> Ben Leinfelder; Terry Parr; chin at tfri.gov.tw
> >> Subject: Re: eml globalization
> >>
> >> Matt,
> >>
> >> Thank you for clarifying the issue.
> >>
> >>  From an ILTER perspective I am excited to hear that you are willing to
> >> commit resources to modifying Morpho to create multi-lingual documents.
> >>
> >> As part of the ILTER information management committee, I am willing to
> >> coordinate efforts to get native-speaker translations. My Java
> >> programming skills are not good enough to make a substantive
> >> contribution to Morpho.
> >>
> >> As I read your comments, it looks like EML will require modification in
> >> order to adequately accomodate internationalization and that the changes
> >> are not trivial. Because this is so important to ILTER, I am will take
> >> have a significant involvement. On the other hand, I know that i am not
> >> currently up to speed on the parser issues.
> >>
> >> There will be an ILTER meeting in Israel in August. Chin Chau-Lin from
> >> Taiwan will be there. Chin had also proposed a follow-up meeting to the
> >> Lake Taihu meeting which he said could be hosted in Taiwan.
> >>
> >> I would appreciate your suggestions as far as the process for moving
> >> forward.
> >>
> >> David
> >> ———————————————————
> >> Everything is possible with a chocolate cookie!
> >>  - Rabbi Herbie of Jerusalem
> >>
> >> If I am not for myself, then who will be for me? If I am for myself
> >> alone, then who am I? If not now, when?
> >> - Rabbi Hillel
> >>
> >>
> >> On Thu, Jun 24, 2010 at 11:28 PM, Matt Jones <jones at nceas.ucsb.edu>
> >> wrote:
> >> Hi --
> >>
> >> This is an important issue, and one that I think we should tackle very
> >> soon for EML as we have a lot of new international groups producing EML
> >> in many languages.  I was in Brazil 2 weeks ago setting up a Metacat for
> >> PELD and the issue of supporting multiple languages came up immediately.
> >>  We've discussed this in the past, and the approach I was thinking of is
> >> summarized here:
> >>
> >> http://bugzilla.ecoinformatics.org/show_bug.cgi?id=585#c4
> >>
> >> The alternate solution of producing multiple metadata documents each in
> >> a different language has the problem of not knowing how to locate a
> >> particular translation -- I guess it would be done by file naming
> >> convention, but this is problematic as it is difficult to standardize
> >> without a specification.
> >>
> >> The three ways I can see doing this are:
> >> 1) At the element level, allow repeating content in multiple languages
> >>    -- matches how ISO19115 does it
> >>    -- this is the proposal in bug
> >> http://bugzilla.ecoinformatics.org/show_bug.cgi?id=585#c4
> >> 2) At the document level, allow two or more sections, each in their own
> >> language
> >> 3) Multiple documents
> >>
> >> Personally, I think the 1st is the most approachable, and allows groups
> >> to add translated content for a few fields easily.  Marking these with
> >> the appropriate locale using xml:lang and related attributes would be
> >> straightforward.  The hard part would be changing the content model of
> >> EML to allow the repeating fields -- it would be best if we could do
> >> this in a way that does not invalidate existing EML 2.1 documents, but
> >> I'm not sure if that is possible.  Also, as attributes in XML can not
> >> repeat, we'd need to determine how to best provide translations for
> >> attribute content -- we use few attributes in EML (mostly for things
> >> like packageId), so maybe they don't need to be translated at all.
> >>
> >> Thanks to our contributions from our collaborators in Taiwan, we now
> >> have a localizable version of Morpho, with the UI translated into
> >> Chinese, Japanese, Spanish, French, and Portuguese.  So the next version
> >> of Morpho would support the UI in multiple languages when we release it
> >> -- we need to get native speakers from those languages to help validate
> >> and fix the translations.  It would be great if we could also add in
> >> multi-language support for metadata content in that same release. If
> >> you're interesting in seeing this development version of Morpho, contact
> >> Ben Leinfelder and he can point you in the right direction.
> >>
> >> We'd be willing to put some time into i18n for Morpho and EML over the
> >> next 6 months if others want to help out too.  New releases need not
> >> take a long time, assuming that people are willing to contribute to
> >> making sure the changes are broadly acceptable and won't break a lot for
> >> existing EML users.
> >>
> >> Matt
> >>
> >>
> >> On Thu, Jun 24, 2010 at 12:03 PM, David Blankman <dblankman1 at gmail.com>
> >> wrote:
> >> Inigo,
> >>
> >> We talked about the possibility of using one document with repeating
> >> elements  with a language tag, but I think that it creates a document
> >> that is confusing. EML is sufficiently complex even in one language.
> >> Personally I do think that mixing languages is a good idea.
> >>
> >> I am copying Matt and Eamonn O Tuama (GBIF) on this since both were a
> >> part of the meeting in China. They may have different ideas. GBIF, I
> >> know, deals with multiple languages on a regular basis.
> >>
> >> It seems to me that mixing languages creates two problems. For the human
> >> reader, it makes the document harder to read. You have more experience
> >> with the machine parsing approach than I do, but intuitively it seems to
> >> me that it is easier to parse two single language documents than one
> >> mixed document, although clearly one can use the language tag to
> >> separate the two languages. ILTER is a resource poor organization
> >> relying on volunteers. ILTER doesn't have the resources to develop the
> >> parsing of a mixed document.
> >>
> >> Most ILTER users have minimal information management people. There are
> >> exceptions: China and Taiwan are the most obvious. But their technical
> >> expertise cannot be counted upon by ILTER in general.
> >>
> >> It also seems to me that generating a mixed document is more difficult.
> >> Morpho can be used easily to create two documents. Creating a mixed
> >> document, as far as I know, requires either hand editing or the
> >> development of a tool specifically for this purpose. Since ILTER does
> >> not have the resources to create such a tool, I think the recommendation
> >> has to be two separate documents.
> >>
> >> Kristin, Matt or Eamonn, feel free to to comment.
> >>
> >> David
> >>
> >>
> >> ———————————————————
> >> Everything is possible with a chocolate cookie!
> >>  - Rabbi Herbie of Jerusalem
> >>
> >> If I am not for myself, then who will be for me? If I am for myself
> >> alone, then who am I? If not now, when?
> >> - Rabbi Hillel
> >>
> >>
> >> 2010/6/24 Inigo San Gil <isangil at canyon.lternet.edu>
> >>
> >> Thanks David,
> >>
> >> Cool.. as for the actual implementation:
> >>
> >> For example, the title tag can be duplicated, so i can see having this
> >> sort of logic
> >>
> >> <title>[Language:En]Snow cover data provided by MODIS satellite imagery
> >> </title>
> >> <title>[Language:Sp]Datos de innivaci&#243;n seg&#250;n im&#225;genes
> >> MODIS</title>
> >> <creator>(this translation would only apply for non latin
> >> codesets)</creator>
> >> <abstract>
> >>   <para>[Language:En] These data shows all the information obtained
> >> through the MODIS atellite imagery about the Snow cover at Sierra
> >> Nevada</para>
> >>         <para>[Language:Sp]Incluye toda la informaci&#243;n obtenida de
> >> las im&#225;genes de sat&#233;lite de MODIS sobre nieve en
> >> Sierra Nevada</para>
> >>   </abstract>
> >> etc...
> >>
> >> An alternative would be to tweak EML to allow for an attribute "lang"
> >> within the EML tags
> >> (this could be painful as it would need to be sanctioned by eml-dev -- a
> >> 2 year wait or more)
> >>
> >> <title lang='en'>Snow cover data provided by MODIS satellite imagery
> >> </title>
> >> <title lang='sp'>Datos de innivaci&#243;n seg&#250;n im&#225;genes
> >> MODIS</title>
> >>
> >> But if I understand it correctly, ILTER suggests two documents,
> >> (optionally).
> >> One must be at least be "discovery level" in english, and other "full
> >> document" in the
> >> native tongue.  Is this what we should do?
> >> like "snowcover.xml" (packageId='knb-spainlster-snv-en.0100.1230493704'
> >> and "innivacion.xml" (packageId='knb-spainlster-snv-sp.0100.1230493704'
> >>
> >> (note the different scope in the packageId)
> >>
> >> i dont know of any specific implementations, all i encountered in
> >> dealing with this monster issue is the Taiwan EML, which does not follow
> >> a unique strategy.  may be i should take a look at the Brazilian or
> >> Chilean EML and such (if they have any..)
> >>
> >> cheers,
> >> Inigo
> >>
> >>
> >>
> >> David Blankman wrote:
> >> Hi Inigo,
> >>
> >> We discussed this issue in an ILTER workshop in China. This workshop
> >> produced a recommendation which the ILTER coordinating committee agreed
> >> at
> >> the ILTER meeting in Slovakia in 2008. The strategy is to provide,
> >> at minimum a basic discovery level document in English to include:
> >> title,
> >> creator, contact, abstract, and keywords.  A site could then produce a
> >> full
> >> document in the native language. In both bases the language tag should
> >> probably be used.
> >>
> >> Let me know if you need more information.
> >>
> >> David
> >> ———————————————————
> >> Everything is possible with a chocolate cookie!
> >>  - Rabbi Herbie of Jerusalem
> >>
> >> If I am not for myself, then who will be for me? If I am for myself
> >> alone,
> >> then who am I? If not now, when?
> >> - Rabbi Hillel
> >>
> >>
> >> On Thu, Jun 24, 2010 at 8:59 PM, Inigo San Gil
> >> <isangil at canyon.lternet.edu>wrote:
> >>
> >>
> >> remind me, David
> >>
> >> how are we tackling the Babelian problem in EML? are we duplicating
> >> titles,
> >> and descriptive tags in the natural language and english? do we use some
> >> sort of XML attribute to denote the language? separate EML docs? what
> >> was
> >> the strategy outlined at ISEI6 (cancun)?
> >>
> >> it is urgent cause the spaniards are producing EML, and we are wondering
> >> what would be the best way.  I know the Taiwan TFRI have a mix-and-match
> >> of
> >> instances (all chinese, a mix of chinese and english, all english).
> >> cheers, inigo
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> 
> 
>