[eml-dev] xml:lang attribute for title in EML 2.1.0

Thu Sep 16 19:14:20 PDT 2010

All - 
I've gone ahead and committed EML schema changes that allow internationalization of 'title' and 'keyword' elements:
	https://code.ecoinformatics.org/code/eml/trunk/eml-resource.xsd
and also 'abstract' elements (as well as any DocBook-based elements in EML):
	https://code.ecoinformatics.org/code/eml/trunk/eml-text.xsd

In terms of 'basic metadata', the marked items remain:
-Identifier
-Title
-Abstract
-Keywords
**-Creator
**-Contact 
**-Metadata Publisher

I believe the Identifier need not be internationalized as this is a special string.
The other fields describe a person's name or an organization's name, address, phone, email and other contact details - should these allow for multiple translations?

If you would like to see other EML elements internationalized please let me know.
Thanks,
-ben

On Sep 16, 2010, at 5:04 PM, Matt Jones wrote:

> Inigo -- 
> 
> If you look a little deeper into that thread that Ben cited, you'll see a link to the summary of the changes, which is in Bugzilla here:
> http://bugzilla.ecoinformatics.org/show_bug.cgi?id=585#c5
> The details are of course in the changed XSD files for EML in the svn repository, but that summary and the comments in the email thread that Ben cited about downsides of this approach should be enough to evaluate things.
> 
> Matt
> 
> On Thu, Sep 16, 2010 at 12:46 PM, Inigo San Gil <isangil at canyon.lternet.edu> wrote:
> 
> 
> David, Congrats on the ILTER election - you bring quality experience there.
> 
> As for objections, I just wouldnt know what to object to :)   ..  I dont 
> see specifics.  The quoted text below is the language I see closest 
> to a specific solution to this issue.
> [...]
> I proposed a solution for allowing mixed element content for fields that should be internationalized.
> 
> [...]
> -EML 2.1.1 would be a more relaxed schema than the current EML 2.1.0
> 	-we could augment existing EML-specific parsers to perform additional checks on the mixed content after schema-based validation was performed. 
> 		-Metacat already includes [...] 
> 		-The EML project has a utility parser that [...]
> 
> 
> It sounds good to me, more details welcomed.
> 
> Inigo
> 
> 
> On 9/16/2010 2:19 PM, David Blankman wrote:
>> Matt and EML-Dev,
>> 
>> I think that Ben's solution should be pushed forward. At the ILTER level we are starting to push strongly for EML documents from ILTER member networks. Having a solution that allows for multiple languages in a single document is certainly preferable to two EML documents for the same dataset.
>> 
>> By the way, at the ILTER meeting earlier this month, I got elected to be the new Chair of the ILTER IM Committee. Kristin has been made co-chair of the US-ILTER committee. One of her tasks will be to encourage US LTER researchers to start pursuing global synthetic research. Having a clear approach to handling EML will help to move this forward.
>> 
>> David
>> ———————————————————
>> Everything is possible with a chocolate cookie!
>>   - Rabbi Herbie of Jerusalem
>> 
>> If I am not for myself, then who will be for me? If I am for myself alone, then who am I? If not now, when?
>>  - Rabbi Hillel
>> 
>> 
>> On Thu, Sep 16, 2010 at 9:56 PM, Matt Jones <jones at nceas.ucsb.edu> wrote:
>> The solution that Ben proposed is meant to address the requirements that arose from the iLTER Lake Taihu meeting for providing core metadata in multiple languages.  These recommendations then were also at the core of the recommendations made to GBIF about which fields should contain English translations, but the set of fields differs slightly in the two recommendations.  Because many of these fields are not currently repeatable according to the EML 2.1 schema, we would need to, at a minumum, change cardinality rules to allow for each field to be included multiple times if the xml:lang tag were used to differentiate them (or for the approach Inigo points to).  As Ben points out, it would still be ambiguous as to whether the repeating fields represent different information, or the same information translated.  So his proposal is meant to explicitly flag translations as such within mixed content string fields, with the goal of doing so without breaking existing EML 2.1 compatibility and without having to change existing cardinality rules.
>> 
>> Ben's prior discussion on this highlighted the conflict with the NonEmptyString type that was introduced in EML 2.1, in that mixed content elements would not be validated and so the rules for NonEmptyString would not be enforced.  I think this would only be a small issue, and that the advantages in compatibility provided by using a mixed content model for language translations outweigh the loss of validation within our string types.  Either way, we would need to add the xml:lang attribute so that it can be used throughout EML, including in the translation                 elements that Ben proposed.
>> 
>> Are there any objections to moving forward with the schema changes to use a mixed content models for translations that Ben proposed in his earlier emails?
>> 
>> Matt
>> 
>> On Thu, Sep 16, 2010 at 11:35 AM, Inigo San Gil <isangil at canyon.lternet.edu> wrote:
>> 
>> We'll keep our eyes on the ball, then.
>> 
>> Meanwhile others have adopted their own solution.
>> Here are two examples:
>> 1) a site from Spain reports this implementation
>> 
>> <title>[Language:En]Snow cover data provided by MODIS satellite imagery</title>
>> <title>[Language:Sp]Datos de innivaci&#243;n seg&#250;n im&#225;genes MODIS</title>
>> 
>> We thought that the use of the XML attribute "lang=en | sp"
>> was interesting -but, among other problems,  we would  have
>> gotten screwed by eml-dev eventual internationalization
>> implementation.  Call it luck, but you can bet the "eventual
>> eml-dev decision" would force us to re-code the EML
>> generation.
>> 
>> 2) From Taiwan, it is also a mix and match.  I had the
>> internationalization conversation years ago, when we set
>> harvesting into the NBII clearinghouse.  at the TFRI, we
>> found EML documents that have a hybrid of english and
>> chinese, with no sign or whatsoever of the language used.
>> We had to devise a mechanism to detect language.  We
>> simply did not harvest those docs whose critical content
>> was not translated in English.
>> 
>> ILTER discussed (two years ago?) some guidelines on
>> how the different countries were going to deal with the
>> tower of Babel problem.  May be you can look into those
>> if you feel curious, but if I recall correctly, it went along
>> the lines of encoding the metadata in the native language,
>> and produce some discovery-level EML in English. This
>> strategy would create two EMLs per EML..
>> 
>> Sparks or not, I still have to recommend the EML users
>> to implement some solution. Im inclined to suggest  that
>> such solution 1) does not break the current EML rules.
>> 2) The solution should allow for easy language detection.
>> Spain's case fits here, for example.
>> 
>> Cheers, inigo
>> 
>> 
>> 
>> 
>> 
>> On 9/16/2010 10:56 AM, ben leinfelder wrote:
>> Hi Markus,
>> I'm afraid your findings are accurate with respect to the xml:lang attribute in the<title>  element (or any "NonEmptyStringType" element).
>> In the course of my experimentation with allowing backwards-compatible internationalization with a new EML version (2.1.1) I did have to include the "http://www.w3.org/XML/1998/namespace" namespace just as you did and also declare the xml:lang attribute in elements where I wanted to employ it.
>> While certain EML elements are repeatable, it's not always clear what the presence of multiple elements represent (are they translations in different languages or are they alternate titles?). In order to clarify this confusion and also allow multiple translations for non-repeatable elements I proposed a solution for allowing mixed element content for fields that should be internationalized. There's a fairly comprehensive discussion of this approach in our eml-dev archives: http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/2010-July/001828.html
>> I didn't get a lot of decisive feedback and so have not moved forward with releasing an updated EML version. Hopefully this thread will again set the ball rolling.
>> -ben
>> .nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>> 
>> _______________________________________________
>> Eml-dev mailing list
>> Eml-dev at ecoinformatics.org
>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>> 
>> 
>> _______________________________________________
>> Eml-dev mailing list
>> Eml-dev at ecoinformatics.org
>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>> 
>> 
>> 
>> _______________________________________________
>> Eml-dev mailing list
>> 
>> Eml-dev at ecoinformatics.org
>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
> 
> 
> _______________________________________________
> Eml-dev mailing list
> Eml-dev at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev