[tcs-lc] Modularisation of standards - identification of names

Sally Hinchcliffe S.Hinchcliffe at rbgkew.org.uk
Tue Mar 8 02:53:46 PST 2005


On the second part of Donald's email:

> (Now that I think of it) I guess there is one other possible reason why we
> may wish to be able to separate names out.  This is a data processing issue
> and Bob Morris may just tell me that my problem comes from assuming a
> particular implementation, but here goes...
> 
> Assume a document in which two concepts refer to the same published name
> (using an abbreviated representation of TCS data):
> 
> <TaxonConcepts>
>   <TaxonConcept id="tc1">
>     <Name>
>       <Label>Aus bus</Label>
>       <CanonicalAuthorship>Black, 1965</CanonicalAuthorship>
>     </Name>
>     <AccordingTo>Smith</AccordingTo>
>   </TaxonConcept>
>   <TaxonConcept id="tc2">
>     <Name>
>       <Label>Aus bus</Label>
>       <CanonicalAuthorship>Black, 1965</CanonicalAuthorship>
>     </Name>
>     <AccordingTo>Jones</AccordingTo>
>   </TaxonConcept>
> </TaxonConcepts>
> 
> The nomenclatural data here under name is rather simple and there may be
> little problem with denormalising it.  A DiGIR-style search could allow a
> user to find all TaxonConcepts based on "Aus bus Black, 1965".
> 
> However will it ever matter to an application processing such a document
> that the two <Name> elements are the same?  Do we need a better way to
> indicate this than simply relying on the byte-identity of the XML content?

One use case that springs to mind is the separation of homonyms, 
particularly where it comes to homonym genera.
In the canonical names part of the Linnean Core we included (I'm 
not sure if it's disappeared in the latest version, but I don't think so) 
scope for a reference attribute in the separate atoms of the names. 
So in a binomial, the <genus> object could have a reference (id) 
that would allow the output to unambiguously identify _which_ Aus 
we had in mind when we said Aus bus. 

when passing information about uninomials, there is a lot more 
scope for ambiguity between byte identical XML content (or 
'homonyms' as I old-fashionedly like to call them)

> An alternative representation (if this mattered) would be:
> 
> <TaxonConcepts>
>   <TaxonConcept id="tc1">
>     <NameRef id="tn1">
>     <AccordingTo>Smith</AccordingTo>
>   </TaxonConcept>
>   <TaxonConcept id="tc2">
>     <NameRef id="tn1">
>     <AccordingTo>Jones</AccordingTo>
>   </TaxonConcept>
> </TaxonConcepts>
> <TaxonNames>
>   <TaxonName id="tn1">
>     <Label>Aus bus</Label>
>     <CanonicalAuthorship>Black, 1965</CanonicalAuthorship>
>   </TaxonName>
> </TaxonNames>
>

There's a third way (sorry to introduce a note of domestic UK 
politics, but Rich and Nico started it) which is to take the LC 
approach and embed both identifiers and data:

 <TaxonConcepts>
   <TaxonConcept id="tc1">
     <Name id="123-1">
       <Label>Aus bus</Label>
       <CanonicalAuthorship>Black, 1965</CanonicalAuthorship>
     </Name>
     <AccordingTo>Smith</AccordingTo>
   </TaxonConcept>
   <TaxonConcept id="tc2">
    <Name id="123-1">
       <Label>Aus bus</Label>
       <CanonicalAuthorship>Black, 1965</CanonicalAuthorship>
     </Name>
     <AccordingTo>Jones</AccordingTo>
   </TaxonConcept>
 </TaxonConcepts>

Of course, to a computer the inclusion of the second set of 
information within the name is redundant but we shouldn't 
underestimate the amount of human eyeballing of XML data goes 
on. Also if the name id has an existence outside the transient life 
of the xml document instance (for instance, if it were an id from a 
nomenclator) then the processing power involved in producing that 
sort of document on demand as part of a web service would be 
reduced.
Possibly we've been missing a trick in how we implement these 
things, but trying to create a document on the fly using templates 
where we were keeping track of a list of publications, a list of 
vouchers and now a list of names, and put ad hoc ids (1, 2, 3 ...) 
into the main schema referring to the separate lists at the bottom 
of the document was the one thing that made implementing TCS a 
bit of a challenge for IPNI

> [Note that it is much harder reliably to assign and police meaningful
> identifiers for name elements if they are fully embedded.  There would
> certainly be no way to enforce a single consistent representation in all
> occurrences for the same <Name> across different <TaxonConcept> elements.]
> 

One thing about XML that I've found, if you try and approach it with 
an OO programmer hat on and make it enforce business rules, 
then you very quickly get frustrated, or end up with very 
complicated schemas. It's a weakness (or a strength) of XML - and 
trying to make it more strict may be a mistake. We may have to 
just design a schema that _allows_ data to be transferred properly, 
and include human readable guidance around the areas like 
consistency etc.

On a slightly related subject, Gregor and I did some thinking and 
discussing on ids - it's on the LC wiki somewhere - trying to come 
up with a structure that would allow ids to have different scopes: 
either transient ones (1, 2, 3 ...) that were created for the life of a 
document and only made sense within that context, or ones which 
referred to ids that were unique and immutable in the context of a 
dataset (e.g . IPNI ids) all the way up to full blown LSIDs. Looking 
at that might help futureproof any schema we do come up with so 
that if LSIDs or whatever do take off we're able to deal with them

> Will there ever be any reasons why tools that generate TCS data would need
> to present it in a more normalised form like this?  I am fairly confident
> that the answer here is "No".
> 
> Will there ever be any reasons why tools that process TCS data would be
> better served by the more normalised form?  Here I am less sure.  Do we have
> any use cases that would drive us that way?
> 
Darn ... I swore I wouldn't get involved in this discussion until I'd 
finished reading my backlog of emails. But Donald said the magic 
words (use case) and I couldn't help but dive in ...

Sally

*** Sally Hinchcliffe
*** Computer section, Royal Botanic Gardens, Kew
*** tel: +44 (0)20 8332 5708
*** S.Hinchcliffe at rbgkew.org.uk



More information about the Tcs-lc mailing list