[tcs-lc] Modularisation of standards - identification of names
Sally Hinchcliffe
S.Hinchcliffe at rbgkew.org.uk
Tue Mar 8 02:53:46 PST 2005
On the second part of Donald's email:
> (Now that I think of it) I guess there is one other possible reason why we
> may wish to be able to separate names out. This is a data processing issue
> and Bob Morris may just tell me that my problem comes from assuming a
> particular implementation, but here goes...
>
> Assume a document in which two concepts refer to the same published name
> (using an abbreviated representation of TCS data):
>
> <TaxonConcepts>
> <TaxonConcept id="tc1">
> <Name>
> <Label>Aus bus</Label>
> <CanonicalAuthorship>Black, 1965</CanonicalAuthorship>
> </Name>
> <AccordingTo>Smith</AccordingTo>
> </TaxonConcept>
> <TaxonConcept id="tc2">
> <Name>
> <Label>Aus bus</Label>
> <CanonicalAuthorship>Black, 1965</CanonicalAuthorship>
> </Name>
> <AccordingTo>Jones</AccordingTo>
> </TaxonConcept>
> </TaxonConcepts>
>
> The nomenclatural data here under name is rather simple and there may be
> little problem with denormalising it. A DiGIR-style search could allow a
> user to find all TaxonConcepts based on "Aus bus Black, 1965".
>
> However will it ever matter to an application processing such a document
> that the two <Name> elements are the same? Do we need a better way to
> indicate this than simply relying on the byte-identity of the XML content?
One use case that springs to mind is the separation of homonyms,
particularly where it comes to homonym genera.
In the canonical names part of the Linnean Core we included (I'm
not sure if it's disappeared in the latest version, but I don't think so)
scope for a reference attribute in the separate atoms of the names.
So in a binomial, the <genus> object could have a reference (id)
that would allow the output to unambiguously identify _which_ Aus
we had in mind when we said Aus bus.
when passing information about uninomials, there is a lot more
scope for ambiguity between byte identical XML content (or
'homonyms' as I old-fashionedly like to call them)
> An alternative representation (if this mattered) would be:
>
> <TaxonConcepts>
> <TaxonConcept id="tc1">
> <NameRef id="tn1">
> <AccordingTo>Smith</AccordingTo>
> </TaxonConcept>
> <TaxonConcept id="tc2">
> <NameRef id="tn1">
> <AccordingTo>Jones</AccordingTo>
> </TaxonConcept>
> </TaxonConcepts>
> <TaxonNames>
> <TaxonName id="tn1">
> <Label>Aus bus</Label>
> <CanonicalAuthorship>Black, 1965</CanonicalAuthorship>
> </TaxonName>
> </TaxonNames>
>
There's a third way (sorry to introduce a note of domestic UK
politics, but Rich and Nico started it) which is to take the LC
approach and embed both identifiers and data:
<TaxonConcepts>
<TaxonConcept id="tc1">
<Name id="123-1">
<Label>Aus bus</Label>
<CanonicalAuthorship>Black, 1965</CanonicalAuthorship>
</Name>
<AccordingTo>Smith</AccordingTo>
</TaxonConcept>
<TaxonConcept id="tc2">
<Name id="123-1">
<Label>Aus bus</Label>
<CanonicalAuthorship>Black, 1965</CanonicalAuthorship>
</Name>
<AccordingTo>Jones</AccordingTo>
</TaxonConcept>
</TaxonConcepts>
Of course, to a computer the inclusion of the second set of
information within the name is redundant but we shouldn't
underestimate the amount of human eyeballing of XML data goes
on. Also if the name id has an existence outside the transient life
of the xml document instance (for instance, if it were an id from a
nomenclator) then the processing power involved in producing that
sort of document on demand as part of a web service would be
reduced.
Possibly we've been missing a trick in how we implement these
things, but trying to create a document on the fly using templates
where we were keeping track of a list of publications, a list of
vouchers and now a list of names, and put ad hoc ids (1, 2, 3 ...)
into the main schema referring to the separate lists at the bottom
of the document was the one thing that made implementing TCS a
bit of a challenge for IPNI
> [Note that it is much harder reliably to assign and police meaningful
> identifiers for name elements if they are fully embedded. There would
> certainly be no way to enforce a single consistent representation in all
> occurrences for the same <Name> across different <TaxonConcept> elements.]
>
One thing about XML that I've found, if you try and approach it with
an OO programmer hat on and make it enforce business rules,
then you very quickly get frustrated, or end up with very
complicated schemas. It's a weakness (or a strength) of XML - and
trying to make it more strict may be a mistake. We may have to
just design a schema that _allows_ data to be transferred properly,
and include human readable guidance around the areas like
consistency etc.
On a slightly related subject, Gregor and I did some thinking and
discussing on ids - it's on the LC wiki somewhere - trying to come
up with a structure that would allow ids to have different scopes:
either transient ones (1, 2, 3 ...) that were created for the life of a
document and only made sense within that context, or ones which
referred to ids that were unique and immutable in the context of a
dataset (e.g . IPNI ids) all the way up to full blown LSIDs. Looking
at that might help futureproof any schema we do come up with so
that if LSIDs or whatever do take off we're able to deal with them
> Will there ever be any reasons why tools that generate TCS data would need
> to present it in a more normalised form like this? I am fairly confident
> that the answer here is "No".
>
> Will there ever be any reasons why tools that process TCS data would be
> better served by the more normalised form? Here I am less sure. Do we have
> any use cases that would drive us that way?
>
Darn ... I swore I wouldn't get involved in this discussion until I'd
finished reading my backlog of emails. But Donald said the magic
words (use case) and I couldn't help but dive in ...
Sally
*** Sally Hinchcliffe
*** Computer section, Royal Botanic Gardens, Kew
*** tel: +44 (0)20 8332 5708
*** S.Hinchcliffe at rbgkew.org.uk
More information about the Tcs-lc
mailing list