[SEEK-Taxon] GUID (was Re: first cut at guid decision document)

Fri Mar 12 13:14:12 PST 2004

Hi James,

Thanks for the prompt reply!

> You are right that text_ID depending on variable contents can't be
> primary key of a tapule (or an object).  However, it does not mean
> that no text_ID works; if contents are fixed, it (or its hashed
> friend) should work fine.  Different tapules in distributed
> databases may indicate the same potential taxon.  In that case I
> prefer to indetify these tapules (or, contents more precisely) as
> the same potential taxon.  Isn't it one of the reason why we need
> a GUID?

I think the concatenated text strings are very powerful for a first-pass
synchronization of datasets -- but I've worked with synchronizing taxonomic
datasets enough to know how incredibly inconsistent the "same" data may be
entered into different databases. So much so that I'm finding the
full-context concatenations to be useful (without human
proofing/intervention) for fewer than half of the records, on average. And
most of these synchronizations are for name-only matchups (concept matching
would present yet another level of complexity and potential inconsistency).
I see the value in an arbitrary GUID system as metaphorically a common "flag
pole" around which all datasets can rally.  There would still be the tedious
process of synchronizing the big datasets (e.g., ITIS with SP2000) -- and
part of that process can be accomplished using a concatenated string (or
hashed equivalent) to identify the synchrony -- but I just don't see how we
can ever arrive at the state of data "cleanliness" such that an
information-bearing package (text_ID) will be the optimal currency of data
exchange. The concatenations are, ultimately, derived from human fingers on
keyboards; and as such, I fear there is just too much imprecision for them
to be relied upon as unique identifiers.

On the other hand, I am aware that "informationless" (arbitrary) identifiers
run the risk of being orphaned/divorced from the data they represent, and
that there will still be the problem of inadvertent duplication -- but on
balance, I feel this to be the lesser of evils.

> Nomencurator is based on publication model; once you published
> an article you can't modify it but publish amendment,

True, but in taxonomy, you can modify the *interpretation* of the elements
of an original published description.  Many names were described in old
publications, where the date of publication is not known with perfect
certainty.  New evidence may come to light that reveals a different year of
description (for example).  If year of description is part of the
concatenated string identifier, then we have a problem in propagating the
correction to all affected datasets.  With a surrogate ID number, the
linking value does not need to be changed -- only the data corresponding to
that ID number.

> Once you
> contribute an data entry to public Nomencurator server, then you
> can't modify the entry.  You can contribute a new entry with
> reference to the previous one to be amended.  This reference
> is retained by Annotation data.  This mechanism allows us to record
> mistakes including typographical one.

O.K., I understand -- this is a form of versioning.  As long as the
Annotation data are maintained properly, I see how this system would work.

> Nomencurator is also designed
> to lint fragmental data, e.g. "Canis but I do not know authority"
> or "Canis L., but do not know citation".  Nomencurator is epected
> to create an itengerated data to manage these 'raw' data.  The
> integrated data has references to raw data used to determine
> its contens.  We need only 'cooked' data for ordinary use; if
> you have doubt in its contents, then you can examine 'raw' data.
> It implies we need two modes in Nomencurator, cooked and raw modes.
> While public 'raw' data can have fixed text_ID, 'cooked' data
> has a text_ID variable when a new but relevant 'raw' data is
> contributed.  We need a sophisticated method including N-gram to
> manage cooked text_ID.  Misspelling in latin names and variants of
> author names are inevitable, so we need such inteligent mechanism
> even with fixed text_ID, or, number ID refering to such contents.

O.K., I understand that approach better now -- thank you.  I'll have to
think on it some to explore the implications for GUID's, though...

> I have strong sympathy with Taxonomer's design, especialy its
> 'extreme' positioning.  Just like Taxonomer does with 'Assertion',
> Nomencurator distinguish each name usage as 'NameUsage' data
> (it was called as NameRecord in our paper).  We also recognise
> importance of richness in linkage types of 'Annotation' expecting
> its similarity to subtyping of 'Assertion'.  I think that we need to
> compare Taxonomer and Nomencurator in more detail,

Personally, I would VERY much like to do this at some point -- for my own
understanding, more than anything else.  I've read most of the "MoRETax"
booklet of the Berlin group, and have read what I have of Nomencurator and
VegBank -- but it was MOST helpful to be in the same room with Bob Peet to
have him explain VegBank face-to-face.  I came away from that meeting with a
much better understanding of how one might crosswalk the data.  I would
very-much like to do likewise with you and Walter as well.

> Could be... porting one data to
> another may be a good practice.

That's exactly what Jessie is working on (as I understand it).  I am behind
in providing a fully-fleshed sample dataset in Taxonomer for Jessie & Robert
to manipulate (sorry about that, folks!  I will get it done!). But the idea
was to find a common "core" that could accommodate output from all of the
major concept models (VegBank, Walter's model, Nomencurator, etc.).

I'm not sure when next I will have email access -- so if I'm silent for a
while, it's because I am traveling.

Aloha,
Rich