[SEEK-Taxon] Thoughts on GUIDs

Richard Pyle deepreef at bishopmuseum.org
Tue May 25 15:51:41 PDT 2004


Hi Shawn,

I think that Jim's point was that the key value itself should not embed
information.

For example, consider my earlier reference to Paracentropyge SEC Pyle, 2003.
To create a truely unique "natural" key for this instance, you'd have to
unambiguously resolve "Paracentropyge" from all possible homonyms, which
means unambiguously indicating sufficient detail about the reference in
which it was describe (including page number, because there are cases where
homonyms are described within the same reference); AND you'd have to
unambiguoysly resolve "Pyle, 2003".  I'm not sure what the most reliable
natural key for a published reference would be, but it would probably have
to minimally include the title of the reference (a single author or set of
authors may very-well publish within the same year more than one aritcle
with overlapping page numbers).  It's hard to use citation details, because
these vary so much depending on the nature of the reference (book, journal
article, newspaper article, etc.). So, the bottom line is that it's
cumbersome (if not nearly impossible) to come up with a natural key for
uniquely indicating a reference, and the natural unique identifier for a
taxonomic concept would almost certainly have to include within it the
unique identifiers for at least two references (one to give the namestring
context, and the other to anchor the namestring to a particular
concept-usage).

So, at it's worst, an information-bearing GUID for a concept would look
something like:

"Paracentropyge-BurgessWE1991Two_new_genera_of_angelfishes,_family_Pomacanth
idae:69SECPyleRL2003Chapter_2._Revision_and_phylogenetic_analysis_of_the_gen
era_and_subgenera_of_Pomacanthidae"

And even that probably won't quite do it (uniquely) in all cases.  But more
significantly, the probability that one of those characters is initially
entered incorrectly when the GUID is created is very high, e.g.:

"Paracentropyge-BurgessWE1991Two_new_genera_of_angelfishes,_family_Pomacanth
idae:68SECPyleRL2003Chapter_2._Revision_and_phylogenetic_analysis_of_the_gen
era_and_subgenera_of_Pomacanthidae"

If local databases are all using this sort of information-bearing GUID to
identify their concepts and to map their concepts to the SEEK
infrastructure, then the discovery of a minor error (in this case, that
Paracentropyge was described on page 69 of Burgess 1991, rather than on page
68); then we are faced with the dilemma:  maintain the GUID as permanent and
unchanging, with its error intact (in which case, why have
information-bearing keys in the first place, if the information cannot be
trusted as accurate); or correct the error in the GUID (and deal with having
to perpetuate the correction of all instances of the GUID on all local
databases). Of course, there is a third option, which is where I think the
"version" thing comes in, which is to maintain BOTH GUIDs, and have a
secondary structure that maintains the fact that one is the more correct
version of the other.

The third option seems attractive, but requires a central authority to
maintain the version equivalencies.  Once you have to commit to a central
authority, why not capitalize on it maximally?  For instance, the central
authority could establish a GUID system for references, such that it
maintains the equivalencies of 1234 is the surrogate key for the Reference
cited as "Burgess, W.E. 1991. Two new genera of angelfishes, family
Pomacanthidae. etc."; and 5678 is the surrogate key for the Reference cited
as "Pyle, R.L. 2003. Chapter2. Revision and phylogenetic analysis of the
genera and subgenera of Pomacanthidae..etc."

Then we could collapse our concept GUIDs to something like:

"Paracentropyge[taxaserver.org:Ref/1234]:68_SEC_[taxaserver.org:Ref/5678]"

But even this is subject to potential future change (e.g., it could be
discovered that the Burgess reference assigned to ID#1234 was not actually
the original description of the genus name Paracentropyge), so we're stuck
with the same problems again.

I agree with you that the Surrogate Keys are established for pragmatic
purposes.  In that sense, having a central authority to resolve the
meaning/currency/versioning of a concept identifier is an impediment to
pragmatism.  However, once you cross the threshold of needing a central
authority for ID resolution, you might as well milk it for all of its
pragmatic worth, and establish an arbitrary surrogate key system to uniquely
identify concepts without any attempt embed information within the key
(except, perhaps, for the metadata about the key itself, such as its
issuer).  The risk of embedding information about the taxonomic concept
within the keystring, is that errors will undoubtedly be discovered in that
keystring -- leaving us with the problem of correction propagation or
versioning.  Removing all taxon concept information from the keystring
allows the arbitrary key to remain constant and unchanging.  The downside,
of course, is that you need to involve a centralized registry that maintains
the "most correct" version of the information that a human would look at to
uniquely identify what the concept that the GUID is intended to represent.
But I just don't see how we'll ever move forward on fluid taxonomic data
exchange without the establishment of such an authority.

This ended up being MUCH longer than I intended it -- and it's not directed
specifically at you, Shawn (you undoubtedly understand general surrogate key
theory much better than I do).  But it seems to me that the goal of SEEK is
to get things moving forward, which means the implementation of pragmatic
steps. It's pretty clear that natural (purely non-arbitrary
information-bearing) identifiers for taxonomic names and concepts are not at
all practical, nor will they be any time soon (owing mostly to issues of
homonymy, and the complexity of establishing natural unique identifiers for
references). So if some sort of resolver is necessary, it seems to me that
pragmatisim is maximized by excluding any taxon-concept information within
the GUID string.

Clearly, it's time now for me to shut up.

Aloha,
Rich

> -----Original Message-----
> From: seek-taxon-admin at ecoinformatics.org
> [mailto:seek-taxon-admin at ecoinformatics.org]On Behalf Of Shawn Bowers
> Sent: Tuesday, May 25, 2004 9:39 AM
> To: Beach, James H
> Cc: SEEK Taxon
> Subject: Re: [SEEK-Taxon] Thoughts on GUIDs
>
>
>
> Beach, James H wrote:
>
> > One of the strongest arguments for the evaluation of 'artificial' or
> > 'surrogate' key fields in a database context is that the 'key' should
> > not contain any implicit or explicit information about the object being
> > identified, other than its identity!
>
> The comment above doesn't seem quite right.  How can something have an
> identity that is independent (i.e., a surrogate) of the identity of the
> thing?  In other words, if the key doesn't contain any information about
> the object being identified, it surely can't uniquely describe or
> identify the object, right?
>
> In general, surrogate keys are used for purely pragmatic purposes within
> a database management system, e.g., so that a clustered B+-tree index
> can be constructed for the table, or to identify certain relationships
> in an ER or OO database.  But, surrogate keys always "pass the buck" of
> identity to something else.  For example, in OODBs there are two notions
> of equality, where objects can be deep-equal (value-equal) or
> shallow-equal (id-equal).
>
> Another problem with surrogate keys is that they are arbitrarily
> assigned, and conceptually, don't provide any information to a user
> about the corresponding object (other than the mac address used to
> construct the identifier, or the time the thing was put into the system,
> etc.). Often, surrogate keys are "hidden" from the user, which gets back
> to the problem of how to really identify objects.  Also, with
> surrogates, true uniqueness is always in question.  Hence arguments for
> "globally" unique identifiers versus "universally" unique identifiers,
> and so on.
>
> > If the key itself has information then you will inevitably run into a
> > situation where the key will need to be changed because something about
> > the information represented by the key value has changed or is in doubt
> > or is a matter of interpretation, (thus losing the temporal uniqueness
> > of the GUID).
>
> Again, then the information used as the key isn't really "identifying"
> information, and you have a problem anyway.
>
> There is a very interesting article that people may want to read
> concerning properties of things and classification, including identity
> and unity, that may be relevant to what taxon is trying to accomplish
> with concepts.
>
> The paper can be found here, and was published in the Communications of
> the ACM in 2002.  There are longer, more detailed versions available,
> but this is a good primer.
>
> http://www.loa-cnr.it/Papers/CACM2002.pdf
>
>
> > If for example, we decide to embed version numbers within
> > the GUID, then there will be relationships between GUIDs that
> need to be
> > maintained and respected and modeled as a consequence of the version
> > numbers themselves (sort of an embedded data model within the
> ID), which
> > adds another layer of abstraction to the whole enterprise of managing
> > concepts.  Instead of just worrying about mapping the taxonomic
> > relationships among concepts using unique IDs as the handles,
> such as in
> > the recent examples, one now has to verify that the subkey/version
> > identifiers are accurate (and that may be a matter of differing
> > interpretations) and related in the appropriate way that corresponds to
> > the taxonomy.
> >
> > I would recommend that versioning be handled outside of the key or ID.
> > Let resolver services deal with version differences based on the
> > metadata, don't hard code relationships among concept versions in the
> > identifier.
>
>
>
> >
> > _____________________________
> > James H. Beach
> > Biodiversity Research Center
> > University of Kansas
> > 1345 Jayhawk Boulevard
> > Lawrence, KS 66045, USA
> > T 785 864-4645, F 785 864-5335
> >
> >
> >
>
> _______________________________________________
> seek-taxon mailing list
> seek-taxon at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/seek-taxon





More information about the Seek-taxon mailing list