[SEEK-Taxon] Thoughts on GUIDs

Tue May 25 12:38:36 PDT 2004

Beach, James H wrote:

> One of the strongest arguments for the evaluation of 'artificial' or 
> 'surrogate' key fields in a database context is that the 'key' should 
> not contain any implicit or explicit information about the object being 
> identified, other than its identity!

The comment above doesn't seem quite right.  How can something have an 
identity that is independent (i.e., a surrogate) of the identity of the 
thing?  In other words, if the key doesn't contain any information about 
the object being identified, it surely can't uniquely describe or 
identify the object, right?

In general, surrogate keys are used for purely pragmatic purposes within 
a database management system, e.g., so that a clustered B+-tree index 
can be constructed for the table, or to identify certain relationships 
in an ER or OO database.  But, surrogate keys always "pass the buck" of 
identity to something else.  For example, in OODBs there are two notions 
of equality, where objects can be deep-equal (value-equal) or 
shallow-equal (id-equal).

Another problem with surrogate keys is that they are arbitrarily 
assigned, and conceptually, don't provide any information to a user 
about the corresponding object (other than the mac address used to 
construct the identifier, or the time the thing was put into the system, 
etc.). Often, surrogate keys are "hidden" from the user, which gets back 
to the problem of how to really identify objects.  Also, with 
surrogates, true uniqueness is always in question.  Hence arguments for 
"globally" unique identifiers versus "universally" unique identifiers, 
and so on.

> If the key itself has information then you will inevitably run into a 
> situation where the key will need to be changed because something about 
> the information represented by the key value has changed or is in doubt 
> or is a matter of interpretation, (thus losing the temporal uniqueness 
> of the GUID).  

Again, then the information used as the key isn't really "identifying" 
information, and you have a problem anyway.

There is a very interesting article that people may want to read 
concerning properties of things and classification, including identity 
and unity, that may be relevant to what taxon is trying to accomplish 
with concepts.

The paper can be found here, and was published in the Communications of 
the ACM in 2002.  There are longer, more detailed versions available, 
but this is a good primer.

http://www.loa-cnr.it/Papers/CACM2002.pdf

> If for example, we decide to embed version numbers within 
> the GUID, then there will be relationships between GUIDs that need to be 
> maintained and respected and modeled as a consequence of the version 
> numbers themselves (sort of an embedded data model within the ID), which 
> adds another layer of abstraction to the whole enterprise of managing 
> concepts.  Instead of just worrying about mapping the taxonomic 
> relationships among concepts using unique IDs as the handles, such as in 
> the recent examples, one now has to verify that the subkey/version 
> identifiers are accurate (and that may be a matter of differing 
> interpretations) and related in the appropriate way that corresponds to 
> the taxonomy.
>  
> I would recommend that versioning be handled outside of the key or ID. 
> Let resolver services deal with version differences based on the 
> metadata, don't hard code relationships among concept versions in the 
> identifier.

> 
> _____________________________
> James H. Beach
> Biodiversity Research Center
> University of Kansas
> 1345 Jayhawk Boulevard
> Lawrence, KS 66045, USA
> T 785 864-4645, F 785 864-5335
> 
> 
>