[SEEK-Taxon] Thoughts on GUIDs

Richard Pyle deepreef at bishopmuseum.org
Tue May 25 13:02:13 PDT 2004


Hi Jim,

> Completely independent of the choice of identifier schemes,
> is the question Nico, Rich and Dave have been tangoing around
> -- whether the identifier should contain explicitly or
> implicitly any information about the identify or the
> relationship of a concept to something else.

That is an important issue (which I have opinions on), but it is not the
issue I was asking about.  My question was more generally about whether
there is such a notion of "versions" of the same concept (as opposed to
different concepts), where each "version" required its own GUID (whether or
not the GUID itself contains information about the relationship of a version
to its concept).  And if so, then what distinguishes a case where there are
two versions of the same concept, from a case where there are two distinct
concepts.  The fundamental question is whether there is one and only one
GUID per concept (except in cases of inadvertent duplication), or if the
system needs to accomodate potentially more than one (intentional) GUID per
concept (i.e., one unique GUID for each "version" of the same concept).

> Embedding version numbers in ID's is additional information,
> i.e. metadata, about the taxon concept that may be present
> nowhere else.  One of the strongest arguments for the
> evaluation of 'artificial' or 'surrogate' key fields in a
> database context is that the 'key' should not contain any
> implicit or explicit information about the object being
> identified, other than its identity!

We are in FULL agreement on this issue!

> If the key itself has information then you will inevitably
> run into a situation where the key will need to be changed
> because something about the information represented by the
> key value has changed or is in doubt or is a matter of
> interpretation, (thus losing the temporal uniqueness of the
> GUID).

Yes - EXACTLY!  I do understand the value of preserving *some* information
within the content of the GUID string (e.g., the server domain that issued
the number).  But in my opinion, the GUID should not attempt to include
metadata/information about the concept itself (only metadata about the
GUID -- like where it was issued).

> If for example, we decide to embed version numbers within
> the GUID, then there will be relationships between GUIDs
> that need to be maintained and respected and modeled as a
> consequence of the version numbers themselves (sort of an
> embedded data model within the ID), which adds another
> layer of abstraction to the whole enterprise of managing
> concepts.  Instead of just worrying about mapping the
> taxonomic relationships among concepts using unique IDs as
> the handles, such as in the recent examples, one now has to
> verify that the subkey/version identifiers are accurate
> (and that may be a matter of differing interpretations)
> and related in the appropriate way that corresponds to
> the taxonomy.

Again, we seem to be in full agreement on this.

> I would recommend that versioning be handled outside of the
> key or ID. Let resolver services deal with version differences
> based on the metadata, don't hard code relationships among
> concept versions in the identifier.

Yes, exactly -- this is one of the points I was originally trying to make.
But more fundamentally, I wanted to first understand what a "version" is,
and how it differed from a case where you would simply identify two distinct
concepts (and then secondarily map their congruencies).  The
important/relevant question is what does a GUID represent?  It makes the
most sense to me that the GUID represents one concept, and not potentially
multiple versions of one concept. But my position on this may be based on a
flawed understanding of what a "version" is, and how it differs from a case
of two distinct concepts (hence my refined questions in later posts).

Examples seem to be helpful for communication for these sorts of
discussions, so I'll go back to Dave's example GUIDs:

urn:lsid:taxaserver.org:3232:1
urn:lsid:taxaserver.org:3232:2

These constitute two distinct GUIDs, pertaining to one concept.  The concept
ID is 3232, within the context of taxaserver.org's LSID series. In this
case, two GUIDs have been assigned to two different versions of the same
concept.

There seem to me to be two kinds of metadata/information embedded within the
GUIDs themselves.  First, there is metadata about the GUID:  it
self-identifies as an LSID, and that it was issued by taxaserver.org.  I see
no real harm in including this sort of information embedded within the GUID.
Second, there is, as Jim described, information about the relationship
between a version of a concept, and a concept. In other words, the GUID
refers to two discrete entities: the concept (3232), and the version (1 or
2), and the implied relationship between them.  Therefore, there is no
single GUID for the "concept".

My concern about the distinction between different versions of the same
concept, vs. different concepts, is that if there is *any* subjectivity at
all in making that distinction, you may potentially be tempted to interpret
it a different way later on, so that you instead have:

urn:lsid:taxaserver.org:3232:1
urn:lsid:taxaserver.org:3233:1

This would requre a change in GUID (and consequent need for propagation of
that change), which, as Jim states, is one of the main things you're trying
to avoid when establishing a surrogate key.

Even if there would never be any ambiguity about whether two records should
be treated as different versions of the same concept, or two separate
concepts; I still feel uneasy about extending the meaning of the GUID to
include concept versions, rather than simply representing distinct concepts
(1:1 Concept:GUID).

So really there are (at least) two subtly different, but I think
fundamentally important, questions here:  What, if any, kinds of information
should be embeded within GUID string itself; and whether the minimal unit of
a GUID is a concept, or a version of a concept.

If I hadn't thoroughly confused the issue before, certainly I have done so
now!

Aloha,
Rich

=======================================================
Richard L. Pyle, PhD
Natural Sciences Database Coordinator, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://www.bishopmuseum.org/bishop/HBS/pylerichard.html







More information about the Seek-taxon mailing list