[SEEK-Taxon] Thoughts on GUIDs

Nico M. Franz franz at nceas.ucsb.edu
Tue May 25 16:20:26 PDT 2004


Hi Rich:

    very good stuff as always. Keep up the examples, please. I'm currently 
working (actually writing a manuscript!) on concept relations, so I'm 
trying to focus on that. However, I do have a general comment. Whoever of 
us takes the lead on this "potentially (mis)informative vs. uninformative 
keys"-issue, should take half a day or so to do some literature research. 
At least for me, I can say that there are many aspects to SEEK that go 
beyond my training. Eventually I may or may not realize that they go beyond 
anyone's training. After Edinburgh, I'm pretty much convinced that's the 
case for concept relations.

    In the case of this particular non-/natural decision - I don't know. I 
would surprise me (though that's obviously just happened with concept 
relations) if nobody had previously struggled with this question and came 
up with a (however tentative) maxim. I'm sure Dave already knows a whole 
lot about this. I always find it neat to say "in adopting 'our' view, we 
followed so-and-so who successfully did this and that in a similar 
situation." Or if not, still cite that project and argue why its solution 
doesn't apply here.

    In short, our voices and convictions should only sound the loudest if 
we know about a fair amount of others. Some things we clearly learn by 
doing, others we can build on past successes. I have no idea where we stand 
in this particular case, but *someone* in SEEK Taxon should know in my 
view. Why did ITIS use numbers? Etc.

Cheers,

Nico

At 12:51 PM 5/25/2004 -1000, Richard Pyle wrote:

>Hi Shawn,
>
>I think that Jim's point was that the key value itself should not embed
>information.
>
>For example, consider my earlier reference to Paracentropyge SEC Pyle, 2003.
>To create a truely unique "natural" key for this instance, you'd have to
>unambiguously resolve "Paracentropyge" from all possible homonyms, which
>means unambiguously indicating sufficient detail about the reference in
>which it was describe (including page number, because there are cases where
>homonyms are described within the same reference); AND you'd have to
>unambiguoysly resolve "Pyle, 2003".  I'm not sure what the most reliable
>natural key for a published reference would be, but it would probably have
>to minimally include the title of the reference (a single author or set of
>authors may very-well publish within the same year more than one aritcle
>with overlapping page numbers).  It's hard to use citation details, because
>these vary so much depending on the nature of the reference (book, journal
>article, newspaper article, etc.). So, the bottom line is that it's
>cumbersome (if not nearly impossible) to come up with a natural key for
>uniquely indicating a reference, and the natural unique identifier for a
>taxonomic concept would almost certainly have to include within it the
>unique identifiers for at least two references (one to give the namestring
>context, and the other to anchor the namestring to a particular
>concept-usage).
>
>So, at it's worst, an information-bearing GUID for a concept would look
>something like:
>
>"Paracentropyge-BurgessWE1991Two_new_genera_of_angelfishes,_family_Pomacanth
>idae:69SECPyleRL2003Chapter_2._Revision_and_phylogenetic_analysis_of_the_gen
>era_and_subgenera_of_Pomacanthidae"
>
>And even that probably won't quite do it (uniquely) in all cases.  But more
>significantly, the probability that one of those characters is initially
>entered incorrectly when the GUID is created is very high, e.g.:
>
>"Paracentropyge-BurgessWE1991Two_new_genera_of_angelfishes,_family_Pomacanth
>idae:68SECPyleRL2003Chapter_2._Revision_and_phylogenetic_analysis_of_the_gen
>era_and_subgenera_of_Pomacanthidae"
>
>If local databases are all using this sort of information-bearing GUID to
>identify their concepts and to map their concepts to the SEEK
>infrastructure, then the discovery of a minor error (in this case, that
>Paracentropyge was described on page 69 of Burgess 1991, rather than on page
>68); then we are faced with the dilemma:  maintain the GUID as permanent and
>unchanging, with its error intact (in which case, why have
>information-bearing keys in the first place, if the information cannot be
>trusted as accurate); or correct the error in the GUID (and deal with having
>to perpetuate the correction of all instances of the GUID on all local
>databases). Of course, there is a third option, which is where I think the
>"version" thing comes in, which is to maintain BOTH GUIDs, and have a
>secondary structure that maintains the fact that one is the more correct
>version of the other.
>
>The third option seems attractive, but requires a central authority to
>maintain the version equivalencies.  Once you have to commit to a central
>authority, why not capitalize on it maximally?  For instance, the central
>authority could establish a GUID system for references, such that it
>maintains the equivalencies of 1234 is the surrogate key for the Reference
>cited as "Burgess, W.E. 1991. Two new genera of angelfishes, family
>Pomacanthidae. etc."; and 5678 is the surrogate key for the Reference cited
>as "Pyle, R.L. 2003. Chapter2. Revision and phylogenetic analysis of the
>genera and subgenera of Pomacanthidae..etc."
>
>Then we could collapse our concept GUIDs to something like:
>
>"Paracentropyge[taxaserver.org:Ref/1234]:68_SEC_[taxaserver.org:Ref/5678]"
>
>But even this is subject to potential future change (e.g., it could be
>discovered that the Burgess reference assigned to ID#1234 was not actually
>the original description of the genus name Paracentropyge), so we're stuck
>with the same problems again.
>
>I agree with you that the Surrogate Keys are established for pragmatic
>purposes.  In that sense, having a central authority to resolve the
>meaning/currency/versioning of a concept identifier is an impediment to
>pragmatism.  However, once you cross the threshold of needing a central
>authority for ID resolution, you might as well milk it for all of its
>pragmatic worth, and establish an arbitrary surrogate key system to uniquely
>identify concepts without any attempt embed information within the key
>(except, perhaps, for the metadata about the key itself, such as its
>issuer).  The risk of embedding information about the taxonomic concept
>within the keystring, is that errors will undoubtedly be discovered in that
>keystring -- leaving us with the problem of correction propagation or
>versioning.  Removing all taxon concept information from the keystring
>allows the arbitrary key to remain constant and unchanging.  The downside,
>of course, is that you need to involve a centralized registry that maintains
>the "most correct" version of the information that a human would look at to
>uniquely identify what the concept that the GUID is intended to represent.
>But I just don't see how we'll ever move forward on fluid taxonomic data
>exchange without the establishment of such an authority.
>
>This ended up being MUCH longer than I intended it -- and it's not directed
>specifically at you, Shawn (you undoubtedly understand general surrogate key
>theory much better than I do).  But it seems to me that the goal of SEEK is
>to get things moving forward, which means the implementation of pragmatic
>steps. It's pretty clear that natural (purely non-arbitrary
>information-bearing) identifiers for taxonomic names and concepts are not at
>all practical, nor will they be any time soon (owing mostly to issues of
>homonymy, and the complexity of establishing natural unique identifiers for
>references). So if some sort of resolver is necessary, it seems to me that
>pragmatisim is maximized by excluding any taxon-concept information within
>the GUID string.
>
>Clearly, it's time now for me to shut up.
>
>Aloha,
>Rich
>
> > -----Original Message-----
> > From: seek-taxon-admin at ecoinformatics.org
> > [mailto:seek-taxon-admin at ecoinformatics.org]On Behalf Of Shawn Bowers
> > Sent: Tuesday, May 25, 2004 9:39 AM
> > To: Beach, James H
> > Cc: SEEK Taxon
> > Subject: Re: [SEEK-Taxon] Thoughts on GUIDs
> >
> >
> >
> > Beach, James H wrote:
> >
> > > One of the strongest arguments for the evaluation of 'artificial' or
> > > 'surrogate' key fields in a database context is that the 'key' should
> > > not contain any implicit or explicit information about the object being
> > > identified, other than its identity!
> >
> > The comment above doesn't seem quite right.  How can something have an
> > identity that is independent (i.e., a surrogate) of the identity of the
> > thing?  In other words, if the key doesn't contain any information about
> > the object being identified, it surely can't uniquely describe or
> > identify the object, right?
> >
> > In general, surrogate keys are used for purely pragmatic purposes within
> > a database management system, e.g., so that a clustered B+-tree index
> > can be constructed for the table, or to identify certain relationships
> > in an ER or OO database.  But, surrogate keys always "pass the buck" of
> > identity to something else.  For example, in OODBs there are two notions
> > of equality, where objects can be deep-equal (value-equal) or
> > shallow-equal (id-equal).
> >
> > Another problem with surrogate keys is that they are arbitrarily
> > assigned, and conceptually, don't provide any information to a user
> > about the corresponding object (other than the mac address used to
> > construct the identifier, or the time the thing was put into the system,
> > etc.). Often, surrogate keys are "hidden" from the user, which gets back
> > to the problem of how to really identify objects.  Also, with
> > surrogates, true uniqueness is always in question.  Hence arguments for
> > "globally" unique identifiers versus "universally" unique identifiers,
> > and so on.
> >
> > > If the key itself has information then you will inevitably run into a
> > > situation where the key will need to be changed because something about
> > > the information represented by the key value has changed or is in doubt
> > > or is a matter of interpretation, (thus losing the temporal uniqueness
> > > of the GUID).
> >
> > Again, then the information used as the key isn't really "identifying"
> > information, and you have a problem anyway.
> >
> > There is a very interesting article that people may want to read
> > concerning properties of things and classification, including identity
> > and unity, that may be relevant to what taxon is trying to accomplish
> > with concepts.
> >
> > The paper can be found here, and was published in the Communications of
> > the ACM in 2002.  There are longer, more detailed versions available,
> > but this is a good primer.
> >
> > http://www.loa-cnr.it/Papers/CACM2002.pdf
> >
> >
> > > If for example, we decide to embed version numbers within
> > > the GUID, then there will be relationships between GUIDs that
> > need to be
> > > maintained and respected and modeled as a consequence of the version
> > > numbers themselves (sort of an embedded data model within the
> > ID), which
> > > adds another layer of abstraction to the whole enterprise of managing
> > > concepts.  Instead of just worrying about mapping the taxonomic
> > > relationships among concepts using unique IDs as the handles,
> > such as in
> > > the recent examples, one now has to verify that the subkey/version
> > > identifiers are accurate (and that may be a matter of differing
> > > interpretations) and related in the appropriate way that corresponds to
> > > the taxonomy.
> > >
> > > I would recommend that versioning be handled outside of the key or ID.
> > > Let resolver services deal with version differences based on the
> > > metadata, don't hard code relationships among concept versions in the
> > > identifier.
> >
> >
> >
> > >
> > > _____________________________
> > > James H. Beach
> > > Biodiversity Research Center
> > > University of Kansas
> > > 1345 Jayhawk Boulevard
> > > Lawrence, KS 66045, USA
> > > T 785 864-4645, F 785 864-5335
> > >
> > >
> > >
> >
> > _______________________________________________
> > seek-taxon mailing list
> > seek-taxon at ecoinformatics.org
> > http://www.ecoinformatics.org/mailman/listinfo/seek-taxon
>
>
>_______________________________________________
>seek-taxon mailing list
>seek-taxon at ecoinformatics.org
>http://www.ecoinformatics.org/mailman/listinfo/seek-taxon




More information about the Seek-taxon mailing list