[SEEK-Taxon] Thoughts on GUIDs

Robert A. Morris ram at cs.umb.edu
Wed May 26 08:04:23 PDT 2004


dave thau wrote:

>>In the specific case of concepts, and the single relation
>>S=isTheSameConceptAs, one may say that the sole requirement of a guid is
>>that
>>     c1 S c2 <==> guid(c1) = guid(c2)
> 
> 
> So the question is what is "S" or... if I'm getting what you're saying,
> what is guid().  In the case of the taxonomic exchange schema, the guid()
> function, or the S relation, amounts to which elements can be changed, and
> to what degree, before the guid() function generates a new result.  Right?
> 

guid() is the function that, given a concept returns its guid. Whether 
it comes from the concept instance document or the algorithm that 
assigns it ought to be immaterial.

S here was meant to be the relation "isTheSameConceptAs". So in natural 
language, that expression would be the requirement that

    c1 is the same concept as c2 if, and only if, their guids are identical

Undoubtedly, the algebra of concept relations will come not from the 
concept schema, but rather from whatever ontology you impose on it. For 
example: I suppose that one might have enough in the schema (which I 
haven't studied---can you point me at the appropriate file in the CVS 
repository?) to decide from an instance document for concept c2 that it 
represents the relation "isaTaxonomicRevisionOf" or 
"isaTaxonomicSynonymOf" the concept c1. But it probably takes external 
ontology to specify the transitivity of such relations (if, indeed, that 
is what the biologists mean in these cases).
In this example, that means that a guarantee of

    If c3 is a synonym of c2 and c2 is a synonym of c1, then c3 is a 
synonym of c1

is unlikely to be easily expressible in the schema, but rather only in 
the ontology. But that's where you want to reason anyway, so it seems fine.

It might be an interesting problem how to map relations on concepts to 
relations on their proxies, but if this is not already well-studied in 
KR, I can imagine some number-theoretic ways to do it. Similar problems 
come up in encryption, where you desire the opposite: you wish relations 
on the proxies (i.e. the cyphertext) to NOT betray relations on the 
proxied, i.e. the cleartext.

> 
>>If the meaning of this requirement is that the function guid() is
>>"informative", so be it. But it certainly has nothing to do with parsing
>>the values of the function to extract that information.
> 
> 
> Down with version numbers!
> 
> Dave
> 
> 
> 
>>The TDWG SDD effort presently is regarding guids (actually we've
>>probably settled on LSID's) as one of several kinds of /proxy/ objects
>>(in the sense of the proxy design pattern). Others include web services
>>and registry entries. I'm unaware of any formalizations of the
>>connections between relations on proxied objects---in this discussion
>>concepts---and relations on the proxies---in this discussion guids,
>>though it will surprise me if the Description Logic experts in SEEK
>>don't have such a formalism. But as a recovering homological algebraist,
>>I would not cast it in a database framework, which is after all, simply
>>one way to represent information. Instead I would ask questions like
>>this: if R is a relation on a set {c} of proxied objects, and p(c) the
>>proxy function, is there always an interesting relation p(R) on the
>>proxy objects such that
>>	c1 R c2 ==> p(c1) p(R) p(c2)
>>
>>[If eyes are already glazed, skip this paragraph.]If one restricts to
>>composable relations---which I guess may be interesting but too
>>restrictive---then with a few more conditions, p is a Functor in the
>>sense of category theory and for about 50 years quite a bit has been
>>known how such things behave. FWIW, in category theory the objects count
>>for nothing: the only things interesting are the transformations among
>>them. Category theory is sort of the yoga of algebra.
>>
>>In the specific case of concepts, and the single relation
>>S=isTheSameConceptAs, one may say that the sole requirement of a guid is
>>that
>>     c1 S c2 <==> guid(c1) = guid(c2)
>>
>>If the meaning of this requirement is that the function guid() is
>>"informative", so be it. But it certainly has nothing to do with parsing
>>the values of the function to extract that information.
>>
>>Although my colleagues on the SDD committee might take me to the
>>woodshed for introducing this kind of framework, they would probably all
>>agree to something like "by their behavior, proxies should at least in
>>some measure reflect the behavior of the thing they proxy". Otherwise,
>>they are just keys. Informally, in SDD the idea behind proxies is
>>roughly, "if you can't get the object, making do with the proxy should
>>not be a show stopper." [In truth, I might get more agreement to this
>>second than the first statement. And it's still on the table whether we
>>actually acomplish this...]
>>
>>Bob Morris
>>seek-taxon lurker
>>sober for 24 years ("Frobenius Endomorphisms in Umbral Calculus,
>>    MIT Studies in Applied Math., 62, 1980, 85-92")
>>
>>
>>Nico M. Franz wrote:
>>
>>
>>>Hi Rich:
>>>
>>>   very good stuff as always. Keep up the examples, please. I'm
>>>currently working (actually writing a manuscript!) on concept relations,
>>>so I'm trying to focus on that. However, I do have a general comment.
>>>Whoever of us takes the lead on this "potentially (mis)informative vs.
>>>uninformative keys"-issue, should take half a day or so to do some
>>>literature research. At least for me, I can say that there are many
>>>aspects to SEEK that go beyond my training. Eventually I may or may not
>>>realize that they go beyond anyone's training. After Edinburgh, I'm
>>>pretty much convinced that's the case for concept relations.
>>>
>>>   In the case of this particular non-/natural decision - I don't know.
>>>I would surprise me (though that's obviously just happened with concept
>>>relations) if nobody had previously struggled with this question and
>>>came up with a (however tentative) maxim. I'm sure Dave already knows a
>>>whole lot about this. I always find it neat to say "in adopting 'our'
>>>view, we followed so-and-so who successfully did this and that in a
>>>similar situation." Or if not, still cite that project and argue why its
>>>solution doesn't apply here.
>>>
>>>   In short, our voices and convictions should only sound the loudest if
>>>we know about a fair amount of others. Some things we clearly learn by
>>>doing, others we can build on past successes. I have no idea where we
>>>stand in this particular case, but *someone* in SEEK Taxon should know
>>>in my view. Why did ITIS use numbers? Etc.
>>>
>>>Cheers,
>>>
>>>Nico
>>>
>>>At 12:51 PM 5/25/2004 -1000, Richard Pyle wrote:
>>>
>>>
>>>>Hi Shawn,
>>>>
>>>>I think that Jim's point was that the key value itself should not embed
>>>>information.
>>>>
>>>>For example, consider my earlier reference to Paracentropyge SEC Pyle,
>>>>2003.
>>>>To create a truely unique "natural" key for this instance, you'd have
>>>>to
>>>>unambiguously resolve "Paracentropyge" from all possible homonyms,
>>>>which
>>>>means unambiguously indicating sufficient detail about the reference in
>>>>which it was describe (including page number, because there are cases
>>>>where
>>>>homonyms are described within the same reference); AND you'd have to
>>>>unambiguoysly resolve "Pyle, 2003".  I'm not sure what the most
>>>>reliable
>>>>natural key for a published reference would be, but it would probably
>>>>have
>>>>to minimally include the title of the reference (a single author or
>>>>set of
>>>>authors may very-well publish within the same year more than one
>>>>aritcle
>>>>with overlapping page numbers).  It's hard to use citation details,
>>>>because
>>>>these vary so much depending on the nature of the reference (book,
>>>>journal
>>>>article, newspaper article, etc.). So, the bottom line is that it's
>>>>cumbersome (if not nearly impossible) to come up with a natural key for
>>>>uniquely indicating a reference, and the natural unique identifier for
>>>>a
>>>>taxonomic concept would almost certainly have to include within it the
>>>>unique identifiers for at least two references (one to give the
>>>>namestring
>>>>context, and the other to anchor the namestring to a particular
>>>>concept-usage).
>>>>
>>>>So, at it's worst, an information-bearing GUID for a concept would look
>>>>something like:
>>>>
>>>>"Paracentropyge-BurgessWE1991Two_new_genera_of_angelfishes,_family_Pomacanth
>>>>
>>>>idae:69SECPyleRL2003Chapter_2._Revision_and_phylogenetic_analysis_of_the_gen
>>>>
>>>>era_and_subgenera_of_Pomacanthidae"
>>>>
>>>>And even that probably won't quite do it (uniquely) in all cases.  But
>>>>more
>>>>significantly, the probability that one of those characters is
>>>>initially
>>>>entered incorrectly when the GUID is created is very high, e.g.:
>>>>
>>>>"Paracentropyge-BurgessWE1991Two_new_genera_of_angelfishes,_family_Pomacanth
>>>>
>>>>idae:68SECPyleRL2003Chapter_2._Revision_and_phylogenetic_analysis_of_the_gen
>>>>
>>>>era_and_subgenera_of_Pomacanthidae"
>>>>
>>>>If local databases are all using this sort of information-bearing GUID
>>>>to
>>>>identify their concepts and to map their concepts to the SEEK
>>>>infrastructure, then the discovery of a minor error (in this case, that
>>>>Paracentropyge was described on page 69 of Burgess 1991, rather than
>>>>on page
>>>>68); then we are faced with the dilemma:  maintain the GUID as
>>>>permanent and
>>>>unchanging, with its error intact (in which case, why have
>>>>information-bearing keys in the first place, if the information cannot
>>>>be
>>>>trusted as accurate); or correct the error in the GUID (and deal with
>>>>having
>>>>to perpetuate the correction of all instances of the GUID on all local
>>>>databases). Of course, there is a third option, which is where I think
>>>>the
>>>>"version" thing comes in, which is to maintain BOTH GUIDs, and have a
>>>>secondary structure that maintains the fact that one is the more
>>>>correct
>>>>version of the other.
>>>>
>>>>The third option seems attractive, but requires a central authority to
>>>>maintain the version equivalencies.  Once you have to commit to a
>>>>central
>>>>authority, why not capitalize on it maximally?  For instance, the
>>>>central
>>>>authority could establish a GUID system for references, such that it
>>>>maintains the equivalencies of 1234 is the surrogate key for the
>>>>Reference
>>>>cited as "Burgess, W.E. 1991. Two new genera of angelfishes, family
>>>>Pomacanthidae. etc."; and 5678 is the surrogate key for the Reference
>>>>cited
>>>>as "Pyle, R.L. 2003. Chapter2. Revision and phylogenetic analysis of
>>>>the
>>>>genera and subgenera of Pomacanthidae..etc."
>>>>
>>>>Then we could collapse our concept GUIDs to something like:
>>>>
>>>>"Paracentropyge[taxaserver.org:Ref/1234]:68_SEC_[taxaserver.org:Ref/5678]"
>>>>
>>>>
>>>>But even this is subject to potential future change (e.g., it could be
>>>>discovered that the Burgess reference assigned to ID#1234 was not
>>>>actually
>>>>the original description of the genus name Paracentropyge), so we're
>>>>stuck
>>>>with the same problems again.
>>>>
>>>>I agree with you that the Surrogate Keys are established for pragmatic
>>>>purposes.  In that sense, having a central authority to resolve the
>>>>meaning/currency/versioning of a concept identifier is an impediment to
>>>>pragmatism.  However, once you cross the threshold of needing a central
>>>>authority for ID resolution, you might as well milk it for all of its
>>>>pragmatic worth, and establish an arbitrary surrogate key system to
>>>>uniquely
>>>>identify concepts without any attempt embed information within the key
>>>>(except, perhaps, for the metadata about the key itself, such as its
>>>>issuer).  The risk of embedding information about the taxonomic concept
>>>>within the keystring, is that errors will undoubtedly be discovered in
>>>>that
>>>>keystring -- leaving us with the problem of correction propagation or
>>>>versioning.  Removing all taxon concept information from the keystring
>>>>allows the arbitrary key to remain constant and unchanging.  The
>>>>downside,
>>>>of course, is that you need to involve a centralized registry that
>>>>maintains
>>>>the "most correct" version of the information that a human would look
>>>>at to
>>>>uniquely identify what the concept that the GUID is intended to
>>>>represent.
>>>>But I just don't see how we'll ever move forward on fluid taxonomic
>>>>data
>>>>exchange without the establishment of such an authority.
>>>>
>>>>This ended up being MUCH longer than I intended it -- and it's not
>>>>directed
>>>>specifically at you, Shawn (you undoubtedly understand general
>>>>surrogate key
>>>>theory much better than I do).  But it seems to me that the goal of
>>>>SEEK is
>>>>to get things moving forward, which means the implementation of
>>>>pragmatic
>>>>steps. It's pretty clear that natural (purely non-arbitrary
>>>>information-bearing) identifiers for taxonomic names and concepts are
>>>>not at
>>>>all practical, nor will they be any time soon (owing mostly to issues
>>>>of
>>>>homonymy, and the complexity of establishing natural unique
>>>>identifiers for
>>>>references). So if some sort of resolver is necessary, it seems to me
>>>>that
>>>>pragmatisim is maximized by excluding any taxon-concept information
>>>>within
>>>>the GUID string.
>>>>
>>>>Clearly, it's time now for me to shut up.
>>>>
>>>>Aloha,
>>>>Rich
>>>>
>>>>
>>>>>-----Original Message-----
>>>>>From: seek-taxon-admin at ecoinformatics.org
>>>>>[mailto:seek-taxon-admin at ecoinformatics.org]On Behalf Of Shawn Bowers
>>>>>Sent: Tuesday, May 25, 2004 9:39 AM
>>>>>To: Beach, James H
>>>>>Cc: SEEK Taxon
>>>>>Subject: Re: [SEEK-Taxon] Thoughts on GUIDs
>>>>>
>>>>>
>>>>>
>>>>>Beach, James H wrote:
>>>>>
>>>>>
>>>>>>One of the strongest arguments for the evaluation of 'artificial'
>>>>
>>>>or
>>>>
>>>>>>'surrogate' key fields in a database context is that the 'key'
>>>>
>>>>should
>>>>
>>>>>>not contain any implicit or explicit information about the object
>>>>
>>>>being
>>>>
>>>>>>identified, other than its identity!
>>>>>
>>>>>The comment above doesn't seem quite right.  How can something have
>>>>
>>>>an
>>>>
>>>>>identity that is independent (i.e., a surrogate) of the identity of
>>>>
>>>>the
>>>>
>>>>>thing?  In other words, if the key doesn't contain any information
>>>>
>>>>about
>>>>
>>>>>the object being identified, it surely can't uniquely describe or
>>>>>identify the object, right?
>>>>>
>>>>>In general, surrogate keys are used for purely pragmatic purposes
>>>>
>>>>within
>>>>
>>>>>a database management system, e.g., so that a clustered B+-tree index
>>>>>can be constructed for the table, or to identify certain
>>>>
>>>>relationships
>>>>
>>>>>in an ER or OO database.  But, surrogate keys always "pass the buck"
>>>>
>>>>of
>>>>
>>>>>identity to something else.  For example, in OODBs there are two
>>>>
>>>>notions
>>>>
>>>>>of equality, where objects can be deep-equal (value-equal) or
>>>>>shallow-equal (id-equal).
>>>>>
>>>>>Another problem with surrogate keys is that they are arbitrarily
>>>>>assigned, and conceptually, don't provide any information to a user
>>>>>about the corresponding object (other than the mac address used to
>>>>>construct the identifier, or the time the thing was put into the
>>>>
>>>>system,
>>>>
>>>>>etc.). Often, surrogate keys are "hidden" from the user, which gets
>>>>
>>>>back
>>>>
>>>>>to the problem of how to really identify objects.  Also, with
>>>>>surrogates, true uniqueness is always in question.  Hence arguments
>>>>
>>>>for
>>>>
>>>>>"globally" unique identifiers versus "universally" unique
>>>>
>>>>identifiers,
>>>>
>>>>>and so on.
>>>>>
>>>>>
>>>>>>If the key itself has information then you will inevitably run into
>>>>
>>>>a
>>>>
>>>>>>situation where the key will need to be changed because something
>>>>
>>>>about
>>>>
>>>>>>the information represented by the key value has changed or is in
>>>>
>>>>doubt
>>>>
>>>>>>or is a matter of interpretation, (thus losing the temporal
>>>>
>>>>uniqueness
>>>>
>>>>>>of the GUID).
>>>>>
>>>>>Again, then the information used as the key isn't really
>>>>
>>>>"identifying"
>>>>
>>>>>information, and you have a problem anyway.
>>>>>
>>>>>There is a very interesting article that people may want to read
>>>>>concerning properties of things and classification, including
>>>>
>>>>identity
>>>>
>>>>>and unity, that may be relevant to what taxon is trying to accomplish
>>>>>with concepts.
>>>>>
>>>>>The paper can be found here, and was published in the Communications
>>>>
>>>>of
>>>>
>>>>>the ACM in 2002.  There are longer, more detailed versions available,
>>>>>but this is a good primer.
>>>>>
>>>>>http://www.loa-cnr.it/Papers/CACM2002.pdf
>>>>>
>>>>>
>>>>>
>>>>>>If for example, we decide to embed version numbers within
>>>>>>the GUID, then there will be relationships between GUIDs that
>>>>>
>>>>>need to be
>>>>>
>>>>>>maintained and respected and modeled as a consequence of the
>>>>
>>>>version
>>>>
>>>>>>numbers themselves (sort of an embedded data model within the
>>>>>
>>>>>ID), which
>>>>>
>>>>>>adds another layer of abstraction to the whole enterprise of
>>>>
>>>>managing
>>>>
>>>>>>concepts.  Instead of just worrying about mapping the taxonomic
>>>>>>relationships among concepts using unique IDs as the handles,
>>>>>
>>>>>such as in
>>>>>
>>>>>>the recent examples, one now has to verify that the subkey/version
>>>>>>identifiers are accurate (and that may be a matter of differing
>>>>>>interpretations) and related in the appropriate way that
>>>>
>>>>corresponds to
>>>>
>>>>>>the taxonomy.
>>>>>>
>>>>>>I would recommend that versioning be handled outside of the key or
>>>>
>>>>ID.
>>>>
>>>>>>Let resolver services deal with version differences based on the
>>>>>>metadata, don't hard code relationships among concept versions in
>>>>
>>>>the
>>>>
>>>>>>identifier.
>>>>>
>>>>>
>>>>>
>>>>>>_____________________________
>>>>>>James H. Beach
>>>>>>Biodiversity Research Center
>>>>>>University of Kansas
>>>>>>1345 Jayhawk Boulevard
>>>>>>Lawrence, KS 66045, USA
>>>>>>T 785 864-4645, F 785 864-5335
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>_______________________________________________
>>>>>seek-taxon mailing list
>>>>>seek-taxon at ecoinformatics.org
>>>>>http://www.ecoinformatics.org/mailman/listinfo/seek-taxon
>>>>
>>>>
>>>>_______________________________________________
>>>>seek-taxon mailing list
>>>>seek-taxon at ecoinformatics.org
>>>>http://www.ecoinformatics.org/mailman/listinfo/seek-taxon
>>>
>>>
>>>_______________________________________________
>>>seek-taxon mailing list
>>>seek-taxon at ecoinformatics.org
>>>http://www.ecoinformatics.org/mailman/listinfo/seek-taxon
>>
>>--
>>Robert A. Morris
>>Professor of Computer Science
>>UMASS-Boston
>>http://www.cs.umb.edu/~ram
>>phone (+1)617 287 6466
>>_______________________________________________
>>seek-taxon mailing list
>>seek-taxon at ecoinformatics.org
>>http://www.ecoinformatics.org/mailman/listinfo/seek-taxon
>>
>>
> 
> 
> _______________________________________________
> seek-taxon mailing list
> seek-taxon at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/seek-taxon

-- 
Robert A. Morris, Professor of Computer Science
University of Massachusetts at Boston
100 Morrissey Blvd; Boston, MA 02125
http://www.cs.umb.edu/~ram http://www.cs.umb.edu/efg
phone: (+1)617-287-6466 fax:   (+1)617-287-6433




More information about the Seek-taxon mailing list