[SEEK-Taxon] Thoughts on GUIDs

dave thau thau at learningsite.com
Tue May 25 21:58:37 PDT 2004


> In the specific case of concepts, and the single relation
> S=isTheSameConceptAs, one may say that the sole requirement of a guid is
> that
>      c1 S c2 <==> guid(c1) = guid(c2)

So the question is what is "S" or... if I'm getting what you're saying,
what is guid().  In the case of the taxonomic exchange schema, the guid()
function, or the S relation, amounts to which elements can be changed, and
to what degree, before the guid() function generates a new result.  Right?

> If the meaning of this requirement is that the function guid() is
> "informative", so be it. But it certainly has nothing to do with parsing
> the values of the function to extract that information.

Down with version numbers!

Dave


> The TDWG SDD effort presently is regarding guids (actually we've
> probably settled on LSID's) as one of several kinds of /proxy/ objects
> (in the sense of the proxy design pattern). Others include web services
> and registry entries. I'm unaware of any formalizations of the
> connections between relations on proxied objects---in this discussion
> concepts---and relations on the proxies---in this discussion guids,
> though it will surprise me if the Description Logic experts in SEEK
> don't have such a formalism. But as a recovering homological algebraist,
> I would not cast it in a database framework, which is after all, simply
> one way to represent information. Instead I would ask questions like
> this: if R is a relation on a set {c} of proxied objects, and p(c) the
> proxy function, is there always an interesting relation p(R) on the
> proxy objects such that
> 	c1 R c2 ==> p(c1) p(R) p(c2)
>
> [If eyes are already glazed, skip this paragraph.]If one restricts to
> composable relations---which I guess may be interesting but too
> restrictive---then with a few more conditions, p is a Functor in the
> sense of category theory and for about 50 years quite a bit has been
> known how such things behave. FWIW, in category theory the objects count
> for nothing: the only things interesting are the transformations among
> them. Category theory is sort of the yoga of algebra.
>
> In the specific case of concepts, and the single relation
> S=isTheSameConceptAs, one may say that the sole requirement of a guid is
> that
>      c1 S c2 <==> guid(c1) = guid(c2)
>
> If the meaning of this requirement is that the function guid() is
> "informative", so be it. But it certainly has nothing to do with parsing
> the values of the function to extract that information.
>
> Although my colleagues on the SDD committee might take me to the
> woodshed for introducing this kind of framework, they would probably all
> agree to something like "by their behavior, proxies should at least in
> some measure reflect the behavior of the thing they proxy". Otherwise,
> they are just keys. Informally, in SDD the idea behind proxies is
> roughly, "if you can't get the object, making do with the proxy should
> not be a show stopper." [In truth, I might get more agreement to this
> second than the first statement. And it's still on the table whether we
> actually acomplish this...]
>
> Bob Morris
> seek-taxon lurker
> sober for 24 years ("Frobenius Endomorphisms in Umbral Calculus,
>     MIT Studies in Applied Math., 62, 1980, 85-92")
>
>
> Nico M. Franz wrote:
>
>> Hi Rich:
>>
>>    very good stuff as always. Keep up the examples, please. I'm
>> currently working (actually writing a manuscript!) on concept relations,
>> so I'm trying to focus on that. However, I do have a general comment.
>> Whoever of us takes the lead on this "potentially (mis)informative vs.
>> uninformative keys"-issue, should take half a day or so to do some
>> literature research. At least for me, I can say that there are many
>> aspects to SEEK that go beyond my training. Eventually I may or may not
>> realize that they go beyond anyone's training. After Edinburgh, I'm
>> pretty much convinced that's the case for concept relations.
>>
>>    In the case of this particular non-/natural decision - I don't know.
>> I would surprise me (though that's obviously just happened with concept
>> relations) if nobody had previously struggled with this question and
>> came up with a (however tentative) maxim. I'm sure Dave already knows a
>> whole lot about this. I always find it neat to say "in adopting 'our'
>> view, we followed so-and-so who successfully did this and that in a
>> similar situation." Or if not, still cite that project and argue why its
>> solution doesn't apply here.
>>
>>    In short, our voices and convictions should only sound the loudest if
>> we know about a fair amount of others. Some things we clearly learn by
>> doing, others we can build on past successes. I have no idea where we
>> stand in this particular case, but *someone* in SEEK Taxon should know
>> in my view. Why did ITIS use numbers? Etc.
>>
>> Cheers,
>>
>> Nico
>>
>> At 12:51 PM 5/25/2004 -1000, Richard Pyle wrote:
>>
>>> Hi Shawn,
>>>
>>> I think that Jim's point was that the key value itself should not embed
>>> information.
>>>
>>> For example, consider my earlier reference to Paracentropyge SEC Pyle,
>>> 2003.
>>> To create a truely unique "natural" key for this instance, you'd have
>>> to
>>> unambiguously resolve "Paracentropyge" from all possible homonyms,
>>> which
>>> means unambiguously indicating sufficient detail about the reference in
>>> which it was describe (including page number, because there are cases
>>> where
>>> homonyms are described within the same reference); AND you'd have to
>>> unambiguoysly resolve "Pyle, 2003".  I'm not sure what the most
>>> reliable
>>> natural key for a published reference would be, but it would probably
>>> have
>>> to minimally include the title of the reference (a single author or
>>> set of
>>> authors may very-well publish within the same year more than one
>>> aritcle
>>> with overlapping page numbers).  It's hard to use citation details,
>>> because
>>> these vary so much depending on the nature of the reference (book,
>>> journal
>>> article, newspaper article, etc.). So, the bottom line is that it's
>>> cumbersome (if not nearly impossible) to come up with a natural key for
>>> uniquely indicating a reference, and the natural unique identifier for
>>> a
>>> taxonomic concept would almost certainly have to include within it the
>>> unique identifiers for at least two references (one to give the
>>> namestring
>>> context, and the other to anchor the namestring to a particular
>>> concept-usage).
>>>
>>> So, at it's worst, an information-bearing GUID for a concept would look
>>> something like:
>>>
>>> "Paracentropyge-BurgessWE1991Two_new_genera_of_angelfishes,_family_Pomacanth
>>>
>>> idae:69SECPyleRL2003Chapter_2._Revision_and_phylogenetic_analysis_of_the_gen
>>>
>>> era_and_subgenera_of_Pomacanthidae"
>>>
>>> And even that probably won't quite do it (uniquely) in all cases.  But
>>> more
>>> significantly, the probability that one of those characters is
>>> initially
>>> entered incorrectly when the GUID is created is very high, e.g.:
>>>
>>> "Paracentropyge-BurgessWE1991Two_new_genera_of_angelfishes,_family_Pomacanth
>>>
>>> idae:68SECPyleRL2003Chapter_2._Revision_and_phylogenetic_analysis_of_the_gen
>>>
>>> era_and_subgenera_of_Pomacanthidae"
>>>
>>> If local databases are all using this sort of information-bearing GUID
>>> to
>>> identify their concepts and to map their concepts to the SEEK
>>> infrastructure, then the discovery of a minor error (in this case, that
>>> Paracentropyge was described on page 69 of Burgess 1991, rather than
>>> on page
>>> 68); then we are faced with the dilemma:  maintain the GUID as
>>> permanent and
>>> unchanging, with its error intact (in which case, why have
>>> information-bearing keys in the first place, if the information cannot
>>> be
>>> trusted as accurate); or correct the error in the GUID (and deal with
>>> having
>>> to perpetuate the correction of all instances of the GUID on all local
>>> databases). Of course, there is a third option, which is where I think
>>> the
>>> "version" thing comes in, which is to maintain BOTH GUIDs, and have a
>>> secondary structure that maintains the fact that one is the more
>>> correct
>>> version of the other.
>>>
>>> The third option seems attractive, but requires a central authority to
>>> maintain the version equivalencies.  Once you have to commit to a
>>> central
>>> authority, why not capitalize on it maximally?  For instance, the
>>> central
>>> authority could establish a GUID system for references, such that it
>>> maintains the equivalencies of 1234 is the surrogate key for the
>>> Reference
>>> cited as "Burgess, W.E. 1991. Two new genera of angelfishes, family
>>> Pomacanthidae. etc."; and 5678 is the surrogate key for the Reference
>>> cited
>>> as "Pyle, R.L. 2003. Chapter2. Revision and phylogenetic analysis of
>>> the
>>> genera and subgenera of Pomacanthidae..etc."
>>>
>>> Then we could collapse our concept GUIDs to something like:
>>>
>>> "Paracentropyge[taxaserver.org:Ref/1234]:68_SEC_[taxaserver.org:Ref/5678]"
>>>
>>>
>>> But even this is subject to potential future change (e.g., it could be
>>> discovered that the Burgess reference assigned to ID#1234 was not
>>> actually
>>> the original description of the genus name Paracentropyge), so we're
>>> stuck
>>> with the same problems again.
>>>
>>> I agree with you that the Surrogate Keys are established for pragmatic
>>> purposes.  In that sense, having a central authority to resolve the
>>> meaning/currency/versioning of a concept identifier is an impediment to
>>> pragmatism.  However, once you cross the threshold of needing a central
>>> authority for ID resolution, you might as well milk it for all of its
>>> pragmatic worth, and establish an arbitrary surrogate key system to
>>> uniquely
>>> identify concepts without any attempt embed information within the key
>>> (except, perhaps, for the metadata about the key itself, such as its
>>> issuer).  The risk of embedding information about the taxonomic concept
>>> within the keystring, is that errors will undoubtedly be discovered in
>>> that
>>> keystring -- leaving us with the problem of correction propagation or
>>> versioning.  Removing all taxon concept information from the keystring
>>> allows the arbitrary key to remain constant and unchanging.  The
>>> downside,
>>> of course, is that you need to involve a centralized registry that
>>> maintains
>>> the "most correct" version of the information that a human would look
>>> at to
>>> uniquely identify what the concept that the GUID is intended to
>>> represent.
>>> But I just don't see how we'll ever move forward on fluid taxonomic
>>> data
>>> exchange without the establishment of such an authority.
>>>
>>> This ended up being MUCH longer than I intended it -- and it's not
>>> directed
>>> specifically at you, Shawn (you undoubtedly understand general
>>> surrogate key
>>> theory much better than I do).  But it seems to me that the goal of
>>> SEEK is
>>> to get things moving forward, which means the implementation of
>>> pragmatic
>>> steps. It's pretty clear that natural (purely non-arbitrary
>>> information-bearing) identifiers for taxonomic names and concepts are
>>> not at
>>> all practical, nor will they be any time soon (owing mostly to issues
>>> of
>>> homonymy, and the complexity of establishing natural unique
>>> identifiers for
>>> references). So if some sort of resolver is necessary, it seems to me
>>> that
>>> pragmatisim is maximized by excluding any taxon-concept information
>>> within
>>> the GUID string.
>>>
>>> Clearly, it's time now for me to shut up.
>>>
>>> Aloha,
>>> Rich
>>>
>>> > -----Original Message-----
>>> > From: seek-taxon-admin at ecoinformatics.org
>>> > [mailto:seek-taxon-admin at ecoinformatics.org]On Behalf Of Shawn Bowers
>>> > Sent: Tuesday, May 25, 2004 9:39 AM
>>> > To: Beach, James H
>>> > Cc: SEEK Taxon
>>> > Subject: Re: [SEEK-Taxon] Thoughts on GUIDs
>>> >
>>> >
>>> >
>>> > Beach, James H wrote:
>>> >
>>> > > One of the strongest arguments for the evaluation of 'artificial'
>>> or
>>> > > 'surrogate' key fields in a database context is that the 'key'
>>> should
>>> > > not contain any implicit or explicit information about the object
>>> being
>>> > > identified, other than its identity!
>>> >
>>> > The comment above doesn't seem quite right.  How can something have
>>> an
>>> > identity that is independent (i.e., a surrogate) of the identity of
>>> the
>>> > thing?  In other words, if the key doesn't contain any information
>>> about
>>> > the object being identified, it surely can't uniquely describe or
>>> > identify the object, right?
>>> >
>>> > In general, surrogate keys are used for purely pragmatic purposes
>>> within
>>> > a database management system, e.g., so that a clustered B+-tree index
>>> > can be constructed for the table, or to identify certain
>>> relationships
>>> > in an ER or OO database.  But, surrogate keys always "pass the buck"
>>> of
>>> > identity to something else.  For example, in OODBs there are two
>>> notions
>>> > of equality, where objects can be deep-equal (value-equal) or
>>> > shallow-equal (id-equal).
>>> >
>>> > Another problem with surrogate keys is that they are arbitrarily
>>> > assigned, and conceptually, don't provide any information to a user
>>> > about the corresponding object (other than the mac address used to
>>> > construct the identifier, or the time the thing was put into the
>>> system,
>>> > etc.). Often, surrogate keys are "hidden" from the user, which gets
>>> back
>>> > to the problem of how to really identify objects.  Also, with
>>> > surrogates, true uniqueness is always in question.  Hence arguments
>>> for
>>> > "globally" unique identifiers versus "universally" unique
>>> identifiers,
>>> > and so on.
>>> >
>>> > > If the key itself has information then you will inevitably run into
>>> a
>>> > > situation where the key will need to be changed because something
>>> about
>>> > > the information represented by the key value has changed or is in
>>> doubt
>>> > > or is a matter of interpretation, (thus losing the temporal
>>> uniqueness
>>> > > of the GUID).
>>> >
>>> > Again, then the information used as the key isn't really
>>> "identifying"
>>> > information, and you have a problem anyway.
>>> >
>>> > There is a very interesting article that people may want to read
>>> > concerning properties of things and classification, including
>>> identity
>>> > and unity, that may be relevant to what taxon is trying to accomplish
>>> > with concepts.
>>> >
>>> > The paper can be found here, and was published in the Communications
>>> of
>>> > the ACM in 2002.  There are longer, more detailed versions available,
>>> > but this is a good primer.
>>> >
>>> > http://www.loa-cnr.it/Papers/CACM2002.pdf
>>> >
>>> >
>>> > > If for example, we decide to embed version numbers within
>>> > > the GUID, then there will be relationships between GUIDs that
>>> > need to be
>>> > > maintained and respected and modeled as a consequence of the
>>> version
>>> > > numbers themselves (sort of an embedded data model within the
>>> > ID), which
>>> > > adds another layer of abstraction to the whole enterprise of
>>> managing
>>> > > concepts.  Instead of just worrying about mapping the taxonomic
>>> > > relationships among concepts using unique IDs as the handles,
>>> > such as in
>>> > > the recent examples, one now has to verify that the subkey/version
>>> > > identifiers are accurate (and that may be a matter of differing
>>> > > interpretations) and related in the appropriate way that
>>> corresponds to
>>> > > the taxonomy.
>>> > >
>>> > > I would recommend that versioning be handled outside of the key or
>>> ID.
>>> > > Let resolver services deal with version differences based on the
>>> > > metadata, don't hard code relationships among concept versions in
>>> the
>>> > > identifier.
>>> >
>>> >
>>> >
>>> > >
>>> > > _____________________________
>>> > > James H. Beach
>>> > > Biodiversity Research Center
>>> > > University of Kansas
>>> > > 1345 Jayhawk Boulevard
>>> > > Lawrence, KS 66045, USA
>>> > > T 785 864-4645, F 785 864-5335
>>> > >
>>> > >
>>> > >
>>> >
>>> > _______________________________________________
>>> > seek-taxon mailing list
>>> > seek-taxon at ecoinformatics.org
>>> > http://www.ecoinformatics.org/mailman/listinfo/seek-taxon
>>>
>>>
>>> _______________________________________________
>>> seek-taxon mailing list
>>> seek-taxon at ecoinformatics.org
>>> http://www.ecoinformatics.org/mailman/listinfo/seek-taxon
>>
>>
>> _______________________________________________
>> seek-taxon mailing list
>> seek-taxon at ecoinformatics.org
>> http://www.ecoinformatics.org/mailman/listinfo/seek-taxon
>
> --
> Robert A. Morris
> Professor of Computer Science
> UMASS-Boston
> http://www.cs.umb.edu/~ram
> phone (+1)617 287 6466
> _______________________________________________
> seek-taxon mailing list
> seek-taxon at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/seek-taxon
>
>




More information about the Seek-taxon mailing list