[SEEK-Taxon] identifiers for taxonomic concepts

Richard Pyle deepreef at bishopmuseum.org
Fri Nov 28 10:44:10 PST 2003


Hi Dave,

I think the solution to the user-unfriendliness of UUID's is what you allude
to in your last sentence:  hide them from the humans.  The real purpose of
the UUID's, from my pespective, is to allow computers to quickly,
efficiently, and unambiguously identify a particular concept (for the
purpose of establishing informational associations).  The human should only
need to see the more "friendly" information-bearing elements of a concept
(e.g., the taxon name as uniquely qualified by the authorship and
publication details of its original description; and the context of how it
was used to express a particular concept such as the agents and date of the
concept creators/asserters).  In other words, the UUID should have a
human-friendly face at the data presentation/manipulation layer.

But I would take it even further.  I don't think the UUID's should ever be
listed in printed form, or even seen by humans (except by the DB managers
who perform maintenance, etc.), and they should never by typed on a
keyboard. In that paradigm, it matters not how much they disrupt a human
sense of aesthetic, nor how easy it would be to mis-type them, because they
are never seen nor typed.

That's my perspective anyway (speaking as someone who has been pushing for
taxonomic GUID/UUIDs for a while now...)

Aloha,
Rich

Richard L. Pyle
Natural Sciences Database Coordinator, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://www.bishopmuseum.org/bishop/HBS/pylerichard.html

> -----Original Message-----
> From: seek-taxon-admin at ecoinformatics.org
> [mailto:seek-taxon-admin at ecoinformatics.org]On Behalf Of Dave Vieglais
> Sent: Friday, November 28, 2003 6:43 AM
> To: seek-taxon at ecoinformatics.org
> Subject: Re: [SEEK-Taxon] identifiers for taxonomic concepts
>
>
> A number of people have been chatting about the concept of using UUID's
> (or at least some form of ID) for taxonomic concepts for a while now-
> Thanks for providing a good summary of why UUIDs are probably the way to
> go (at least for now).
>
> The only thing against UUID's is that they are not nice for people to
> use - it's easy to make a mistake writing them down or keying in, and I
> imagine that a publication with a bunch of UUID's for cross reference to
> concepts will look pretty ugly.  Oh, and I suspect that taxonomists will
> be concerned that UUIDs can only be generated for 2400 years or so
> before they start repeating
>
> That said, since SEEK is a research project, it makes a lot of sense to
> go ahead and use such a system- just to see how it works out.  If UUIDs
> are too unpleasant then it would not be that hard to build into the SEEK
> architecture a UUID resolution service that took a more human friendly
> rendering of a UUID and returns the actual UUID and vice versa.
>
> Dave V.
>
> thau at learningsite.com wrote:
>
> > Hello everyone,
> >
> > I've been thinking a bit about what kind of identifiers to use for
> > representing taxonomic concepts throughout SEEK.  I have a longish note,
> > which is attached here as a text file and as a word document.
> >
> > Here's the summary:
> >
> > -----
> >
> > SEEK is storing data using grid technologies.  The draft mechanism for
> > identifying resources in the ecogrid uses URIs of the form:
> >
> > ecogrid://registered.naming.authority/local_id
> >
> > I think the local_ids for taxonomic concepts should be UUIDs -
> > semantic-free, globally unique strings generated following a specific
> > algorithm.  UUIDs look like this:
> >
> > 5c2775f0-1f59-11d8-a2da-b8a03c50a862
> >
> > Libraries for generating UUIDs already exist in most major
> languages.  If
> > UUIDs are too ugly, ids should at least be semantics free, all lower
> > case, and draw from the following character set if we want the
> flexibility
> > of using the ids for systems outside of the ecogrid:
> >
> > | "(" | ")" | "-" | "." |
> > | "_" | "!" | "*" |
> > | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
> > | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
> > | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
> > | "y" | "z" |
> > | "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
> > | "8" | "9" |
> >
> > -----
> >
> > The attached document has justifications for, and expansions of, these
> > opinions, which you may want to digest on Friday, along with your
> > Thanksgiving and/or Eid leftovers, or wait until next week if you're
> > afraid of upsetting your stomach.
> >
> > Dave
> >
> >
> > ------------------------------------------------------------------------
> >
> > Identifiers in the SEEK Taxonomic Concept Repository
> >
> >
> > 0.  The Short Story
> >
> > SEEK is storing data using grid technologies.  The draft
> mechanism for identifying resources in the ecogrid uses URIs of the form:
> >
> > ecogrid://registered.naming.authority/local_id
> >
> > I think the local_ids for taxonomic concepts should be UUIDs -
> semantic-free, globally unique strings generated following a
> specific algorithm.  UUIDs look like this:
> >
> > 5c2775f0-1f59-11d8-a2da-b8a03c50a862
> >
> > Libraries for generating UUIDs already exist in most major
> languages.  If UUIDs are too ugly, ids should at least be
> semantics free, all lower case, and draw from the following
> character set if we want the flexibility of using the ids for
> systems outside of the ecogrid:
> >
> > | "(" | ")" | "-" | "." |
> > | "_" | "!" | "*" |
> > | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
> > | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
> > | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
> > | "y" | "z" |
> > | "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
> > | "8" | "9" |
> >
> >
> > I.  Introduction
> >
> > There is a clear need for taxonomic concept identifiers in
> SEEK.  Taxonomic concepts are complicated data objects, and most
> components of SEEK need only store a simple identifier for a
> concept, as long as the full representation is available if
> necessary.  I'll be calling the system used to issue taxonomic
> concept identifiers, and tie them to their fuller
> representations, a taxonomic concept repository (TCR).   I'm
> going to argue here that UUIDs (as defined by the Internet
> Engineering Task Force) are a good candidate for internal
> identifiers in the TCR, but, really, almost any internal
> identifier is fine as long as it follows a small set of rules
> (listed in Appendix A).
> >
> >
> > II.  The Wonderful World of Unique Identifiers
> >
> > There are currently a number of different schemes for issuing
> unique, persistent, location independent resource locators.  Some
> biggies are:
> >
> > Life Science Identifiers (LSID, a type of URN with associated
> protocols for id resolution)
> > Digital Object Identifiers (DOI)
> > Uniform Resource Identifiers (URI)
> >
> > If location independence is relaxed, URLs and RDF IDs count as
> well - they're a specific type of URI.
> >
> > All of these schemes are comprised of a way of indicating a
> naming authority and a way of locating a data resource within
> that naming authority.  I'll call the naming authority the global
> part of the id, and the part locating the resource inside that
> authority the local part of the id.
> >
> > Let's say each naming authority assigns each taxonomic concept
> an integer.  For example, Kansas may assign the taxonomic concept
> id 3214 to "Canis lupus as described by Linnaeus in 1758 with the
> original specimen circumscription."  That number could be mapped
> onto the identifiers described above like this:
> >
> > LSID - urn:lsid:ku.edu:ecogrid:3214
> >
> > DOI - 10.1202/3214  (where 1202 is a IDF assigned number)
> >
> > URI - ecogrid://ku.edu/3214
> >
> > URL - http://seek.ku.edu/taxonomy.owl#3214
> >
> > As you can see, all these cases have an id representing the
> item, and then one or more terms describing a naming authority.
> >
> > In all of these cases, the local id is fairly unconstrained.
> Each scheme has its own set of forbidden characters, and some are
> case sensitive while others are not.  Each scheme also allows for
> multiple naming authorities, because the full unique id includes
> the name of the naming authority.  Of the examples listed, the
> LSID has the most atomic pattern.  It has separate fields for the
> URN name space (lsid), the authority id (ku.edu), the name space
> inside that authority (ecogrid) and the id of the item.
> >
> > Because SEEK is using grid technologies for data storage, the
> URI standard makes the most sense for identifying taxonomic
> concepts in SEEK.  Inside the TCR, the local ids can simple
> numbers or strings as long as none of the characters forbidden in
> URIs are used.
> >
> >
> > III.  Semantic Identifiers vs Opaque Strings
> >
> > So, what should these IDs look like?  One of the main decisions
> about identifiers is whether or not they should have any semantic
> value.  Identifiers with semantic value make nice addresses, and
> might help users determine the relevance of a taxon based on the
> identifier, without having to go an extra step to resolve the
> identifier to its fuller concept.  The task of determining the
> format of semantically meaningful identifiers involves
> determining the information necessary to create unique
> identifiers, deciding what information should be packed into the
> identifier, and determining the steps necessary to create the
> unique id based on the relevant information.  One possibility
> would be to include the name of the taxon and a number:
> canis_lupus_103.  This would let a user see what taxon is being
> identified, but not necessarily which version of that taxon.
> >
> > The downside of semantically informative identifiers is they
> may perpetuate the problem of people misusing taxonomic names.
> Someone might think bluebird_102 and bluebird_232 are equivalent
> because they're both bluebirds.  Only by going the extra step of
> resolving the concepts will determine whether the taxa are
> equivalent or not, and a user may skip the step based on the
> content of the identifier.
> >
> > Unless there is a very good reason to include semantics, it's
> probably best to leave them out of the identifiers to avoid
> possible confusion and to simplify their generation.
> >
> >
> > IV.  Unique Within a Naming Authority vs Globally Unique
> >
> > All of the schemes described above allow for duplication of
> local ids as long as they're individuated by their naming
> authority.  For example, there could be two different concepts
> with id 3124 as long as they are stored and shared outside of the
> local databases as
> >
> > ecogrid://ku.edu/3214
> > ecogrid://nceas.edu/3214
> >
> > In the world of the grid, these handles would be registered
> somewhere as separate entities, so there's no conflict.  Outside
> the grid, however, the numbers 3214 do conflict.  If people
> outside of SEEK wanted to use the identifiers for some reason, it
> would be odd for them to have to refer to the taxonomic concepts
> as "ecogrid://ku.edu/3214."  This might be irrelevant - perhaps
> we only want SEEK to have access to these taxonomic concepts.
> But, it wouldn't be hard to create globally unique IDs which
> could be used outside of SEEK.
> >
> > A UUID is string generated according to a specific mechanism
> which guarantees that it's unique to all other UUIDs for another
> 2400 years. See
> http://www.ietf.org/internet-drafts/draft-mealling-uuid-urn-01.txt
>  for the full specification.  UUIDs are being used in many places
> (Web Services, Bluetooth and Microsoft all use them), there are
> code libraries already written in Perl, Java, Python, C, and C++,
> and there are classes built into Globus, which is what ecogrid is
> being developed on.
> >
> > Here's a typical looking UUID: 5c2775f0-1f59-11d8-a2da-b8a03c50a862
> > (This is the id for Missouri Botanical's DiGIR service in the
> GBIF UDDI registry)
> >
> > It isn't pretty but it does have the benefits of being globally
> unique, not needing to be issued by a central authority, and it
> is valid in all the identifier types described above.
> >
> > Using UUIDs provides the extra flexibility of not needing to
> tie the naming authority to the id.  Suppose at some point we
> wanted the TCR to operate with systems outside the ecogrid.  It
> would be a neat trick, and it would simplify implementation if
> all of the below pointed to the same taxonomic concept:
> >
> > ecogrid://ku.edu/3214
> > urn:lsid:ku.edu:ecogrid:3214
> > http://seek.ku.edu/taxonomy.owl#3214
> > doi:10.1202/3214
> >
> > Except for DOI, this is straightforward for all the schemes
> without resorting to UUIDs because the naming authority is at
> more-or-less the same level of resolution in the others.  In the
> case of DOI, the number 1202 stands for the naming authority.
> For DOI to work the same as the others, each naming authority
> would have to be issued its own number.  This would restrict the
> ability to add new naming authorities by necessitating that each
> new authority register with some central agency for their own DOI
> prefix.  For example, without using UUIDs we might have this
> >
> > ecogrid://ku.edu/3214
> > ecogrid://nceas.edu/3214
> >
> > To map that onto DOI, ku.edu and nceas.edu would need their own
> DOI prefixes.  However, if we used UUIDs, we could have one DOI
> prefix and all the local IDs could be registered to that prefix.
> >
> > If we do plan on using the unique ids as a way to work with
> LSIDs and DOIs, decoupling the naming authority from the local ID
> would make it easier to determine when two system IDs point to
> the same concept.  Without using UUIDs these two ids may or may
> not point to the same taxonomic concept:
> >
> > doi:10.1202/3214
> > ecogrid://ku.edu/3214
> >
> > The only way to know is to resolve both and compare the
> concepts.  If, however, we use UUIDs
> >
> > doi:10.1202/5c2775f0-1f59-11d8-a2dc-b8a03c50a862
> > ecogrid://ku.edu/5c2775f0-1f59-11d8-a2dc-b8a03c50a862
> >
> > Would by definition point to the same taxonomic concept.
> >
> >
> > V.  Conclusion
> >
> > As stated at the beginning, any string will work as a valid
> local id as long as none of the rules of the various schemes are
> violated.  This is because URIs, and the other schemes, all
> ensure uniqueness by prepending the local ids with disambiguating
> global ids.  This practice means that the unique identifier for a
> concept must carry with it the name of some authority, which may
> be problematic.
> >
> > Decoupling the naming authority from the local ID seems like
> good practice and it may make TCR identifiers more relevant to
> systems outside of SEEK.
> >
> >
> > Appendix A.   Rules valid for URIs, DOIs and LSIDs
> >
> > In order to create identifiers which will work under all of
> these schemes the following rules should be followed.
> >
> > 1.  Some systems are case sensitive, while others are not.  To
> prevent conflicts, all alpha characters should be in the same
> case - I recommend lower case as that seems to be the W3C naming
> convention for URLs.
> >
> > 2.   The following characters are legal in all the described systems:
> >
> > | "(" | ")" | "-" | "." |
> > | "_" | "!" | "*" | "'" |
> > | "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" |
> > | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" |
> > | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" |
> > | "Y" | "Z"
> > | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
> > | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
> > | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
> > | "y" | "z"
> > | "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
> > | "8" | "9"
> >
> >
> > 3.  Given #1, #2 and my opinion that "'" should not be used in
> identifiers because people often incorrectly use single quotes in
> HTML, here are the characters I recommend using for unique identifiers:
> >
> > | "(" | ")" | "-" | "." |
> > | "_" | "!" | "*" |
> > | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
> > | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
> > | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
> > | "y" | "z"
> > | "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
> > | "8" | "9"
> >
> >
> > Appendix B.   The rules for local identifiers in various systems.
> >
> >
> > 1. URIs
> >
> > Full Specification: http://www.ietf.org/rfc/rfc2396.txt
> >
> > URIs are described as follows
> >
> > <scheme>://<authority><path>?<query>#fragment
> >
> > The path, query and fragment are all case sensitive.  Reserved
> and "unwise" characters are:
> >
> > | ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
> > | "$" | "," | "<" | ">" | "#" | "%" | """ | "{" |
> > | "}" | "|" | "\" | "^" | "[" | "]" | "`"
> >
> > also, no spaces and no control characters (hex 00-1F and 7F)
> >
> > Oddly, the "'" (single-quote) character is not considered
> unwise, although people often (incorrectly) use ' in HTML, so I
> did not include it in the recommendation in Appendix A.
> >
> >
> > 2.  DOIs
> >
> > Information here: http://www.doi.org/hb.html
> >
> > DOIs are described as follows
> >
> > 10.<authority>/<suffix string>
> >
> > DOIs are case sensitive.  Any unicode 2.0 character is legal
> for the suffix string however, it can't start with a character
> followed by a "/"
> >
> > Because DOIs are often found inside URIs, it's best to follow
> the restrictions of URIs.
> >
> >
> > 3.  LSID
> >
> > Information here:
> http://www.i3c.org/wgr/ta/resources/lsid/docs/index.asp
> > Specification here:
> http://www.i3c.org/wgr/ta/resources/lsid/docs/LSIDSyntax9-20-02.htm
> >
> > LSIDs are described as follows
> >
> > URN:LSID:<authority>:<namespace>:objectID:revisionID
> >
> > They are case insensitive.
> >
> > LSIDs restrict their characters to:
> >
> > | "(" | ")" | "+" | "," | "-" | "." |
> > | "=" | "@" | ";" | "$" | """ |
> > | "_" | "!" | "*" | "'" |
> > | "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" |
> > | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" |
> > | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" |
> > | "Y" | "Z"
> > | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
> > | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
> > | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
> > | "y" | "z"
> > | "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
> > | "8" | "9"
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
> >
>
>
> _______________________________________________
> seek-taxon mailing list
> seek-taxon at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/seek-taxon





More information about the Seek-taxon mailing list