[SEEK-Taxon] identifiers for taxonomic concepts

Fri Nov 28 08:42:31 PST 2003

A number of people have been chatting about the concept of using UUID's 
(or at least some form of ID) for taxonomic concepts for a while now- 
Thanks for providing a good summary of why UUIDs are probably the way to 
go (at least for now).

The only thing against UUID's is that they are not nice for people to 
use - it's easy to make a mistake writing them down or keying in, and I 
imagine that a publication with a bunch of UUID's for cross reference to 
concepts will look pretty ugly.  Oh, and I suspect that taxonomists will 
be concerned that UUIDs can only be generated for 2400 years or so 
before they start repeating

That said, since SEEK is a research project, it makes a lot of sense to 
go ahead and use such a system- just to see how it works out.  If UUIDs 
are too unpleasant then it would not be that hard to build into the SEEK 
architecture a UUID resolution service that took a more human friendly 
rendering of a UUID and returns the actual UUID and vice versa.

Dave V.

thau at learningsite.com wrote:

> Hello everyone,
> 
> I've been thinking a bit about what kind of identifiers to use for
> representing taxonomic concepts throughout SEEK.  I have a longish note,
> which is attached here as a text file and as a word document.  
> 
> Here's the summary:
> 
> -----
> 
> SEEK is storing data using grid technologies.  The draft mechanism for
> identifying resources in the ecogrid uses URIs of the form: 
> 
> ecogrid://registered.naming.authority/local_id
> 
> I think the local_ids for taxonomic concepts should be UUIDs -
> semantic-free, globally unique strings generated following a specific
> algorithm.  UUIDs look like this:
> 
> 5c2775f0-1f59-11d8-a2da-b8a03c50a862
> 
> Libraries for generating UUIDs already exist in most major languages.  If
> UUIDs are too ugly, ids should at least be semantics free, all lower
> case, and draw from the following character set if we want the flexibility
> of using the ids for systems outside of the ecogrid:
> 
> | "(" | ")" | "-" | "." |
> | "_" | "!" | "*" | 
> | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
> | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
> | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
> | "y" | "z" |
> | "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
> | "8" | "9" |
> 
> -----
> 
> The attached document has justifications for, and expansions of, these
> opinions, which you may want to digest on Friday, along with your
> Thanksgiving and/or Eid leftovers, or wait until next week if you're
> afraid of upsetting your stomach.
> 
> Dave
> 
> 
> ------------------------------------------------------------------------
> 
> Identifiers in the SEEK Taxonomic Concept Repository
> 
> 
> 0.  The Short Story
> 
> SEEK is storing data using grid technologies.  The draft mechanism for identifying resources in the ecogrid uses URIs of the form: 
> 
> ecogrid://registered.naming.authority/local_id
> 
> I think the local_ids for taxonomic concepts should be UUIDs - semantic-free, globally unique strings generated following a specific algorithm.  UUIDs look like this:
> 
> 5c2775f0-1f59-11d8-a2da-b8a03c50a862
> 
> Libraries for generating UUIDs already exist in most major languages.  If UUIDs are too ugly, ids should at least be semantics free, all lower case, and draw from the following character set if we want the flexibility of using the ids for systems outside of the ecogrid:
> 
> | "(" | ")" | "-" | "." |
> | "_" | "!" | "*" | 
> | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
> | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
> | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
> | "y" | "z" |
> | "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
> | "8" | "9" |
> 
> 
> I.  Introduction
>                         
> There is a clear need for taxonomic concept identifiers in SEEK.  Taxonomic concepts are complicated data objects, and most components of SEEK need only store a simple identifier for a concept, as long as the full representation is available if necessary.  I'll be calling the system used to issue taxonomic concept identifiers, and tie them to their fuller representations, a taxonomic concept repository (TCR).   I'm going to argue here that UUIDs (as defined by the Internet Engineering Task Force) are a good candidate for internal identifiers in the TCR, but, really, almost any internal identifier is fine as long as it follows a small set of rules (listed in Appendix A).
> 
> 
> II.  The Wonderful World of Unique Identifiers
> 
> There are currently a number of different schemes for issuing unique, persistent, location independent resource locators.  Some biggies are:
> 
> Life Science Identifiers (LSID, a type of URN with associated protocols for id resolution)
> Digital Object Identifiers (DOI)
> Uniform Resource Identifiers (URI)
> 
> If location independence is relaxed, URLs and RDF IDs count as well - they're a specific type of URI.
> 
> All of these schemes are comprised of a way of indicating a naming authority and a way of locating a data resource within that naming authority.  I'll call the naming authority the global part of the id, and the part locating the resource inside that authority the local part of the id.  
> 
> Let's say each naming authority assigns each taxonomic concept an integer.  For example, Kansas may assign the taxonomic concept id 3214 to "Canis lupus as described by Linnaeus in 1758 with the original specimen circumscription."  That number could be mapped onto the identifiers described above like this:
> 
> LSID - urn:lsid:ku.edu:ecogrid:3214
> 
> DOI - 10.1202/3214  (where 1202 is a IDF assigned number)
> 
> URI - ecogrid://ku.edu/3214
> 
> URL - http://seek.ku.edu/taxonomy.owl#3214
> 
> As you can see, all these cases have an id representing the item, and then one or more terms describing a naming authority.  
> 
> In all of these cases, the local id is fairly unconstrained.  Each scheme has its own set of forbidden characters, and some are case sensitive while others are not.  Each scheme also allows for multiple naming authorities, because the full unique id includes the name of the naming authority.  Of the examples listed, the LSID has the most atomic pattern.  It has separate fields for the URN name space (lsid), the authority id (ku.edu), the name space inside that authority (ecogrid) and the id of the item.  
> 
> Because SEEK is using grid technologies for data storage, the URI standard makes the most sense for identifying taxonomic concepts in SEEK.  Inside the TCR, the local ids can simple numbers or strings as long as none of the characters forbidden in URIs are used.  
> 
> 
> III.  Semantic Identifiers vs Opaque Strings
> 
> So, what should these IDs look like?  One of the main decisions about identifiers is whether or not they should have any semantic value.  Identifiers with semantic value make nice addresses, and might help users determine the relevance of a taxon based on the identifier, without having to go an extra step to resolve the identifier to its fuller concept.  The task of determining the format of semantically meaningful identifiers involves determining the information necessary to create unique identifiers, deciding what information should be packed into the identifier, and determining the steps necessary to create the unique id based on the relevant information.  One possibility would be to include the name of the taxon and a number: canis_lupus_103.  This would let a user see what taxon is being identified, but not necessarily which version of that taxon.  
> 
> The downside of semantically informative identifiers is they may perpetuate the problem of people misusing taxonomic names.  Someone might think bluebird_102 and bluebird_232 are equivalent because they're both bluebirds.  Only by going the extra step of resolving the concepts will determine whether the taxa are equivalent or not, and a user may skip the step based on the content of the identifier.   
> 
> Unless there is a very good reason to include semantics, it's probably best to leave them out of the identifiers to avoid possible confusion and to simplify their generation.
> 
> 
> IV.  Unique Within a Naming Authority vs Globally Unique
> 
> All of the schemes described above allow for duplication of local ids as long as they're individuated by their naming authority.  For example, there could be two different concepts with id 3124 as long as they are stored and shared outside of the local databases as
> 
> ecogrid://ku.edu/3214
> ecogrid://nceas.edu/3214
> 
> In the world of the grid, these handles would be registered somewhere as separate entities, so there's no conflict.  Outside the grid, however, the numbers 3214 do conflict.  If people outside of SEEK wanted to use the identifiers for some reason, it would be odd for them to have to refer to the taxonomic concepts as "ecogrid://ku.edu/3214."  This might be irrelevant - perhaps we only want SEEK to have access to these taxonomic concepts.  But, it wouldn't be hard to create globally unique IDs which could be used outside of SEEK.  
> 
> A UUID is string generated according to a specific mechanism which guarantees that it's unique to all other UUIDs for another 2400 years. See http://www.ietf.org/internet-drafts/draft-mealling-uuid-urn-01.txt for the full specification.  UUIDs are being used in many places (Web Services, Bluetooth and Microsoft all use them), there are code libraries already written in Perl, Java, Python, C, and C++, and there are classes built into Globus, which is what ecogrid is being developed on.
> 
> Here's a typical looking UUID: 5c2775f0-1f59-11d8-a2da-b8a03c50a862
> (This is the id for Missouri Botanical's DiGIR service in the GBIF UDDI registry)
> 
> It isn't pretty but it does have the benefits of being globally unique, not needing to be issued by a central authority, and it is valid in all the identifier types described above.  
> 
> Using UUIDs provides the extra flexibility of not needing to tie the naming authority to the id.  Suppose at some point we wanted the TCR to operate with systems outside the ecogrid.  It would be a neat trick, and it would simplify implementation if all of the below pointed to the same taxonomic concept:
> 
> ecogrid://ku.edu/3214
> urn:lsid:ku.edu:ecogrid:3214 
> http://seek.ku.edu/taxonomy.owl#3214
> doi:10.1202/3214 
> 
> Except for DOI, this is straightforward for all the schemes without resorting to UUIDs because the naming authority is at more-or-less the same level of resolution in the others.  In the case of DOI, the number 1202 stands for the naming authority.  For DOI to work the same as the others, each naming authority would have to be issued its own number.  This would restrict the ability to add new naming authorities by necessitating that each new authority register with some central agency for their own DOI prefix.  For example, without using UUIDs we might have this 
> 
> ecogrid://ku.edu/3214
> ecogrid://nceas.edu/3214
> 
> To map that onto DOI, ku.edu and nceas.edu would need their own DOI prefixes.  However, if we used UUIDs, we could have one DOI prefix and all the local IDs could be registered to that prefix.
> 
> If we do plan on using the unique ids as a way to work with LSIDs and DOIs, decoupling the naming authority from the local ID would make it easier to determine when two system IDs point to the same concept.  Without using UUIDs these two ids may or may not point to the same taxonomic concept:
> 
> doi:10.1202/3214 
> ecogrid://ku.edu/3214
> 
> The only way to know is to resolve both and compare the concepts.  If, however, we use UUIDs
> 
> doi:10.1202/5c2775f0-1f59-11d8-a2dc-b8a03c50a862 
> ecogrid://ku.edu/5c2775f0-1f59-11d8-a2dc-b8a03c50a862
> 
> Would by definition point to the same taxonomic concept.
> 
> 
> V.  Conclusion
> 
> As stated at the beginning, any string will work as a valid local id as long as none of the rules of the various schemes are violated.  This is because URIs, and the other schemes, all ensure uniqueness by prepending the local ids with disambiguating global ids.  This practice means that the unique identifier for a concept must carry with it the name of some authority, which may be problematic.
> 
> Decoupling the naming authority from the local ID seems like good practice and it may make TCR identifiers more relevant to systems outside of SEEK.
> 
> 
> Appendix A.   Rules valid for URIs, DOIs and LSIDs
> 
> In order to create identifiers which will work under all of these schemes the following rules should be followed.
> 
> 1.  Some systems are case sensitive, while others are not.  To prevent conflicts, all alpha characters should be in the same case - I recommend lower case as that seems to be the W3C naming convention for URLs.
> 
> 2.   The following characters are legal in all the described systems:
> 
> | "(" | ")" | "-" | "." |
> | "_" | "!" | "*" | "'" |
> | "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" |
> | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" |
> | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" |
> | "Y" | "Z"
> | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
> | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
> | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
> | "y" | "z"
> | "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
> | "8" | "9"
> 
> 
> 3.  Given #1, #2 and my opinion that "'" should not be used in identifiers because people often incorrectly use single quotes in HTML, here are the characters I recommend using for unique identifiers:
> 
> | "(" | ")" | "-" | "." |
> | "_" | "!" | "*" | 
> | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
> | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
> | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
> | "y" | "z"
> | "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
> | "8" | "9"
> 
> 
> Appendix B.   The rules for local identifiers in various systems.
> 
> 
> 1. URIs
> 
> Full Specification: http://www.ietf.org/rfc/rfc2396.txt
> 
> URIs are described as follows
> 
> <scheme>://<authority><path>?<query>#fragment
> 
> The path, query and fragment are all case sensitive.  Reserved and "unwise" characters are:
> 
> | ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
> | "$" | "," | "<" | ">" | "#" | "%" | """ | "{" |
> | "}" | "|" | "\" | "^" | "[" | "]" | "`"
> 
> also, no spaces and no control characters (hex 00-1F and 7F)
> 
> Oddly, the "'" (single-quote) character is not considered unwise, although people often (incorrectly) use ' in HTML, so I did not include it in the recommendation in Appendix A.
> 
> 
> 2.  DOIs
> 
> Information here: http://www.doi.org/hb.html
> 
> DOIs are described as follows
> 
> 10.<authority>/<suffix string>
> 
> DOIs are case sensitive.  Any unicode 2.0 character is legal for the suffix string however, it can't start with a character followed by a "/"
> 
> Because DOIs are often found inside URIs, it's best to follow the restrictions of URIs.
> 
> 
> 3.  LSID
> 
> Information here: http://www.i3c.org/wgr/ta/resources/lsid/docs/index.asp
> Specification here: http://www.i3c.org/wgr/ta/resources/lsid/docs/LSIDSyntax9-20-02.htm
> 
> LSIDs are described as follows
> 
> URN:LSID:<authority>:<namespace>:objectID:revisionID
> 
> They are case insensitive.
> 
> LSIDs restrict their characters to:
> 
> | "(" | ")" | "+" | "," | "-" | "." |
> | "=" | "@" | ";" | "$" | """ |
> | "_" | "!" | "*" | "'" |
> | "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" |
> | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" |
> | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" |
> | "Y" | "Z"
> | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
> | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
> | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
> | "y" | "z"
> | "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
> | "8" | "9"
> 
> 
> 
> 
> 
> 
> 
> 
> 
>                              
> 
>                                                                                 
>