[seek-dev] looking into global unique identifier systems

Wed Mar 10 14:21:43 PST 2004

Hello everyone,

I've been looking into a couple of the popular systems for assigning and
resolving globally unique identifiers on the internet.  My main interest
has been in assigning and resolving unique ids for the taxonomic concepts
being devised by the taxon group, but this probably has impact on other
areas of SEEK.

Specifically, I've been looking at various versions of LSIDs
(http://www-124.ibm.com/developerworks/oss/lsid/) and the
Handle System (www.handle.net, which underlies DOI, www.doi.org).

This email is a bit long - if you skip to the bottom you can see a
shortish summary of the pros and cons of these systems.  Alternatively, if
you find this email pleasingly concise and are glutton for even more GUID
joy, I've put two documents in CVS:
http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/seek/projects/taxon/docs/guid/
which give a bit of background on what I'm talking about here.

First I'll talk about my experiences with the Handle System.

I've had a great time with the Handle System.  Setting it up was pretty
simple, and it comes with handy GUI and command line tools for entering,
deleting and modifying handles.  I had to receive a handle registry number
from CNRI, and that took about a day, which isn't bad.  

The Handle system has a bulk registration facility in which you list a
bunch of handles in a text document and process it via a java program.
It's fairly slow - running about 1.5 seconds per handle.  That may not
seem bad, but it means that registering all 400,000+ items already in the
kansas database will take over a week.  I'm not sure why it takes as
long as it does... might be worth trying to speed it up and I think
there's a pretty simple way to do that.

Right now I have registered the first 1100 taxonomic concepts from the
taxon concept database at Kansas.

When you register a handle, you can attach a bunch of information to it.
For each handle, I've attached an email address (mine) and three URLs.
The first URL leads to a web page which describes the concept and tells
you how to get more information about it.  The second URL leads to an OWL
representation of the concept.  Well... it will when I've finished that
part :)  The third URL leads to an XML version of the concept.

You can view these pages by going to 
http://plato.learningsite.com:8080/handleresolve/index.jsp
and filling out the form.  Be nice, I haven't bulletproofed anything.
Remember I only have the first 1100 taxonomic concepts, and I started at
#2 so handles like 1883/24, 1883/345, and 1883/2 are ok but anything under
2 or over 1101 will get you ugly results.

Alternatively, you can visit the first page (the ugly web page) by going
to: http://hdl.handle.net/1883/582

This works because when the thing running at hdl.handle.net is given a
handle it resolves it, grabs the first URL it can find in the handle
record and forwards to that.  It could just as easily automatically
forward to the XML version, the OWL version, or whatever.

Alternatively, you can visit the XML page by going to
http://hdl.handle.net/1883/582?index=3

and the owl page by going to
http://hdl.handle.net/1883/582?index=2

The page on learningsite just gives you a web interface to these things.

The handle system resolves handles very quickly.  Almost all of the
(small) delay you get after putting in a handle comes from the code I use
to generate the html and xml.  That code... for reasons of sheer
perversity, is based on the code for resolving LSIDs.  Actually, there's a
good reason I used LSIDs and that leads me to...

The LSID code.

I had less fun playing with the LSID code.  There are several versions of
both the code to serve up LSIDs (the server part) and the client used to
communicate with the server (the client part).  I have the 1.0.1 version
of the server, which works well, but doesn't seem to work with Launch Pad,
which is the cool Windows Internet Explorer plugin client you can use to
automatically resolve LSIDs.  

Setting up the LSID server wasn't as straightforward as the handle server
stuff.  It doesn't have a handy installer, and the documentation isn't the
greatest (unless there's some I couldn't find...).  

One nice thing about the LSID code is that it plugs right into your
database.  That means you don't have to actually register anything to get
the LSIDs to work.  (There looks to be a pretty straightforward way to do
this with the handle system too, it's just not the default and I haven't
tried it yet.)

For example, if you ask for the LSID
urn:lsid:plato.learningsite.com:seek:582 the code just looks at the
database for a taxon concept with id 582 and delivers it.  Looking for
urn:lsid:plato.learningsite.com:seek:582.xml will serve up the xml version
of the code, and urn:lsid:plato.learningsite.com:seek:582.owl will serve
up the owl version.

Don't be thrown by the existence of plato.learningsite.com in the LSID -
the software doesn't have to run on the machine called
plato.learningsite.com.  However, I, as owner of the learningsite.com
domain have the responsibility to point to where the software is running.
This is done in learningsite's DNS - it says, if you get a request for an
LSID with plato.learningsite.com in it, forward that request to where ever
the correct LSID server is running.

You can see these LSIDs working by visiting 

http://plato.learningsite.com/authority/data/?lsid=urn:lsid:plato.learningsite.com:seek:582
http://plato.learningsite.com/authority/data/?lsid=urn:lsid:plato.learningsite.com:seek:582.xml
http://plato.learningsite.com/authority/data/?lsid=urn:lsid:plato.learningsite.com:seek:582.owl

These are exactly the pages you got with the Handle System.  The Handle
System just points to those URLs.  That's mostly what the Handle System
does - it point to things (URLs, emails, and URNS, for example).

The nice thing is, those URLs will work for all 400,000+ concepts in the
Kansas database - try
http://plato.learningsite.com/authority/data/?lsid=urn:lsid:plato.learningsite.com:seek:407825.xml

So... what are the differences?

Handle System Pros

* Easier to install
* Easier for non-programmers to manage
* Handles look like this 1883/the_id  - which is pretty simple
* It's sort of like DOI, so people who are favorably inclined toward 
  DOIs will like these handles
* Big current user base

Handle System Cons

* Registering handles is a bit slow - which may be simple to solve
* These handles aren't part of any other internet standard
* Dependence on the greater Handle system network.  If it goes down, 
  the handles can't be resolved.
* No standards for metadata queries.
* Using the standard Handle system, there can be only one SEEK registrar -
  all SEEK handles would be 1883/whatever.  This means that registration
  of SEEK handles will have to be centralized. (some may say this is a
  pro)

LSID Pros

* They use the internet standard URN notation
* Rather than depending on the Handle system running, LSIDs depend on
  the internet running (more precisely, DNS)
* LSIDs have built-in metadata support 
  (standard calls to retrieve rdf formatted information) 
* LSIDs natively support SOAP calls 
* LSIDs are less centralized, so there can be more than one issuer
  (some may say this is a con)

LSID Cons

* Less mature - seems to be still evolving a bit
* Not as simple to install
* IDs are a bit more ugly, and they do include domain information, which
  may or may not be a big deal.
* Not sure about the current user base, but I think it's pretty small

So.... questions.

1.  Anyone else mess with this stuff?  If so, how does the above gel with
your experience?

2.  Which of these seems like a better route to take to uniquely identify
taxonomic concepts (among other things in SEEK)?  

3.  Are there other alternatives which ought to be explored?

Dave