[SEEK-Taxon] guids

dave thau thau at learningsite.com
Mon May 24 09:34:38 PDT 2004


Hello everybody,

I think discussions of GUIDs in Edinburgh went quite well.  A number of
conversations I had before and after the presentations, as well as some of
the demonstrations of what myGrid has been doing, has lead me to
reconsider the idea of using the handle system to create an initial
prototype of the taxon GUID server and instead go with an intial LSID
implementation.  The main reasons for going with LSIDs are:

1.  Explicit versioning
2.  Explicit metadata
3.  Interoperability with other systems
4.  Interoperability with GBIF

1.  Explicit versioning

In conversations with Bob Peet, and in numerous contexts throught the SEEK
meetings, the need for versioning taxonomic concepts became clear.  LSIDs
have an explicit mechanism for versioning, which grew directly out of
consultation with people in the life sciences.  An LSID looks like this:
urn:LSID:authority:context:localId:version

So, the first version of a taxon might be

urn:LSID:taxaserver.org:taxon:3432:1

This notation permits different versions of a taxonomic concept to have
different GUIDs, however it also makes it easy to realize that two GUIDs
are versions of the same concept.  Because the version is an explicit part
of the LSID, systems using LSIDs will know that these are two versions of
the same thing.  Although versions could be added to the handles of a
handle server, the notation for doing so would be specific to SEEK, rather
than an explicit part of the handle system.  Therefore, the handle system
version of versioning would not be useful to third party systems which use
handles.

2.  Explicit metadata

LSID provides an explicit way to retrieve metadata about a service.  The
standard for the metadata is now RDF - which is the backbone of OWL. 
Although there is no standard for what the metadata contains, there are
standard calls for retrieving the data.  The handle system has no such
standard, so any client using handles to retrieve data would have to know
our specific way of retrieving metadata about the handle.  Such metadata
might include the last time this record was updated, contact information
for the person responsible for maintaining the record, or it could even
include relationships to other LSIDs.

3.  Interoperability with other systems

Because LSIDs use WSDL to announce which services they provide, and SOAP,
FTP, HTTP and other internet standards for providing information, LSIDs
embed easily in third party systems which use the standards.  For example,
the MyGrid workflow software Taverner has a way to hook up with servers
providing LSIDs and use the LSIDs in those servers as input to actions. 
Apparently (I haven't tried this yet) the LSID server can provide services
beyond the standard ones.  These services would be announced in the WSDL
document.  If this is the case, the services could make the API to the
taxon server available to all LSID clients.  This would mean that a system
like MyGrid, or Ptolemy (I suppose) could call the taxon server API
through the LSID server.  This allows the following scenario.

A users sees a taxaserver LSID somewhere and resolves it - getting the
data behind it, and the services available for that LSID.  One of those
services might be to get synonyms.  The user can then create a work flow
actor which takes as input an LSID and outputs a list of synonyms that may
then be inputs to another actor.  This ties the taxon server directly into
work flow systems like Taverner and Kepler.  Now, it's true that a user
could do this with the taxon server regardless of whether or not it is
plugged into the LSID server.  However, having the LSID server interface
directly with the taxon server means a user can go directly from a LSID to
the services we support.  Nothing like this is available if we use the
handle server.

4.  Interoperability with GBIF.

Donald Hobern is recommending that GBIF uses LSIDs for a variety of data
objects.  I don't think he has explicitly considered the handle system,
but he has considered DOIs.  He feels that LSIDs are the best identifiers
for interoperability and have broad enough support to use with comfort. 
Aligning SEEK's integration efforts with GBIFs and agreeing to a common
standard would do a great deal for interoperability between SEEK and many
other biodiversity informatics resources.


The down side of LSIDs are that the domain information is in the handle,
that they're not simple to reassign, and that publishers are less likely
to accept them.  Here are some arguements to undermine these issues.

1.  Domain information in the handle.  I recommend registering something
neutral, like taxaserver.org, and registering information through that. 
Eventually, if things take off, a body comprised of major stake holders
can take it over.  If there's a functioning system in place, it will be
easy to find a host if the system is being used.

2.  Not simple to reassign.  This isn't strictly true, the information can
be hosted anywhere.  The only caveat is that a LSID assigned by
taxaserver.org will forever have the taxaserver.org domain.

3.  Publishers are more likely to accept them.  Although many publishers
have accepted DOIs, it's unclear that a handle would be preferable to an
lsid.  Publishers already accept genebank accession numbers, which are
quite long and very different from DOIs.  If a publisher felt strongly
about only accepting a handle, it would be quite straightforward to issue
a handle which resolves to an LSID.  The only downside to this is that
people might start putting the handle in their databases rather than the
LSID.  If they do that, their database won't be immediately interoperable
with systems using LSIDs.  However, because the handle resovles to an
LSID, a database using the handles would simply have to resolve them to
their attached LSIDs to become interoperable with LSID systems.

Although these three objections are quite vaild, I feel that LSID's
satisfaction of the SEEK goal of interoperability with  networked
resources, its usage of internet standards, and its native treatement of
versioning argue in favor of using LSIDs at least for the first prototype.

I've attached the start of a document outlining what features the
prototype might contain.  This is very rough, but I think it's something
to start from.

Dave
-------------- next part --------------
A non-text attachment was scrubbed...
Name: guid_design.ppt
Type: application/vnd.ms-powerpoint
Size: 41984 bytes
Desc: not available
Url : http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/seek-taxon/attachments/20040524/6c0c76dd/guid_design.ppt


More information about the Seek-taxon mailing list