[tcs-lc] Misspelling db response

Tue May 3 03:22:18 PDT 2005

Charles,

I fully agree, and see this as one of the key things that GBIF for one will
need to address as we get a little further.

Considering the situation for a DiGIR-DwC or BioCASe-ABCD provider (since it
is a little simpler to describe - but I hope that this will simplify down to
TAPIR-DwC and TAPIR-ABCD in time), I hope that we will be able to get to the
following situation.

When a data provider registers a collection database (in the UDDI registry),
a set of alternative bindings for the same data set are automatically
registered.  These bindings have different service characteristics.  In the
case of a DiGIR-DwC provider, a BioCASe-ABCD endpoint could be registered
(pointing to a proxy service capable of performing the mappings).  More
interestingly there may be endpoints offering combinations of the following
features:

1. Expand search requests with (unambiguous) synonyms known from any source.
2. Expand search requests with synonyms from a specific (parameter-defined)
data resource (e.g. a particular GSD or NBN Gateway).
3. Expand search requests with alternative geospatial filters (e.g. ISO
country codes or coordinate rectangles to supplement country names?).

As we develop further, I am sure that we are going to recognise more areas
in which we should offer the possibility of expanding requests.  I think
though that users should always have the choice between expanded and
non-expanded searches.

Donald

---------------------------------------------------------------
Donald Hobern (dhobern at gbif.org)
Programme Officer for Data Access and Database Interoperability 
Global Biodiversity Information Facility Secretariat 
Universitetsparken 15, DK-2100 Copenhagen, Denmark
Tel: +45-35321483   Mobile: +45-28751483   Fax: +45-35321480
---------------------------------------------------------------

-----Original Message-----
From: tcs-lc-bounces at ecoinformatics.org
[mailto:tcs-lc-bounces at ecoinformatics.org] On Behalf Of Charles Hussey
Sent: 03 May 2005 12:02
To: tcs-lc at ecoinformatics.org
Subject: Re: [tcs-lc] Misspelling db response

This thread, for me, has homed in on a very real issue that will affect the
usefulness of any project that attempts to integrate different databases
(either through a portal or a datawarehouse approach).

The key is to provide a "query expansion tool" - otherwise relevant data is
likely not to be picked up through a simple query.

The problem in serving up data is in different renditions of a name +
authority string (let alone distinguishing different taxonomic concepts) and
the effort that needs to be put into mapping equivalences. Some mapping will
be obvious and trivial; others will need expert scrutiny. When one moves
from published records to online access to collections and observations
records, this problem is going to increase.

Who should put in this effort? -

1) Data provider (list compiler)
2) Data collator (manager of database incorporating several lists, e.g.
Fauna Europaea, ERMS)
3) Third party body (a nameserver e.g. GBIF ECAT)
4) Individual user (up to user to research possible alternatives that they
may need to use as search terms)
5) Users collectively (through online editing tool - e.g. IPNI project)

In the UK, the National Biodiversity Network has already had to tackle this
problem in its Gateway project which gives access to over 18 million species
observation records. I run the Species Dictionary project which manages
nomenclature for the NBN and we have started to map equivalences for
priority groups. Having all the observation records in one  datawarehouse
(the Gateway) and all the name checklists in another warehouse (the
Dictionary) helps in the capture of all actually occurring name + author
strings and in the mapping of equivalences.

Here is how a search result for "Picea abies" is currently presented:
http://www.searchnbn.net/speciesInfo/taxonomy.jsp?searchTerm=Abies%20abies&s
pKey=NHMSYS0000461247

another example:
http://www.searchnbn.net/speciesInfo/taxonomy.jsp?searchTerm=Myotis&spKey=NH
MSYS0000528026

There are a whole set of dangers associated with aggregating data that need
to be spelled out to users and ours is a very simplistic approach
(pragmatism over purism).

My concern is that unless all name variants actually present in data sources
are captured and mapped, the user is going to only get a partial return and
will not even know that they are getting a partial return.

This is indeed a separate issue to constructing data exchange schema but
should influence this debate.

Cheers,

Charles Hussey,

Science Data Co-ordinator,
Data and Digital Systems Team,
Library and Information Services,
Natural History Museum,
Cromwell Road,
London SW7 5BD
United Kingdom

Tel. +44 (0)207 942 5213
Fax. +44 (0)207 942 5559
e-mail c.hussey at nhm.ac.uk
Species Dictionary project: www.nhm.ac.uk/nbn/
Nature Navigator: www.nhm.ac.uk/naturenavigator/

_______________________________________________
Tcs-lc mailing list
Tcs-lc at ecoinformatics.org
http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/tcs-lc