[seek-taxon] SEEK_Feb_21-23_WG_Meeting_Notes.doc

Wed Apr 9 07:02:18 PDT 2003

Here at last is the summary from the brief breakout session we had on taxa
at the Albuqurque PI meeting in late February.  There are also two summary
PowerPoint slides located at: 

http://speciesanalyst.net/seek/ow.asp?SEEKPiFeb21%5F2003

Notes from the Taxon Working Group Breakout Session at the SEEK PIs meeting
Feb 21-23, 2003, Albuquerque, N.M.

The breakout session discussion wrestled with the basic issues of how to
disambiguate concepts with ambiguous names.  With enough information
disambiguation seems possible and reasonable using richer semantics about
the relationships between concepts and names, but the critical, key issue is
how to get that information from existing databases.  In other words, for
all of the existing classifications that need to be mapped, how would you
practically, automatically extract enough semantics to make a concept
mapping model work.  The WG came around to the view that without unlimited
resources and time, this was not likely going to be a practical approach for
the SEEK project.

The suggestion from the group was to start with the data that are available,
just as they are and start mapping from there.  Similar conclusion to that
reached in January by the WG.

For queries and retrieval of taxonomic concepts in the short term in the
absence of any semantics, SusanGauch suggested using IR techniques to assess
what the query was asking and some frequency statistics of the occurrence of
the name or a concept in the database to return a probabilistic answer, i.e.
these are the taxa or concepts that are likely to be what you are looking
for.  These approaches can benefit from the way that ambiguous words (and,
essentially all words are ambiguous) are handled by text processing systems.

Some examples and points from the discussion:

Ambiguity in queries

The goal here is to map from ambiguous user query terms (names) to unique
concepts.

One approach is to use the context of the use of the string name.  By the
nature or the construction of the query you can assess what people are
looking for with ambiguous names.  Query: "Which anemones grow in cedar
swamps?" indicates anemones as plants not as marine invertebrates, the plant
concepts would be assigned a higher probability of being correct.

Frequency of use is also a good way to distinguish which concept was meant
and which one is being looked for.  If 90% of all people looking for
anemones are looking for sea anemones, then we can guess that this user is
most likely looking for sea anemones.   

User profiles can be used to help remove ambiguity (if I'm a marine
biologist or have previously searched/used many marine biology data sets)
then I probably mean sea anemones when I use them in a query.  

The simplest, initial approach for ambiguous query terms is to explicitly
ask the user which of a list of options they mean.  

Finally, we can ignore the ambiguity in the query and search for all data
sets that have anemone information regardless of which concept is desired by
the user or, if the data sets are identified by concepts, then we map from
the name to all concepts that are known by that name, then search in all
data sets relevant to any of the concepts.

Ambiguity in data sets

If I have two ecological data sets, how confident can I be that named
concept "foo.1" and "foo.2" are the same
thing/taxon/circumscription/species?

The most accurate approach is to have the creator of the data set or other
individual provide meta-data that explicitly indicates which concept is
relevant to the data set.

If we do not have manually provided meta-data, then the user of the data set
(human or program) must guess which concept is being used.   The simplest
and least accurate approach is to map from the name in the data set to all
concepts it could be, assigning them as equally probable in the data set.
I.e., if there are n concepts that are known by the name "foo", then the
probability that a particular data set that is about "foo" is about one of
the possible concepts is 1/n.  This is a uniform distribution of
probabilities across the known possibilities.

We can do a somewhat better job of assigning the probabilities to the
possible concepts by examining one or more of the following: 

*	Dates are one way to disambiguate individual taxon names. 

*	Use geographic range to disambiguate, 

*	Use a tree of associations of authors. 

*	As data sets begin to be marked up by concept rather than names, we
can use this information to better guess (assign better probabilities) to
undisambiguated names

o	If 80% of the "foo" data sets are identified with a particular
concept (e.g., "Foo" sensu Peet) then we would guess that an ambiguous "foo"
would have an 80% chance of being "sensu Peet" although there may be 5
possible concepts and a uniform distribution would have predicted 20%
o	We could also look at the credibility/expertise level of those using
a particular concept and give a higher weight to particular groups use of
taxa

Summary

The basic idea is to begin simply and build a system that searches for all
data sets by name since that's what we have.  As we start to create
concepts, we have two basic problems to solve:
	Mapping from concepts to names (done manually by the concept
builders)
	Mapping from names to concepts
		In queries
		In data sets

In both cases, the mappings in queries and data sets can initially be done
manually, and we could build a small demo that works accurately for a small
number of manually created concepts, manually annotated data sets, and a
query window in which users query by concept rather than name alone.

As we improve the system, we could start to map from names to concepts for a
small number of data sets and assign probabilities to the different possible
concepts uniformly.  Then users could search by concept as before, but they
could control the level of uncertainty they are willing to accept.  If they
set the uncertainty level low, we would only query data sets that have been
accurately (manually) disambiguated (i.e., have a particular concept
assigned to them).  If they are willing to accept a certain level of
uncertainty, we would broaden the potential data sets to those that might
contain information about the concept.  

Then, we could begin to add intelligence to the concept disambituation
process as described above so that the concept guesses are better.  

Finally, we could begin to address disambiguating the queries, but this
seems a lower priority since it is easy and more accurate to ask the user
what they want.

Notes:

Two solutions to achieve: 

1) 1:M one word has many meanings, you disambiguate 

2) Someone used bar for foo, so you have a M:1, so this is where you need
the classification and taxonomy (thesaurus, ontology). 

Handling Ambiguity of terms, come back with a ranked list of terms that
fuzzy matching Disambiguation would be a formal process to make the
ambiguity go away, you must reveal yourself and tell us who you are. 

Errors would be tracked, the probability that you have the right name. "Your
input data may contain as much as 50% of the taxa that you really want." 

This doesn't deal with the issue of how to map classifications. Also changes
in classification need to be explicitly described, taxonomic changes through
time need to be explicit. 

Tree distance measures? How similar are two trees? How many transformations
are required to get from tree a to tree b. 

Draft working group work plan:

May 2003 Deliverable: Demo  an IR approach to retrieving (Aimee Stewart, KU
assigned to do this)

Develop a paper on this IR initial approach to the problem.

Longer term: Develop a way to represent concept relationships (yr 1), then a
way to author them over the net (yr2), then ways to infer them (yr
3-4-5)--that latter requires semantics.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: SEEK_Feb_21-23_WG_Meeting_Notes.doc
Type: application/msword
Size: 33792 bytes
Desc: not available
Url : http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/seek-taxon/attachments/20030409/b6e1097f/SEEK_Feb_21-23_WG_Meeting_Notes.doc