[seek-dev] update on TOS Kepler actors

Thu Sep 8 00:07:39 PDT 2005

Hi Matt,

> Thanks for the summary.  We agreed on the approach of using simpleTypes,
> and I still think that is the best approach, so no changes there.  Could
> you send seek-dev a configured workflow that exercises all of the TOS
> functions via the web services actor?

Sure.  I think I'll implement a stubbed server with realistic (will 
result in at least some names that are present in MaNIS) canned results 
until the implementation of TOS is done and fully tested.  If nothing 
else, it will at least be a proof of concept that the TOS portion of the 
ENM workflow will work.  I head out for the TDWG conference on Friday 
and will be out of the office tomorrow, but will build the stub and put 
together the workflow while I'm at the conference and mail it to 
seek-dev as soon as possible.

>That way Dan and I can see it in
> action using Kepler.  I'll probabaly check it in so Dan can see how to
> use it within the ENM workflow.  Thanks.
> 
> I mostly agreed with the ENM workflow you described below, but I think
> there is a critical modification that I think you missed.  We need to
> make sure name strings are not used multiply among different concepts,
> i.e. each name is assigned to only one concept.  Procedurally, once we
> have the list of accepted species-concepts from TOS, we get the list of
> synonym names that have been used for that concept.  Then for each
> synonym name, we see what the most likely concept is for that name, then
> assign it to that (and only that) concept.  Then you move onto querying
> DiGIR for each concept using the associated set of names we've established.

This was left out of the workflow because it should not be necessary 
given the manner in which we have decided to handle MaNIS names.  For 
every name we harvest, we are creating a "calculated synonymy" 
relationship to concepts within all authoritative lists.  In other 
words, we assign a relationship between that name and its best 
concept(s) for every authoritative list.  So the basic algorithm would 
look like

for (String name : namesImported)
{
   for (Authority auth : authorities)
   {
      Concept[] concs = getBestConcept(name, auth);
      for (Concept c : concs)
      {
        addRelationship(name, conc, "calculated synonymy");
      }
   }
}

So the results of getSynonymousNames will return names of two types:

1)  Those names present in concepts defined as synonymous from the 
authority, for which a call to getBestConcept will return the GUID of 
the concept defined with that name.

or

2)  A name harvested from MaNIS, for which a call to getBestConcept will 
end up simply echoing the GUID of the concept for which 
getSynonymousNames was originally called with.

However unlikely it is that a name string will map to multiple concepts 
within a single authoritative list (particularly if that list is for a 
single class), there is really no feasible way that TOS can guarantee 
that a name string will not map to multiple concepts with equal scores. 
  This is due to the limitation of having only string matching 
algorithms at our disposal when matching a name string to a concept. 
Regardless of the algorithm chosen (currently n-gram based) or set of 
algorithms with a voting or priority system, there will always be the 
possibility that a given name string will match equally to more than one 
concept within a particular authoritative list.  The end result is that 
it is at least feasible that getSynonymousNames will have some overlap 
between concepts.  If this were to occur (and it will be interesting to 
see if it does with the MaNIS data), it seems that the individual doing 
the analysis should have a few options.  Given name x matches equally to 
concepts 1 and 2 the following options may be appropriate:

1)  remove x from the synonymy lists of concepts 1 and 2 because the 
data resulting from queries using x cannot be reliably associated with 
either concept 1 or 2.

2)  leave the overlap on x because the data resulting from queries using 
x is equally likely to be applicable for concepts 1 and 2 (a bit risky 
in my opinion)

3)  remove x selectively from one or the other of the synonymy lists. 
In other words, the individual doing the study is assuming that the data 
resulting from a query on x is more likely associated with one or the 
other of the concepts.  This seems the most dangerous because the 
individual could be selectively removing the name from the wrong 
synonymy list (ex. remove x from concept 1, but the identifier of the 
specimen meant concept 2).

That is my humble opinion on the issue of synonymy lists with 
overlapping names and until data sources consistently provide more 
information with which TOS can accomplish finer grain concept matching, 
this will continue to be a problem.

Cheers,
Rob

> 
> This addition is important to make sure that we are not re-using
> collection records in multiple species runs of the ENM.
> 
> Thanks.
> Matt
> 
> Robert Gales wrote:
> 
>>Hi Matt,
>>
>>After our phone discussion, I thought we had agreed that the best route
>>would be for Taxon was to modify (or add to) our SOAP API such that
>>simple types and arrays of simple types could be used to take advantage
>>of the web services actor.  So after the phone call, I did exactly that,
>>adding any methods that the niche modelling case would require so that
>>they act on only simple types.  During that time, I also completely
>>isolated the SOAP layer from the business logic, which has been a
>>maintenance pain for quite some time.
>>
>>I ran quite a number of tests with the web service actor to become
>>familiar with, and ensure that everything we needed was working
>>correctly.  I found a few bugs in the handling of arrays which I fixed
>>and sent a patch to Ilkay, who forwarded it to Nandita.  There is still
>>one noticeable problem I found recently with it dealing with in/out
>>parameters, basically because it assigns the ports the same name, which
>>causes a failure because multiple ports with the same name are not
>>allowed in Kepler/Ptolemy.
>>
>>As of now, we have all but one of the necessary API methods implemented,
>>some of which (getAuthoritativeList in particular) required some changes
>>in the database/hibernate model.  The one that is still under
>>investigation is get best concept from a name.  Currently Aimee's
>>working on a dictionary system to handle and match mispellings to the
>>most appropriate concepts, once it has been completed, the remainder of
>>the implementation should be trivial.  This is the key API method
>>however, because it will be the entry point into the TOS for the niche
>>modeling case.
>>
>>The workflow for the TOS portion of the niche modeling use case as I
>>understood it from Estes Park looks like the following:
>>
>>get the best concept according to ITIS for Mammalia (TOS responsibility)
>>  * returns the guid of the Mammalia concept from ITIS
>>
>>given the returned guid, get the authoritative list at the level of
>>species from the subtree rooted at the guid and within ITIS'
>>classification (TOS responsibility)
>>  * returns a list of the guids of the species that are descendants
>>    of Mammalia according to ITIS
>>
>>using the array actors in kepler, iterate through the list, calling
>>get synonymous names with each guid (TOS responsibility)
>>  * returns a list of all names that TOS knows about that have been
>>    associated with that guid, this includes the names pulled from
>>    MaNIS
>>
>>use those names to create the DiGIR query
>>
>>All of those methods have been implemented such that they take simple
>>types and return one or more simple types or arrays of simple types in
>>order to work with the web services actor.
>>
>>This is what I understood needed to be done from the Estes Park meeting
>>and our phone call.  If since then, something has changed that require
>>custom TOS actors that use our complicated object model then I'll
>>redirect my code cleanup/maintenance enhancing/testing efforts back
>>towards the Kepler actor issue.
>>
>>- Rob
>>
>>Matt Jones wrote:
>>
>>
>>>Hi Rob,
>>>
>>>I was wondering what the status is on this project.  Have you started
>>>work on the Kepler actors?  If not, has something been getting in the
>>>way in the TOS work?  Thanks for the info.
>>>
>>>Matt
>>>
>>
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: rgales.vcf
Type: text/x-vcard
Size: 306 bytes
Desc: not available
Url : http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/seek-dev/attachments/20050908/40c96c03/rgales.vcf