[seek-dev] update on TOS Kepler actors
Robert Gales
rgales at ku.edu
Thu Sep 8 00:07:39 PDT 2005
Hi Matt,
> Thanks for the summary. We agreed on the approach of using simpleTypes,
> and I still think that is the best approach, so no changes there. Could
> you send seek-dev a configured workflow that exercises all of the TOS
> functions via the web services actor?
Sure. I think I'll implement a stubbed server with realistic (will
result in at least some names that are present in MaNIS) canned results
until the implementation of TOS is done and fully tested. If nothing
else, it will at least be a proof of concept that the TOS portion of the
ENM workflow will work. I head out for the TDWG conference on Friday
and will be out of the office tomorrow, but will build the stub and put
together the workflow while I'm at the conference and mail it to
seek-dev as soon as possible.
>That way Dan and I can see it in
> action using Kepler. I'll probabaly check it in so Dan can see how to
> use it within the ENM workflow. Thanks.
>
> I mostly agreed with the ENM workflow you described below, but I think
> there is a critical modification that I think you missed. We need to
> make sure name strings are not used multiply among different concepts,
> i.e. each name is assigned to only one concept. Procedurally, once we
> have the list of accepted species-concepts from TOS, we get the list of
> synonym names that have been used for that concept. Then for each
> synonym name, we see what the most likely concept is for that name, then
> assign it to that (and only that) concept. Then you move onto querying
> DiGIR for each concept using the associated set of names we've established.
This was left out of the workflow because it should not be necessary
given the manner in which we have decided to handle MaNIS names. For
every name we harvest, we are creating a "calculated synonymy"
relationship to concepts within all authoritative lists. In other
words, we assign a relationship between that name and its best
concept(s) for every authoritative list. So the basic algorithm would
look like
for (String name : namesImported)
{
for (Authority auth : authorities)
{
Concept[] concs = getBestConcept(name, auth);
for (Concept c : concs)
{
addRelationship(name, conc, "calculated synonymy");
}
}
}
So the results of getSynonymousNames will return names of two types:
1) Those names present in concepts defined as synonymous from the
authority, for which a call to getBestConcept will return the GUID of
the concept defined with that name.
or
2) A name harvested from MaNIS, for which a call to getBestConcept will
end up simply echoing the GUID of the concept for which
getSynonymousNames was originally called with.
However unlikely it is that a name string will map to multiple concepts
within a single authoritative list (particularly if that list is for a
single class), there is really no feasible way that TOS can guarantee
that a name string will not map to multiple concepts with equal scores.
This is due to the limitation of having only string matching
algorithms at our disposal when matching a name string to a concept.
Regardless of the algorithm chosen (currently n-gram based) or set of
algorithms with a voting or priority system, there will always be the
possibility that a given name string will match equally to more than one
concept within a particular authoritative list. The end result is that
it is at least feasible that getSynonymousNames will have some overlap
between concepts. If this were to occur (and it will be interesting to
see if it does with the MaNIS data), it seems that the individual doing
the analysis should have a few options. Given name x matches equally to
concepts 1 and 2 the following options may be appropriate:
1) remove x from the synonymy lists of concepts 1 and 2 because the
data resulting from queries using x cannot be reliably associated with
either concept 1 or 2.
2) leave the overlap on x because the data resulting from queries using
x is equally likely to be applicable for concepts 1 and 2 (a bit risky
in my opinion)
3) remove x selectively from one or the other of the synonymy lists.
In other words, the individual doing the study is assuming that the data
resulting from a query on x is more likely associated with one or the
other of the concepts. This seems the most dangerous because the
individual could be selectively removing the name from the wrong
synonymy list (ex. remove x from concept 1, but the identifier of the
specimen meant concept 2).
That is my humble opinion on the issue of synonymy lists with
overlapping names and until data sources consistently provide more
information with which TOS can accomplish finer grain concept matching,
this will continue to be a problem.
Cheers,
Rob
>
> This addition is important to make sure that we are not re-using
> collection records in multiple species runs of the ENM.
>
> Thanks.
> Matt
>
> Robert Gales wrote:
>
>>Hi Matt,
>>
>>After our phone discussion, I thought we had agreed that the best route
>>would be for Taxon was to modify (or add to) our SOAP API such that
>>simple types and arrays of simple types could be used to take advantage
>>of the web services actor. So after the phone call, I did exactly that,
>>adding any methods that the niche modelling case would require so that
>>they act on only simple types. During that time, I also completely
>>isolated the SOAP layer from the business logic, which has been a
>>maintenance pain for quite some time.
>>
>>I ran quite a number of tests with the web service actor to become
>>familiar with, and ensure that everything we needed was working
>>correctly. I found a few bugs in the handling of arrays which I fixed
>>and sent a patch to Ilkay, who forwarded it to Nandita. There is still
>>one noticeable problem I found recently with it dealing with in/out
>>parameters, basically because it assigns the ports the same name, which
>>causes a failure because multiple ports with the same name are not
>>allowed in Kepler/Ptolemy.
>>
>>As of now, we have all but one of the necessary API methods implemented,
>>some of which (getAuthoritativeList in particular) required some changes
>>in the database/hibernate model. The one that is still under
>>investigation is get best concept from a name. Currently Aimee's
>>working on a dictionary system to handle and match mispellings to the
>>most appropriate concepts, once it has been completed, the remainder of
>>the implementation should be trivial. This is the key API method
>>however, because it will be the entry point into the TOS for the niche
>>modeling case.
>>
>>The workflow for the TOS portion of the niche modeling use case as I
>>understood it from Estes Park looks like the following:
>>
>>get the best concept according to ITIS for Mammalia (TOS responsibility)
>> * returns the guid of the Mammalia concept from ITIS
>>
>>given the returned guid, get the authoritative list at the level of
>>species from the subtree rooted at the guid and within ITIS'
>>classification (TOS responsibility)
>> * returns a list of the guids of the species that are descendants
>> of Mammalia according to ITIS
>>
>>using the array actors in kepler, iterate through the list, calling
>>get synonymous names with each guid (TOS responsibility)
>> * returns a list of all names that TOS knows about that have been
>> associated with that guid, this includes the names pulled from
>> MaNIS
>>
>>use those names to create the DiGIR query
>>
>>All of those methods have been implemented such that they take simple
>>types and return one or more simple types or arrays of simple types in
>>order to work with the web services actor.
>>
>>This is what I understood needed to be done from the Estes Park meeting
>>and our phone call. If since then, something has changed that require
>>custom TOS actors that use our complicated object model then I'll
>>redirect my code cleanup/maintenance enhancing/testing efforts back
>>towards the Kepler actor issue.
>>
>>- Rob
>>
>>Matt Jones wrote:
>>
>>
>>>Hi Rob,
>>>
>>>I was wondering what the status is on this project. Have you started
>>>work on the Kepler actors? If not, has something been getting in the
>>>way in the TOS work? Thanks for the info.
>>>
>>>Matt
>>>
>>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rgales.vcf
Type: text/x-vcard
Size: 306 bytes
Desc: not available
Url : http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/seek-dev/attachments/20050908/40c96c03/rgales.vcf
More information about the Seek-dev
mailing list