[kepler-dev] Simple Search

Fri Nov 4 07:16:11 PST 2005

Bertram et.al,

I'll answer a couple of these in terms of the Digir service.

Bertram Ludaescher wrote:

>>>>On Thu, 03 Nov 2005 14:19:49 -0900
>>>>Matt Jones <jones at nceas.ucsb.edu> wrote: 
>>>>        
>>>>
>MJ> 
>MJ> Yeah, we debated whether or not to have the source button so visible or 
>MJ> whether to bury it in prefs.  Burying it means few or no users would 
>MJ> regularly use it.  As the distributed searches can take a while, 
>
>Is someone looking at the issue why?  And how to solve it?
>(a special kind of bug for sure -- make the system faster ;-)
>
>MJ> sometimes its nice to disable several EcoGrid nodes so that the results 
>MJ> from a single node return faster.  
>
>I think that's only a workaround, not a solution. Could we change (not
>today, but in the future) the result return protocol of the EcoGrid to
>a "streaming mode"? That is, results appear as they come in? 
>  
>
This could actually be pretty difficult.  For Digir, results are 
organized by species scientific name, however the results comming back 
from the service are unordered (well, actually they are ordered by 
service provider).  The other problem is related to notorious threading 
issues which are throughout the code.  We'd essentially have to use a 
SAX type parser (currently it uses a dom) and at the appropriate times 
push results into the individual result sets.  Of course, this is what 
we should be doing anyway for memory considerations.  The reason this is 
such a big change is we are cuurently using jaxrpc bindings in the 
client code therefore the client completely parses the soap result into 
jaxrpc binding classes prior to any processing. 

>And why do EG-searches take so long in the first place (first-time or
>not..)? Do we have an analysis where the time is spent? 
>
>Seems we might have an indexing challenge somewhere.. Our searches are
>by keyword or concept name it seems, but not complex nested-queries
>with cyclic, multi-way joins and aggregations ...  So why aren't they
>blindingly fast? 
>  
>
Digir queries are actually distributed.  When the submits an 
Ecogrid-Digir query, the query string is essentially sent to *all* 
available digir providers.  At any given time there are 300 or more 
digir providers available.  The EcoGrid service code waits for all 
providers to return results before collating into the complete result 
set.  This complete result set is then transformed from the Digir schema 
into the resultset schema before being returned to the Kepler client.

You can imagine this takes some time.  In fact, the query as a whole is 
severly network bound -- it takes 3 minutes or more just transfer the 
Mephitis result set from one host to another over a reliable 10Mbs 
ethernet local connection (the switch is under my desk).

The old digir service (which everyone but me is using) had 12 or so hard 
coded providers so it was surprisingly fast.  One could argue that not 
all available providers should be queried, but then the question is 
which ones should be?

I do plan on trying to remove one network hop from the Ecogrid service 
by essentially embedding the digir provider client code into the Ecogrid 
server code.  This will give me control over the threading and allow me 
to parallelize the code more effectively.

I am uncertain at this time if I can stream results back to the client 
code because of some impedence problems between the resultset schema and 
the digir processing.

>MJ> This will be more needed as the 
>MJ> number of nodes grows -- but it also placesa burden onto the user to 
>MJ> know what nodes they want or need to search.  Today its pragmatically 
>
>right; we can't outsource/work-around the problem that way.. 
>
>MJ> useful, but it would be nice if we could eliminate it altogether and 
>MJ> simply direct searches to the most appropriate nodes automatically.  For 
>
>.. and use indexing. Maybe even a warehousing approach. One big
>warehouse with efficient index structures for the kinds of searches we
>do.. can't beat that one I think ;)
>  
>
Having a Digir warehouse would probably be a big benefit to the 
scientific community as a whole.  This is something Dave should consider.

Kevin