[kepler-dev] Simple Search
Kevin Ruland
kruland at ku.edu
Fri Nov 4 07:16:11 PST 2005
Bertram et.al,
I'll answer a couple of these in terms of the Digir service.
Bertram Ludaescher wrote:
>>>>On Thu, 03 Nov 2005 14:19:49 -0900
>>>>Matt Jones <jones at nceas.ucsb.edu> wrote:
>>>>
>>>>
>MJ>
>MJ> Yeah, we debated whether or not to have the source button so visible or
>MJ> whether to bury it in prefs. Burying it means few or no users would
>MJ> regularly use it. As the distributed searches can take a while,
>
>Is someone looking at the issue why? And how to solve it?
>(a special kind of bug for sure -- make the system faster ;-)
>
>MJ> sometimes its nice to disable several EcoGrid nodes so that the results
>MJ> from a single node return faster.
>
>I think that's only a workaround, not a solution. Could we change (not
>today, but in the future) the result return protocol of the EcoGrid to
>a "streaming mode"? That is, results appear as they come in?
>
>
This could actually be pretty difficult. For Digir, results are
organized by species scientific name, however the results comming back
from the service are unordered (well, actually they are ordered by
service provider). The other problem is related to notorious threading
issues which are throughout the code. We'd essentially have to use a
SAX type parser (currently it uses a dom) and at the appropriate times
push results into the individual result sets. Of course, this is what
we should be doing anyway for memory considerations. The reason this is
such a big change is we are cuurently using jaxrpc bindings in the
client code therefore the client completely parses the soap result into
jaxrpc binding classes prior to any processing.
>And why do EG-searches take so long in the first place (first-time or
>not..)? Do we have an analysis where the time is spent?
>
>Seems we might have an indexing challenge somewhere.. Our searches are
>by keyword or concept name it seems, but not complex nested-queries
>with cyclic, multi-way joins and aggregations ... So why aren't they
>blindingly fast?
>
>
Digir queries are actually distributed. When the submits an
Ecogrid-Digir query, the query string is essentially sent to *all*
available digir providers. At any given time there are 300 or more
digir providers available. The EcoGrid service code waits for all
providers to return results before collating into the complete result
set. This complete result set is then transformed from the Digir schema
into the resultset schema before being returned to the Kepler client.
You can imagine this takes some time. In fact, the query as a whole is
severly network bound -- it takes 3 minutes or more just transfer the
Mephitis result set from one host to another over a reliable 10Mbs
ethernet local connection (the switch is under my desk).
The old digir service (which everyone but me is using) had 12 or so hard
coded providers so it was surprisingly fast. One could argue that not
all available providers should be queried, but then the question is
which ones should be?
I do plan on trying to remove one network hop from the Ecogrid service
by essentially embedding the digir provider client code into the Ecogrid
server code. This will give me control over the threading and allow me
to parallelize the code more effectively.
I am uncertain at this time if I can stream results back to the client
code because of some impedence problems between the resultset schema and
the digir processing.
>MJ> This will be more needed as the
>MJ> number of nodes grows -- but it also placesa burden onto the user to
>MJ> know what nodes they want or need to search. Today its pragmatically
>
>right; we can't outsource/work-around the problem that way..
>
>MJ> useful, but it would be nice if we could eliminate it altogether and
>MJ> simply direct searches to the most appropriate nodes automatically. For
>
>.. and use indexing. Maybe even a warehousing approach. One big
>warehouse with efficient index structures for the kinds of searches we
>do.. can't beat that one I think ;)
>
>
Having a Digir warehouse would probably be a big benefit to the
scientific community as a whole. This is something Dave should consider.
Kevin
More information about the Kepler-dev
mailing list