[kepler-dev] [Bug 3578] New: - optimize timing of data download by EML and other data source actors
bugzilla-daemon at ecoinformatics.org
bugzilla-daemon at ecoinformatics.org
Wed Oct 29 11:58:39 PDT 2008
http://bugzilla.ecoinformatics.org/show_bug.cgi?id=3578
Summary: optimize timing of data download by EML and other data
source actors
Product: Kepler
Version: 1.0.0
Platform: Other
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P2
Component: data access
AssignedTo: tao at nceas.ucsb.edu
ReportedBy: jones at nceas.ucsb.edu
QAContact: kepler-dev at ecoinformatics.org
The EML actor and some other data source actors download data when the drag and
drop on the workflow canvas occurs, or when a workflow is opened in the case of
a previously saved workflow. For the EML actor, the Object Manager is first
checked to see if the data file is already locally cached, and if not it is
retrieved from the server. This is not always optimal, as sometimes it could
make sense to delay the download to later in the workflow execution cycle.
Case 1: Optimal download at workflow loading time
When data objects are large but must be fully retrieved, it is best to
retrieve the object as early as possible to avoid delaying the workflow
execution.
Case 2: Optimal loading upon execution
When data objects are large but might not be fully downloaded (e.g.,
via a filtering SQL query on the remote host), it is better to postpone
download until after the user has fully configured the actor, which should be
complete by the time of workflow execution. Unfortunately, the EML actor does
not yet support remote data subsetting, so there is no mechanism yet to support
this case. When the Data Manager library is reincorporated in Kepler, this
should then be possible and desirable.
Case 3: Optimal loading upon actor firing
When data objects are large but an actor is part of a distributed
workflow, it is better to postpone loading data until the actor fires as the
actor may actually execute on a different slave node rather than the master.
Thus, prematurely downloading the data may cause the master to download data
when in fact one or more slave nodes are actually the ones that need it
locally.
There are probably other cases as well. The hard part is how Kepler can
differentiate these cases with minimal user input in order to decide which case
applies and therefore optimize the timing of the download via appropriate
default behaviors for each case.
More information about the Kepler-dev
mailing list