[kepler-users] Stop and resume execution...

Bertram Ludaescher ludaesch at ucdavis.edu
Tue Aug 26 05:59:27 PDT 2008


Hi Quentin:

Interesting question! There are several answers to this.

First, "knowing which actor was executing" is generally not enough to resume
execution:
Consider a workflow executing with a PN (process network) director. Then all
actors execute as independent processes (Java threads really), so all are
executing simultaneously.
In contrast, a (sub-)workflow executing under SDF or DDF will be executed
within a single thread, so at most one actor is executing at a given time in
such a workflow.
(SDF creates a schedule "statically", i.e., prior to workflow execution,
while DDF figures out which actors are ready to fire at runtime, then
selects one and repeats)

But what you really need is to maintain the "workflow state" (or some part
of it) persistently, so that you can resume a stopped or failed workflow.
One general way to do this is checkpointing, i.e., writing relevant
information out to disk at certain times. While checkpointing can be very
costly in general applications, in scientific workflows it can often be
easier to do so, since usually components are loosely coupled, all
information flow is visible via the channels (unless you do some
side-effects outside the model), and actors are often (but not always)
stateless.

I'm aware of several extensions that allow one to resume Kepler workflows (I
think Ptolemy might have further ways):

-- One system has been called "smart rerun" (e.g. Ilkay Altintas or Dan
Crawl can point you to it) and allows you to rerun a workflow with modified
inputs and/or parameter settings, avoiding to re-execute parts that are
"unchanged". I don't recall whether it handles only successful workflow runs
(and optimizes their re-execution under change) or also partial (aborted)
runs.

-- Norbert Podhorszki has developed workflows where actors themselves write
out to disk some small information (in his case: remote commands that
successfully terminated) which is used upon re-running the workflow to only
execute the commands not yet successfully completed previously.  Call this
the "custom checkpointing" solution (instead of a general system extension,
individual actors or workflows decide what to checkpoint; more work, but it
can be more efficient to know what is needed to rerun).

-- One new director and workflow programming model called COMAD makes
visible most if not all of the execution state visible "on the wire" by
streaming nested data collections between actors. Like in other approaches,
the information on the wire can be written to disk and the workflow resumed
based on this info.

All these approaches are based on record information during runtime on disk
(sometimes called 'provenance information'), which is then used when
resuming the workflow.

The above options are not the only ones (e.g. Ptolemy probably has
additional ways to restart a failed model). Which variant to choose (or
which new variant to develop) may depend on, among other things:
-- the size of data flowing through channels (or the availability of
persistent ids to large chunks of data)
-- whether actors are stateful or stateless
-- the director(s) programming/execution model being used

Bertram


On Tue, Aug 26, 2008 at 2:22 AM, Quentin BEY <quentin.bey at onera.fr> wrote:

> Hi all,
>
> Once again I need help about Kepler's possibilities.
>
> I wonder if we can stop a workflow, quit Kepler, then reopen Kepler and
> resume the workflow. For instance, a workflow which take long time to
> execute stops because the computer shutdown (for whatever reason we
> ignore), if we know which actor was executing is there a simple way to
> resume execution from this actor?
>
>
> Thanks in advance,
>
>
> Quentin BEY -ONERA- France
>
> _______________________________________________
> Kepler-users mailing list
> Kepler-users at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/kepler-users/attachments/20080826/d72e17d2/attachment.html>


More information about the Kepler-users mailing list