[kepler-users] Stop and resume execution

Wed Sep 17 17:06:17 PDT 2008

Hi,

First, please note that while COMAD has some promise as a technology  
for supporting stopping and resuming of workflows, capabilities for  
doing this with COMAD currently do not exist.  It is a (proposed)  
research area where COMAD is concerned.  In short, COMAD does not  
solve this problem at this time.  If you still want to try out COMAD  
please read on and feel free to contact me for additional information.

You can find the COMAD actors and support classes a couple different  
ways depending on how you are working with Kepler.

If you are developing with Kepler you can find the source files for  
the COMAD-related capabilities in Kepler under src/org/nddp.  This is  
where you will find them if you have, say, checked out the Kepler 1.0  
release of Kepler from the subversion repository.  There are sample  
workflows and data sets under workflows/nddp.

A newer version of this code is stored in the part of the subversion  
repository set aside for Kepler extensions: https://code.kepler- 
project.org/code/kepler/modules/.  If you want to build this code the  
best way to do this for now is to follow the instructions in this  
tutorial on using a new build system for developing Kepler  
extensions.  The code in the extension area will supplant the code in  
the main source area for Kepler as soon as we have a way of easily  
packaging and sharing extensions to Kepler, at which point the  
procedure will be to download and install Kepler, then download and  
install the comad extension.

However, if you just want to try out some sample Kepler workflows  
based on COMAD, the easiest thing to do is to download and run the  
Kepler/ppod preview release for OS X.    Again, I'll be happy to ask  
questions about the COMAD capabilities demonstrated in this "preview  
release".

You can also read about the pPOD preview release and COMAD in the  
latest Kepler newsletter.

Cheers,

Tim

On Sep 10, 2008, at 8:45 AM, Josep Maria Campanera Alsina wrote:

> Hi all,
> I'm back again, I'm extremely interested also in these extension
> actors that are able to manage the execution of a Kepler workflow.
>
> But where can I find the COMAD and "Smart rerun" actors? I haven't
> been able to allocate them in the Kepler repository.
>
> All the best,
>
> Josep Maria,
>
>> Date: Tue, 26 Aug 2008 05:59:27 -0700
>> From: "Bertram Ludaescher" <ludaesch at ucdavis.edu>
>> Subject: Re: [kepler-users] Stop and resume execution...
>> To: "Quentin BEY" <quentin.bey at onera.fr>
>> Cc: kepler-users at ecoinformatics.org
>> Message-ID:
>>        <657a810a0808260559x305d349ahc64dcdf60e95fd88 at mail.gmail.com>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> Hi Quentin:
>>
>> Interesting question! There are several answers to this.
>>
>> First, "knowing which actor was executing" is generally not enough  
>> to resume
>> execution:
>> Consider a workflow executing with a PN (process network)  
>> director. Then all
>> actors execute as independent processes (Java threads really), so  
>> all are
>> executing simultaneously.
>> In contrast, a (sub-)workflow executing under SDF or DDF will be  
>> executed
>> within a single thread, so at most one actor is executing at a  
>> given time in
>> such a workflow.
>> (SDF creates a schedule "statically", i.e., prior to workflow  
>> execution,
>> while DDF figures out which actors are ready to fire at runtime, then
>> selects one and repeats)
>>
>> But what you really need is to maintain the "workflow state" (or  
>> some part
>> of it) persistently, so that you can resume a stopped or failed  
>> workflow.
>> One general way to do this is checkpointing, i.e., writing relevant
>> information out to disk at certain times. While checkpointing can  
>> be very
>> costly in general applications, in scientific workflows it can  
>> often be
>> easier to do so, since usually components are loosely coupled, all
>> information flow is visible via the channels (unless you do some
>> side-effects outside the model), and actors are often (but not  
>> always)
>> stateless.
>>
>> I'm aware of several extensions that allow one to resume Kepler  
>> workflows (I
>> think Ptolemy might have further ways):
>>
>> -- One system has been called "smart rerun" (e.g. Ilkay Altintas  
>> or Dan
>> Crawl can point you to it) and allows you to rerun a workflow with  
>> modified
>> inputs and/or parameter settings, avoiding to re-execute parts  
>> that are
>> "unchanged". I don't recall whether it handles only successful  
>> workflow runs
>> (and optimizes their re-execution under change) or also partial  
>> (aborted)
>> runs.
>>
>> -- Norbert Podhorszki has developed workflows where actors  
>> themselves write
>> out to disk some small information (in his case: remote commands that
>> successfully terminated) which is used upon re-running the  
>> workflow to only
>> execute the commands not yet successfully completed previously.   
>> Call this
>> the "custom checkpointing" solution (instead of a general system  
>> extension,
>> individual actors or workflows decide what to checkpoint; more  
>> work, but it
>> can be more efficient to know what is needed to rerun).
>>
>> -- One new director and workflow programming model called COMAD makes
>> visible most if not all of the execution state visible "on the  
>> wire" by
>> streaming nested data collections between actors. Like in other  
>> approaches,
>> the information on the wire can be written to disk and the  
>> workflow resumed
>> based on this info.
>>
>> All these approaches are based on record information during  
>> runtime on disk
>> (sometimes called 'provenance information'), which is then used when
>> resuming the workflow.
>>
>> The above options are not the only ones (e.g. Ptolemy probably has
>> additional ways to restart a failed model). Which variant to  
>> choose (or
>> which new variant to develop) may depend on, among other things:
>> -- the size of data flowing through channels (or the availability of
>> persistent ids to large chunks of data)
>> -- whether actors are stateful or stateless
>> -- the director(s) programming/execution model being used
>>
>> Bertram
>>
>>
>> On Tue, Aug 26, 2008 at 2:22 AM, Quentin BEY  
>> <quentin.bey at onera.fr> wrote:
>>
>>> Hi all,
>>>
>>> Once again I need help about Kepler's possibilities.
>>>
>>> I wonder if we can stop a workflow, quit Kepler, then reopen  
>>> Kepler and
>>> resume the workflow. For instance, a workflow which take long  
>>> time to
>>> execute stops because the computer shutdown (for whatever reason we
>>> ignore), if we know which actor was executing is there a simple  
>>> way to
>>> resume execution from this actor?
>>>
>>>
>>> Thanks in advance,
>>>
>>>
>>> Quentin BEY -ONERA- France
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nceas.ucsb.edu/kepler/pipermail/kepler-users/attachments/20080917/50e45a9d/attachment.html>