[kepler-dev] extending kepler with workflow optimization algorithms

Fri Jul 8 02:22:30 PDT 2011

Dear Christopher

I was actually thinking of developing a new director for kepler that  
optimizes a workflow using an algorithm that I have recently developed  
as part of my PhD. The characteristics of the optimization algorithm  
are the following:

1.The algorithm takes into account both the cost of the services and  
the number of tokens passed between actors to produce an optimal (in  
terms of cost) workflow.

2.It also accounts decentralized data transfers between the actors,  
i.e., it is assumed that the actors can be deployed anywhere in a  
wide-area infrastructure, while they can exchange data directly with  
varying data transferring costs.

3.Finally, it assumes that the data are exchanged among the actors in  
a pipelined fashion, so that the tuples already processed by an actor  
are processed by the subsequent actor in the workflow at the same time  
as the former processes new input data.

(If you are interested, a technical report presenting the algorithm  
can be found at:   
http://delab.csd.auth.gr/~tsamoura/tsamoura_2010_technical-report.pdf)

Although I have seen that kepler does not support decentralized data  
transfers but only centralized ones (through the DistributedComposite  
actor), i prefer kepler for experimentation because it is extremely  
well documented and can be easily extended to support new functionality.

So, I was actually wondering if types of workflows, such as the ones  
presented in the example of my first email, are met. I have seen many  
scientific workflows but in the majority of them the workflow tasks  
must be in a predefined order, while altering the task ordering either  
does not produce the correct result, or does not change the amount of  
tokens passed between the actors.

I want to discover such workflow types/examples, in order to see if  
such an optimization algorithm would be useful for the community.

Thank you very much for the quick reply

Best regards
Efi

Quoting Christopher Brooks <cxh at eecs.berkeley.edu>:

> Hi Efthymia,
> Scheduling of actor models is a deep topic.
> Kepler uses Ptolemy II as its execution engine.
> Ptolemy II is based on Ptolemy Classic.
> In Ptolemy Classic, we had a number of different synchronous  
> dataflow scheduling
> algorithms, many of these were oriented towards clustering for parallel
> processing.
>
> In your example model, I see there being two costs:
> 1) The cost of the service.  For example FICO might charge more money
> per query than an email lookup service.
> 2) The number of tokens passed between actors varies.
>
> There is a relationship between these two costs where #1 is usually the
> more important cost, but #2 can eventually overrule #1.
>
> Offhand, I don't know of a model of computation that does exactly what
> you want.
>
> In Ptolemy II, models like your example model are typically Kahn
> Process Network (PN) models where each actor is a separate process.
>
> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/ptolemy/domains/pn/doc/
> says:
>
>   "The following are the most important features of the operational  
> semantics of
>   PN as proposed by Kahn and MacQueen:
>
>   * This is a network of sequential processes.
>   * The processes do not share memory. Instead they communicate with  
> each other
>   only through unidirectional FIFO channels.
>   * Communication between processes is asynchronous.
>   * Processes block on a read from a FIFO channel if the channel is empty but
>   can write to a channel whenever they want to."
>
> In PN, there is not really a predefined schedule, the threads  
> operate and then block.
>
> Synchronous Dataflow (SDF) can be thought of as a subset of PN, where
> the number of tokens is known in advance and a schedule is defined  
> in advance.
>
> Dynamic Dataflow (DDF) is between the two, where the number of tokens passed
> between actors is not known in advance.
>
> There is another trivial director called the LeftToRightDirector  
> that fires the actors
> in order from LeftToRight.  This would allow you to drag actors  
> around and try
> different firings.  That director is at
> ptII/doc/tutorial/domains/LeftRightDirector.java
>
> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptIIfaq.htm#kepler
> says
> "If you want to use a director not in Kepler tree, you have to use  
> the "Tools/Instantiate Attribute" menu. I use it to get a DDF  
> director frequently (class ptolemy.domains.ddf.kernel.DDFDirector). "
>
> So, in Kepler, you would do Tools-> Instantiate Attribute and then
> enter doc.tutorial.domains.LeftRightDirector, but that does
> not seem to work in Kepler, so you would need to download Ptolemy II via
> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/
>
> The Timed Multitasking model (TM) is somewhat close to what you want
> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/ptolemy/domains/tm/doc/
> says
> --start--
> The timed multitasking (TM) domain, created by Jie Liu, offers a  
> model of computation based on priority-driven multitasking, as  
> common in real-time operating systems (RTOSs), but with more  
> deterministic behavior. In TM, actors (conceptually) execute as  
> concurrent threads in reaction to inputs. The domain provides an  
> event dispatcher, which maintains a prioritized event queue. The  
> execution of an actor is triggered by the event dispatcher by  
> invoking first its prefire() method. The actor may begin execution  
> of a concurrent thread at this time. Some time later, the dispatcher  
> will invoke the fire() and postfire() methods of the actor (unless  
> prefire() returns false).
>
> The amount of time that elapses between the invocation of prefire()  
> and fire() depends on the declared /executionTime/ and /priority/ of  
> the actor (or more specifically, of the port of the port receiving  
> the triggering event). The domain assumes there is a single  
> resource, the CPU, shared by the execution of all actors. At one  
> particular time, only one of the actors can get the resource and  
> execute. Execution of one actor may be preempted by another eligible  
> actor with a higher priority input event. If an actor is not  
> preempted, then the amount of time that elapses between prefire()  
> and fire() equals the declared. /executionTime/. If it is preempted,  
> then it equals the sum of the /executionTime/ and the execution  
> times of the actors that preempt it. The model of computation is  
> more deterministic than the usual priority-driven multitasking  
> because the actor produces outputs (in its fire() method) only after  
> it has been assured access to the CPU for its declared  
> /executionTime/. In this domain, the model time may be synchronized  
> to real time or not.
>
> --end--
>
>
>
> For an overview of the models of computation, see the "Domain  
> Overview" link at
> the top of
> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/doc/index.htm
>
> or the first chapter of
>
>
>      http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-28.pdf
>
>
> _Christopher
>
>
>
>
>
>
> On 7/7/11 2:10 AM, Efthymia Tsamoura wrote:
>> Hello
>>
>> I am a phd student and during this period i am dealing with  
>> workflow optimization problems in distributed environments.  I  
>> would like to ask, if there are exist any cases where if the order  
>> of task invocation in a scientific workflow changes its performance  
>> changes too without, however, affecting the produced results. In  
>> the following, a present a small use case of the problem i am  
>> interested in:
>>
>> Suppose that a company wants to obtain a list of email addresses of  
>> potential customers selecting only those who have a good payment  
>> history for at least one card and a credit rating above some  
>> threshold. The company has the right to use the following web  
>> services
>>
>> WS1 : SSN id (ssn, threshold) -> credit rating (cr)
>> WS2 : SSN id (ssn) -> credit card numbers (ccn)
>> WS3 : card number (ccn, good) -> good history (gph)
>> WS4 : SSN id (ssn) -> email addresses (ea)
>>
>> The input data containing customer identifiers (ssn) and other  
>> relevant information is stored in a local data resource. Two  
>> possible web service linear workflows that can be formed to process  
>> the input data using the above services are C1 = WS2,WS3,WS1,WS4  
>> and C2 = WS1,WS2,WS3,WS4. In the first workflow, first, the  
>> customers having a good payment history are initially selected  
>> (WS2,WS3), and then, the remaining customers whose credit history  
>> is below some threshold are filtered out (through WS1). The C2  
>> workflow performs the same tasks in a reverse order. The above  
>> linear workflows may have different performance; if WS3 filters out  
>> more data than WS1, then it will be more beneficial to invoke WS3  
>> before WS1 in order for the subsequent web services in the workflow  
>> to process less data.
>>
>> It would be very useful to know if there exist similar scientific  
>> workflow examples (where the order of task invocation can change  
>> and it is not known a-priori by the user, while the workflow  
>> performance depends on the workflow task invocation order) and if  
>> you are interested in extending kepler with optimization algorithms  
>> for such workflows.
>>
>> I am asking because i have recently developed an optimization  
>> algorithm for this problem and i would like to test its performance  
>> in a real-world workflow management system with real-world workflows.
>>
>> P.S.: references to publications or any other information dealing  
>> with scientific workflows of the above rationale will be extremely  
>> useful.
>>
>> Thank you very much for your time
>>
>>
>>
>> _______________________________________________
>> Kepler-dev mailing list
>> Kepler-dev at kepler-project.org
>> http://lists.nceas.ucsb.edu/kepler/mailman/listinfo/kepler-dev
>
> -- 
> Christopher Brooks, PMP                       University of California
> CHESS Executive Director                      US Mail: 337 Cory Hall
> Programmer/Analyst CHESS/Ptolemy/Trust        Berkeley, CA 94720-1774
> ph: 510.643.9841      	                      (Office: 545Q Cory)
> home: (F-Tu) 707.665.0131 cell: 707.332.0670
>
>