[kepler-dev] extending kepler with workflow optimization algorithms

Christopher Brooks cxh at eecs.berkeley.edu
Mon Jul 11 10:05:08 PDT 2011


Hi Efthymia,
I just stumbled across ptolemy/domains/sdf/optimize,
which has an OptimizingSDFDirector:
"OptimizingSDFScheduler is the scheduler companion to the OptimizingSDFDirector
It works with the synchronous dataflow (SDF) model of computation to find
an optimized schedule according to a defined criterion. "

This is the work of Marc Geilen, who visited Berkeley awhile back.
The code might be an interesting starting point for some of your ideas.

_Christopher

On 7/11/11 6:13 AM, Efthymia Tsamoura wrote:
> Dear Edward
>
> Thank you very much for the valuable feedback.
>
> Best Regards,
> Efi
>
>
> Quoting "Edward A. Lee" <eal at eecs.berkeley.edu>:
>
>>
>> Dear Efi,
>>
>> The question of whether order of invocation affects the computed
>> results is the question of determinacy. Here are a few papers
>> that address this:
>>
>>
>> http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-7.html
>>
>> http://ptolemy.eecs.berkeley.edu/publications/papers/06/problemwithThreads/
>>
>> http://ptolemy.eecs.berkeley.edu/publications/papers/95/processNets/
>>
>> There are many other papers on the Ptolemy website that talk some
>> about the issue, but I think the ones above are examples
>> where it is a major focus.
>>
>> Edward
>>
>>
>> On 7/8/11 2:22 AM, Efthymia Tsamoura wrote:
>>> Dear Christopher
>>>
>>> I was actually thinking of developing a new director for kepler that
>>> optimizes a workflow using an algorithm that I have recently developed
>>> as part of my PhD. The characteristics of the optimization algorithm are
>>> the following:
>>>
>>> 1.The algorithm takes into account both the cost of the services and the
>>> number of tokens passed between actors to produce an optimal (in terms
>>> of cost) workflow.
>>>
>>> 2.It also accounts decentralized data transfers between the actors,
>>> i.e., it is assumed that the actors can be deployed anywhere in a
>>> wide-area infrastructure, while they can exchange data directly with
>>> varying data transferring costs.
>>>
>>> 3.Finally, it assumes that the data are exchanged among the actors in a
>>> pipelined fashion, so that the tuples already processed by an actor are
>>> processed by the subsequent actor in the workflow at the same time as
>>> the former processes new input data.
>>>
>>> (If you are interested, a technical report presenting the algorithm can
>>> be found at:
>>> http://delab.csd.auth.gr/~tsamoura/tsamoura_2010_technical-report.pdf)
>>>
>>> Although I have seen that kepler does not support decentralized data
>>> transfers but only centralized ones (through the DistributedComposite
>>> actor), i prefer kepler for experimentation because it is extremely well
>>> documented and can be easily extended to support new functionality.
>>>
>>> So, I was actually wondering if types of workflows, such as the ones
>>> presented in the example of my first email, are met. I have seen many
>>> scientific workflows but in the majority of them the workflow tasks must
>>> be in a predefined order, while altering the task ordering either does
>>> not produce the correct result, or does not change the amount of tokens
>>> passed between the actors.
>>>
>>> I want to discover such workflow types/examples, in order to see if such
>>> an optimization algorithm would be useful for the community.
>>>
>>> Thank you very much for the quick reply
>>>
>>>
>>> Best regards
>>> Efi
>>>
>>>
>>> Quoting Christopher Brooks <cxh at eecs.berkeley.edu>:
>>>
>>>> Hi Efthymia,
>>>> Scheduling of actor models is a deep topic.
>>>> Kepler uses Ptolemy II as its execution engine.
>>>> Ptolemy II is based on Ptolemy Classic.
>>>> In Ptolemy Classic, we had a number of different synchronous dataflow
>>>> scheduling
>>>> algorithms, many of these were oriented towards clustering for parallel
>>>> processing.
>>>>
>>>> In your example model, I see there being two costs:
>>>> 1) The cost of the service. For example FICO might charge more money
>>>> per query than an email lookup service.
>>>> 2) The number of tokens passed between actors varies.
>>>>
>>>> There is a relationship between these two costs where #1 is usually the
>>>> more important cost, but #2 can eventually overrule #1.
>>>>
>>>> Offhand, I don't know of a model of computation that does exactly what
>>>> you want.
>>>>
>>>> In Ptolemy II, models like your example model are typically Kahn
>>>> Process Network (PN) models where each actor is a separate process.
>>>>
>>>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/ptolemy/domains/pn/doc/
>>>>
>>>> says:
>>>>
>>>> "The following are the most important features of the operational
>>>> semantics of
>>>> PN as proposed by Kahn and MacQueen:
>>>>
>>>> * This is a network of sequential processes.
>>>> * The processes do not share memory. Instead they communicate with
>>>> each other
>>>> only through unidirectional FIFO channels.
>>>> * Communication between processes is asynchronous.
>>>> * Processes block on a read from a FIFO channel if the channel is
>>>> empty but
>>>> can write to a channel whenever they want to."
>>>>
>>>> In PN, there is not really a predefined schedule, the threads operate
>>>> and then block.
>>>>
>>>> Synchronous Dataflow (SDF) can be thought of as a subset of PN, where
>>>> the number of tokens is known in advance and a schedule is defined in
>>>> advance.
>>>>
>>>> Dynamic Dataflow (DDF) is between the two, where the number of tokens
>>>> passed
>>>> between actors is not known in advance.
>>>>
>>>> There is another trivial director called the LeftToRightDirector that
>>>> fires the actors
>>>> in order from LeftToRight. This would allow you to drag actors around
>>>> and try
>>>> different firings. That director is at
>>>> ptII/doc/tutorial/domains/LeftRightDirector.java
>>>>
>>>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptIIfaq.htm#kepler
>>>> says
>>>> "If you want to use a director not in Kepler tree, you have to use the
>>>> "Tools/Instantiate Attribute" menu. I use it to get a DDF director
>>>> frequently (class ptolemy.domains.ddf.kernel.DDFDirector). "
>>>>
>>>> So, in Kepler, you would do Tools-> Instantiate Attribute and then
>>>> enter doc.tutorial.domains.LeftRightDirector, but that does
>>>> not seem to work in Kepler, so you would need to download Ptolemy II via
>>>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/
>>>>
>>>> The Timed Multitasking model (TM) is somewhat close to what you want
>>>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/ptolemy/domains/tm/doc/
>>>>
>>>> says
>>>> --start--
>>>> The timed multitasking (TM) domain, created by Jie Liu, offers a model
>>>> of computation based on priority-driven multitasking, as common in
>>>> real-time operating systems (RTOSs), but with more deterministic
>>>> behavior. In TM, actors (conceptually) execute as concurrent threads
>>>> in reaction to inputs. The domain provides an event dispatcher, which
>>>> maintains a prioritized event queue. The execution of an actor is
>>>> triggered by the event dispatcher by invoking first its prefire()
>>>> method. The actor may begin execution of a concurrent thread at this
>>>> time. Some time later, the dispatcher will invoke the fire() and
>>>> postfire() methods of the actor (unless prefire() returns false).
>>>>
>>>> The amount of time that elapses between the invocation of prefire()
>>>> and fire() depends on the declared /executionTime/ and /priority/ of
>>>> the actor (or more specifically, of the port of the port receiving the
>>>> triggering event). The domain assumes there is a single resource, the
>>>> CPU, shared by the execution of all actors. At one particular time,
>>>> only one of the actors can get the resource and execute. Execution of
>>>> one actor may be preempted by another eligible actor with a higher
>>>> priority input event. If an actor is not preempted, then the amount of
>>>> time that elapses between prefire() and fire() equals the declared.
>>>> /executionTime/. If it is preempted, then it equals the sum of the
>>>> /executionTime/ and the execution times of the actors that preempt it.
>>>> The model of computation is more deterministic than the usual
>>>> priority-driven multitasking because the actor produces outputs (in
>>>> its fire() method) only after it has been assured access to the CPU
>>>> for its declared /executionTime/. In this domain, the model time may
>>>> be synchronized to real time or not.
>>>>
>>>> --end--
>>>>
>>>>
>>>>
>>>> For an overview of the models of computation, see the "Domain
>>>> Overview" link at
>>>> the top of
>>>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/doc/index.htm
>>>>
>>>>
>>>> or the first chapter of
>>>>
>>>>
>>>> http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-28.pdf
>>>>
>>>>
>>>> _Christopher
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 7/7/11 2:10 AM, Efthymia Tsamoura wrote:
>>>>> Hello
>>>>>
>>>>> I am a phd student and during this period i am dealing with workflow
>>>>> optimization problems in distributed environments. I would like to
>>>>> ask, if there are exist any cases where if the order of task
>>>>> invocation in a scientific workflow changes its performance changes
>>>>> too without, however, affecting the produced results. In the
>>>>> following, a present a small use case of the problem i am interested in:
>>>>>
>>>>> Suppose that a company wants to obtain a list of email addresses of
>>>>> potential customers selecting only those who have a good payment
>>>>> history for at least one card and a credit rating above some
>>>>> threshold. The company has the right to use the following web services
>>>>>
>>>>> WS1 : SSN id (ssn, threshold) -> credit rating (cr)
>>>>> WS2 : SSN id (ssn) -> credit card numbers (ccn)
>>>>> WS3 : card number (ccn, good) -> good history (gph)
>>>>> WS4 : SSN id (ssn) -> email addresses (ea)
>>>>>
>>>>> The input data containing customer identifiers (ssn) and other
>>>>> relevant information is stored in a local data resource. Two possible
>>>>> web service linear workflows that can be formed to process the input
>>>>> data using the above services are C1 = WS2,WS3,WS1,WS4 and C2 =
>>>>> WS1,WS2,WS3,WS4. In the first workflow, first, the customers having a
>>>>> good payment history are initially selected (WS2,WS3), and then, the
>>>>> remaining customers whose credit history is below some threshold are
>>>>> filtered out (through WS1). The C2 workflow performs the same tasks
>>>>> in a reverse order. The above linear workflows may have different
>>>>> performance; if WS3 filters out more data than WS1, then it will be
>>>>> more beneficial to invoke WS3 before WS1 in order for the subsequent
>>>>> web services in the workflow to process less data.
>>>>>
>>>>> It would be very useful to know if there exist similar scientific
>>>>> workflow examples (where the order of task invocation can change and
>>>>> it is not known a-priori by the user, while the workflow performance
>>>>> depends on the workflow task invocation order) and if you are
>>>>> interested in extending kepler with optimization algorithms for such
>>>>> workflows.
>>>>>
>>>>> I am asking because i have recently developed an optimization
>>>>> algorithm for this problem and i would like to test its performance
>>>>> in a real-world workflow management system with real-world workflows.
>>>>>
>>>>> P.S.: references to publications or any other information dealing
>>>>> with scientific workflows of the above rationale will be extremely
>>>>> useful.
>>>>>
>>>>> Thank you very much for your time
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Kepler-dev mailing list
>>>>> Kepler-dev at kepler-project.org
>>>>> http://lists.nceas.ucsb.edu/kepler/mailman/listinfo/kepler-dev
>>>>
>>>> -- 
>>>> Christopher Brooks, PMP University of California
>>>> CHESS Executive Director US Mail: 337 Cory Hall
>>>> Programmer/Analyst CHESS/Ptolemy/Trust Berkeley, CA 94720-1774
>>>> ph: 510.643.9841 (Office: 545Q Cory)
>>>> home: (F-Tu) 707.665.0131 cell: 707.332.0670
>>>>
>>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Kepler-dev mailing list
>>> Kepler-dev at kepler-project.org
>>> http://lists.nceas.ucsb.edu/kepler/mailman/listinfo/kepler-dev
>>
>
>
>
>

-- 
Christopher Brooks, PMP                       University of California
CHESS Executive Director                      US Mail: 337 Cory Hall
Programmer/Analyst CHESS/Ptolemy/Trust        Berkeley, CA 94720-1774
ph: 510.643.9841      	                      (Office: 545Q Cory)
home: (F-Tu) 707.665.0131 cell: 707.332.0670



More information about the Kepler-dev mailing list