[kepler-dev] extending kepler with workflow optimization algorithms

Wed Jul 13 05:55:59 PDT 2011

Dear Christopher

I will check the source code you have posted

Thank you very much for the assistance

Efi

Quoting Christopher Brooks <cxh at eecs.berkeley.edu>:

> Hi Efthymia,
> I just stumbled across ptolemy/domains/sdf/optimize,
> which has an OptimizingSDFDirector:
> "OptimizingSDFScheduler is the scheduler companion to the  
> OptimizingSDFDirector
> It works with the synchronous dataflow (SDF) model of computation to find
> an optimized schedule according to a defined criterion. "
>
> This is the work of Marc Geilen, who visited Berkeley awhile back.
> The code might be an interesting starting point for some of your ideas.
>
> _Christopher
>
> On 7/11/11 6:13 AM, Efthymia Tsamoura wrote:
>> Dear Edward
>>
>> Thank you very much for the valuable feedback.
>>
>> Best Regards,
>> Efi
>>
>>
>> Quoting "Edward A. Lee" <eal at eecs.berkeley.edu>:
>>
>>>
>>> Dear Efi,
>>>
>>> The question of whether order of invocation affects the computed
>>> results is the question of determinacy. Here are a few papers
>>> that address this:
>>>
>>>
>>> http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-7.html
>>>
>>> http://ptolemy.eecs.berkeley.edu/publications/papers/06/problemwithThreads/
>>>
>>> http://ptolemy.eecs.berkeley.edu/publications/papers/95/processNets/
>>>
>>> There are many other papers on the Ptolemy website that talk some
>>> about the issue, but I think the ones above are examples
>>> where it is a major focus.
>>>
>>> Edward
>>>
>>>
>>> On 7/8/11 2:22 AM, Efthymia Tsamoura wrote:
>>>> Dear Christopher
>>>>
>>>> I was actually thinking of developing a new director for kepler that
>>>> optimizes a workflow using an algorithm that I have recently developed
>>>> as part of my PhD. The characteristics of the optimization algorithm are
>>>> the following:
>>>>
>>>> 1.The algorithm takes into account both the cost of the services and the
>>>> number of tokens passed between actors to produce an optimal (in terms
>>>> of cost) workflow.
>>>>
>>>> 2.It also accounts decentralized data transfers between the actors,
>>>> i.e., it is assumed that the actors can be deployed anywhere in a
>>>> wide-area infrastructure, while they can exchange data directly with
>>>> varying data transferring costs.
>>>>
>>>> 3.Finally, it assumes that the data are exchanged among the actors in a
>>>> pipelined fashion, so that the tuples already processed by an actor are
>>>> processed by the subsequent actor in the workflow at the same time as
>>>> the former processes new input data.
>>>>
>>>> (If you are interested, a technical report presenting the algorithm can
>>>> be found at:
>>>> http://delab.csd.auth.gr/~tsamoura/tsamoura_2010_technical-report.pdf)
>>>>
>>>> Although I have seen that kepler does not support decentralized data
>>>> transfers but only centralized ones (through the DistributedComposite
>>>> actor), i prefer kepler for experimentation because it is extremely well
>>>> documented and can be easily extended to support new functionality.
>>>>
>>>> So, I was actually wondering if types of workflows, such as the ones
>>>> presented in the example of my first email, are met. I have seen many
>>>> scientific workflows but in the majority of them the workflow tasks must
>>>> be in a predefined order, while altering the task ordering either does
>>>> not produce the correct result, or does not change the amount of tokens
>>>> passed between the actors.
>>>>
>>>> I want to discover such workflow types/examples, in order to see if such
>>>> an optimization algorithm would be useful for the community.
>>>>
>>>> Thank you very much for the quick reply
>>>>
>>>>
>>>> Best regards
>>>> Efi
>>>>
>>>>
>>>> Quoting Christopher Brooks <cxh at eecs.berkeley.edu>:
>>>>
>>>>> Hi Efthymia,
>>>>> Scheduling of actor models is a deep topic.
>>>>> Kepler uses Ptolemy II as its execution engine.
>>>>> Ptolemy II is based on Ptolemy Classic.
>>>>> In Ptolemy Classic, we had a number of different synchronous dataflow
>>>>> scheduling
>>>>> algorithms, many of these were oriented towards clustering for parallel
>>>>> processing.
>>>>>
>>>>> In your example model, I see there being two costs:
>>>>> 1) The cost of the service. For example FICO might charge more money
>>>>> per query than an email lookup service.
>>>>> 2) The number of tokens passed between actors varies.
>>>>>
>>>>> There is a relationship between these two costs where #1 is usually the
>>>>> more important cost, but #2 can eventually overrule #1.
>>>>>
>>>>> Offhand, I don't know of a model of computation that does exactly what
>>>>> you want.
>>>>>
>>>>> In Ptolemy II, models like your example model are typically Kahn
>>>>> Process Network (PN) models where each actor is a separate process.
>>>>>
>>>>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/ptolemy/domains/pn/doc/
>>>>>
>>>>> says:
>>>>>
>>>>> "The following are the most important features of the operational
>>>>> semantics of
>>>>> PN as proposed by Kahn and MacQueen:
>>>>>
>>>>> * This is a network of sequential processes.
>>>>> * The processes do not share memory. Instead they communicate with
>>>>> each other
>>>>> only through unidirectional FIFO channels.
>>>>> * Communication between processes is asynchronous.
>>>>> * Processes block on a read from a FIFO channel if the channel is
>>>>> empty but
>>>>> can write to a channel whenever they want to."
>>>>>
>>>>> In PN, there is not really a predefined schedule, the threads operate
>>>>> and then block.
>>>>>
>>>>> Synchronous Dataflow (SDF) can be thought of as a subset of PN, where
>>>>> the number of tokens is known in advance and a schedule is defined in
>>>>> advance.
>>>>>
>>>>> Dynamic Dataflow (DDF) is between the two, where the number of tokens
>>>>> passed
>>>>> between actors is not known in advance.
>>>>>
>>>>> There is another trivial director called the LeftToRightDirector that
>>>>> fires the actors
>>>>> in order from LeftToRight. This would allow you to drag actors around
>>>>> and try
>>>>> different firings. That director is at
>>>>> ptII/doc/tutorial/domains/LeftRightDirector.java
>>>>>
>>>>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptIIfaq.htm#kepler
>>>>> says
>>>>> "If you want to use a director not in Kepler tree, you have to use the
>>>>> "Tools/Instantiate Attribute" menu. I use it to get a DDF director
>>>>> frequently (class ptolemy.domains.ddf.kernel.DDFDirector). "
>>>>>
>>>>> So, in Kepler, you would do Tools-> Instantiate Attribute and then
>>>>> enter doc.tutorial.domains.LeftRightDirector, but that does
>>>>> not seem to work in Kepler, so you would need to download Ptolemy II via
>>>>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/
>>>>>
>>>>> The Timed Multitasking model (TM) is somewhat close to what you want
>>>>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/ptolemy/domains/tm/doc/
>>>>>
>>>>> says
>>>>> --start--
>>>>> The timed multitasking (TM) domain, created by Jie Liu, offers a model
>>>>> of computation based on priority-driven multitasking, as common in
>>>>> real-time operating systems (RTOSs), but with more deterministic
>>>>> behavior. In TM, actors (conceptually) execute as concurrent threads
>>>>> in reaction to inputs. The domain provides an event dispatcher, which
>>>>> maintains a prioritized event queue. The execution of an actor is
>>>>> triggered by the event dispatcher by invoking first its prefire()
>>>>> method. The actor may begin execution of a concurrent thread at this
>>>>> time. Some time later, the dispatcher will invoke the fire() and
>>>>> postfire() methods of the actor (unless prefire() returns false).
>>>>>
>>>>> The amount of time that elapses between the invocation of prefire()
>>>>> and fire() depends on the declared /executionTime/ and /priority/ of
>>>>> the actor (or more specifically, of the port of the port receiving the
>>>>> triggering event). The domain assumes there is a single resource, the
>>>>> CPU, shared by the execution of all actors. At one particular time,
>>>>> only one of the actors can get the resource and execute. Execution of
>>>>> one actor may be preempted by another eligible actor with a higher
>>>>> priority input event. If an actor is not preempted, then the amount of
>>>>> time that elapses between prefire() and fire() equals the declared.
>>>>> /executionTime/. If it is preempted, then it equals the sum of the
>>>>> /executionTime/ and the execution times of the actors that preempt it.
>>>>> The model of computation is more deterministic than the usual
>>>>> priority-driven multitasking because the actor produces outputs (in
>>>>> its fire() method) only after it has been assured access to the CPU
>>>>> for its declared /executionTime/. In this domain, the model time may
>>>>> be synchronized to real time or not.
>>>>>
>>>>> --end--
>>>>>
>>>>>
>>>>>
>>>>> For an overview of the models of computation, see the "Domain
>>>>> Overview" link at
>>>>> the top of
>>>>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/doc/index.htm
>>>>>
>>>>>
>>>>> or the first chapter of
>>>>>
>>>>>
>>>>> http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-28.pdf
>>>>>
>>>>>
>>>>> _Christopher
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 7/7/11 2:10 AM, Efthymia Tsamoura wrote:
>>>>>> Hello
>>>>>>
>>>>>> I am a phd student and during this period i am dealing with workflow
>>>>>> optimization problems in distributed environments. I would like to
>>>>>> ask, if there are exist any cases where if the order of task
>>>>>> invocation in a scientific workflow changes its performance changes
>>>>>> too without, however, affecting the produced results. In the
>>>>>> following, a present a small use case of the problem i am interested in:
>>>>>>
>>>>>> Suppose that a company wants to obtain a list of email addresses of
>>>>>> potential customers selecting only those who have a good payment
>>>>>> history for at least one card and a credit rating above some
>>>>>> threshold. The company has the right to use the following web services
>>>>>>
>>>>>> WS1 : SSN id (ssn, threshold) -> credit rating (cr)
>>>>>> WS2 : SSN id (ssn) -> credit card numbers (ccn)
>>>>>> WS3 : card number (ccn, good) -> good history (gph)
>>>>>> WS4 : SSN id (ssn) -> email addresses (ea)
>>>>>>
>>>>>> The input data containing customer identifiers (ssn) and other
>>>>>> relevant information is stored in a local data resource. Two possible
>>>>>> web service linear workflows that can be formed to process the input
>>>>>> data using the above services are C1 = WS2,WS3,WS1,WS4 and C2 =
>>>>>> WS1,WS2,WS3,WS4. In the first workflow, first, the customers having a
>>>>>> good payment history are initially selected (WS2,WS3), and then, the
>>>>>> remaining customers whose credit history is below some threshold are
>>>>>> filtered out (through WS1). The C2 workflow performs the same tasks
>>>>>> in a reverse order. The above linear workflows may have different
>>>>>> performance; if WS3 filters out more data than WS1, then it will be
>>>>>> more beneficial to invoke WS3 before WS1 in order for the subsequent
>>>>>> web services in the workflow to process less data.
>>>>>>
>>>>>> It would be very useful to know if there exist similar scientific
>>>>>> workflow examples (where the order of task invocation can change and
>>>>>> it is not known a-priori by the user, while the workflow performance
>>>>>> depends on the workflow task invocation order) and if you are
>>>>>> interested in extending kepler with optimization algorithms for such
>>>>>> workflows.
>>>>>>
>>>>>> I am asking because i have recently developed an optimization
>>>>>> algorithm for this problem and i would like to test its performance
>>>>>> in a real-world workflow management system with real-world workflows.
>>>>>>
>>>>>> P.S.: references to publications or any other information dealing
>>>>>> with scientific workflows of the above rationale will be extremely
>>>>>> useful.
>>>>>>
>>>>>> Thank you very much for your time
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Kepler-dev mailing list
>>>>>> Kepler-dev at kepler-project.org
>>>>>> http://lists.nceas.ucsb.edu/kepler/mailman/listinfo/kepler-dev
>>>>>
>>>>> -- 
>>>>> Christopher Brooks, PMP University of California
>>>>> CHESS Executive Director US Mail: 337 Cory Hall
>>>>> Programmer/Analyst CHESS/Ptolemy/Trust Berkeley, CA 94720-1774
>>>>> ph: 510.643.9841 (Office: 545Q Cory)
>>>>> home: (F-Tu) 707.665.0131 cell: 707.332.0670
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Kepler-dev mailing list
>>>> Kepler-dev at kepler-project.org
>>>> http://lists.nceas.ucsb.edu/kepler/mailman/listinfo/kepler-dev
>>>
>>
>>
>>
>>
>
> -- 
> Christopher Brooks, PMP                       University of California
> CHESS Executive Director                      US Mail: 337 Cory Hall
> Programmer/Analyst CHESS/Ptolemy/Trust        Berkeley, CA 94720-1774
> ph: 510.643.9841      	                      (Office: 545Q Cory)
> home: (F-Tu) 707.665.0131 cell: 707.332.0670
>
>