[kepler-dev] extending kepler with workflow optimization algorithms

Mon Jul 11 06:13:36 PDT 2011

Dear Edward

Thank you very much for the valuable feedback.

Best Regards,
Efi

Quoting "Edward A. Lee" <eal at eecs.berkeley.edu>:

>
> Dear Efi,
>
> The question of whether order of invocation affects the computed
> results is the question of determinacy. Here are a few papers
> that address this:
>
>
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-7.html
>
> http://ptolemy.eecs.berkeley.edu/publications/papers/06/problemwithThreads/
>
> http://ptolemy.eecs.berkeley.edu/publications/papers/95/processNets/
>
> There are many other papers on the Ptolemy website that talk some
> about the issue, but I think the ones above are examples
> where it is a major focus.
>
> Edward
>
>
> On 7/8/11 2:22 AM, Efthymia Tsamoura wrote:
>> Dear Christopher
>>
>> I was actually thinking of developing a new director for kepler that
>> optimizes a workflow using an algorithm that I have recently developed
>> as part of my PhD. The characteristics of the optimization algorithm are
>> the following:
>>
>> 1.The algorithm takes into account both the cost of the services and the
>> number of tokens passed between actors to produce an optimal (in terms
>> of cost) workflow.
>>
>> 2.It also accounts decentralized data transfers between the actors,
>> i.e., it is assumed that the actors can be deployed anywhere in a
>> wide-area infrastructure, while they can exchange data directly with
>> varying data transferring costs.
>>
>> 3.Finally, it assumes that the data are exchanged among the actors in a
>> pipelined fashion, so that the tuples already processed by an actor are
>> processed by the subsequent actor in the workflow at the same time as
>> the former processes new input data.
>>
>> (If you are interested, a technical report presenting the algorithm can
>> be found at:
>> http://delab.csd.auth.gr/~tsamoura/tsamoura_2010_technical-report.pdf)
>>
>> Although I have seen that kepler does not support decentralized data
>> transfers but only centralized ones (through the DistributedComposite
>> actor), i prefer kepler for experimentation because it is extremely well
>> documented and can be easily extended to support new functionality.
>>
>> So, I was actually wondering if types of workflows, such as the ones
>> presented in the example of my first email, are met. I have seen many
>> scientific workflows but in the majority of them the workflow tasks must
>> be in a predefined order, while altering the task ordering either does
>> not produce the correct result, or does not change the amount of tokens
>> passed between the actors.
>>
>> I want to discover such workflow types/examples, in order to see if such
>> an optimization algorithm would be useful for the community.
>>
>> Thank you very much for the quick reply
>>
>>
>> Best regards
>> Efi
>>
>>
>> Quoting Christopher Brooks <cxh at eecs.berkeley.edu>:
>>
>>> Hi Efthymia,
>>> Scheduling of actor models is a deep topic.
>>> Kepler uses Ptolemy II as its execution engine.
>>> Ptolemy II is based on Ptolemy Classic.
>>> In Ptolemy Classic, we had a number of different synchronous dataflow
>>> scheduling
>>> algorithms, many of these were oriented towards clustering for parallel
>>> processing.
>>>
>>> In your example model, I see there being two costs:
>>> 1) The cost of the service. For example FICO might charge more money
>>> per query than an email lookup service.
>>> 2) The number of tokens passed between actors varies.
>>>
>>> There is a relationship between these two costs where #1 is usually the
>>> more important cost, but #2 can eventually overrule #1.
>>>
>>> Offhand, I don't know of a model of computation that does exactly what
>>> you want.
>>>
>>> In Ptolemy II, models like your example model are typically Kahn
>>> Process Network (PN) models where each actor is a separate process.
>>>
>>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/ptolemy/domains/pn/doc/
>>>
>>> says:
>>>
>>> "The following are the most important features of the operational
>>> semantics of
>>> PN as proposed by Kahn and MacQueen:
>>>
>>> * This is a network of sequential processes.
>>> * The processes do not share memory. Instead they communicate with
>>> each other
>>> only through unidirectional FIFO channels.
>>> * Communication between processes is asynchronous.
>>> * Processes block on a read from a FIFO channel if the channel is
>>> empty but
>>> can write to a channel whenever they want to."
>>>
>>> In PN, there is not really a predefined schedule, the threads operate
>>> and then block.
>>>
>>> Synchronous Dataflow (SDF) can be thought of as a subset of PN, where
>>> the number of tokens is known in advance and a schedule is defined in
>>> advance.
>>>
>>> Dynamic Dataflow (DDF) is between the two, where the number of tokens
>>> passed
>>> between actors is not known in advance.
>>>
>>> There is another trivial director called the LeftToRightDirector that
>>> fires the actors
>>> in order from LeftToRight. This would allow you to drag actors around
>>> and try
>>> different firings. That director is at
>>> ptII/doc/tutorial/domains/LeftRightDirector.java
>>>
>>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptIIfaq.htm#kepler
>>> says
>>> "If you want to use a director not in Kepler tree, you have to use the
>>> "Tools/Instantiate Attribute" menu. I use it to get a DDF director
>>> frequently (class ptolemy.domains.ddf.kernel.DDFDirector). "
>>>
>>> So, in Kepler, you would do Tools-> Instantiate Attribute and then
>>> enter doc.tutorial.domains.LeftRightDirector, but that does
>>> not seem to work in Kepler, so you would need to download Ptolemy II via
>>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/
>>>
>>> The Timed Multitasking model (TM) is somewhat close to what you want
>>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/ptolemy/domains/tm/doc/
>>>
>>> says
>>> --start--
>>> The timed multitasking (TM) domain, created by Jie Liu, offers a model
>>> of computation based on priority-driven multitasking, as common in
>>> real-time operating systems (RTOSs), but with more deterministic
>>> behavior. In TM, actors (conceptually) execute as concurrent threads
>>> in reaction to inputs. The domain provides an event dispatcher, which
>>> maintains a prioritized event queue. The execution of an actor is
>>> triggered by the event dispatcher by invoking first its prefire()
>>> method. The actor may begin execution of a concurrent thread at this
>>> time. Some time later, the dispatcher will invoke the fire() and
>>> postfire() methods of the actor (unless prefire() returns false).
>>>
>>> The amount of time that elapses between the invocation of prefire()
>>> and fire() depends on the declared /executionTime/ and /priority/ of
>>> the actor (or more specifically, of the port of the port receiving the
>>> triggering event). The domain assumes there is a single resource, the
>>> CPU, shared by the execution of all actors. At one particular time,
>>> only one of the actors can get the resource and execute. Execution of
>>> one actor may be preempted by another eligible actor with a higher
>>> priority input event. If an actor is not preempted, then the amount of
>>> time that elapses between prefire() and fire() equals the declared.
>>> /executionTime/. If it is preempted, then it equals the sum of the
>>> /executionTime/ and the execution times of the actors that preempt it.
>>> The model of computation is more deterministic than the usual
>>> priority-driven multitasking because the actor produces outputs (in
>>> its fire() method) only after it has been assured access to the CPU
>>> for its declared /executionTime/. In this domain, the model time may
>>> be synchronized to real time or not.
>>>
>>> --end--
>>>
>>>
>>>
>>> For an overview of the models of computation, see the "Domain
>>> Overview" link at
>>> the top of
>>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/doc/index.htm
>>>
>>>
>>> or the first chapter of
>>>
>>>
>>> http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-28.pdf
>>>
>>>
>>> _Christopher
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 7/7/11 2:10 AM, Efthymia Tsamoura wrote:
>>>> Hello
>>>>
>>>> I am a phd student and during this period i am dealing with workflow
>>>> optimization problems in distributed environments. I would like to
>>>> ask, if there are exist any cases where if the order of task
>>>> invocation in a scientific workflow changes its performance changes
>>>> too without, however, affecting the produced results. In the
>>>> following, a present a small use case of the problem i am interested in:
>>>>
>>>> Suppose that a company wants to obtain a list of email addresses of
>>>> potential customers selecting only those who have a good payment
>>>> history for at least one card and a credit rating above some
>>>> threshold. The company has the right to use the following web services
>>>>
>>>> WS1 : SSN id (ssn, threshold) -> credit rating (cr)
>>>> WS2 : SSN id (ssn) -> credit card numbers (ccn)
>>>> WS3 : card number (ccn, good) -> good history (gph)
>>>> WS4 : SSN id (ssn) -> email addresses (ea)
>>>>
>>>> The input data containing customer identifiers (ssn) and other
>>>> relevant information is stored in a local data resource. Two possible
>>>> web service linear workflows that can be formed to process the input
>>>> data using the above services are C1 = WS2,WS3,WS1,WS4 and C2 =
>>>> WS1,WS2,WS3,WS4. In the first workflow, first, the customers having a
>>>> good payment history are initially selected (WS2,WS3), and then, the
>>>> remaining customers whose credit history is below some threshold are
>>>> filtered out (through WS1). The C2 workflow performs the same tasks
>>>> in a reverse order. The above linear workflows may have different
>>>> performance; if WS3 filters out more data than WS1, then it will be
>>>> more beneficial to invoke WS3 before WS1 in order for the subsequent
>>>> web services in the workflow to process less data.
>>>>
>>>> It would be very useful to know if there exist similar scientific
>>>> workflow examples (where the order of task invocation can change and
>>>> it is not known a-priori by the user, while the workflow performance
>>>> depends on the workflow task invocation order) and if you are
>>>> interested in extending kepler with optimization algorithms for such
>>>> workflows.
>>>>
>>>> I am asking because i have recently developed an optimization
>>>> algorithm for this problem and i would like to test its performance
>>>> in a real-world workflow management system with real-world workflows.
>>>>
>>>> P.S.: references to publications or any other information dealing
>>>> with scientific workflows of the above rationale will be extremely
>>>> useful.
>>>>
>>>> Thank you very much for your time
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Kepler-dev mailing list
>>>> Kepler-dev at kepler-project.org
>>>> http://lists.nceas.ucsb.edu/kepler/mailman/listinfo/kepler-dev
>>>
>>> --
>>> Christopher Brooks, PMP University of California
>>> CHESS Executive Director US Mail: 337 Cory Hall
>>> Programmer/Analyst CHESS/Ptolemy/Trust Berkeley, CA 94720-1774
>>> ph: 510.643.9841 (Office: 545Q Cory)
>>> home: (F-Tu) 707.665.0131 cell: 707.332.0670
>>>
>>>
>>
>>
>>
>>
>> _______________________________________________
>> Kepler-dev mailing list
>> Kepler-dev at kepler-project.org
>> http://lists.nceas.ucsb.edu/kepler/mailman/listinfo/kepler-dev
>