[kepler-dev] extending kepler with workflow optimization algorithms

Fri Jul 8 13:39:12 PDT 2011

Dear Efi,

The question of whether order of invocation affects the computed
results is the question of determinacy. Here are a few papers
that address this:

http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-7.html

http://ptolemy.eecs.berkeley.edu/publications/papers/06/problemwithThreads/

http://ptolemy.eecs.berkeley.edu/publications/papers/95/processNets/

There are many other papers on the Ptolemy website that talk some
about the issue, but I think the ones above are examples
where it is a major focus.

Edward

On 7/8/11 2:22 AM, Efthymia Tsamoura wrote:
> Dear Christopher
>
> I was actually thinking of developing a new director for kepler that
> optimizes a workflow using an algorithm that I have recently developed
> as part of my PhD. The characteristics of the optimization algorithm are
> the following:
>
> 1.The algorithm takes into account both the cost of the services and the
> number of tokens passed between actors to produce an optimal (in terms
> of cost) workflow.
>
> 2.It also accounts decentralized data transfers between the actors,
> i.e., it is assumed that the actors can be deployed anywhere in a
> wide-area infrastructure, while they can exchange data directly with
> varying data transferring costs.
>
> 3.Finally, it assumes that the data are exchanged among the actors in a
> pipelined fashion, so that the tuples already processed by an actor are
> processed by the subsequent actor in the workflow at the same time as
> the former processes new input data.
>
> (If you are interested, a technical report presenting the algorithm can
> be found at:
> http://delab.csd.auth.gr/~tsamoura/tsamoura_2010_technical-report.pdf)
>
> Although I have seen that kepler does not support decentralized data
> transfers but only centralized ones (through the DistributedComposite
> actor), i prefer kepler for experimentation because it is extremely well
> documented and can be easily extended to support new functionality.
>
> So, I was actually wondering if types of workflows, such as the ones
> presented in the example of my first email, are met. I have seen many
> scientific workflows but in the majority of them the workflow tasks must
> be in a predefined order, while altering the task ordering either does
> not produce the correct result, or does not change the amount of tokens
> passed between the actors.
>
> I want to discover such workflow types/examples, in order to see if such
> an optimization algorithm would be useful for the community.
>
> Thank you very much for the quick reply
>
>
> Best regards
> Efi
>
>
> Quoting Christopher Brooks <cxh at eecs.berkeley.edu>:
>
>> Hi Efthymia,
>> Scheduling of actor models is a deep topic.
>> Kepler uses Ptolemy II as its execution engine.
>> Ptolemy II is based on Ptolemy Classic.
>> In Ptolemy Classic, we had a number of different synchronous dataflow
>> scheduling
>> algorithms, many of these were oriented towards clustering for parallel
>> processing.
>>
>> In your example model, I see there being two costs:
>> 1) The cost of the service. For example FICO might charge more money
>> per query than an email lookup service.
>> 2) The number of tokens passed between actors varies.
>>
>> There is a relationship between these two costs where #1 is usually the
>> more important cost, but #2 can eventually overrule #1.
>>
>> Offhand, I don't know of a model of computation that does exactly what
>> you want.
>>
>> In Ptolemy II, models like your example model are typically Kahn
>> Process Network (PN) models where each actor is a separate process.
>>
>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/ptolemy/domains/pn/doc/
>>
>> says:
>>
>> "The following are the most important features of the operational
>> semantics of
>> PN as proposed by Kahn and MacQueen:
>>
>> * This is a network of sequential processes.
>> * The processes do not share memory. Instead they communicate with
>> each other
>> only through unidirectional FIFO channels.
>> * Communication between processes is asynchronous.
>> * Processes block on a read from a FIFO channel if the channel is
>> empty but
>> can write to a channel whenever they want to."
>>
>> In PN, there is not really a predefined schedule, the threads operate
>> and then block.
>>
>> Synchronous Dataflow (SDF) can be thought of as a subset of PN, where
>> the number of tokens is known in advance and a schedule is defined in
>> advance.
>>
>> Dynamic Dataflow (DDF) is between the two, where the number of tokens
>> passed
>> between actors is not known in advance.
>>
>> There is another trivial director called the LeftToRightDirector that
>> fires the actors
>> in order from LeftToRight. This would allow you to drag actors around
>> and try
>> different firings. That director is at
>> ptII/doc/tutorial/domains/LeftRightDirector.java
>>
>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptIIfaq.htm#kepler
>> says
>> "If you want to use a director not in Kepler tree, you have to use the
>> "Tools/Instantiate Attribute" menu. I use it to get a DDF director
>> frequently (class ptolemy.domains.ddf.kernel.DDFDirector). "
>>
>> So, in Kepler, you would do Tools-> Instantiate Attribute and then
>> enter doc.tutorial.domains.LeftRightDirector, but that does
>> not seem to work in Kepler, so you would need to download Ptolemy II via
>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/
>>
>> The Timed Multitasking model (TM) is somewhat close to what you want
>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/ptolemy/domains/tm/doc/
>>
>> says
>> --start--
>> The timed multitasking (TM) domain, created by Jie Liu, offers a model
>> of computation based on priority-driven multitasking, as common in
>> real-time operating systems (RTOSs), but with more deterministic
>> behavior. In TM, actors (conceptually) execute as concurrent threads
>> in reaction to inputs. The domain provides an event dispatcher, which
>> maintains a prioritized event queue. The execution of an actor is
>> triggered by the event dispatcher by invoking first its prefire()
>> method. The actor may begin execution of a concurrent thread at this
>> time. Some time later, the dispatcher will invoke the fire() and
>> postfire() methods of the actor (unless prefire() returns false).
>>
>> The amount of time that elapses between the invocation of prefire()
>> and fire() depends on the declared /executionTime/ and /priority/ of
>> the actor (or more specifically, of the port of the port receiving the
>> triggering event). The domain assumes there is a single resource, the
>> CPU, shared by the execution of all actors. At one particular time,
>> only one of the actors can get the resource and execute. Execution of
>> one actor may be preempted by another eligible actor with a higher
>> priority input event. If an actor is not preempted, then the amount of
>> time that elapses between prefire() and fire() equals the declared.
>> /executionTime/. If it is preempted, then it equals the sum of the
>> /executionTime/ and the execution times of the actors that preempt it.
>> The model of computation is more deterministic than the usual
>> priority-driven multitasking because the actor produces outputs (in
>> its fire() method) only after it has been assured access to the CPU
>> for its declared /executionTime/. In this domain, the model time may
>> be synchronized to real time or not.
>>
>> --end--
>>
>>
>>
>> For an overview of the models of computation, see the "Domain
>> Overview" link at
>> the top of
>> http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/doc/index.htm
>>
>>
>> or the first chapter of
>>
>>
>> http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-28.pdf
>>
>>
>> _Christopher
>>
>>
>>
>>
>>
>>
>> On 7/7/11 2:10 AM, Efthymia Tsamoura wrote:
>>> Hello
>>>
>>> I am a phd student and during this period i am dealing with workflow
>>> optimization problems in distributed environments. I would like to
>>> ask, if there are exist any cases where if the order of task
>>> invocation in a scientific workflow changes its performance changes
>>> too without, however, affecting the produced results. In the
>>> following, a present a small use case of the problem i am interested in:
>>>
>>> Suppose that a company wants to obtain a list of email addresses of
>>> potential customers selecting only those who have a good payment
>>> history for at least one card and a credit rating above some
>>> threshold. The company has the right to use the following web services
>>>
>>> WS1 : SSN id (ssn, threshold) -> credit rating (cr)
>>> WS2 : SSN id (ssn) -> credit card numbers (ccn)
>>> WS3 : card number (ccn, good) -> good history (gph)
>>> WS4 : SSN id (ssn) -> email addresses (ea)
>>>
>>> The input data containing customer identifiers (ssn) and other
>>> relevant information is stored in a local data resource. Two possible
>>> web service linear workflows that can be formed to process the input
>>> data using the above services are C1 = WS2,WS3,WS1,WS4 and C2 =
>>> WS1,WS2,WS3,WS4. In the first workflow, first, the customers having a
>>> good payment history are initially selected (WS2,WS3), and then, the
>>> remaining customers whose credit history is below some threshold are
>>> filtered out (through WS1). The C2 workflow performs the same tasks
>>> in a reverse order. The above linear workflows may have different
>>> performance; if WS3 filters out more data than WS1, then it will be
>>> more beneficial to invoke WS3 before WS1 in order for the subsequent
>>> web services in the workflow to process less data.
>>>
>>> It would be very useful to know if there exist similar scientific
>>> workflow examples (where the order of task invocation can change and
>>> it is not known a-priori by the user, while the workflow performance
>>> depends on the workflow task invocation order) and if you are
>>> interested in extending kepler with optimization algorithms for such
>>> workflows.
>>>
>>> I am asking because i have recently developed an optimization
>>> algorithm for this problem and i would like to test its performance
>>> in a real-world workflow management system with real-world workflows.
>>>
>>> P.S.: references to publications or any other information dealing
>>> with scientific workflows of the above rationale will be extremely
>>> useful.
>>>
>>> Thank you very much for your time
>>>
>>>
>>>
>>> _______________________________________________
>>> Kepler-dev mailing list
>>> Kepler-dev at kepler-project.org
>>> http://lists.nceas.ucsb.edu/kepler/mailman/listinfo/kepler-dev
>>
>> --
>> Christopher Brooks, PMP University of California
>> CHESS Executive Director US Mail: 337 Cory Hall
>> Programmer/Analyst CHESS/Ptolemy/Trust Berkeley, CA 94720-1774
>> ph: 510.643.9841 (Office: 545Q Cory)
>> home: (F-Tu) 707.665.0131 cell: 707.332.0670
>>
>>
>
>
>
>
> _______________________________________________
> Kepler-dev mailing list
> Kepler-dev at kepler-project.org
> http://lists.nceas.ucsb.edu/kepler/mailman/listinfo/kepler-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eal.vcf
Type: text/x-vcard
Size: 330 bytes
Desc: not available
URL: <http://lists.nceas.ucsb.edu/kepler/pipermail/kepler-dev/attachments/20110708/c6c226eb/attachment.vcf>