[kepler-dev] [kepler-users] Parallel Pipelines

Fri Jul 13 09:09:37 PDT 2007

Hey,

I'll just give a brief update on the work that Lucas and I have been 
doing, since there seem to be a few people interested.

We've got a working prototype going right now.  What is currently 
possible is to build a workflow with a DistributedCompositeActor (DCA), 
inside which you place the workflow that you want executed remotely. 
You then put the host names of the distributed hosts you want to run on 
in a config file.  When Kepler executes the workflow, it sends the DCA 
and its components out to the slaves for execution.  The results are 
then sent back to the master when execution is finished.  It's still 
pretty rough and we're working out the details, but it does work for 
limited inputs and outputs (actually, now it should work for multiple 
inputs, but we're still working on sending back multiple outputs). 
We're also still working on error handling which requires that the slave 
report back to the master via RMI when an exception occurs on the remote 
workflow.

If you want to try it for yourself, there is a readme 
(kepler/distributed-readme.txt) that outlines the process for getting it 
going.  Basically you need to run rmiregistry and our SlaveController on 
each of the remote machine you want to access, then you need to build 
your workflow using the DCA.

I'd be happy to answer any questions or try to help people get it 
running.  Just find me on IRC.

chad

Matthew Jones wrote:
> I've asked Chad and Lucas to document their design -- I'm sure they'll 
> get there eventually -- it is based on the use of HigherOrder Component 
> actors as Edward outlined in our kepler meeting in Davis over a year 
> ago.  The crude whiteboard pictures and notes from the day we spent on 
> this in Santa Barbara last December (Norbert, Tim, and Daniel were there 
> too, so they should have a good idea of this stuff) are here:
> 
> http://www.kepler-project.org/Wiki.jsp?page=DistributedKepler
> 
> Hope this helps.  Chad and Lucas can clarify further.
> 
> Matt
> 
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Matthew B. Jones
> Director of Informatics Research and Development
> National Center for Ecological Analysis and Synthesis (NCEAS)
> UC Santa Barbara
> jones at nceas.ucsb.edu
> http://www.nceas.ucsb.edu/ecoinformatics
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> 
> Bertram Ludaescher wrote:
>> Hi Matt:
>>
>> Good to see that you're online again. I was a bit concerned after not 
>> having heard from you for a while.
>>
>>  From a research POV I'd be interested in the capabilities being 
>> implemented. Can you point me (again, sorry if I've asked this before) 
>> to any info on the design Chad and Lucas are working on?
>>
>> thanks
>>
>> Bertram
>>
>> On 7/12/07, *Matthew Jones* <jones at nceas.ucsb.edu 
>> <mailto:jones at nceas.ucsb.edu>> wrote:
>>
>>     Hi Kyle,
>>
>>     That indeed is a common need -- we are currently working on 
>> implementing
>>     just the capabilities you describe.  Chad Berkley and Lucas 
>> Gilbert are
>>     doing the work, and it is based on using a higher-order composite 
>> actor
>>     to designate the subworkflows to be distributed and then the system
>>     handles scheduling and communication across nodes without further
>>     intervention from the user.  I think the system we are designing 
>> would
>>     be ideal for your case.  I recommend that you have a discussion with
>>     Chad and Lucas to further discuss this to be sure that the 
>> solution they
>>     are developing will work for your case, and if not, consider whether
>>     they should be managing additional requirements introduced from your
>>     case.
>>
>>     There is also other work going on in distributed execution in 
>> Kepler and
>>     Ptolemy, some of which is focused on leveraging existing grid engines
>>     (e.g., globus, nimrod).
>>
>>     Cheers,
>>     Matt
>>
>>     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>     Matthew B. Jones
>>     Director of Informatics Research and Development
>>     National Center for Ecological Analysis and Synthesis (NCEAS)
>>     UC Santa Barbara
>>     jones at nceas.ucsb.edu <mailto:jones at nceas.ucsb.edu>
>>     http://www.nceas.ucsb.edu/ecoinformatics
>>     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>>
>>     Kyle wrote:
>>      > I'm looking to move some previous hard coded workflow style 
>> analysis
>>      > of protein sequences (reference links below), to something like
>>      > Kepler, so that it can be easily modified and expanded.  The 
>> biggest
>>      > problem is that there is a lot of proteins to be processed 
>> (about 1.5
>>      > million for 537 bacterial genomes), each would be requiring 
>> various
>>      > tasks.  (If you know bioinformatics, it's stuff like membrane 
>> helix
>>      > prediction, running blast against the NR database, secondary
>>      > structure prediction, protein threading, running modeller).  
>> This is
>>      > all more work then then I would want one computer to do.
>>      > Has anybody done any work on parallel work flows?  What is the 
>> best
>>      > way to handle a workflow of this scope?
>>      > I could try to set it up so that Kepler merely manages the 
>> workflow
>>      > of coordinating web services and queue submissions.  But that 
>> would
>>      > introduce a lot of extra 'lag' for communication time and 
>> submission
>>      > to busy queues.  I would prefer some method where I got a block of
>>      > computers on a cluster and the various actors would fire off on 
>> the
>>      > free nodes, and the director/management system would coordinate 
>> the
>>      > actors and move data around in-between the different computers.
>>      >
>>      > Has there been any research into this sort of thing?  Does anybody
>>      > have any ideas on the best way to tackle this sort of thing?
>>      >
>>      > Kyle Ellrott
>>      >
>>      >
>>      > PROSPECT-PSPP: an automatic computational pipeline for protein
>>      > structure prediction.
>>      > http://www.ncbi.nlm.nih.gov/sites/entrez
>>     <http://www.ncbi.nlm.nih.gov/sites/entrez>?
>>      >
>>     
>> Db=pubmed&Cmd=ShowDetailView&TermToSearch=15215441&ordinalpos=5&itool=En
>>      > trezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum
>>      >
>>      > A computational pipeline for protein structure prediction and
>>      > analysis at genome scale.
>>      > http://www.ncbi.nlm.nih.gov/sites/entrez?
>>      >
>>     
>> Db=pubmed&Cmd=ShowDetailView&TermToSearch=14555633&ordinalpos=7&itool=En
>>
>>      > trezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum
>>      > _______________________________________________
>>      > Kepler-users mailing list
>>      > Kepler-users at ecoinformatics.org
>>     <mailto:Kepler-users at ecoinformatics.org>
>>      >
>>     
>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-users 
>>
>>     _______________________________________________
>>     Kepler-users mailing list
>>     Kepler-users at ecoinformatics.org 
>> <mailto:Kepler-users at ecoinformatics.org>
>>     
>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-users 
>>
>>     
>> <http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-users> 
>>
>>
>>
>>
>>
>> -- 
>> Bertram Ludaescher, Assoc. Professor
>> Dept of Computer Science & Genome Center
>> University of California, Davis
>> One Shields Avenue, Davis, CA 95616
>> Ph: (530) 754-8576
>> ludaesch at ucdavis.edu <mailto:ludaesch at ucdavis.edu>