[kepler-users] Parallel Pipelines

Thu Jul 12 15:56:29 PDT 2007

Hi Kyle,

That indeed is a common need -- we are currently working on implementing 
just the capabilities you describe.  Chad Berkley and Lucas Gilbert are 
doing the work, and it is based on using a higher-order composite actor 
to designate the subworkflows to be distributed and then the system 
handles scheduling and communication across nodes without further 
intervention from the user.  I think the system we are designing would 
be ideal for your case.  I recommend that you have a discussion with 
Chad and Lucas to further discuss this to be sure that the solution they 
are developing will work for your case, and if not, consider whether 
they should be managing additional requirements introduced from your case.

There is also other work going on in distributed execution in Kepler and 
Ptolemy, some of which is focused on leveraging existing grid engines 
(e.g., globus, nimrod).

Cheers,
Matt

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Matthew B. Jones
Director of Informatics Research and Development
National Center for Ecological Analysis and Synthesis (NCEAS)
UC Santa Barbara
jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/ecoinformatics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Kyle wrote:
> I'm looking to move some previous hard coded workflow style analysis  
> of protein sequences (reference links below), to something like  
> Kepler, so that it can be easily modified and expanded.  The biggest  
> problem is that there is a lot of proteins to be processed (about 1.5  
> million for 537 bacterial genomes), each would be requiring various  
> tasks.  (If you know bioinformatics, it's stuff like membrane helix  
> prediction, running blast against the NR database, secondary  
> structure prediction, protein threading, running modeller).  This is  
> all more work then then I would want one computer to do.
> Has anybody done any work on parallel work flows?  What is the best  
> way to handle a workflow of this scope?
> I could try to set it up so that Kepler merely manages the workflow  
> of coordinating web services and queue submissions.  But that would  
> introduce a lot of extra 'lag' for communication time and submission  
> to busy queues.  I would prefer some method where I got a block of  
> computers on a cluster and the various actors would fire off on the  
> free nodes, and the director/management system would coordinate the  
> actors and move data around in-between the different computers.
> 
> Has there been any research into this sort of thing?  Does anybody  
> have any ideas on the best way to tackle this sort of thing?
> 
> Kyle Ellrott
> 
> 
> PROSPECT-PSPP: an automatic computational pipeline for protein  
> structure prediction.
> http://www.ncbi.nlm.nih.gov/sites/entrez? 
> Db=pubmed&Cmd=ShowDetailView&TermToSearch=15215441&ordinalpos=5&itool=En 
> trezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum
> 
> A computational pipeline for protein structure prediction and  
> analysis at genome scale.
> http://www.ncbi.nlm.nih.gov/sites/entrez? 
> Db=pubmed&Cmd=ShowDetailView&TermToSearch=14555633&ordinalpos=7&itool=En 
> trezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum
> _______________________________________________
> Kepler-users mailing list
> Kepler-users at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-users