[kepler-users] Parallel Pipelines

Thu Jul 12 15:42:12 PDT 2007

I'm looking to move some previous hard coded workflow style analysis  
of protein sequences (reference links below), to something like  
Kepler, so that it can be easily modified and expanded.  The biggest  
problem is that there is a lot of proteins to be processed (about 1.5  
million for 537 bacterial genomes), each would be requiring various  
tasks.  (If you know bioinformatics, it's stuff like membrane helix  
prediction, running blast against the NR database, secondary  
structure prediction, protein threading, running modeller).  This is  
all more work then then I would want one computer to do.
Has anybody done any work on parallel work flows?  What is the best  
way to handle a workflow of this scope?
I could try to set it up so that Kepler merely manages the workflow  
of coordinating web services and queue submissions.  But that would  
introduce a lot of extra 'lag' for communication time and submission  
to busy queues.  I would prefer some method where I got a block of  
computers on a cluster and the various actors would fire off on the  
free nodes, and the director/management system would coordinate the  
actors and move data around in-between the different computers.

Has there been any research into this sort of thing?  Does anybody  
have any ideas on the best way to tackle this sort of thing?

Kyle Ellrott

PROSPECT-PSPP: an automatic computational pipeline for protein  
structure prediction.
http://www.ncbi.nlm.nih.gov/sites/entrez? 
Db=pubmed&Cmd=ShowDetailView&TermToSearch=15215441&ordinalpos=5&itool=En 
trezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum

A computational pipeline for protein structure prediction and  
analysis at genome scale.
http://www.ncbi.nlm.nih.gov/sites/entrez? 
Db=pubmed&Cmd=ShowDetailView&TermToSearch=14555633&ordinalpos=7&itool=En 
trezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum