[kepler-users] Parallel Pipelines
Kyle
kellrott at csbl.bmb.uga.edu
Thu Jul 12 15:42:12 PDT 2007
I'm looking to move some previous hard coded workflow style analysis
of protein sequences (reference links below), to something like
Kepler, so that it can be easily modified and expanded. The biggest
problem is that there is a lot of proteins to be processed (about 1.5
million for 537 bacterial genomes), each would be requiring various
tasks. (If you know bioinformatics, it's stuff like membrane helix
prediction, running blast against the NR database, secondary
structure prediction, protein threading, running modeller). This is
all more work then then I would want one computer to do.
Has anybody done any work on parallel work flows? What is the best
way to handle a workflow of this scope?
I could try to set it up so that Kepler merely manages the workflow
of coordinating web services and queue submissions. But that would
introduce a lot of extra 'lag' for communication time and submission
to busy queues. I would prefer some method where I got a block of
computers on a cluster and the various actors would fire off on the
free nodes, and the director/management system would coordinate the
actors and move data around in-between the different computers.
Has there been any research into this sort of thing? Does anybody
have any ideas on the best way to tackle this sort of thing?
Kyle Ellrott
PROSPECT-PSPP: an automatic computational pipeline for protein
structure prediction.
http://www.ncbi.nlm.nih.gov/sites/entrez?
Db=pubmed&Cmd=ShowDetailView&TermToSearch=15215441&ordinalpos=5&itool=En
trezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum
A computational pipeline for protein structure prediction and
analysis at genome scale.
http://www.ncbi.nlm.nih.gov/sites/entrez?
Db=pubmed&Cmd=ShowDetailView&TermToSearch=14555633&ordinalpos=7&itool=En
trezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum
More information about the Kepler-users
mailing list