[kepler-users] Parallel Pipelines
Matthew Jones
jones at nceas.ucsb.edu
Thu Jul 12 15:56:29 PDT 2007
Hi Kyle,
That indeed is a common need -- we are currently working on implementing
just the capabilities you describe. Chad Berkley and Lucas Gilbert are
doing the work, and it is based on using a higher-order composite actor
to designate the subworkflows to be distributed and then the system
handles scheduling and communication across nodes without further
intervention from the user. I think the system we are designing would
be ideal for your case. I recommend that you have a discussion with
Chad and Lucas to further discuss this to be sure that the solution they
are developing will work for your case, and if not, consider whether
they should be managing additional requirements introduced from your case.
There is also other work going on in distributed execution in Kepler and
Ptolemy, some of which is focused on leveraging existing grid engines
(e.g., globus, nimrod).
Cheers,
Matt
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Matthew B. Jones
Director of Informatics Research and Development
National Center for Ecological Analysis and Synthesis (NCEAS)
UC Santa Barbara
jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/ecoinformatics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Kyle wrote:
> I'm looking to move some previous hard coded workflow style analysis
> of protein sequences (reference links below), to something like
> Kepler, so that it can be easily modified and expanded. The biggest
> problem is that there is a lot of proteins to be processed (about 1.5
> million for 537 bacterial genomes), each would be requiring various
> tasks. (If you know bioinformatics, it's stuff like membrane helix
> prediction, running blast against the NR database, secondary
> structure prediction, protein threading, running modeller). This is
> all more work then then I would want one computer to do.
> Has anybody done any work on parallel work flows? What is the best
> way to handle a workflow of this scope?
> I could try to set it up so that Kepler merely manages the workflow
> of coordinating web services and queue submissions. But that would
> introduce a lot of extra 'lag' for communication time and submission
> to busy queues. I would prefer some method where I got a block of
> computers on a cluster and the various actors would fire off on the
> free nodes, and the director/management system would coordinate the
> actors and move data around in-between the different computers.
>
> Has there been any research into this sort of thing? Does anybody
> have any ideas on the best way to tackle this sort of thing?
>
> Kyle Ellrott
>
>
> PROSPECT-PSPP: an automatic computational pipeline for protein
> structure prediction.
> http://www.ncbi.nlm.nih.gov/sites/entrez?
> Db=pubmed&Cmd=ShowDetailView&TermToSearch=15215441&ordinalpos=5&itool=En
> trezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum
>
> A computational pipeline for protein structure prediction and
> analysis at genome scale.
> http://www.ncbi.nlm.nih.gov/sites/entrez?
> Db=pubmed&Cmd=ShowDetailView&TermToSearch=14555633&ordinalpos=7&itool=En
> trezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum
> _______________________________________________
> Kepler-users mailing list
> Kepler-users at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-users
More information about the Kepler-users
mailing list