[seek-dev] Re: Some thoughts on ENM/FullGARPP workflow

Fri Sep 17 14:43:17 PDT 2004

Hi Dan,

Nice summarization of the approach we proposed in the conference call. 
Thanks.  One of the issues I have been batting around is how distributed 
processes interact with the workflow director.  Most distributed systems 
(e.g., Web Services) have a call/return model without state, and so they 
do not communicate with the calling node until they are completely 
finished the computation.  Grid services add state to this model and 
lifetime management, but only in a crude way because its up to the 
service how it works, so there's no consistency in the state management. 
or communication.

I think an interesting extension would be to create a 'Kepler Grid 
Service' that is distinct from the vergil GUI and allows for distributed 
communication among runnning kepler nodes.  This is basically possible 
in Ptolemy now as long as you don't use graphical actors -- we would 
need to extend it so that graphical actors were permitted but 
communicated through another mechanism.  Over the short-term we could 
have it so that only non-graphical workflows (or sub-workflows) ran. 
And then we would need to embed it in a Grid Service framework.

What I have been thinking of is a Kepler Grid Service that contains the 
ptolemy runtime engine and is listening for events and data flows that 
are sent from a director on a controlling node.  The service would 
probably have its own director that knew how to communicate with the 
controlling director via the grid service, or in another channel -- this 
needs more thought, but I think the following components would be involved:
* vergil would run on the client node and be the controller,
* a copy of the kepler runtime service would run on the local node for 
executing local tasks (e.g., graphical tasks) -- this makes up 
essentially the current kepler system, but refactored
* a copy of the kepler runtime service would run on one or more 
distributed nodes and would events and data from the controlling node
* actors and workflows could be passed around nodes in signed archive 
files (jars) that contain the java executable code, a moml description, 
and any semantic annotations -- this archive could be loaded at runtime 
by a custom classloader to make newly developed actors available on 
distributed nodes
* a security mechanism would need to be established to control who could 
run code on distributed nodes -- this would probably leverage the Grid 
GSI system of proxy certificates -- a sanbox excluding certain types of 
operations (or only including some) would also probably be needed

A hard part would be separating out the Vergil UI from the Ptolemy 
runtime more completely (including for Graphical models).  UI events 
would need to be propagated back to the controlling vergil instance. 
You could even have a publish/subscribe sort of mechanism so that an 
authenticated user could login with vergil on Node A, start a 
distributed run, and then go login at Node B and subscribe to the UI 
events for the workflow to monitor its progress.  Such a mechanism would 
also support multiple principals seeing (and possibly steering) the same 
workflow executions, so it would make a nice distance collaboration tool.

What this whole system allows is tight semantic knowledge of the data 
flows because all running components, whether local or remote, would 
have the right semantic annotations associated and would be passing data 
under the control of a director.  Right now kicking off a web service in 
the middle of an execution completely steps outside the Director's 
knowledge of what is happening.  This would return much of the 
distributed computation back into the control of the director, so we 
might be able to get domains other than PN working with the distributed 
computations.  Which means much more integrated models and simulations 
could be run (e.g., ptolemy's continuous time) on a distributed network.

We would want to accomodate the 'third party transfer' idea so that data 
doesn't need to flow through the contoller to get from one node to 
another in the distributed system.  Bertram's recent writeup of this 
outlines the issues there well.

Finally, I could also envision a peer-to-peer sort of architecture where 
the workflow isn't actually centrally controlled. Instead, a single 
vergil GUI kicks off a workflow, sending our subworkflows to distributed 
nodes, who in turn can send out sub-sub workflows to further nodes.  I'm 
not sure how an optimizer and scheduler would interact with such a 
system (the computational power vs data transfer tradeoff can be 
significant), but its interesting to think about.

These are just some initial thoughts I am throwing out there seeing as 
you brought it up.  Out initial prototype of a distributed system should 
be far simpler, especially for our Oct 15 release. But I wanted to throw 
out some more information so you could chew on it.  We should probably 
make this into a Wiki page at some point so that it can become a living 
design document -- email threads eventually get lost.

Matt

Dan Higgins wrote:
> Hi All,
> 
>    Some thoughts related to the question of multiple species 
> calculations are below:
> 
> Environmental Niche Modeling (ENM/GARP) Pipeline - Repeated Calcuations
> 
> 1) For a given species, DIGR actor should determine known distribution. 
> This is input into the ENM/GARP/OpenModeler to create a set of 'rules' 
> for species distribution depending on current environmental factors as a 
> function of geographic location. Reportedly, the desire is to run the 
> calculation ~500 times for a given species and then save ~20 of the 
> 'best' rule sets (based on genetic algorithms).
> 2) This set of 500 runs/pecies requires about 1 day of computer time on 
> a typical desktop PC.
> 
> 3) Once the 20 or so rule sets per species exist, alternate 
> environmental layers (~5) can be evaluated. These calculation require 
> roughly 1 minute each.
> 
> 4) There is a desire to run the NNM pipeline for a number of species. If 
> the number of species is ~1000, then approximately 1000 computer-days 
> would be needed (~3 years)! This obviously implies that the calculations 
> should be done in parallel on multiple computers.
>                                          It appears that the workflow 
> for each species is complete by itself, depending only on the species 
> name. One could thus just iterate over a species list (or, equivalently, 
> apply a 'map' function to the calculation of a single species). Summary 
> calculations would then be applied to the overall set of results.
> 
> So assume that the environment layers are stored on the ecogrid and a 
> list of all species to be considered are also stored. A number of 
> independent, parallel calculations could be carried out with a single 
> species pipeline that reads the specied list, removes a species name 
> that it isgoing to run a calculation for, downloadeds the environment 
> layers, executes the 500 runs for that species, and then stores the 
> resulting rule set on the ecogrid. (If the run were not completed, the 
> species name would be added back to the ecogrid list so some other 
> computer could run it.) This would allow any number of parallel 
> computations.
> 
> After a number of species were considered, a seperate workflow could 
> retreive the collection of stored GARP/OM rule sets for for carrying 
> whatever summary calculations are desired.
> So I suggest that we modify the Full_GARP/OM workflow to only work with 
> a single species and add I/O from the Ecogrid to get species/environment 
> layers and save results. A second summary workflow would then be created 
> (perhaps later, after a number of individula species calculations).
> 

-- 
-------------------------------------------------------------------
Matt Jones                                     jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439    Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org
-------------------------------------------------------------------