[kepler-users] Kepler and R (bioinformatics)

Matthew Jones jones at nceas.ucsb.edu
Mon Oct 29 12:26:17 PDT 2007


Some additional thoughts on R and Kepler...

I'm one of the people who complains about the 'out-of-band' passing of
data via the R workspace.  My main issue is this -- Kepler is being
engineered to allow components to be run in many environments.  One of
them is a distributed mode where actors might be moved to a completely
different host and run remotely, with their inputs and outputs being
properly redirected by the system.  If an RExpression actor passes data
out-of-band (using mechanisms other than the established ports and
parameters, such as files or the R workspace), then the distributed
execution system does not know that this data needs to be moved and
staged, and the workflow execution will fail.  Thus, it is only through
consistent use of the published I/O interfaces that we can reliably
develop workflows that can run in multiple execution contexts.  This is
why I think it is important.

That said, it would also be nice to be able to fully use R, and be able
to pass some of R's more complex, customized data structures.  A new
TokenType in Kepler for 'RObject' might be able to be encapsulated as a
binary object in the workflow and passed from one R actor to another
without modification.  Such a strategy would allow us to use the more
complex data structures, but would of course only work for another
RActor downstream as it is likely the only actor that would know how to
process this custom RObject TokenType.  But it would mean that
everything could be passed via ports, eliminating out-of-band transfers.

It would also be good to be able to chain R actors together without
having to start, execute, and stop each one in its own Rsession, as this
is very inefficient.  I did a little experimentation with the JRI
library to start an RSession in Kepler and keep it running for multiple
actors to utilize.  It also provides more robust data
marshalling/unmarshalling between R and Java data structures.  I think
this might work better than our current approach, but it would need more
experimentation.

My 0.02...

Matt

Dan Higgins wrote:
> Hi Bruno,
> 
>     I am the person who wrote the existing version of the RExpression 
> actor in Kepler, so I can probably supply some answers for your 
> questions ;-)
> 
>    There are a number of limitations in the current implementation on 
> using multiple R actors in Kepler. First of all, each R actor runs in a 
> seperate process. In other words, R is started up, the commands in the 
> actor R-script execute, and R then shuts down. If you create an output 
> port that has the same name as an R variable, the R variables value is 
> 'translated' to a Kepler variable (with limits discussed below) and sent 
> through the port to any connected actor.
> 
>    This output of values limited to a few simple R data structures due 
> to problems of converting arbitrary R classes to some equivalent in 
> Kepler. Strings, numbers, arrays, matrices, and dataframes can be 
> output, but arbitrary classes cannot. If you check out the nightly build 
> zip file of Kepler (which is a build of the latest version), you will 
> see under a workflow called $Kepler/demos/R/ReadTable.xml that demos 
> passing a dataframe from one R actor to another.
> 
>     As I said, by default one R actor does not inherit the workspace of 
> another. However, there is a parameter of the RExpression actor called 
> 'save or Not' that has the default value of '--no-save'. This can be 
> changed to '--save'. Doing so saves the R workspace. Another actor can 
> be connected to this first actor and the script of the second actor set 
> to load the saved workspace. Thus one can pass a complete R workspace 
> from one RExpression actor to another. A series of 4 R actors use this 
> technique in the example workflow "$Kepler/demos/R/BEAM_$_1.xml". In 
> this approach, data is NOT passed through the ports; instead, an output 
> port just passes a 'trigger' to the input port of the next RExpression 
> actor and this trigger controls the sequencing of how the actors fire. 
> Some people have complained that this technique 'subverts' the basic 
> Kepler concept of passing all data through the ports. It also does 
> create some potential problems with workflows where actors are not 
> configured to fire sequentially. Nevertheless, it does allow modularized 
> R scripts to be used in a sequential manner.
> 
>     Hopefully these comments will be useful. Let me know if you have 
> additional questions.
> 
> Dan Higgins
> NCEAS/Ecoinformatics
> 
> 
> ------------------------
> 
> Bruno Yoshimura wrote:
>> I am writing a paper about Visual Programming for bioinformatics and 
>> it would be wonderful if I found somebody who knows how to use R with 
>> Kepler. The main objective of this study is to decide whether or not 
>> it is worth to use Kepler in this area.
>>
>> More specifically, I would like to know the challenges and problems in 
>> using R inside Kepler. I had some problems trying to run some R 
>> Expressions, and I could not find a solution. The problem is the 
>> following: When I try to link two "R Expressions" ("a" and "b"), the R 
>> Expression "b" does not inherit  the classes and imports  of the R 
>> Expression "a" (using the ports).
>>
>> It is possible to link two or more R Expressions? How should I do it?
>> It is possible to send a entire Class using ports?
>>
>> Thanks,
>> Bruno
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Kepler-users mailing list
>> Kepler-users at ecoinformatics.org
>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-users
>>   
> 
> 
> -- 
> *******************************************************************
> Dan Higgins                                  higgins at nceas.ucsb.edu
> http://www.nceas.ucsb.edu/    Ph: 805-893-5127
> National Center for Ecological Analysis and Synthesis (NCEAS) Marine Science Building - Room 3405
> Santa Barbara, CA 93195
> *******************************************************************
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Kepler-users mailing list
> Kepler-users at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-users


-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Matthew B. Jones
Director of Informatics Research and Development
National Center for Ecological Analysis and Synthesis (NCEAS)
UC Santa Barbara
jones at nceas.ucsb.edu                       Ph: 1-907-523-1960
http://www.nceas.ucsb.edu/ecoinfo
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



More information about the Kepler-users mailing list