[kepler-dev] [Bug 3173] - Improve data frame handling between RExpression actors

Jim Regetz regetz at nceas.ucsb.edu
Thu Mar 13 17:30:57 PDT 2008


A few more thoughts from this Johnny-come-lately. To clarify the subject 
line, this is really now about _arbitrary R object_ handling between 
RExpression actors, not just dataframes.

I'm happy to continue the debate about whether Kepler should provide an 
explicit mechanism for R actors to faithfully exchange arbitrary R 
objects. But for the purposes of this message, I'm taking it as given 
that many real-world workflows involving multiple R actors will require 
this. The ones I would like to develop sure do! And we even have at 
least one example among the demos: BEAM_4_1.xml.

Lacking alternatives, I've previously used a workaround similar to what 
appears in BEAM_4_1.xml: saving the R workspace in one actor, then 
reloading it in a subsequent actor.

However, I see a major risk associated with transferring data via 
workspaces. I can illustrate this with the following example involving 
two R actors, the first one set to save the R workspace, and the second 
one containing a load(".RData") statement at the start of the R script.

Joe Scientist executes the workflow:
Actor 1 fires up R
  a. R script creates object x
  b. When script completes, object x is saved to .Rdata file
Actor 2 fires up R
  a. R script loads .Rdata file, which includes x
  b. R script does something with object x

All is well. But now Joe modifies the workflow, unwittingly doing 
something that will generate an R error in Actor 1. Maybe he introduces 
a bug into the R script, but maybe he just changes an upstream parameter 
to an inappropriate value that gets passed into the R actor.

Joe now executes the workflow a second time:
  1. Actor 1 fires up R
   a. R script may or may not create object x (depends on the error)
   b. Error occurs, halting R execution. Workspace is *not* saved.
  2. Actor 2 fires up R
   a. The *original* .RData file is loaded, with the *old* object x
   b. Script now works on *old* object x

In the best case, Joe gets lucky and notices that the new workflow 
output is the same as the old, when in fact he expected it to be 
different. In the worst case, Joe assumes the workflow executed 
properly, and does something embarrassing with his erroneous results. Of 
course, the story would be even more complicated in a large workflow 
with many R actors saving and restoring workspaces. Who knows what the 
.RData file actually contains at any given point along the workflow?

IMHO, this is a Bad Thing. I think loading .RData files should be 
recommended against in workflows (and I'm glad the --no-restore option 
is currently enforced in the R actors).

While I'm at it, two other disadvantages of passing data via workspaces:

- It doesn't use ports & connectors. The only way to keep up the 
workflow illusion is use dummy ports/connectors, which seems to be what 
the BEAM_4_1.xml does. And it really is an illusion, as Kepler has no 
control over what's going on.

- Object names will necessarily be the same in both the upstream and 
downstream actors, breaking the modularity.

Seems to me the new serialization approach is an improvement on all 
counts: specific R objects are bound to specific output ports during 
execution (and not passed if they are not created, or if there is an 
error), communication to downstream actors requires a connector, and 
object names are happily decoupled between actors.

Lastly, I do agree with Dan that it's good for actors to output data in 
an application-neutral (or at least Kepler-general) form when 
appropriate. But that won't always suffice. Moreover, adding this R 
serialization capability hasn't actually diminished Kepler's usefulness 
at all. If someone wants to save a dataframe to a tabular ASCII format 
that can be ingested by another actor, they can do so using another R 
actor that invokes write.table() or some such. In fact, I would humbly 
suggest that exposing this step on the canvas is an improvement! Aren't 
the data translation steps as much a part of the actual scientific 
workflow as anything else?

Cheers,
Jim

ben leinfelder wrote:
> Dan,
> That's certainly a good point - and one that Matt and Jim (Regetz) and  
> I were discussing at some length yesterday.
> I think the end goal is to have a true dataframe type within Kepler,  
> but until that time it seems reasonable that dataframes (and other  
> complex R objects) can only be passed between R actors  
> (automagically).  If a workflow author wants to pass the object to non- 
> R actors then they can do that particular [text serialization] within  
> their custom R script.  Reasonable?
> 
> -ben
> 
> On Mar 13, 2008, at 8:41 AM, Dan Higgins wrote:
> 
>> Ben (and others)
>>
>>   I deliberately avoided the R serialization approach when I  
>> originally implemented the RExpression actor because the serialized  
>> dataframe cannot be accessed by non-R actors, while the text file  
>> table that was previously written could be read by other Kepler  
>> actors. I fear making Kepler too R-centric will lessen its general  
>> usefulness.
>>
>>
>> Dan Higgins
>>
>> bugzilla-daemon at ecoinformatics.org wrote:
>>> http://bugzilla.ecoinformatics.org/show_bug.cgi?id=3173
>>>
>>>
>>> leinfelder at nceas.ucsb.edu changed:
>>>
>>>           What    |Removed                     |Added
>>> ----------------------------------------------------------------------------
>>>             Status|NEW                         |RESOLVED
>>>         Resolution|                            |FIXED
>>>
>>>
>>>
>>>
>>> ------- Comment #2 from leinfelder at nceas.ucsb.edu  2008-03-12 23:57  
>>> -------
>>> The R actor now uses the serialize/unserialize method described by  
>>> jim for data
>>> frames and also for other complex R objects (the result of lm() for  
>>> example).
>>>
>>> The sample workflow mentioned in this bug, also works (due to  
>>> previous changes
>>> to support floating point numbers)
>>> _______________________________________________
>>> Kepler-dev mailing list
>>> Kepler-dev at ecoinformatics.org
>>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev
>>>
>>>
>>
> 
> _______________________________________________
> Kepler-dev mailing list
> Kepler-dev at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev
> 


More information about the Kepler-dev mailing list