[kepler-dev] [Bug 3173] - Improve data frame handling between RExpression actors

Fri Mar 14 14:16:54 PDT 2008

Hi all,

Thanks for the feedback Dan!

Dan Higgins wrote:
> After thinking over Jim's comments, I decided that he definitely has 
> several goods points and I have changed my mind about the serialization 
> of arbitrary R object - I now agree that this addition is a nice 
> improvement!
> 
> With respect to saving workspaces in a '.RData' file, I assume that this 
> uses exactly the same serialization to a file as was added to 
> RExpression. So what is the difference? The code that was added to 
> RExpression writes a file BUT THE NAME OF THE FILE IS DIFFERENT EVERY 
> TIME !  That small difference is what avoids the problems that Jim 
> described. I ran into lots of problems in the past with passing data in 
> a file, most of it related to re-using the same file name.

Quite right, the serialize function and the save.image function (which R 
uses to save its workspace) both eventually call the same C function 
R_Serialize internally. So, the solutions are similar. As you mentioned, 
one nice aspect of the new approach is explicit control over the 
filename, though in principle the RExpression actor could handle this by 
explicitly calling save.image(file=<whatever>) when wrapping up rather 
than using the built-in 'R --save' mechanism. But I think another big 
advantage is something I vaguely alluded to yesterday: save.image() 
saves all objects with their names; subsequently calling load() makes 
the same named objects "reappear" in the workspace. In contrast, 
serialize() only saves the underlying object, not its bound name; 
unserializing() returns this unnamed object, which you must assign to a 
name if you want to retain it. I'd say the latter behavior is certainly 
preferable in the context of Kepler, insofar as it allows object names 
within a particular actor to be dictated by its ports rather than the 
upstream actor.

> I would like to point out the Kepler/PTII is designed so that ALL data 
> need NOT be passed through ports and connections. In particular, 
> parameters on workflow screen are directly available to all actors also 
> on that screen (and to actors inside composites). These parameters are a 
> form of 'global' variables that need not be input to 
> functions/subroutines in other languages. BEAM_4_1.xml passes data 
> between the R actors this way and uses the connections to just determine 
> the sequence of firing. This is problematic because the data is not 
> static like normal parameters. [Actually the data is static - just the 
> name '.RData', but this a a reference to a file that is changing!)

Thanks for this. I haven't quite completely worked through the 
implications yet, but I definitely want to think more about it!

> One controversial question from someone who has spent a lot of time 
> working on Kepler - If a workflow is all R actors, why put it in Kepler? 
> An R script itself documents the actions just as completely as adding 
> this script inside a Kepler workflow. The BEAM_4_1.xml workflow shows 
> that one can connect R actors, but why bother? [I was asked this at a 
> training session and I don't have a good answer.]

Shoot, as someone who has spent countless hours conversing with the R 
interpreter and sending it jobs, I totally understand where this 
question comes from! For my own purposes, there's usually not enough 
incentive to incur the various costs of Keplerizing an analysis that I 
can already run using R alone. Even when my workflow involves non-R 
procedures, I will often just use a shell script or make a system call 
from within R. But in cases like this, I reckon Kepler is not really 
"for me" -- it's for the non-programming collaborators, reviewers, and 
other folks who want to replicate, reuse, or even just understand the 
analysis. In that regard, two Kepler advantages pop to mind:

* Graphical depiction of workflow. Practically speaking, most 
non-programmers are not going to take the time to inspect code. Everyone 
can look at the graphical workflow and see what it does. Moreover, using 
composite actors provides a nice way to expose levels of complexity in a 
nested fashion.

* Consistent delivery platform. "Start Kepler, open the workflow file, 
and press the run button". Not bad! Yes, it can be almost as easy in R, 
but not quite. Now add in some complexities: a couple of adjustable 
model parameters, user-specified input/output file names, etc. Heck, 
even the need to rearrange some of the steps. R gets harder much faster 
for those who don't know it. And of course, the investment a user makes 
in learning how to manipulate and run this particular Kepler workflow 
will apply to any other type of workflow, involving R or not.

Beyond that, even for me there is the promise of cool Kepler features: 
Opportunity to incorporate non-R actors that do something better or 
faster. Ecogrid data access. Distributed execution. Probably more that 
I'm forgetting???

I realize I probably haven't said anything new here, but I'm glad Dan 
raised the question. Being on the front lines here at NCEAS, I 
definitely want to have a clear answer in my head!

Cheers,
Jim

> Dan Higgins
>
> Jim Regetz wrote:
>> A few more thoughts from this Johnny-come-lately. To clarify the 
>> subject line, this is really now about _arbitrary R object_ handling 
>> between RExpression actors, not just dataframes.
>>
>> I'm happy to continue the debate about whether Kepler should provide 
>> an explicit mechanism for R actors to faithfully exchange arbitrary R 
>> objects. But for the purposes of this message, I'm taking it as given 
>> that many real-world workflows involving multiple R actors will 
>> require this. The ones I would like to develop sure do! And we even 
>> have at least one example among the demos: BEAM_4_1.xml.
>>
>> Lacking alternatives, I've previously used a workaround similar to 
>> what appears in BEAM_4_1.xml: saving the R workspace in one actor, 
>> then reloading it in a subsequent actor.
>>
>> However, I see a major risk associated with transferring data via 
>> workspaces. I can illustrate this with the following example involving 
>> two R actors, the first one set to save the R workspace, and the 
>> second one containing a load(".RData") statement at the start of the R 
>> script.
>>
>> Joe Scientist executes the workflow:
>> Actor 1 fires up R
>>  a. R script creates object x
>>  b. When script completes, object x is saved to .Rdata file
>> Actor 2 fires up R
>>  a. R script loads .Rdata file, which includes x
>>  b. R script does something with object x
>>
>> All is well. But now Joe modifies the workflow, unwittingly doing 
>> something that will generate an R error in Actor 1. Maybe he 
>> introduces a bug into the R script, but maybe he just changes an 
>> upstream parameter to an inappropriate value that gets passed into the 
>> R actor.
>>
>> Joe now executes the workflow a second time:
>>  1. Actor 1 fires up R
>>   a. R script may or may not create object x (depends on the error)
>>   b. Error occurs, halting R execution. Workspace is *not* saved.
>>  2. Actor 2 fires up R
>>   a. The *original* .RData file is loaded, with the *old* object x
>>   b. Script now works on *old* object x
>>
>> In the best case, Joe gets lucky and notices that the new workflow 
>> output is the same as the old, when in fact he expected it to be 
>> different. In the worst case, Joe assumes the workflow executed 
>> properly, and does something embarrassing with his erroneous results. 
>> Of course, the story would be even more complicated in a large 
>> workflow with many R actors saving and restoring workspaces. Who knows 
>> what the .RData file actually contains at any given point along the 
>> workflow?
>>
>> IMHO, this is a Bad Thing. I think loading .RData files should be 
>> recommended against in workflows (and I'm glad the --no-restore option 
>> is currently enforced in the R actors).
>>
>> While I'm at it, two other disadvantages of passing data via workspaces:
>>
>> - It doesn't use ports & connectors. The only way to keep up the 
>> workflow illusion is use dummy ports/connectors, which seems to be 
>> what the BEAM_4_1.xml does. And it really is an illusion, as Kepler 
>> has no control over what's going on.
>>
>> - Object names will necessarily be the same in both the upstream and 
>> downstream actors, breaking the modularity.
>>
>> Seems to me the new serialization approach is an improvement on all 
>> counts: specific R objects are bound to specific output ports during 
>> execution (and not passed if they are not created, or if there is an 
>> error), communication to downstream actors requires a connector, and 
>> object names are happily decoupled between actors.
>>
>> Lastly, I do agree with Dan that it's good for actors to output data 
>> in an application-neutral (or at least Kepler-general) form when 
>> appropriate. But that won't always suffice. Moreover, adding this R 
>> serialization capability hasn't actually diminished Kepler's 
>> usefulness at all. If someone wants to save a dataframe to a tabular 
>> ASCII format that can be ingested by another actor, they can do so 
>> using another R actor that invokes write.table() or some such. In 
>> fact, I would humbly suggest that exposing this step on the canvas is 
>> an improvement! Aren't the data translation steps as much a part of 
>> the actual scientific workflow as anything else?
>>
>> Cheers,
>> Jim
>>
>> ben leinfelder wrote:
>>> Dan,
>>> That's certainly a good point - and one that Matt and Jim (Regetz) 
>>> and  I were discussing at some length yesterday.
>>> I think the end goal is to have a true dataframe type within Kepler,  
>>> but until that time it seems reasonable that dataframes (and other  
>>> complex R objects) can only be passed between R actors  
>>> (automagically).  If a workflow author wants to pass the object to 
>>> non- R actors then they can do that particular [text serialization] 
>>> within  their custom R script.  Reasonable?
>>>
>>> -ben
>>>
>>> On Mar 13, 2008, at 8:41 AM, Dan Higgins wrote:
>>>
>>>> Ben (and others)
>>>>
>>>>   I deliberately avoided the R serialization approach when I  
>>>> originally implemented the RExpression actor because the serialized  
>>>> dataframe cannot be accessed by non-R actors, while the text file  
>>>> table that was previously written could be read by other Kepler  
>>>> actors. I fear making Kepler too R-centric will lessen its general  
>>>> usefulness.
>>>>
>>>>
>>>> Dan Higgins
>>>>
>>>> bugzilla-daemon at ecoinformatics.org wrote:
>>>>> http://bugzilla.ecoinformatics.org/show_bug.cgi?id=3173
>>>>>
>>>>>
>>>>> leinfelder at nceas.ucsb.edu changed:
>>>>>
>>>>>           What    |Removed                     |Added
>>>>> ---------------------------------------------------------------------------- 
>>>>>
>>>>>             Status|NEW                         |RESOLVED
>>>>>         Resolution|                            |FIXED
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------- Comment #2 from leinfelder at nceas.ucsb.edu  2008-03-12 
>>>>> 23:57  -------
>>>>> The R actor now uses the serialize/unserialize method described by  
>>>>> jim for data
>>>>> frames and also for other complex R objects (the result of lm() 
>>>>> for  example).
>>>>>
>>>>> The sample workflow mentioned in this bug, also works (due to  
>>>>> previous changes
>>>>> to support floating point numbers)
>>>>> _______________________________________________
>>>>> Kepler-dev mailing list
>>>>> Kepler-dev at ecoinformatics.org
>>>>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev 
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> Kepler-dev mailing list
>>> Kepler-dev at ecoinformatics.org
>>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev
>>>
>>
>