[kepler-dev] [Bug 3173] - Improve data frame handling between RExpression actors
higgins at nceas.ucsb.edu
Fri Mar 14 11:28:14 PDT 2008
After thinking over Jim's comments, I decided that he definitely has
several goods points and I have changed my mind about the serialization
of arbitrary R object - I now agree that this addition is a nice
With respect to saving workspaces in a '.RData' file, I assume that this
uses exactly the same serialization to a file as was added to
RExpression. So what is the difference? The code that was added to
RExpression writes a file BUT THE NAME OF THE FILE IS DIFFERENT EVERY
TIME ! That small difference is what avoids the problems that Jim
described. I ran into lots of problems in the past with passing data in
a file, most of it related to re-using the same file name.
I would like to point out the Kepler/PTII is designed so that ALL data
need NOT be passed through ports and connections. In particular,
parameters on workflow screen are directly available to all actors also
on that screen (and to actors inside composites). These parameters are a
form of 'global' variables that need not be input to
functions/subroutines in other languages. BEAM_4_1.xml passes data
between the R actors this way and uses the connections to just determine
the sequence of firing. This is problematic because the data is not
static like normal parameters. [Actually the data is static - just the
name '.RData', but this a a reference to a file that is changing!)
One controversial question from someone who has spent a lot of time
working on Kepler - If a workflow is all R actors, why put it in Kepler?
An R script itself documents the actions just as completely as adding
this script inside a Kepler workflow. The BEAM_4_1.xml workflow shows
that one can connect R actors, but why bother? [I was asked this at a
training session and I don't have a good answer.]
Jim Regetz wrote:
> A few more thoughts from this Johnny-come-lately. To clarify the
> subject line, this is really now about _arbitrary R object_ handling
> between RExpression actors, not just dataframes.
> I'm happy to continue the debate about whether Kepler should provide
> an explicit mechanism for R actors to faithfully exchange arbitrary R
> objects. But for the purposes of this message, I'm taking it as given
> that many real-world workflows involving multiple R actors will
> require this. The ones I would like to develop sure do! And we even
> have at least one example among the demos: BEAM_4_1.xml.
> Lacking alternatives, I've previously used a workaround similar to
> what appears in BEAM_4_1.xml: saving the R workspace in one actor,
> then reloading it in a subsequent actor.
> However, I see a major risk associated with transferring data via
> workspaces. I can illustrate this with the following example involving
> two R actors, the first one set to save the R workspace, and the
> second one containing a load(".RData") statement at the start of the R
> Joe Scientist executes the workflow:
> Actor 1 fires up R
> a. R script creates object x
> b. When script completes, object x is saved to .Rdata file
> Actor 2 fires up R
> a. R script loads .Rdata file, which includes x
> b. R script does something with object x
> All is well. But now Joe modifies the workflow, unwittingly doing
> something that will generate an R error in Actor 1. Maybe he
> introduces a bug into the R script, but maybe he just changes an
> upstream parameter to an inappropriate value that gets passed into the
> R actor.
> Joe now executes the workflow a second time:
> 1. Actor 1 fires up R
> a. R script may or may not create object x (depends on the error)
> b. Error occurs, halting R execution. Workspace is *not* saved.
> 2. Actor 2 fires up R
> a. The *original* .RData file is loaded, with the *old* object x
> b. Script now works on *old* object x
> In the best case, Joe gets lucky and notices that the new workflow
> output is the same as the old, when in fact he expected it to be
> different. In the worst case, Joe assumes the workflow executed
> properly, and does something embarrassing with his erroneous results.
> Of course, the story would be even more complicated in a large
> workflow with many R actors saving and restoring workspaces. Who knows
> what the .RData file actually contains at any given point along the
> IMHO, this is a Bad Thing. I think loading .RData files should be
> recommended against in workflows (and I'm glad the --no-restore option
> is currently enforced in the R actors).
> While I'm at it, two other disadvantages of passing data via workspaces:
> - It doesn't use ports & connectors. The only way to keep up the
> workflow illusion is use dummy ports/connectors, which seems to be
> what the BEAM_4_1.xml does. And it really is an illusion, as Kepler
> has no control over what's going on.
> - Object names will necessarily be the same in both the upstream and
> downstream actors, breaking the modularity.
> Seems to me the new serialization approach is an improvement on all
> counts: specific R objects are bound to specific output ports during
> execution (and not passed if they are not created, or if there is an
> error), communication to downstream actors requires a connector, and
> object names are happily decoupled between actors.
> Lastly, I do agree with Dan that it's good for actors to output data
> in an application-neutral (or at least Kepler-general) form when
> appropriate. But that won't always suffice. Moreover, adding this R
> serialization capability hasn't actually diminished Kepler's
> usefulness at all. If someone wants to save a dataframe to a tabular
> ASCII format that can be ingested by another actor, they can do so
> using another R actor that invokes write.table() or some such. In
> fact, I would humbly suggest that exposing this step on the canvas is
> an improvement! Aren't the data translation steps as much a part of
> the actual scientific workflow as anything else?
> ben leinfelder wrote:
>> That's certainly a good point - and one that Matt and Jim (Regetz)
>> and I were discussing at some length yesterday.
>> I think the end goal is to have a true dataframe type within Kepler,
>> but until that time it seems reasonable that dataframes (and other
>> complex R objects) can only be passed between R actors
>> (automagically). If a workflow author wants to pass the object to
>> non- R actors then they can do that particular [text serialization]
>> within their custom R script. Reasonable?
>> On Mar 13, 2008, at 8:41 AM, Dan Higgins wrote:
>>> Ben (and others)
>>> I deliberately avoided the R serialization approach when I
>>> originally implemented the RExpression actor because the serialized
>>> dataframe cannot be accessed by non-R actors, while the text file
>>> table that was previously written could be read by other Kepler
>>> actors. I fear making Kepler too R-centric will lessen its general
>>> Dan Higgins
>>> bugzilla-daemon at ecoinformatics.org wrote:
>>>> leinfelder at nceas.ucsb.edu changed:
>>>> What |Removed |Added
>>>> Status|NEW |RESOLVED
>>>> Resolution| |FIXED
>>>> ------- Comment #2 from leinfelder at nceas.ucsb.edu 2008-03-12
>>>> 23:57 -------
>>>> The R actor now uses the serialize/unserialize method described by
>>>> jim for data
>>>> frames and also for other complex R objects (the result of lm()
>>>> for example).
>>>> The sample workflow mentioned in this bug, also works (due to
>>>> previous changes
>>>> to support floating point numbers)
>>>> Kepler-dev mailing list
>>>> Kepler-dev at ecoinformatics.org
>> Kepler-dev mailing list
>> Kepler-dev at ecoinformatics.org
More information about the Kepler-dev