[kepler-dev] [Bug 3173] - Improve data frame handling between RExpression actors
regetz at nceas.ucsb.edu
Thu Mar 13 17:30:57 PDT 2008
A few more thoughts from this Johnny-come-lately. To clarify the subject
line, this is really now about _arbitrary R object_ handling between
RExpression actors, not just dataframes.
I'm happy to continue the debate about whether Kepler should provide an
explicit mechanism for R actors to faithfully exchange arbitrary R
objects. But for the purposes of this message, I'm taking it as given
that many real-world workflows involving multiple R actors will require
this. The ones I would like to develop sure do! And we even have at
least one example among the demos: BEAM_4_1.xml.
Lacking alternatives, I've previously used a workaround similar to what
appears in BEAM_4_1.xml: saving the R workspace in one actor, then
reloading it in a subsequent actor.
However, I see a major risk associated with transferring data via
workspaces. I can illustrate this with the following example involving
two R actors, the first one set to save the R workspace, and the second
one containing a load(".RData") statement at the start of the R script.
Joe Scientist executes the workflow:
Actor 1 fires up R
a. R script creates object x
b. When script completes, object x is saved to .Rdata file
Actor 2 fires up R
a. R script loads .Rdata file, which includes x
b. R script does something with object x
All is well. But now Joe modifies the workflow, unwittingly doing
something that will generate an R error in Actor 1. Maybe he introduces
a bug into the R script, but maybe he just changes an upstream parameter
to an inappropriate value that gets passed into the R actor.
Joe now executes the workflow a second time:
1. Actor 1 fires up R
a. R script may or may not create object x (depends on the error)
b. Error occurs, halting R execution. Workspace is *not* saved.
2. Actor 2 fires up R
a. The *original* .RData file is loaded, with the *old* object x
b. Script now works on *old* object x
In the best case, Joe gets lucky and notices that the new workflow
output is the same as the old, when in fact he expected it to be
different. In the worst case, Joe assumes the workflow executed
properly, and does something embarrassing with his erroneous results. Of
course, the story would be even more complicated in a large workflow
with many R actors saving and restoring workspaces. Who knows what the
.RData file actually contains at any given point along the workflow?
IMHO, this is a Bad Thing. I think loading .RData files should be
recommended against in workflows (and I'm glad the --no-restore option
is currently enforced in the R actors).
While I'm at it, two other disadvantages of passing data via workspaces:
- It doesn't use ports & connectors. The only way to keep up the
workflow illusion is use dummy ports/connectors, which seems to be what
the BEAM_4_1.xml does. And it really is an illusion, as Kepler has no
control over what's going on.
- Object names will necessarily be the same in both the upstream and
downstream actors, breaking the modularity.
Seems to me the new serialization approach is an improvement on all
counts: specific R objects are bound to specific output ports during
execution (and not passed if they are not created, or if there is an
error), communication to downstream actors requires a connector, and
object names are happily decoupled between actors.
Lastly, I do agree with Dan that it's good for actors to output data in
an application-neutral (or at least Kepler-general) form when
appropriate. But that won't always suffice. Moreover, adding this R
serialization capability hasn't actually diminished Kepler's usefulness
at all. If someone wants to save a dataframe to a tabular ASCII format
that can be ingested by another actor, they can do so using another R
actor that invokes write.table() or some such. In fact, I would humbly
suggest that exposing this step on the canvas is an improvement! Aren't
the data translation steps as much a part of the actual scientific
workflow as anything else?
ben leinfelder wrote:
> That's certainly a good point - and one that Matt and Jim (Regetz) and
> I were discussing at some length yesterday.
> I think the end goal is to have a true dataframe type within Kepler,
> but until that time it seems reasonable that dataframes (and other
> complex R objects) can only be passed between R actors
> (automagically). If a workflow author wants to pass the object to non-
> R actors then they can do that particular [text serialization] within
> their custom R script. Reasonable?
> On Mar 13, 2008, at 8:41 AM, Dan Higgins wrote:
>> Ben (and others)
>> I deliberately avoided the R serialization approach when I
>> originally implemented the RExpression actor because the serialized
>> dataframe cannot be accessed by non-R actors, while the text file
>> table that was previously written could be read by other Kepler
>> actors. I fear making Kepler too R-centric will lessen its general
>> Dan Higgins
>> bugzilla-daemon at ecoinformatics.org wrote:
>>> leinfelder at nceas.ucsb.edu changed:
>>> What |Removed |Added
>>> Status|NEW |RESOLVED
>>> Resolution| |FIXED
>>> ------- Comment #2 from leinfelder at nceas.ucsb.edu 2008-03-12 23:57
>>> The R actor now uses the serialize/unserialize method described by
>>> jim for data
>>> frames and also for other complex R objects (the result of lm() for
>>> The sample workflow mentioned in this bug, also works (due to
>>> previous changes
>>> to support floating point numbers)
>>> Kepler-dev mailing list
>>> Kepler-dev at ecoinformatics.org
> Kepler-dev mailing list
> Kepler-dev at ecoinformatics.org
More information about the Kepler-dev