[kepler-dev] [Bug 1342] - need R actor

Thu Mar 3 11:41:34 PST 2005

http://bugzilla.ecoinformatics.org/show_bug.cgi?id=1342


------- Additional Comments From higgins at nceas.ucsb.edu  2005-03-03 11:41 -------
    In his recent comments added to the R-bug in kepler/Bugzilla (#1342), Matt
suggests that we need to create some sort of interface definition language for R
scripts (a sort of WSDL for R). His suggestions triggered some additional
thought(s) about this subject which I will attempt to document here.

    Consider first R and just what data types it uses. In general, R is a
functional language that operates on named data structures. Probably the most
common data structure is the R vector which is a sequence of numbers, booleans,
or strings. R is not strongly typed, meaning that content type of vectors need
not be specified prior to use; types are determined by the values. All sorts of
functions can be applied to these vectors and the ability to apply mathematical
operations to a vector eliminates the need for many explicit looping operations
required in lower level languages (C, Java, etc.). Of course, R also has other
data types like the data frame, which is basically a collection of vectors
(think of columns in a table). Often, analyses are started by importing a table
from a file into a dataframe and individual columns are pulled from the
dataframe as vectors and operated on with R functions. (Incidently, the R
Reference Manual Base Package is a 700 page book just documenting functions in
the R base package; R has a 'lot' of functions!) .

    A common R input is a vector (or table) which is assigned a name when
imported and the name is later manipulated by various functions. The actual
import is typically a string entered from the keyboard or a datafile (table)
read from the file system.   

    R output is usually text displayed on the screen or a plot created by R in a
graphics device. Functions often display some summary results; plots are the
results of specific functions and appear on 'graphic devices'. To export to
other systems, the plots generally need to be written to graphic files.

    So what type of ports would an R actor have if used in Kepler? Tables could
of course be referenced as filenames or text streams. But R vectors might well
be input as what are called 'arrays' in the Kepler/Ptolemy expression language.
And output ports would presumable be text strings or file names, or perhaps
kepler arrays/records.

    Note that some of the Kepler token types are similar to R data types. The
kepler token array corresponds to the R vector. A Kepler record can attach a
name to kepler array, so an array of kepler records could be converted to/from
an R dataframe.

    Now, an interesting 'feature' of Kepler (inherited from Ptolemy) is that one
can add ports to any(?) actor by
simply using the 'Configure Ports' menu item of the context (popup) menu. In
fact, adding ports to many actors is meaningless because the actor is not
written to respond to new ports! [And it has been noted that adding ports,
configuring the actor, changing the actor name, etc. should probably all be done
in a single dialog.]

    The one actor where adding a number of new input ports is very useful and
meaningful is the Expression actor. When new input ports are added and named,
the name can then be used in the 'expression' that appears inside the actor! The
expression is just a function that operates on the named input parameters and
creates a result. [Note that the Expression actor has a single resulting output
(although this may be a vector, matrix, etc.), although one can add additional
output ports to an Expression actor, they really are meaningless.]

    One might think of an R script as an extension of a Kepler expression.
Certainly the input data could be treated as in the Expression actor. Simply
create input ports, and give these ports a name and (perhaps) a type. For
example, Kepler arrays could be mapped directly to R vectors and the R vector
given the port name. That port name would be used in the R script to operate on
the input data. [Actually, the R actor could automatically examine the token
type to determine how to convert to an R type. A string could just be entered as
an R command; a fileParameter could be read as a dataframe; an array as an R
vector.] Do we really need a special interface definition language to describe
the Kepler inputs to an R actor? My thought is 'No' --- the MOML port
specification is sufficient.

    But what about output? Much of the output of R is just text which appears
when a command (or script) is entered and executed. One output port could thus
just transmit the R text output stream (either as a single token or as a series
of tokens - see the ExpressionReader actor). Another type of output is an image
(plot) created by R. This is probably in the form of an image file name (e.g. a
jpeg file); R creates the image file and its name is returned to Kepler. (Most
likely, Kepler then uses another actor to display the result.) In some cases,
the desired output may be in the form of Kepler arrays, matrices, or records.
These might be needed for further processing in a workflow. The actor thus needs
a method for converting R objects (especially vectors) to Kepler arrays, etc.
So, one could consider simply mapping R objects to Kepler ports. The port name
would be the R object name and the port type would determine how the R object is
converted to a Kepler token.

    So I guess that I am not convinced that we need a 'formal' WSDL for R, at
least for use only within Kepler. I think that any data needed to define I/O can
be included in Port definitions, thus avoiding the needless complexity of
another interface definition language. (We might need a better dialog for
setting port parameters, however!). As I see it, the creator of a general
purpose R actor understands R well enough to create an R script which is like
the expression in an Expression actor. Input ports are added to represent inputs
and port names can be used in the script. Output ports would be similarly
defined, with perhaps a standard port for text output.