[kepler-dev] Thoughts on an IDL for R in Kepler
Dan Higgins
higgins at nceas.ucsb.edu
Thu Mar 3 11:41:45 PST 2005
Hi All,
Some somewhat rambling thoughts on a how to interface R with Kepler
are presented below.
In his recent comments added to the R-bug in kepler/Bugzilla
(#1342), Matt suggests that we need to create some sort of interface
definition language for R scripts (a sort of WSDL for R). His
suggestions triggered some additional thought(s) about this subject
which I will attempt to document here.
Consider first R and just what data types it uses. In general, R is
a functional language that operates on named data structures. Probably
the most common data structure is the R vector which is a sequence of
numbers, booleans, or strings. R is not strongly typed, meaning that
content type of vectors need not be specified prior to use; types are
determined by the values. All sorts of functions can be applied to these
vectors and the ability to apply mathematical operations to a vector
eliminates the need for many explicit looping operations required in
lower level languages (C, Java, etc.). Of course, R also has other data
types like the data frame, which is basically a collection of vectors
(think of columns in a table). Often, analyses are started by importing
a table from a file into a dataframe and individual columns are pulled
from the dataframe as vectors and operated on with R functions.
(Incidently, the R Reference Manual Base Package is a 700 page book just
documenting functions in the R base package; R has a 'lot' of functions!) .
A common R input is a vector (or table) which is assigned a name
when imported and the name is later manipulated by various functions.
The actual import is typically a string entered from the keyboard or a
datafile (table) read from the file system.
R output is usually text displayed on the screen or a plot created
by R in a graphics device. Functions often display some summary results;
plots are the results of specific functions and appear on 'graphic
devices'. To export to other systems, the plots generally need to be
written to graphic files.
So what type of ports would an R actor have if used in Kepler?
Tables could of course be referenced as filenames or text streams. But R
vectors might well be input as what are called 'arrays' in the
Kepler/Ptolemy expression language. And output ports would presumable be
text strings or file names, or perhaps kepler arrays/records.
Note that some of the Kepler token types are similar to R data
types. The kepler token array corresponds to the R vector. A Kepler
record can attach a name to kepler array, so an array of kepler records
could be converted to/from an R dataframe.
Now, an interesting 'feature' of Kepler (inherited from Ptolemy) is
that one can add ports to any(?) actor by
simply using the 'Configure Ports' menu item of the context (popup)
menu. In fact, adding ports to many actors is meaningless because the
actor is not written to respond to new ports! [And it has been noted
that adding ports, configuring the actor, changing the actor name, etc.
should probably all be done in a single dialog.]
The one actor where adding a number of new input ports is very
useful and meaningful is the Expression actor. When new input ports are
added and named, the name can then be used in the 'expression' that
appears inside the actor! The expression is just a function that
operates on the named input parameters and creates a result. [Note that
the Expression actor has a single resulting output (although this may be
a vector, matrix, etc.), although one can add additional output ports to
an Expression actor, they really are meaningless.]
One might think of an R script as an extension of a Kepler
expression. Certainly the input data could be treated as in the
Expression actor. Simply create input ports, and give these ports a name
and (perhaps) a type. For example, Kepler arrays could be mapped
directly to R vectors and the R vector given the port name. That port
name would be used in the R script to operate on the input data.
[Actually, the R actor could automatically examine the token type to
determine how to convert to an R type. A string could just be entered as
an R command; a fileParameter could be read as a dataframe; an array as
an R vector.] Do we really need a special interface definition language
to describe the Kepler inputs to an R actor? My thought is 'No' --- the
MOML port specification is sufficient.
But what about output? Much of the output of R is just text which
appears when a command (or script) is entered and executed. One output
port could thus just transmit the R text output stream (either as a
single token or as a series of tokens - see the ExpressionReader actor).
Another type of output is an image (plot) created by R. This is probably
in the form of an image file name (e.g. a jpeg file); R creates the
image file and its name is returned to Kepler. (Most likely, Kepler then
uses another actor to display the result.) In some cases, the desired
output may be in the form of Kepler arrays, matrices, or records. These
might be needed for further processing in a workflow. The actor thus
needs a method for converting R objects (especially vectors) to Kepler
arrays, etc. So, one could consider simply mapping R objects to Kepler
ports. The port name would be the R object name and the port type would
determine how the R object is converted to a Kepler token.
So I guess that I am not convinced that we need a 'formal' WSDL for
R, at least for use only within Kepler. I think that any data needed to
define I/O can be included in Port definitions, thus avoiding the
needless complexity of another interface definition language. (We might
need a better dialog for setting port parameters, however!). As I see
it, the creator of a general purpose R actor understands R well enough
to create an R script which is like the expression in an Expression
actor. Input ports are added to represent inputs and port names can be
used in the script. Output ports would be similarly defined, with
perhaps a standard port for text output.
Dan
--
*******************************************************************
Dan Higgins higgins at nceas.ucsb.edu
http://www.nceas.ucsb.edu/ Ph: 805-893-5127
National Center for Ecological Analysis and Synthesis (NCEAS)
Marine Science Building - Room 3405
Santa Barbara, CA 93195
*******************************************************************
More information about the Kepler-dev
mailing list