[kepler-dev] Thoughts on an IDL for R in Kepler

Thu Mar 3 11:41:45 PST 2005

Hi All,
    Some somewhat rambling thoughts on a how to interface R with Kepler 
are presented below.

    In his recent comments added to the R-bug in kepler/Bugzilla 
(#1342), Matt suggests that we need to create some sort of interface 
definition language for R scripts (a sort of WSDL for R). His 
suggestions triggered some additional thought(s) about this subject 
which I will attempt to document here.

    Consider first R and just what data types it uses. In general, R is 
a functional language that operates on named data structures. Probably 
the most common data structure is the R vector which is a sequence of 
numbers, booleans, or strings. R is not strongly typed, meaning that 
content type of vectors need not be specified prior to use; types are 
determined by the values. All sorts of functions can be applied to these 
vectors and the ability to apply mathematical operations to a vector 
eliminates the need for many explicit looping operations required in 
lower level languages (C, Java, etc.). Of course, R also has other data 
types like the data frame, which is basically a collection of vectors 
(think of columns in a table). Often, analyses are started by importing 
a table from a file into a dataframe and individual columns are pulled 
from the dataframe as vectors and operated on with R functions. 
(Incidently, the R Reference Manual Base Package is a 700 page book just 
documenting functions in the R base package; R has a 'lot' of functions!) .

    A common R input is a vector (or table) which is assigned a name 
when imported and the name is later manipulated by various functions. 
The actual import is typically a string entered from the keyboard or a 
datafile (table) read from the file system.   

    R output is usually text displayed on the screen or a plot created 
by R in a graphics device. Functions often display some summary results; 
plots are the results of specific functions and appear on 'graphic 
devices'. To export to other systems, the plots generally need to be 
written to graphic files.

    So what type of ports would an R actor have if used in Kepler? 
Tables could of course be referenced as filenames or text streams. But R 
vectors might well be input as what are called 'arrays' in the 
Kepler/Ptolemy expression language. And output ports would presumable be 
text strings or file names, or perhaps kepler arrays/records.

    Note that some of the Kepler token types are similar to R data 
types. The kepler token array corresponds to the R vector. A Kepler 
record can attach a name to kepler array, so an array of kepler records 
could be converted to/from an R dataframe.

    Now, an interesting 'feature' of Kepler (inherited from Ptolemy) is 
that one can add ports to any(?) actor by
simply using the 'Configure Ports' menu item of the context (popup) 
menu. In fact, adding ports to many actors is meaningless because the 
actor is not written to respond to new ports! [And it has been noted 
that adding ports, configuring the actor, changing the actor name, etc. 
should probably all be done in a single dialog.]

    The one actor where adding a number of new input ports is very 
useful and meaningful is the Expression actor. When new input ports are 
added and named, the name can then be used in the 'expression' that 
appears inside the actor! The expression is just a function that 
operates on the named input parameters and creates a result. [Note that 
the Expression actor has a single resulting output (although this may be 
a vector, matrix, etc.), although one can add additional output ports to 
an Expression actor, they really are meaningless.]

    One might think of an R script as an extension of a Kepler 
expression. Certainly the input data could be treated as in the 
Expression actor. Simply create input ports, and give these ports a name 
and (perhaps) a type. For example, Kepler arrays could be mapped 
directly to R vectors and the R vector given the port name. That port 
name would be used in the R script to operate on the input data. 
[Actually, the R actor could automatically examine the token type to 
determine how to convert to an R type. A string could just be entered as 
an R command; a fileParameter could be read as a dataframe; an array as 
an R vector.] Do we really need a special interface definition language 
to describe the Kepler inputs to an R actor? My thought is 'No' --- the 
MOML port specification is sufficient.

    But what about output? Much of the output of R is just text which 
appears when a command (or script) is entered and executed. One output 
port could thus just transmit the R text output stream (either as a 
single token or as a series of tokens - see the ExpressionReader actor). 
Another type of output is an image (plot) created by R. This is probably 
in the form of an image file name (e.g. a jpeg file); R creates the 
image file and its name is returned to Kepler. (Most likely, Kepler then 
uses another actor to display the result.) In some cases, the desired 
output may be in the form of Kepler arrays, matrices, or records. These 
might be needed for further processing in a workflow. The actor thus 
needs a method for converting R objects (especially vectors) to Kepler 
arrays, etc. So, one could consider simply mapping R objects to Kepler 
ports. The port name would be the R object name and the port type would 
determine how the R object is converted to a Kepler token.

    So I guess that I am not convinced that we need a 'formal' WSDL for 
R, at least for use only within Kepler. I think that any data needed to 
define I/O can be included in Port definitions, thus avoiding the 
needless complexity of another interface definition language. (We might 
need a better dialog for setting port parameters, however!). As I see 
it, the creator of a general purpose R actor understands R well enough 
to create an R script which is like the expression in an Expression 
actor. Input ports are added to represent inputs and port names can be 
used in the script. Output ports would be similarly defined, with 
perhaps a standard port for text output.


Dan

-- 
*******************************************************************
Dan Higgins                                  higgins at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Ph: 805-893-5127
National Center for Ecological Analysis and Synthesis (NCEAS) 
Marine Science Building - Room 3405
Santa Barbara, CA 93195
*******************************************************************