[seek-dev] Re: Thoughts on data input into Kepler: and 'R'

Fri Jul 9 10:37:44 PDT 2004

>>>>> "MJ" == Matt Jones <jones at nceas.ucsb.edu> writes:
MJ> 
MJ> Dan,
MJ> Thanks.  I agree with your assessment and we have plans turning for 
MJ> substantially upgrading the EML data access.  Currently it is really 
MJ> just a proof of concept.  The way it is implemented (loading all data in 
MJ> RAM, for example) is not scalable and will need to be fixed.  Other 
MJ> inefficiences abound as well, such as retrieving data and metadata from 
MJ> the server multiple times (ie, no caching).  Jing is actively in the 
MJ> process of redesigning data access mechanisms in Kepler and EcoGrid so 
MJ> that standard query (e.g. joins) and resultset (e.g., cursors) 
MJ> operations are available.  This should give us pretty good scalability.

Matt: 

Scalability and the handling of database access in Kepler and the
EcoGrid are interesting points. I'm very interested how you guys
design this.  

A simple solution to avoid loading of all data into memory and/or the
shipping around of data between data producer P and data sink S
through the Kepler client C, is to use a handle mechanism (we've said
this before). Has that mechanism be nailed down for Kepler/SEEK?
(we've discussed this point before).  Moreover, I think there is also
room for doing some database mediation and query rewriting work here.
(this is implicit in the SEEK architecture)

Bertram

MJ> I also agree that we'll need alternative ways to view the data than just 
MJ> chunking it up to one record per fire.  We've discussed 'table at a 
MJ> time' delivery and your suggestion of being able to output 'vector at a 
MJ> time' delivery is also a good idea.  Both R and Matlab are more vector 
MJ> oriented languages, and could benefit from this sort of data delivery. 
MJ> Most traditional stats programs such as SAS and SPSS take more of a 
MJ> relational model to data, and so that is the initial perspective that we 
MJ> took on.  We'll also have to deal with some real mismatches in how 
MJ> relational data might be processed and how spatial data would fit into 
MJ> such a flow.
MJ> 
MJ> Thanks for the thoughts,
MJ> 
MJ> Matt
MJ> 
MJ> Daniel Higgins wrote:
>> Thoughts on data input into Kepler: and 'R'
>> 
>> Looking into the use of 'R' inside Kepler has resulted in some 
>> thoughts/questions regarding just where we get the data that R (and 
>> Kepler) processes. I am presenting some of these ideas here for 
>> discussion/comments.
>> 
>> First, consider the EML200DataSource actor. This actor uses an EML 
>> description to locate a data source and configures itself to have one 
>> output port for each column (attribute) in the data table (the entity). 
>> A sequence of tokens is then output through these ports, one for each 
>> row in the data table. The sequence of tokens out of each port is a  
>> data stream that could come from any of a variety of sources (database, 
>> file. etc.) and could conceptually handle very large data sources. 
>> Currently, however, the whole data table is read before the output 
>> stream is created.  All the information ends up in local RAM, limiting 
>> the amount of information that can exist in a table. Also, the 
>> attributes are output on different ports, so that the very concept of a 
>> table is sort of lost. (not to mention the possibility of a very large 
>> and confusing number of potential ports).
>> 
>> It would seem that alternative outputs would sometimes be useful. For 
>> example, one could output a Ptolemy record for each row in the table. In 
>> this row oriented output. a column name would be associated with  each  
>> value. Another possibility would be to use a column oriented approach 
>> which would create an array for each column and then a record which 
>> associated a name with each column array. A single record would thus 
>> represent each table. (Note that we could do this within Kepler by 
>> adding a SequenceToArray actor on the output of the existing 
>> EML200DataSource and then creating an Record associating each array with 
>> a name.)
>> 
>> This last idea is suggested by the way R reads data. Typically, a 
>> read.table() function is applied to a text file (or URL) to create a 
>> 'data frame' object. Each column in the data frame is a vector that can 
>> be individually manipulated. R "operates on named data structures', 
>> usually a 'vector' which is an ordered collection (most often numbers). 
>> This corresponds to the Ptolemy 'array', which is an ordered collection 
>> of tokens of the same type. Thus, it would be nice if we could just hand 
>> an 'R' actor a PTII array and have it converted to an 'R' vector. 
>> However, in an interactive 'R'  session, data is usually either entered 
>> as command line strings or read from a file or URL. (Connections to 
>> databases or binary files/connections are also possible.)  So, it might 
>> be useful to have an EMLDataSource that either created a local file or 
>> returned a URL for the data. An 'R' actor (script) could then just read 
>> this file/url as its datasource.
>> 
>> [We might also consider using some code from Morpo which stored large 
>> data table as random access files which allowed us to display very large 
>> tables without having everything in RAM. ]
>> 
MJ> 
MJ> -- 
MJ> -------------------------------------------------------------------
MJ> Matt Jones                                     jones at nceas.ucsb.edu
MJ> http://www.nceas.ucsb.edu/    Fax: 425-920-2439    Ph: 907-789-0496
MJ> National Center for Ecological Analysis and Synthesis (NCEAS)
MJ> University of California Santa Barbara
MJ> Interested in ecological informatics? http://www.ecoinformatics.org
MJ> -------------------------------------------------------------------
MJ> _______________________________________________
MJ> seek-dev mailing list
MJ> seek-dev at ecoinformatics.org
MJ> http://www.ecoinformatics.org/mailman/listinfo/seek-dev