[seek-dev] Re: Thoughts on data input into Kepler: and 'R'
Bertram Ludaescher
ludaesch at sdsc.edu
Fri Jul 9 10:37:44 PDT 2004
>>>>> "MJ" == Matt Jones <jones at nceas.ucsb.edu> writes:
MJ>
MJ> Dan,
MJ> Thanks. I agree with your assessment and we have plans turning for
MJ> substantially upgrading the EML data access. Currently it is really
MJ> just a proof of concept. The way it is implemented (loading all data in
MJ> RAM, for example) is not scalable and will need to be fixed. Other
MJ> inefficiences abound as well, such as retrieving data and metadata from
MJ> the server multiple times (ie, no caching). Jing is actively in the
MJ> process of redesigning data access mechanisms in Kepler and EcoGrid so
MJ> that standard query (e.g. joins) and resultset (e.g., cursors)
MJ> operations are available. This should give us pretty good scalability.
Matt:
Scalability and the handling of database access in Kepler and the
EcoGrid are interesting points. I'm very interested how you guys
design this.
A simple solution to avoid loading of all data into memory and/or the
shipping around of data between data producer P and data sink S
through the Kepler client C, is to use a handle mechanism (we've said
this before). Has that mechanism be nailed down for Kepler/SEEK?
(we've discussed this point before). Moreover, I think there is also
room for doing some database mediation and query rewriting work here.
(this is implicit in the SEEK architecture)
Bertram
MJ> I also agree that we'll need alternative ways to view the data than just
MJ> chunking it up to one record per fire. We've discussed 'table at a
MJ> time' delivery and your suggestion of being able to output 'vector at a
MJ> time' delivery is also a good idea. Both R and Matlab are more vector
MJ> oriented languages, and could benefit from this sort of data delivery.
MJ> Most traditional stats programs such as SAS and SPSS take more of a
MJ> relational model to data, and so that is the initial perspective that we
MJ> took on. We'll also have to deal with some real mismatches in how
MJ> relational data might be processed and how spatial data would fit into
MJ> such a flow.
MJ>
MJ> Thanks for the thoughts,
MJ>
MJ> Matt
MJ>
MJ> Daniel Higgins wrote:
>> Thoughts on data input into Kepler: and 'R'
>>
>> Looking into the use of 'R' inside Kepler has resulted in some
>> thoughts/questions regarding just where we get the data that R (and
>> Kepler) processes. I am presenting some of these ideas here for
>> discussion/comments.
>>
>> First, consider the EML200DataSource actor. This actor uses an EML
>> description to locate a data source and configures itself to have one
>> output port for each column (attribute) in the data table (the entity).
>> A sequence of tokens is then output through these ports, one for each
>> row in the data table. The sequence of tokens out of each port is a
>> data stream that could come from any of a variety of sources (database,
>> file. etc.) and could conceptually handle very large data sources.
>> Currently, however, the whole data table is read before the output
>> stream is created. All the information ends up in local RAM, limiting
>> the amount of information that can exist in a table. Also, the
>> attributes are output on different ports, so that the very concept of a
>> table is sort of lost. (not to mention the possibility of a very large
>> and confusing number of potential ports).
>>
>> It would seem that alternative outputs would sometimes be useful. For
>> example, one could output a Ptolemy record for each row in the table. In
>> this row oriented output. a column name would be associated with each
>> value. Another possibility would be to use a column oriented approach
>> which would create an array for each column and then a record which
>> associated a name with each column array. A single record would thus
>> represent each table. (Note that we could do this within Kepler by
>> adding a SequenceToArray actor on the output of the existing
>> EML200DataSource and then creating an Record associating each array with
>> a name.)
>>
>> This last idea is suggested by the way R reads data. Typically, a
>> read.table() function is applied to a text file (or URL) to create a
>> 'data frame' object. Each column in the data frame is a vector that can
>> be individually manipulated. R "operates on named data structures',
>> usually a 'vector' which is an ordered collection (most often numbers).
>> This corresponds to the Ptolemy 'array', which is an ordered collection
>> of tokens of the same type. Thus, it would be nice if we could just hand
>> an 'R' actor a PTII array and have it converted to an 'R' vector.
>> However, in an interactive 'R' session, data is usually either entered
>> as command line strings or read from a file or URL. (Connections to
>> databases or binary files/connections are also possible.) So, it might
>> be useful to have an EMLDataSource that either created a local file or
>> returned a URL for the data. An 'R' actor (script) could then just read
>> this file/url as its datasource.
>>
>> [We might also consider using some code from Morpo which stored large
>> data table as random access files which allowed us to display very large
>> tables without having everything in RAM. ]
>>
MJ>
MJ> --
MJ> -------------------------------------------------------------------
MJ> Matt Jones jones at nceas.ucsb.edu
MJ> http://www.nceas.ucsb.edu/ Fax: 425-920-2439 Ph: 907-789-0496
MJ> National Center for Ecological Analysis and Synthesis (NCEAS)
MJ> University of California Santa Barbara
MJ> Interested in ecological informatics? http://www.ecoinformatics.org
MJ> -------------------------------------------------------------------
MJ> _______________________________________________
MJ> seek-dev mailing list
MJ> seek-dev at ecoinformatics.org
MJ> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
More information about the Seek-dev
mailing list