[seek-dev] Today's Ecogrid Call

Fri Sep 17 12:04:43 PDT 2004

Dan Higgins wrote:
> Hi Shawn,
>    I had been thinking along the same lines of using records or arrays 
> of records. This is similar to dataframes in R. However, there are some 
> desired capabilities that would be nice that I don't see how to carry 
> out without some work.
> 
> Say a table is a collection of records (columns). How do I sort the 
> whole table based on the sort of one column? Or how do I subset the 
> table based on values in one column?  For example, in R one can subset a 
> datatable with a command like
> 
> d[d$col1>1000,]
> 
> where 'd' is a datatable name and 'col1' is the name of a column. The 
> result is all rows in the table with values in col1 greter than 1000.
> I can, of course, white code to do this by examing the record values 
> individually and building new records, but it sure would be nice to have 
> some simpler expressions for such things. [And we can always use the 
> HSQL engine for all such operations, even locally]
> 

I think the array of record thing is useful for passing tables around, 
not necessarily for querying them (i.e., the R expression is really 
select * from d where col1 > 1000).

If you need to query it, then I would think that is best done using an 
SQL query engine; which as you say could be quickly performed by the 
HSQL engine.  For example, have an HSQL actor that takes as a parameter 
(or as an input) an SQL query expression and one or more input tables, 
and outputs a result table.

I think all of this is predicated on the tables being smallish.

For large tables (that won't fit in main memory reasonably), you need a 
real database :)  -- Out of curiosity, does R use a db backend?

shawn

> 
> Dan
> 
> 
> Shawn Bowers wrote:
> 
>>
>> I think that Ptolemy actually supports tables, through complex 
>> structures, pretty well. In particular, every table is simply an array 
>> of records.
>>
>> Lets say I have the following relation schema:
>>
>> CREATE TABLE ds1
>> (
>> age int,
>> weight double,
>> plot int,
>> species string
>> )
>>
>> (I'm fudging a bit on the domains since these aren't valid sql, but 
>> that is a minor detail.)
>>
>>
>> This can be represented in Ptolemy as the following type definition:
>>
>> {{age=int, weight=double, plot=int, species=string}}
>>
>> That is, as a list of 4-tuples.
>>
>> Of course, this definition doesn't explicitly state that the structure 
>> is a table. One could introduce a convention for representing tables a 
>> la xml (i.e., through tags), or else, could introduce an explicit 
>> ptolemy data type to support tables (not hard given that the data 
>> structures exist and in principle ptolemy's type system is extensible).
>>
>> For the convention approach, we could just wrap the whole structure in 
>> a record:
>>
>> { sql_tbl = { { _attributes here_ } } }
>>
>> So, for example, the above def would be:
>>
>> { sql_tbl = { { age=int, weight=double, plot=int, species=string } } }
>>
>> And the actual table would be passed as:
>>
>> { sql_tbl = {
>>     {age=1, weight=50.0, plot=1, species="ABCD" },
>>     {age=1, weight=49.9, plot=1, species="ABCE" },
>>     {age=2, weight=50.1, plot=2, species="ABCD" }
>>   }
>> }
>>
>>
>> Shawn
>>
>>
>>
>>
>> Dan Higgins wrote:
>>
>>> Rod,
>>>     I haven't been following very closely all the work you and Jing 
>>> (and others) are doing , so this may be a silly question. I notice 
>>> that you have referred to the EML200DataSource returning a table. 
>>> Just what sort of data strructure are you referring to?  As far as I 
>>> know, Kepler doesn't have a 'TableToken'. It this a Java class or 
>>> some array/vector of column data (strings?) in the Java code? It 
>>> would be nice if we could create some table strucure like the 'data 
>>> frames' of 'R' that could be passed between actors iin Kepler and 
>>> easily manipulated.
>>>
>>> Dan
>>>
>>> Rod Spears wrote:
>>>
>>>> Thinks to think about before we meet:
>>>>
>>>> 1) The Eml200DataSource uses the Ecogrid to get Metadata about an 
>>>> item and then returns the data for that item as a single table. The 
>>>> QueryBuilder can be used to reduce the number of columns that are 
>>>> pass through the ports, but is not necessarily a require part of 
>>>> this data object.
>>>>
>>>> 2) What else will we be using the generic QueryBuilder for? Meaning 
>>>> what kind of data object will be returning more than one table that 
>>>> is not an Ecogrid Query?
>>>>     2.1) I think we have talked about this when the user will be 
>>>> accessing local data files; thru HSQL?  JDBC?
>>>>         2.1.1) If so, then how do they discover and get their local 
>>>> data into Kepler?
>>>>
>>>> 3) Do we need a more generic EcogridDataSource object that can 
>>>> execute generic Ecogrid Queries? And if so, do we need an Ecogrid 
>>>> Query specific QueryBuilder instead of a generic one?
>>>>
>>>> 4) Do we need a DiGIR Data Source object, or would this be covered 
>>>> by #2. If it was DiGIR specific than we could get data from node 
>>>> that may not be register???? (I am not sure)
>>>>
>>>> Rod
>>>>
>>>> -- 
>>>> Rod Spears
>>>> Biodiversity Research Center
>>>> University of Kansas
>>>> 1345 Jayhawk Boulevard
>>>> Lawrence, KS 66045, USA
>>>> Tel: 785 864-4082, Fax: 785 864-5335
>>>>
>>>
>>>
>>> -- 
>>> *******************************************************************
>>> Dan Higgins                                  higgins at nceas.ucsb.edu
>>> http://www.nceas.ucsb.edu/    Ph: 805-892-2531
>>> National Center for Ecological Analysis and Synthesis (NCEAS) 735 
>>> State Street - Room 205
>>> Santa Barbara, CA 93195
>>> *******************************************************************
>>>
>>
>> _______________________________________________
>> seek-dev mailing list
>> seek-dev at ecoinformatics.org
>> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
> 
> 
> 
>