[seek-dev] Today's Ecogrid Call
Shawn Bowers
bowers at sdsc.edu
Fri Sep 17 12:04:43 PDT 2004
Dan Higgins wrote:
> Hi Shawn,
> I had been thinking along the same lines of using records or arrays
> of records. This is similar to dataframes in R. However, there are some
> desired capabilities that would be nice that I don't see how to carry
> out without some work.
>
> Say a table is a collection of records (columns). How do I sort the
> whole table based on the sort of one column? Or how do I subset the
> table based on values in one column? For example, in R one can subset a
> datatable with a command like
>
> d[d$col1>1000,]
>
> where 'd' is a datatable name and 'col1' is the name of a column. The
> result is all rows in the table with values in col1 greter than 1000.
> I can, of course, white code to do this by examing the record values
> individually and building new records, but it sure would be nice to have
> some simpler expressions for such things. [And we can always use the
> HSQL engine for all such operations, even locally]
>
I think the array of record thing is useful for passing tables around,
not necessarily for querying them (i.e., the R expression is really
select * from d where col1 > 1000).
If you need to query it, then I would think that is best done using an
SQL query engine; which as you say could be quickly performed by the
HSQL engine. For example, have an HSQL actor that takes as a parameter
(or as an input) an SQL query expression and one or more input tables,
and outputs a result table.
I think all of this is predicated on the tables being smallish.
For large tables (that won't fit in main memory reasonably), you need a
real database :) -- Out of curiosity, does R use a db backend?
shawn
>
> Dan
>
>
> Shawn Bowers wrote:
>
>>
>> I think that Ptolemy actually supports tables, through complex
>> structures, pretty well. In particular, every table is simply an array
>> of records.
>>
>> Lets say I have the following relation schema:
>>
>> CREATE TABLE ds1
>> (
>> age int,
>> weight double,
>> plot int,
>> species string
>> )
>>
>> (I'm fudging a bit on the domains since these aren't valid sql, but
>> that is a minor detail.)
>>
>>
>> This can be represented in Ptolemy as the following type definition:
>>
>> {{age=int, weight=double, plot=int, species=string}}
>>
>> That is, as a list of 4-tuples.
>>
>> Of course, this definition doesn't explicitly state that the structure
>> is a table. One could introduce a convention for representing tables a
>> la xml (i.e., through tags), or else, could introduce an explicit
>> ptolemy data type to support tables (not hard given that the data
>> structures exist and in principle ptolemy's type system is extensible).
>>
>> For the convention approach, we could just wrap the whole structure in
>> a record:
>>
>> { sql_tbl = { { _attributes here_ } } }
>>
>> So, for example, the above def would be:
>>
>> { sql_tbl = { { age=int, weight=double, plot=int, species=string } } }
>>
>> And the actual table would be passed as:
>>
>> { sql_tbl = {
>> {age=1, weight=50.0, plot=1, species="ABCD" },
>> {age=1, weight=49.9, plot=1, species="ABCE" },
>> {age=2, weight=50.1, plot=2, species="ABCD" }
>> }
>> }
>>
>>
>> Shawn
>>
>>
>>
>>
>> Dan Higgins wrote:
>>
>>> Rod,
>>> I haven't been following very closely all the work you and Jing
>>> (and others) are doing , so this may be a silly question. I notice
>>> that you have referred to the EML200DataSource returning a table.
>>> Just what sort of data strructure are you referring to? As far as I
>>> know, Kepler doesn't have a 'TableToken'. It this a Java class or
>>> some array/vector of column data (strings?) in the Java code? It
>>> would be nice if we could create some table strucure like the 'data
>>> frames' of 'R' that could be passed between actors iin Kepler and
>>> easily manipulated.
>>>
>>> Dan
>>>
>>> Rod Spears wrote:
>>>
>>>> Thinks to think about before we meet:
>>>>
>>>> 1) The Eml200DataSource uses the Ecogrid to get Metadata about an
>>>> item and then returns the data for that item as a single table. The
>>>> QueryBuilder can be used to reduce the number of columns that are
>>>> pass through the ports, but is not necessarily a require part of
>>>> this data object.
>>>>
>>>> 2) What else will we be using the generic QueryBuilder for? Meaning
>>>> what kind of data object will be returning more than one table that
>>>> is not an Ecogrid Query?
>>>> 2.1) I think we have talked about this when the user will be
>>>> accessing local data files; thru HSQL? JDBC?
>>>> 2.1.1) If so, then how do they discover and get their local
>>>> data into Kepler?
>>>>
>>>> 3) Do we need a more generic EcogridDataSource object that can
>>>> execute generic Ecogrid Queries? And if so, do we need an Ecogrid
>>>> Query specific QueryBuilder instead of a generic one?
>>>>
>>>> 4) Do we need a DiGIR Data Source object, or would this be covered
>>>> by #2. If it was DiGIR specific than we could get data from node
>>>> that may not be register???? (I am not sure)
>>>>
>>>> Rod
>>>>
>>>> --
>>>> Rod Spears
>>>> Biodiversity Research Center
>>>> University of Kansas
>>>> 1345 Jayhawk Boulevard
>>>> Lawrence, KS 66045, USA
>>>> Tel: 785 864-4082, Fax: 785 864-5335
>>>>
>>>
>>>
>>> --
>>> *******************************************************************
>>> Dan Higgins higgins at nceas.ucsb.edu
>>> http://www.nceas.ucsb.edu/ Ph: 805-892-2531
>>> National Center for Ecological Analysis and Synthesis (NCEAS) 735
>>> State Street - Room 205
>>> Santa Barbara, CA 93195
>>> *******************************************************************
>>>
>>
>> _______________________________________________
>> seek-dev mailing list
>> seek-dev at ecoinformatics.org
>> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
>
>
>
>
More information about the Seek-dev
mailing list