[seek-dev] data

Thu Mar 18 10:38:14 PST 2004

Hi Shawn,

Those requirements have not been formally defined.  To use data 
generally within the context of SEEK and Kepler, we almost certainly 
need a good metadata description in order to interpret the data 
correctly.  I think it makes sense for that to be an EML description, or 
maybe EML with some SEEK extensions (e.g., for semantic labeling).  EML 
of course is quite loose about which metadata are required.  If someone 
omits the physical and logical descriptions of the data, it would be 
hard to build automated ingestion tools.  Our work so far in Kepler for 
automatically ingesting arbitrary data sources is that they have a 
complete entity/attribute description in EML.  There certainly other 
ways to provide this information that I would not want to rule out as 
options, but I think the EML route is a sensible one for SEEK.

One of the other Kepler developers (Efrat from GEON) created a data 
ingestion actor based on JDBC.  You provide an endpoint and a SQL query 
as input and it exposes the records as output.  This is another way of 
getting data into Kepler.  Its problem is that there is no formal 
relationship between the SQL query and the datatypes of the output 
port(s).  I think we could consolidate some of this code with other code 
(such as Chad's EML ingestion actor) and come up with a more general 
approach that is extensible.  Here's what I've been thinking...

Using data in Kepler involves 1) transporting the data to the machine, 
2) filtering the data to produce a subset (potentially ona remote 
machine before (1)), and 3) exposing the resulting data as 
strongly-typed ports in Kepler.  The first (1) is accomplished now 
through jdbc, file system access, grid access, and (soon) ecogrid 
access.  The second (2) is part of the proposal we've made for ecogrid 
access (a generic means of expressing filter conditions) and is part of 
Efrat's jdbc actor (via sql).  The third (3) is currently handled by 
Chad's (somewhat incomplete) EML ingestion actor, although I think it 
could be generalized to support other metadata sources as well.  We (the 
EcoGrid team) will be continuing to explore these issues in more detail 
as Jing and Rod continue working on incorporating the EcoGrid client 
into Kepler.  Comments appreciated, especially on the proposed data 
access changes to Kepler (see kepler/docs/dev/screenshots and 
kepler/docs/dev/EcoGrid* in CVS).

Cheers,
Matt

Shawn Bowers wrote:
> 
> Out of curiosity, what exactly is the "SEEK requirement" for dataset use 
> in analytical pipelines. For example, your email below seems to suggest 
> detailed EML metadata and placement in a catalog service (EcoGrid), 
> which involves placement in a SEEK-aware catalog system (Metacat or SRB, 
> e.g.) that I am assuming are (or will be eventually) curated.
> 
> Are there other requirements? Are these or additional requirements 
> captured somewhere?
> 
> Shawn
> 
> 
> 
> Matt Jones wrote:
> 
>> Deana,
>>
>> I took a look at the site containing the data.  In order to get it 
>> into EcoGrid reasonably, we really should develop some EML metadata 
>> descriptions of the products you are interested in from that site.  
>> I'm not sure how much work that would be -- depends on how complex and 
>> variable the different data sources are.  Once we have an EML 
>> description of each source, we can add them to the EcoGrid (currently 
>> that means manually adding the EML and the data to one of the EcoGrid 
>> systems).  My guess is that Metacat and SRB could be used for this 
>> one, but DiGIR is probably not appropriate for this data type.  Jing 
>> is working on putting EcoGrid access capabilities into Kepler, so once 
>> the data sets are accessible in EcoGrid you should be able to use them 
>> in Kepler in the workflow Chad is developing.
>>
>> Matt
>>
>> Deana Pennington wrote:
>>
>>> At the BEAM/AMS/KR meeting in early Feb, we designed a first 
>>> application for the ecological niche modelling community, that 
>>> involves analyzing the effect of various modeled climate change 
>>> scenarios on mammal populations.  To do the analysis, we need to use 
>>> climate data from the following site:
>>>
>>> IPCC climate change:    http://ipcc-ddc.cru.uea.ac.uk/
>>>
>>> There will be other sites as well; I'll let you know when I find out 
>>> what they are.  We will need to either set these up as nodes on the 
>>> EcoGrid, or mirror the sites on one of our nodes.  Could someone 
>>> please take a look at this site, and let me know if that is possible 
>>> any time in the near future?  I am currently trying to figure out 
>>> exactly which data are needed, and what we will have to do to them to 
>>> get them into the workflow Chad is constructing.
>>> Thanks,
>>> Deana
>>>
>>>
>>
>>
>> _______________________________________________
>> seek-dev mailing list
>> seek-dev at ecoinformatics.org
>> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
> 
> 
> _______________________________________________
> seek-dev mailing list
> seek-dev at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/seek-dev

-- 
-------------------------------------------------------------------
Matt Jones                                     jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439    Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org
-------------------------------------------------------------------