[seek-dev] data

Matt Jones jones at nceas.ucsb.edu
Thu Apr 8 11:51:48 PDT 2004

Thanks for the update, Deana.

One important thing for Kepler is to be able to do the preprocessing you 
need to get from raw data to the data you want to use in an analysis. 
Thus, if you need to clip, reproject, change resoulution, etc, then we'd 
like you to be able to script those operations within Kepler.  Chad was 
working on actors that provide at least some of this functionality by 
wrapping GRASS.  Do you think he has a complete list of the operations 
you might need to perform?  If not, that would be extremely valuable. 
Otherwise, we'll still be in the same situation as now where a bunch of 
manual pre-processing is required before ingesting data into a workflow, 
whereas we want that preprocessing to be part of the workflow.


Deana Pennington wrote:
> Matt,
> There is quite a bit of work that will have to be done to the climate 
> data, before it can be used in Chad's pipeline.  I'm still trying to get 
> a final decision from the ENM community about which datasets they want 
> to use (sigh...no wonder you guys get frustrated working with 
> ecologists.  response time is slow).  We'll have to write scripts to 
> integrate the different climate scenario data, because they have 
> different projections, different grid cell sizes and shapes, etc, plus 
> we will have to do the EML.  I think we can do all of the tasks here in 
> ABQ, once I get our DARPA postdoc hired (hopefully he will be here by 
> late June).  I'm attaching a task doc, if you're interested in exactly 
> what I think needs to be done to the data.  If all goes well, Chad's 
> prototype will be done by Scotland, and we can have the data ready by 
> the end of July, which means we can have the project results by the end 
> of the summer.  It will take some time after that, for Town to analyze 
> the results.
> BTW, who is the owner of the crosswalk tool that Blankman was using, and 
> where could I get it?  I want to try to automate the EML creation for 
> these datasets as much as possible.
> FYI, also attaching the postdoc's CV & letters (3 pdf files), which I 
> think you said were corrupted when I sent them to you earlier.
> Coming next... Use case for the mammal project, to incorporate into 
> Shawn's write up. Coming next week...EOT management plan to set context 
> for Samantha's performance plan, for Jim Reichman.  Would like you to 
> look at it & comment, before I send it to Jim.
> Deana
> Matt Jones wrote:
>> Hi Shawn,
>> Those requirements have not been formally defined.  To use data 
>> generally within the context of SEEK and Kepler, we almost certainly 
>> need a good metadata description in order to interpret the data 
>> correctly.  I think it makes sense for that to be an EML description, 
>> or maybe EML with some SEEK extensions (e.g., for semantic labeling).  
>> EML of course is quite loose about which metadata are required.  If 
>> someone omits the physical and logical descriptions of the data, it 
>> would be hard to build automated ingestion tools.  Our work so far in 
>> Kepler for automatically ingesting arbitrary data sources is that they 
>> have a complete entity/attribute description in EML.  There certainly 
>> other ways to provide this information that I would not want to rule 
>> out as options, but I think the EML route is a sensible one for SEEK.
>> One of the other Kepler developers (Efrat from GEON) created a data 
>> ingestion actor based on JDBC.  You provide an endpoint and a SQL 
>> query as input and it exposes the records as output.  This is another 
>> way of getting data into Kepler.  Its problem is that there is no 
>> formal relationship between the SQL query and the datatypes of the 
>> output port(s).  I think we could consolidate some of this code with 
>> other code (such as Chad's EML ingestion actor) and come up with a 
>> more general approach that is extensible.  Here's what I've been 
>> thinking...
>> Using data in Kepler involves 1) transporting the data to the machine, 
>> 2) filtering the data to produce a subset (potentially ona remote 
>> machine before (1)), and 3) exposing the resulting data as 
>> strongly-typed ports in Kepler.  The first (1) is accomplished now 
>> through jdbc, file system access, grid access, and (soon) ecogrid 
>> access.  The second (2) is part of the proposal we've made for ecogrid 
>> access (a generic means of expressing filter conditions) and is part 
>> of Efrat's jdbc actor (via sql).  The third (3) is currently handled 
>> by Chad's (somewhat incomplete) EML ingestion actor, although I think 
>> it could be generalized to support other metadata sources as well.  We 
>> (the EcoGrid team) will be continuing to explore these issues in more 
>> detail as Jing and Rod continue working on incorporating the EcoGrid 
>> client into Kepler.  Comments appreciated, especially on the proposed 
>> data access changes to Kepler (see kepler/docs/dev/screenshots and 
>> kepler/docs/dev/EcoGrid* in CVS).
>> Cheers,
>> Matt
>> Shawn Bowers wrote:
>>> Out of curiosity, what exactly is the "SEEK requirement" for dataset 
>>> use in analytical pipelines. For example, your email below seems to 
>>> suggest detailed EML metadata and placement in a catalog service 
>>> (EcoGrid), which involves placement in a SEEK-aware catalog system 
>>> (Metacat or SRB, e.g.) that I am assuming are (or will be eventually) 
>>> curated.
>>> Are there other requirements? Are these or additional requirements 
>>> captured somewhere?
>>> Shawn
>>> Matt Jones wrote:
>>>> Deana,
>>>> I took a look at the site containing the data.  In order to get it 
>>>> into EcoGrid reasonably, we really should develop some EML metadata 
>>>> descriptions of the products you are interested in from that site.  
>>>> I'm not sure how much work that would be -- depends on how complex 
>>>> and variable the different data sources are.  Once we have an EML 
>>>> description of each source, we can add them to the EcoGrid 
>>>> (currently that means manually adding the EML and the data to one of 
>>>> the EcoGrid systems).  My guess is that Metacat and SRB could be 
>>>> used for this one, but DiGIR is probably not appropriate for this 
>>>> data type.  Jing is working on putting EcoGrid access capabilities 
>>>> into Kepler, so once the data sets are accessible in EcoGrid you 
>>>> should be able to use them in Kepler in the workflow Chad is 
>>>> developing.
>>>> Matt
>>>> Deana Pennington wrote:
>>>>> At the BEAM/AMS/KR meeting in early Feb, we designed a first 
>>>>> application for the ecological niche modelling community, that 
>>>>> involves analyzing the effect of various modeled climate change 
>>>>> scenarios on mammal populations.  To do the analysis, we need to 
>>>>> use climate data from the following site:
>>>>> IPCC climate change:    http://ipcc-ddc.cru.uea.ac.uk/
>>>>> There will be other sites as well; I'll let you know when I find 
>>>>> out what they are.  We will need to either set these up as nodes on 
>>>>> the EcoGrid, or mirror the sites on one of our nodes.  Could 
>>>>> someone please take a look at this site, and let me know if that is 
>>>>> possible any time in the near future?  I am currently trying to 
>>>>> figure out exactly which data are needed, and what we will have to 
>>>>> do to them to get them into the workflow Chad is constructing.
>>>>> Thanks,
>>>>> Deana
>>>> _______________________________________________
>>>> seek-dev mailing list
>>>> seek-dev at ecoinformatics.org
>>>> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
>>> _______________________________________________
>>> seek-dev mailing list
>>> seek-dev at ecoinformatics.org
>>> http://www.ecoinformatics.org/mailman/listinfo/seek-dev

Matt Jones                                     jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439    Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org

More information about the Seek-dev mailing list