[seek-dev] Re: [Fwd: [seek-kr-sms] Taxon/KR integration prototype proposal]

Tue Apr 20 10:15:10 PDT 2004

Comments on one of your comments :-)

(Also, I CC'd seek-dev in case anyone else is interested in the thread)

Deana Pennington wrote:

>> - In the scenario, interestingly, the researcher first searches for 
>> appropriate workflows and once found, searches for the data.  It seems 
>> like it could go either way: the researcher may have found roughly the 
>> data to support their hypothesis, and then wants to find the right 
>> workflow/analysis to use on the data.  The latter seems like it has 
>> more potential need for data integration/transformation in that as a 
>> researcher looking for data, you wouldn't be restricted by everything 
>> being "uniform" just so you could plug it into the right model (of 
>> course, you wouldn't necessarily be limited in this way by finding 
>> analyses first, but I think it becomes a much harder problem).  
>> Instead, you would be looking for "good" data, regardless of whether 
>> it is nicely formatted (which seems to be true for the mammal case -- 
>> I believe that is the motivation for using IPCC data).
> 
> 
> Yes, it could go either way.  However, I think for most scientists, they 
> think of the problem first, then look for data.  The order is more 
> likely to be, "I want to compare NPP at grassland sites around the 
> world, and there are 4 different ways I could calculate NPP, and each of 
> those ways requires different types of data".  The "4 different ways" 
> would be expressed as analytical workflows.  It is possible, thought, 
> that after framing the question "I want to compare NPP", then they would 
> decide to look and see what data are available before thinking about the 
> appropriate analysis.  In fact, the whole idea of data-driven analyses 
> is a new one in ecology (and science in general), and there are whole 
> groups of people who think it is a completely wrong approach.
> 

I take back my original statement that the problem is harder in one 
direction than in the other. Basically, our problem is to match datasets 
with services.  There is a set of constraints we want the datasets to 
satisfy Dq (the query) and a set of implied constraints in the datasets 
found Dc (e.g., structural and semantic constraints) by Dq.  Similarly, 
there is a set of constraints we want the services to satisfy Sq (e.g., 
that the services/workflow computes NPP), and a set of implied 
constraints in the services found Ds (structural and semantic 
constraints on inputs, e.g.) by Dq. So generally, regardless of whether 
we search for datasets first or services first, our goal is to figure 
out a way to transform and group the datasets to make the implied 
constraints on the datasets fit with the implied constraints on the 
services.  The problem changes (which is what my original point was 
trying to say) if we assume that the datasets we look for *must* match 
(without any transformation) the service constraints, which is the 
current motivation for choosing the IPCC data (which isn't necessarily a 
bad thing, it just isn't general, which we already know).

Also, for data-driven analysis, what is the argument as to why people 
say it is the wrong approach?  Aren't there other "scientific" fields, 
such as medicine or psychology (I consider these scientific, but that 
probably isn't the general classification), that are very much "data 
driven" in this way? I am just curious, and am interested in hearing 
your opinion ...

Shawn