[seek-dev] Re: [Fwd: [seek-kr-sms] Taxon/KR integration prototype proposal]

Bertram Ludaescher ludaesch at sdsc.edu
Tue Apr 20 18:55:08 PDT 2004

Interesting discussioN!

Btw: I think of two very different things when I hear data-driven:

(a) what you guys say below: I have some data, let's see what I can do 
with it, and 

(b) some scientific workflow that is data-driven (optionally in a very
technical sense as in Ptolemy/Kepler)


>>>>> "SB" == Shawn Bowers <bowers at sdsc.edu> writes:
SB> Comments on one of your comments :-)
SB> (Also, I CC'd seek-dev in case anyone else is interested in the thread)
SB> Deana Pennington wrote:
>>> - In the scenario, interestingly, the researcher first searches for 
>>> appropriate workflows and once found, searches for the data.  It seems 
>>> like it could go either way: the researcher may have found roughly the 
>>> data to support their hypothesis, and then wants to find the right 
>>> workflow/analysis to use on the data.  The latter seems like it has 
>>> more potential need for data integration/transformation in that as a 
>>> researcher looking for data, you wouldn't be restricted by everything 
>>> being "uniform" just so you could plug it into the right model (of 
>>> course, you wouldn't necessarily be limited in this way by finding 
>>> analyses first, but I think it becomes a much harder problem).  
>>> Instead, you would be looking for "good" data, regardless of whether 
>>> it is nicely formatted (which seems to be true for the mammal case -- 
>>> I believe that is the motivation for using IPCC data).
>> Yes, it could go either way.  However, I think for most scientists, they 
>> think of the problem first, then look for data.  The order is more 
>> likely to be, "I want to compare NPP at grassland sites around the 
>> world, and there are 4 different ways I could calculate NPP, and each of 
>> those ways requires different types of data".  The "4 different ways" 
>> would be expressed as analytical workflows.  It is possible, thought, 
>> that after framing the question "I want to compare NPP", then they would 
>> decide to look and see what data are available before thinking about the 
>> appropriate analysis.  In fact, the whole idea of data-driven analyses 
>> is a new one in ecology (and science in general), and there are whole 
>> groups of people who think it is a completely wrong approach.
SB> I take back my original statement that the problem is harder in one 
SB> direction than in the other. Basically, our problem is to match datasets 
SB> with services.  There is a set of constraints we want the datasets to 
SB> satisfy Dq (the query) and a set of implied constraints in the datasets 
SB> found Dc (e.g., structural and semantic constraints) by Dq.  Similarly, 
SB> there is a set of constraints we want the services to satisfy Sq (e.g., 
SB> that the services/workflow computes NPP), and a set of implied 
SB> constraints in the services found Ds (structural and semantic 
SB> constraints on inputs, e.g.) by Dq. So generally, regardless of whether 
SB> we search for datasets first or services first, our goal is to figure 
SB> out a way to transform and group the datasets to make the implied 
SB> constraints on the datasets fit with the implied constraints on the 
SB> services.  The problem changes (which is what my original point was 
SB> trying to say) if we assume that the datasets we look for *must* match 
SB> (without any transformation) the service constraints, which is the 
SB> current motivation for choosing the IPCC data (which isn't necessarily a 
SB> bad thing, it just isn't general, which we already know).
SB> Also, for data-driven analysis, what is the argument as to why people 
SB> say it is the wrong approach?  Aren't there other "scientific" fields, 
SB> such as medicine or psychology (I consider these scientific, but that 
SB> probably isn't the general classification), that are very much "data 
SB> driven" in this way? I am just curious, and am interested in hearing 
SB> your opinion ...
SB> Shawn
