[seek-dev] Re: [Fwd: [seek-kr-sms] Taxon/KR integration prototype proposal]

Deana Pennington dpennington at lternet.edu
Tue Apr 27 14:14:22 PDT 2004

Sorry so long to reply...I've been at a conference without e-mail...

The entire scientific process is designed around testing hypotheses.  
You come up with a research question of interest, then create an 
analysis to test it.  NSF funding (and other funding sources) are 
completely based on the strength (scientific merit) of the question and 
how well thought out the proposed methodology is.  The idea of 
integrating data simply to see if anything comes out of it is strongly 
resisted, as is the idea of tool-driven science.  The general argument 
is that science should be directed and focused along paths that have 
been rationally determined.  Occasionally a tool comes along that 
changes the way we can think about science (like the microsope, for 
example), and for a short time, some exploratory analysis is funded.  
But that is the exception, not the norm.  The synthetic work that is 
being encouraged may depend on data integration, but it will have to be 
proposed as a traditional research question to get funded.  Its the 
difference between saying you want to put climate and hydrology data 
together over time to look for interesting patterns, and having a 
focused question that requires data integration to do the analysis 
(hypothesis: drought in the western US has resulted in reduced 
evapotranpiration in high elevation forests, which should result in an 
increase in runoff for a given increase in precipitation).

Actually, this seems to me to be a fundamental difference in the way 
CIS/IM and domain scientists approach problems.  I've been having a 
long-term discussion about this with Samantha.  The RCN classes have 
presented a data-centric view that works well with information managers, 
but did not work well with the domain scientists at the new fac/postdoc 
workshop.  They kept wondering what the goals/objectives were of the 
information that was presented early in the week (Why are we doing 
this?).  For the distributed graduate seminar, we have intentionally 
changed that order around to a research question focus.  We'll see what 
kind of response we get, but I think it will resonate with them.  
Formulating your ideas through knowledge representation, pulling 
together concepts, creating approaches to workflows...those are early in 
the seminar, and would occur early in the scientific process, long 
before a scientists thinks about data models, structures, or metadata.


Bertram Ludaescher wrote:

>Interesting discussioN!
>Btw: I think of two very different things when I hear data-driven:
>(a) what you guys say below: I have some data, let's see what I can do 
>with it, and 
>(b) some scientific workflow that is data-driven (optionally in a very
>technical sense as in Ptolemy/Kepler)
>>>>>>"SB" == Shawn Bowers <bowers at sdsc.edu> writes:
>SB> Comments on one of your comments :-)
>SB> (Also, I CC'd seek-dev in case anyone else is interested in the thread)
>SB> Deana Pennington wrote:
>>>>- In the scenario, interestingly, the researcher first searches for 
>>>>appropriate workflows and once found, searches for the data.  It seems 
>>>>like it could go either way: the researcher may have found roughly the 
>>>>data to support their hypothesis, and then wants to find the right 
>>>>workflow/analysis to use on the data.  The latter seems like it has 
>>>>more potential need for data integration/transformation in that as a 
>>>>researcher looking for data, you wouldn't be restricted by everything 
>>>>being "uniform" just so you could plug it into the right model (of 
>>>>course, you wouldn't necessarily be limited in this way by finding 
>>>>analyses first, but I think it becomes a much harder problem).  
>>>>Instead, you would be looking for "good" data, regardless of whether 
>>>>it is nicely formatted (which seems to be true for the mammal case -- 
>>>>I believe that is the motivation for using IPCC data).
>>>Yes, it could go either way.  However, I think for most scientists, they 
>>>think of the problem first, then look for data.  The order is more 
>>>likely to be, "I want to compare NPP at grassland sites around the 
>>>world, and there are 4 different ways I could calculate NPP, and each of 
>>>those ways requires different types of data".  The "4 different ways" 
>>>would be expressed as analytical workflows.  It is possible, thought, 
>>>that after framing the question "I want to compare NPP", then they would 
>>>decide to look and see what data are available before thinking about the 
>>>appropriate analysis.  In fact, the whole idea of data-driven analyses 
>>>is a new one in ecology (and science in general), and there are whole 
>>>groups of people who think it is a completely wrong approach.
>SB> I take back my original statement that the problem is harder in one 
>SB> direction than in the other. Basically, our problem is to match datasets 
>SB> with services.  There is a set of constraints we want the datasets to 
>SB> satisfy Dq (the query) and a set of implied constraints in the datasets 
>SB> found Dc (e.g., structural and semantic constraints) by Dq.  Similarly, 
>SB> there is a set of constraints we want the services to satisfy Sq (e.g., 
>SB> that the services/workflow computes NPP), and a set of implied 
>SB> constraints in the services found Ds (structural and semantic 
>SB> constraints on inputs, e.g.) by Dq. So generally, regardless of whether 
>SB> we search for datasets first or services first, our goal is to figure 
>SB> out a way to transform and group the datasets to make the implied 
>SB> constraints on the datasets fit with the implied constraints on the 
>SB> services.  The problem changes (which is what my original point was 
>SB> trying to say) if we assume that the datasets we look for *must* match 
>SB> (without any transformation) the service constraints, which is the 
>SB> current motivation for choosing the IPCC data (which isn't necessarily a 
>SB> bad thing, it just isn't general, which we already know).
>SB> Also, for data-driven analysis, what is the argument as to why people 
>SB> say it is the wrong approach?  Aren't there other "scientific" fields, 
>SB> such as medicine or psychology (I consider these scientific, but that 
>SB> probably isn't the general classification), that are very much "data 
>SB> driven" in this way? I am just curious, and am interested in hearing 
>SB> your opinion ...
>SB> Shawn
>SB> _______________________________________________
>SB> seek-dev mailing list
>SB> seek-dev at ecoinformatics.org
>SB> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
>seek-dev mailing list
>seek-dev at ecoinformatics.org


Deana D. Pennington, PhD
Long-term Ecological Research Network Office

UNM Biology Department
MSC03  2020
1 University of New Mexico
Albuquerque, NM  87131-0001

505-272-7288 (office)
505 272-7080 (fax)

More information about the Seek-dev mailing list