[seek-dev] Re: [Fwd: [seek-kr-sms] Taxon/KR integration prototype proposal]

Tue Apr 27 15:46:22 PDT 2004

(Note that I moved this thread to kr-sms ... and off of seek-dev)

 > Actually, this seems to me to be a fundamental difference in the way
 > CIS/IM and domain scientists approach problems.

Even in computer science, research follows the model you describe 
directly below. In addition, I believe that most fields in CS are not 
data-driven -- even in database fields (we don't care about the data, we 
care about the algorithms and their generality).  There are CS fields 
that are more closely related to branches of psychology (e.g., 
human-computer interaction, and natural language processing) that are 
exceptions.  Typically the hypothesis testing in these more 
"touchy-feely" CS fields strongly depends on the experimental data, and 
use standard techniques to evaluate their hypothesis ... These may use 
available data, or they may require designing new experiments to get 
data.  I would consider these both data-driven.  And, these are 
definitely useful endeavors -- e.g., in the field of medicine.

The way you characterize data-driven below reminds me of data mining -- 
you have some data and you try to find patterns in the data.  I 
definitely don't advocate this approach in SEEK ... and this isn't 
really what I was suggesting before.

So, I think we agree that pure tool-driven (not sure of an example) and 
data-driven approaches (data mining in the traditional sense) are out (I 
don't think we ever thought they were in, but anyway ...), and users of 
SEEK technology generally will have a hypothesis in mind when they 
interact with the system, is it useful to try to capture / represent 
hypothesis in the system, and if so, how could they be exploited and how 
could they be practically represented?

For example, could workflows be organized based on their applicability 
to certain styles of hypothesis?  Or, as a holy grail, you could imagine 
a scientist entering a hypothesis and the system actually trying to 
organize data and services that could be used to test the hypothesis 
(where the hypothesis is like a query, I suppose). For the latter, GEON 
is actually designing, and has designed, many of their test cases and 
use cases around specific hypothesis ... as opposed to the approach in 
SEEK of focusing test cases on a tool (GARP).

Shawn

Deana Pennington wrote:

> Sorry so long to reply...I've been at a conference without e-mail...
> 
> The entire scientific process is designed around testing hypotheses.  
> You come up with a research question of interest, then create an 
> analysis to test it.  NSF funding (and other funding sources) are 
> completely based on the strength (scientific merit) of the question and 
> how well thought out the proposed methodology is.  The idea of 
> integrating data simply to see if anything comes out of it is strongly 
> resisted, as is the idea of tool-driven science.  The general argument 
> is that science should be directed and focused along paths that have 
> been rationally determined.  Occasionally a tool comes along that 
> changes the way we can think about science (like the microsope, for 
> example), and for a short time, some exploratory analysis is funded.  
> But that is the exception, not the norm.  The synthetic work that is 
> being encouraged may depend on data integration, but it will have to be 
> proposed as a traditional research question to get funded.  Its the 
> difference between saying you want to put climate and hydrology data 
> together over time to look for interesting patterns, and having a 
> focused question that requires data integration to do the analysis 
> (hypothesis: drought in the western US has resulted in reduced 
> evapotranpiration in high elevation forests, which should result in an 
> increase in runoff for a given increase in precipitation).
> 
> Actually, this seems to me to be a fundamental difference in the way 
> CIS/IM and domain scientists approach problems.  I've been having a 
> long-term discussion about this with Samantha.  The RCN classes have 
> presented a data-centric view that works well with information managers, 
> but did not work well with the domain scientists at the new fac/postdoc 
> workshop.  They kept wondering what the goals/objectives were of the 
> information that was presented early in the week (Why are we doing 
> this?).  For the distributed graduate seminar, we have intentionally 
> changed that order around to a research question focus.  We'll see what 
> kind of response we get, but I think it will resonate with them.  
> Formulating your ideas through knowledge representation, pulling 
> together concepts, creating approaches to workflows...those are early in 
> the seminar, and would occur early in the scientific process, long 
> before a scientists thinks about data models, structures, or metadata.
> 
> Deana
> 
> 
> Bertram Ludaescher wrote:
> 
>> Interesting discussioN!
>>
>> Btw: I think of two very different things when I hear data-driven:
>>
>> (a) what you guys say below: I have some data, let's see what I can do 
>> with it, and
>> (b) some scientific workflow that is data-driven (optionally in a very
>> technical sense as in Ptolemy/Kepler)
>>
>> Bertram
>>
>>
>>
>>  
>>
>>>>>>> "SB" == Shawn Bowers <bowers at sdsc.edu> writes:
>>>>>>>           
>>
>> SB> SB> Comments on one of your comments :-)
>> SB> SB> (Also, I CC'd seek-dev in case anyone else is interested in 
>> the thread)
>> SB> SB> Deana Pennington wrote:
>> SB>  
>>
>>>>> - In the scenario, interestingly, the researcher first searches for 
>>>>> appropriate workflows and once found, searches for the data.  It 
>>>>> seems like it could go either way: the researcher may have found 
>>>>> roughly the data to support their hypothesis, and then wants to 
>>>>> find the right workflow/analysis to use on the data.  The latter 
>>>>> seems like it has more potential need for data 
>>>>> integration/transformation in that as a researcher looking for 
>>>>> data, you wouldn't be restricted by everything being "uniform" just 
>>>>> so you could plug it into the right model (of course, you wouldn't 
>>>>> necessarily be limited in this way by finding analyses first, but I 
>>>>> think it becomes a much harder problem).  Instead, you would be 
>>>>> looking for "good" data, regardless of whether it is nicely 
>>>>> formatted (which seems to be true for the mammal case -- I believe 
>>>>> that is the motivation for using IPCC data).
>>>>>       
>>>>
>>>> Yes, it could go either way.  However, I think for most scientists, 
>>>> they think of the problem first, then look for data.  The order is 
>>>> more likely to be, "I want to compare NPP at grassland sites around 
>>>> the world, and there are 4 different ways I could calculate NPP, and 
>>>> each of those ways requires different types of data".  The "4 
>>>> different ways" would be expressed as analytical workflows.  It is 
>>>> possible, thought, that after framing the question "I want to 
>>>> compare NPP", then they would decide to look and see what data are 
>>>> available before thinking about the appropriate analysis.  In fact, 
>>>> the whole idea of data-driven analyses is a new one in ecology (and 
>>>> science in general), and there are whole groups of people who think 
>>>> it is a completely wrong approach.
>>>>
>>>>     
>>
>> SB> SB> I take back my original statement that the problem is harder 
>> in one SB> direction than in the other. Basically, our problem is to 
>> match datasets SB> with services.  There is a set of constraints we 
>> want the datasets to SB> satisfy Dq (the query) and a set of implied 
>> constraints in the datasets SB> found Dc (e.g., structural and 
>> semantic constraints) by Dq.  Similarly, SB> there is a set of 
>> constraints we want the services to satisfy Sq (e.g., SB> that the 
>> services/workflow computes NPP), and a set of implied SB> constraints 
>> in the services found Ds (structural and semantic SB> constraints on 
>> inputs, e.g.) by Dq. So generally, regardless of whether SB> we search 
>> for datasets first or services first, our goal is to figure SB> out a 
>> way to transform and group the datasets to make the implied SB> 
>> constraints on the datasets fit with the implied constraints on the 
>> SB> services.  The problem changes (which is what my original point 
>> was SB> trying to say) if we assume that the datasets we look for 
>> *must* match SB> (without any transformation) the service constraints, 
>> which is the SB> current motivation for choosing the IPCC data (which 
>> isn't necessarily a SB> bad thing, it just isn't general, which we 
>> already know).
>> SB> SB> Also, for data-driven analysis, what is the argument as to why 
>> people SB> say it is the wrong approach?  Aren't there other 
>> "scientific" fields, SB> such as medicine or psychology (I consider 
>> these scientific, but that SB> probably isn't the general 
>> classification), that are very much "data SB> driven" in this way? I 
>> am just curious, and am interested in hearing SB> your opinion ...
>> SB> SB> Shawn
>> SB> SB> SB> SB> _______________________________________________
>> SB> seek-dev mailing list
>> SB> seek-dev at ecoinformatics.org
>> SB> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
>> _______________________________________________
>> seek-dev mailing list
>> seek-dev at ecoinformatics.org
>> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
>>  
>>
>