[seek-dev] Re: [Fwd: [seek-kr-sms] Taxon/KR integration prototype proposal]

Tue Apr 27 15:50:30 PDT 2004

Shawn Bowers wrote:

> 
> (Note that I moved this thread to kr-sms ... and off of seek-dev)
> 
>  > Actually, this seems to me to be a fundamental difference in the way
>  > CIS/IM and domain scientists approach problems.
> 
> Even in computer science, research follows the model you describe 
> directly below. In addition, I believe that most fields in CS are not 
> data-driven -- even in database fields (we don't care about the data, we 
> care about the algorithms and their generality).  There are CS fields 
> that are more closely related to branches of psychology (e.g., 
> human-computer interaction, and natural language processing) that are 
> exceptions.  Typically the hypothesis testing in these more 
> "touchy-feely" CS fields strongly depends on the experimental data, and 
> use standard techniques to evaluate their hypothesis ... These may use 
> available data, or they may require designing new experiments to get 
> data.  I would consider these both data-driven.  And, these are 
> definitely useful endeavors -- e.g., in the field of medicine.

I meant to say, that medicine is another field that I would say is "data 
driven" in this way.

> 
> The way you characterize data-driven below reminds me of data mining -- 
> you have some data and you try to find patterns in the data.  I 
> definitely don't advocate this approach in SEEK ... and this isn't 
> really what I was suggesting before.
> 
> So, I think we agree that pure tool-driven (not sure of an example) and 
> data-driven approaches (data mining in the traditional sense) are out (I 
> don't think we ever thought they were in, but anyway ...), and users of 
> SEEK technology generally will have a hypothesis in mind when they 
> interact with the system, is it useful to try to capture / represent 
> hypothesis in the system, and if so, how could they be exploited and how 
> could they be practically represented?
> 
> For example, could workflows be organized based on their applicability 
> to certain styles of hypothesis?  Or, as a holy grail, you could imagine 
> a scientist entering a hypothesis and the system actually trying to 
> organize data and services that could be used to test the hypothesis 
> (where the hypothesis is like a query, I suppose). For the latter, GEON 
> is actually designing, and has designed, many of their test cases and 
> use cases around specific hypothesis ... as opposed to the approach in 
> SEEK of focusing test cases on a tool (GARP).
> 
> Shawn
> 
> 
> 
> Deana Pennington wrote:
> 
>> Sorry so long to reply...I've been at a conference without e-mail...
>>
>> The entire scientific process is designed around testing hypotheses.  
>> You come up with a research question of interest, then create an 
>> analysis to test it.  NSF funding (and other funding sources) are 
>> completely based on the strength (scientific merit) of the question 
>> and how well thought out the proposed methodology is.  The idea of 
>> integrating data simply to see if anything comes out of it is strongly 
>> resisted, as is the idea of tool-driven science.  The general argument 
>> is that science should be directed and focused along paths that have 
>> been rationally determined.  Occasionally a tool comes along that 
>> changes the way we can think about science (like the microsope, for 
>> example), and for a short time, some exploratory analysis is funded.  
>> But that is the exception, not the norm.  The synthetic work that is 
>> being encouraged may depend on data integration, but it will have to 
>> be proposed as a traditional research question to get funded.  Its the 
>> difference between saying you want to put climate and hydrology data 
>> together over time to look for interesting patterns, and having a 
>> focused question that requires data integration to do the analysis 
>> (hypothesis: drought in the western US has resulted in reduced 
>> evapotranpiration in high elevation forests, which should result in an 
>> increase in runoff for a given increase in precipitation).
>>
>> Actually, this seems to me to be a fundamental difference in the way 
>> CIS/IM and domain scientists approach problems.  I've been having a 
>> long-term discussion about this with Samantha.  The RCN classes have 
>> presented a data-centric view that works well with information 
>> managers, but did not work well with the domain scientists at the new 
>> fac/postdoc workshop.  They kept wondering what the goals/objectives 
>> were of the information that was presented early in the week (Why are 
>> we doing this?).  For the distributed graduate seminar, we have 
>> intentionally changed that order around to a research question focus.  
>> We'll see what kind of response we get, but I think it will resonate 
>> with them.  Formulating your ideas through knowledge representation, 
>> pulling together concepts, creating approaches to workflows...those 
>> are early in the seminar, and would occur early in the scientific 
>> process, long before a scientists thinks about data models, 
>> structures, or metadata.
>>
>> Deana
>>
>>
>> Bertram Ludaescher wrote:
>>
>>> Interesting discussioN!
>>>
>>> Btw: I think of two very different things when I hear data-driven:
>>>
>>> (a) what you guys say below: I have some data, let's see what I can 
>>> do with it, and
>>> (b) some scientific workflow that is data-driven (optionally in a very
>>> technical sense as in Ptolemy/Kepler)
>>>
>>> Bertram
>>>
>>>
>>>
>>>  
>>>
>>>>>>>> "SB" == Shawn Bowers <bowers at sdsc.edu> writes:
>>>>>>>>           
>>>
>>>
>>> SB> SB> Comments on one of your comments :-)
>>> SB> SB> (Also, I CC'd seek-dev in case anyone else is interested in 
>>> the thread)
>>> SB> SB> Deana Pennington wrote:
>>> SB> 
>>>
>>>>>> - In the scenario, interestingly, the researcher first searches 
>>>>>> for appropriate workflows and once found, searches for the data.  
>>>>>> It seems like it could go either way: the researcher may have 
>>>>>> found roughly the data to support their hypothesis, and then wants 
>>>>>> to find the right workflow/analysis to use on the data.  The 
>>>>>> latter seems like it has more potential need for data 
>>>>>> integration/transformation in that as a researcher looking for 
>>>>>> data, you wouldn't be restricted by everything being "uniform" 
>>>>>> just so you could plug it into the right model (of course, you 
>>>>>> wouldn't necessarily be limited in this way by finding analyses 
>>>>>> first, but I think it becomes a much harder problem).  Instead, 
>>>>>> you would be looking for "good" data, regardless of whether it is 
>>>>>> nicely formatted (which seems to be true for the mammal case -- I 
>>>>>> believe that is the motivation for using IPCC data).
>>>>>>       
>>>>>
>>>>>
>>>>> Yes, it could go either way.  However, I think for most scientists, 
>>>>> they think of the problem first, then look for data.  The order is 
>>>>> more likely to be, "I want to compare NPP at grassland sites around 
>>>>> the world, and there are 4 different ways I could calculate NPP, 
>>>>> and each of those ways requires different types of data".  The "4 
>>>>> different ways" would be expressed as analytical workflows.  It is 
>>>>> possible, thought, that after framing the question "I want to 
>>>>> compare NPP", then they would decide to look and see what data are 
>>>>> available before thinking about the appropriate analysis.  In fact, 
>>>>> the whole idea of data-driven analyses is a new one in ecology (and 
>>>>> science in general), and there are whole groups of people who think 
>>>>> it is a completely wrong approach.
>>>>>
>>>>>     
>>>
>>>
>>> SB> SB> I take back my original statement that the problem is harder 
>>> in one SB> direction than in the other. Basically, our problem is to 
>>> match datasets SB> with services.  There is a set of constraints we 
>>> want the datasets to SB> satisfy Dq (the query) and a set of implied 
>>> constraints in the datasets SB> found Dc (e.g., structural and 
>>> semantic constraints) by Dq.  Similarly, SB> there is a set of 
>>> constraints we want the services to satisfy Sq (e.g., SB> that the 
>>> services/workflow computes NPP), and a set of implied SB> constraints 
>>> in the services found Ds (structural and semantic SB> constraints on 
>>> inputs, e.g.) by Dq. So generally, regardless of whether SB> we 
>>> search for datasets first or services first, our goal is to figure 
>>> SB> out a way to transform and group the datasets to make the implied 
>>> SB> constraints on the datasets fit with the implied constraints on 
>>> the SB> services.  The problem changes (which is what my original 
>>> point was SB> trying to say) if we assume that the datasets we look 
>>> for *must* match SB> (without any transformation) the service 
>>> constraints, which is the SB> current motivation for choosing the 
>>> IPCC data (which isn't necessarily a SB> bad thing, it just isn't 
>>> general, which we already know).
>>> SB> SB> Also, for data-driven analysis, what is the argument as to 
>>> why people SB> say it is the wrong approach?  Aren't there other 
>>> "scientific" fields, SB> such as medicine or psychology (I consider 
>>> these scientific, but that SB> probably isn't the general 
>>> classification), that are very much "data SB> driven" in this way? I 
>>> am just curious, and am interested in hearing SB> your opinion ...
>>> SB> SB> Shawn
>>> SB> SB> SB> SB> _______________________________________________
>>> SB> seek-dev mailing list
>>> SB> seek-dev at ecoinformatics.org
>>> SB> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
>>> _______________________________________________
>>> seek-dev mailing list
>>> seek-dev at ecoinformatics.org
>>> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
>>>  
>>>
>>
> 
> _______________________________________________
> seek-kr-sms mailing list
> seek-kr-sms at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/seek-kr-sms