[seek-dev] Re: [Fwd: [seek-kr-sms] Taxon/KR integration prototype proposal]

Deana Pennington dpennington at lternet.edu
Wed Apr 28 09:04:01 PDT 2004

Seems like if we have ontologies that represent what is known, and 
explicitly link analyses and data to those ontologies, that we could 
represent a hypothesis in terms of the formalized ontology, then drive 
the data and analysis discovery from that.  This seems like a very 
intuitive approach to me, and doesn't seem like it would be that 
difficult to do, but then, I'm not the one that has to make it happen : 
)   I'm thinking that, for example, if I generated a hypothesis that 
linked sinkhole occurrence with some environmental variables, and 
formalized that hypothesis in terms of an existing geologic ontology, 
that the geologic ontology would link with statistical and measurement 
ontologies from which it could be reasoned that a logistic model of 
occurrence data (dependent var) and environmental layers (independent) 
would be relevant.  The geologic ontology would link with a stat 
ontology, which would link with an ecological ontology, where the garp 
model could be found, then used in an entirely different domain than the 
one for which it was constructed. 

We talked about this some in our breakout group in Santa Barbara, but 
ontologically-contructed hypothesis generation is something I have been 
thinking quite a bit about. In fact, when I was at NSF in January, I was 
asked what one thing I would recommend to enable synthetic research, and 
I responded that I would force would-be multidisciplinary teams to 
construct ontologies around all of the concepts relevant to their 
question of interest then explicitly show how their proposed research 
was going to extend/clarify the ontologies.  I have also suggested to 
Bob Waide that we go for funding to hold some working meetings with some 
of the LTER groups who are trying to come up with proposals for 
synthetic research, to do exactly that.  I haven't heard back from him, 
but I think this approach might be an opportunity to do some very 
interesting, new research within the kr/sms group.


Shawn Bowers wrote:

> Shawn Bowers wrote:
>> (Note that I moved this thread to kr-sms ... and off of seek-dev)
>>  > Actually, this seems to me to be a fundamental difference in the way
>>  > CIS/IM and domain scientists approach problems.
>> Even in computer science, research follows the model you describe 
>> directly below. In addition, I believe that most fields in CS are not 
>> data-driven -- even in database fields (we don't care about the data, 
>> we care about the algorithms and their generality).  There are CS 
>> fields that are more closely related to branches of psychology (e.g., 
>> human-computer interaction, and natural language processing) that are 
>> exceptions.  Typically the hypothesis testing in these more 
>> "touchy-feely" CS fields strongly depends on the experimental data, 
>> and use standard techniques to evaluate their hypothesis ... These 
>> may use available data, or they may require designing new experiments 
>> to get data.  I would consider these both data-driven.  And, these 
>> are definitely useful endeavors -- e.g., in the field of medicine.
> I meant to say, that medicine is another field that I would say is 
> "data driven" in this way.
>> The way you characterize data-driven below reminds me of data mining 
>> -- you have some data and you try to find patterns in the data.  I 
>> definitely don't advocate this approach in SEEK ... and this isn't 
>> really what I was suggesting before.
>> So, I think we agree that pure tool-driven (not sure of an example) 
>> and data-driven approaches (data mining in the traditional sense) are 
>> out (I don't think we ever thought they were in, but anyway ...), and 
>> users of SEEK technology generally will have a hypothesis in mind 
>> when they interact with the system, is it useful to try to capture / 
>> represent hypothesis in the system, and if so, how could they be 
>> exploited and how could they be practically represented?
>> For example, could workflows be organized based on their 
>> applicability to certain styles of hypothesis?  Or, as a holy grail, 
>> you could imagine a scientist entering a hypothesis and the system 
>> actually trying to organize data and services that could be used to 
>> test the hypothesis (where the hypothesis is like a query, I 
>> suppose). For the latter, GEON is actually designing, and has 
>> designed, many of their test cases and use cases around specific 
>> hypothesis ... as opposed to the approach in SEEK of focusing test 
>> cases on a tool (GARP).
>> Shawn
>> Deana Pennington wrote:
>>> Sorry so long to reply...I've been at a conference without e-mail...
>>> The entire scientific process is designed around testing 
>>> hypotheses.  You come up with a research question of interest, then 
>>> create an analysis to test it.  NSF funding (and other funding 
>>> sources) are completely based on the strength (scientific merit) of 
>>> the question and how well thought out the proposed methodology is.  
>>> The idea of integrating data simply to see if anything comes out of 
>>> it is strongly resisted, as is the idea of tool-driven science.  The 
>>> general argument is that science should be directed and focused 
>>> along paths that have been rationally determined.  Occasionally a 
>>> tool comes along that changes the way we can think about science 
>>> (like the microsope, for example), and for a short time, some 
>>> exploratory analysis is funded.  But that is the exception, not the 
>>> norm.  The synthetic work that is being encouraged may depend on 
>>> data integration, but it will have to be proposed as a traditional 
>>> research question to get funded.  Its the difference between saying 
>>> you want to put climate and hydrology data together over time to 
>>> look for interesting patterns, and having a focused question that 
>>> requires data integration to do the analysis (hypothesis: drought in 
>>> the western US has resulted in reduced evapotranpiration in high 
>>> elevation forests, which should result in an increase in runoff for 
>>> a given increase in precipitation).
>>> Actually, this seems to me to be a fundamental difference in the way 
>>> CIS/IM and domain scientists approach problems.  I've been having a 
>>> long-term discussion about this with Samantha.  The RCN classes have 
>>> presented a data-centric view that works well with information 
>>> managers, but did not work well with the domain scientists at the 
>>> new fac/postdoc workshop.  They kept wondering what the 
>>> goals/objectives were of the information that was presented early in 
>>> the week (Why are we doing this?).  For the distributed graduate 
>>> seminar, we have intentionally changed that order around to a 
>>> research question focus.  We'll see what kind of response we get, 
>>> but I think it will resonate with them.  Formulating your ideas 
>>> through knowledge representation, pulling together concepts, 
>>> creating approaches to workflows...those are early in the seminar, 
>>> and would occur early in the scientific process, long before a 
>>> scientists thinks about data models, structures, or metadata.
>>> Deana
>>> Bertram Ludaescher wrote:
>>>> Interesting discussioN!
>>>> Btw: I think of two very different things when I hear data-driven:
>>>> (a) what you guys say below: I have some data, let's see what I can 
>>>> do with it, and
>>>> (b) some scientific workflow that is data-driven (optionally in a very
>>>> technical sense as in Ptolemy/Kepler)
>>>> Bertram
>>>>>>>>> "SB" == Shawn Bowers <bowers at sdsc.edu> writes:
>>>> SB> SB> Comments on one of your comments :-)
>>>> SB> SB> (Also, I CC'd seek-dev in case anyone else is interested in 
>>>> the thread)
>>>> SB> SB> Deana Pennington wrote:
>>>> SB>
>>>>>>> - In the scenario, interestingly, the researcher first searches 
>>>>>>> for appropriate workflows and once found, searches for the 
>>>>>>> data.  It seems like it could go either way: the researcher may 
>>>>>>> have found roughly the data to support their hypothesis, and 
>>>>>>> then wants to find the right workflow/analysis to use on the 
>>>>>>> data.  The latter seems like it has more potential need for data 
>>>>>>> integration/transformation in that as a researcher looking for 
>>>>>>> data, you wouldn't be restricted by everything being "uniform" 
>>>>>>> just so you could plug it into the right model (of course, you 
>>>>>>> wouldn't necessarily be limited in this way by finding analyses 
>>>>>>> first, but I think it becomes a much harder problem).  Instead, 
>>>>>>> you would be looking for "good" data, regardless of whether it 
>>>>>>> is nicely formatted (which seems to be true for the mammal case 
>>>>>>> -- I believe that is the motivation for using IPCC data).
>>>>>> Yes, it could go either way.  However, I think for most 
>>>>>> scientists, they think of the problem first, then look for data.  
>>>>>> The order is more likely to be, "I want to compare NPP at 
>>>>>> grassland sites around the world, and there are 4 different ways 
>>>>>> I could calculate NPP, and each of those ways requires different 
>>>>>> types of data".  The "4 different ways" would be expressed as 
>>>>>> analytical workflows.  It is possible, thought, that after 
>>>>>> framing the question "I want to compare NPP", then they would 
>>>>>> decide to look and see what data are available before thinking 
>>>>>> about the appropriate analysis.  In fact, the whole idea of 
>>>>>> data-driven analyses is a new one in ecology (and science in 
>>>>>> general), and there are whole groups of people who think it is a 
>>>>>> completely wrong approach.
>>>> SB> SB> I take back my original statement that the problem is 
>>>> harder in one SB> direction than in the other. Basically, our 
>>>> problem is to match datasets SB> with services.  There is a set of 
>>>> constraints we want the datasets to SB> satisfy Dq (the query) and 
>>>> a set of implied constraints in the datasets SB> found Dc (e.g., 
>>>> structural and semantic constraints) by Dq.  Similarly, SB> there 
>>>> is a set of constraints we want the services to satisfy Sq (e.g., 
>>>> SB> that the services/workflow computes NPP), and a set of implied 
>>>> SB> constraints in the services found Ds (structural and semantic 
>>>> SB> constraints on inputs, e.g.) by Dq. So generally, regardless of 
>>>> whether SB> we search for datasets first or services first, our 
>>>> goal is to figure SB> out a way to transform and group the datasets 
>>>> to make the implied SB> constraints on the datasets fit with the 
>>>> implied constraints on the SB> services.  The problem changes 
>>>> (which is what my original point was SB> trying to say) if we 
>>>> assume that the datasets we look for *must* match SB> (without any 
>>>> transformation) the service constraints, which is the SB> current 
>>>> motivation for choosing the IPCC data (which isn't necessarily a 
>>>> SB> bad thing, it just isn't general, which we already know).
>>>> SB> SB> Also, for data-driven analysis, what is the argument as to 
>>>> why people SB> say it is the wrong approach?  Aren't there other 
>>>> "scientific" fields, SB> such as medicine or psychology (I consider 
>>>> these scientific, but that SB> probably isn't the general 
>>>> classification), that are very much "data SB> driven" in this way? 
>>>> I am just curious, and am interested in hearing SB> your opinion ...
>>>> SB> SB> Shawn
>>>> SB> SB> SB> SB> _______________________________________________
>>>> SB> seek-dev mailing list
>>>> SB> seek-dev at ecoinformatics.org
>>>> SB> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
>>>> _______________________________________________
>>>> seek-dev mailing list
>>>> seek-dev at ecoinformatics.org
>>>> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
>> _______________________________________________
>> seek-kr-sms mailing list
>> seek-kr-sms at ecoinformatics.org
>> http://www.ecoinformatics.org/mailman/listinfo/seek-kr-sms
> _______________________________________________
> seek-kr-sms mailing list
> seek-kr-sms at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/seek-kr-sms


Deana D. Pennington, PhD
Long-term Ecological Research Network Office

UNM Biology Department
MSC03  2020
1 University of New Mexico
Albuquerque, NM  87131-0001

505-272-7288 (office)
505 272-7080 (fax)

More information about the Seek-kr-sms mailing list