[seek-dev] Re: [Fwd: [seek-kr-sms] Taxon/KR integration prototype proposal]

Bertram Ludaescher ludaesch at sdsc.edu
Wed Apr 28 09:37:13 PDT 2004


Your examples and vision are very interesting! Hopefully we get a
chance to talk about those in Edinburgh (I'm sure we are ;-)



>>>>> "DP" == Deana Pennington <dpennington at lternet.edu> writes:
DP> Seems like if we have ontologies that represent what is known, and 
DP> explicitly link analyses and data to those ontologies, that we could 
DP> represent a hypothesis in terms of the formalized ontology, then drive 
DP> the data and analysis discovery from that.  This seems like a very 
DP> intuitive approach to me, and doesn't seem like it would be that 
DP> difficult to do, but then, I'm not the one that has to make it happen : 
DP> )   I'm thinking that, for example, if I generated a hypothesis that 
DP> linked sinkhole occurrence with some environmental variables, and 
DP> formalized that hypothesis in terms of an existing geologic ontology, 
DP> that the geologic ontology would link with statistical and measurement 
DP> ontologies from which it could be reasoned that a logistic model of 
DP> occurrence data (dependent var) and environmental layers (independent) 
DP> would be relevant.  The geologic ontology would link with a stat 
DP> ontology, which would link with an ecological ontology, where the garp 
DP> model could be found, then used in an entirely different domain than the 
DP> one for which it was constructed. 
DP> We talked about this some in our breakout group in Santa Barbara, but 
DP> ontologically-contructed hypothesis generation is something I have been 
DP> thinking quite a bit about. In fact, when I was at NSF in January, I was 
DP> asked what one thing I would recommend to enable synthetic research, and 
DP> I responded that I would force would-be multidisciplinary teams to 
DP> construct ontologies around all of the concepts relevant to their 
DP> question of interest then explicitly show how their proposed research 
DP> was going to extend/clarify the ontologies.  I have also suggested to 
DP> Bob Waide that we go for funding to hold some working meetings with some 
DP> of the LTER groups who are trying to come up with proposals for 
DP> synthetic research, to do exactly that.  I haven't heard back from him, 
DP> but I think this approach might be an opportunity to do some very 
DP> interesting, new research within the kr/sms group.
DP> Deana
DP> Shawn Bowers wrote:
>> Shawn Bowers wrote:
>>> (Note that I moved this thread to kr-sms ... and off of seek-dev)
>>> > Actually, this seems to me to be a fundamental difference in the way
>>> > CIS/IM and domain scientists approach problems.
>>> Even in computer science, research follows the model you describe 
>>> directly below. In addition, I believe that most fields in CS are not 
>>> data-driven -- even in database fields (we don't care about the data, 
>>> we care about the algorithms and their generality).  There are CS 
>>> fields that are more closely related to branches of psychology (e.g., 
>>> human-computer interaction, and natural language processing) that are 
>>> exceptions.  Typically the hypothesis testing in these more 
>>> "touchy-feely" CS fields strongly depends on the experimental data, 
>>> and use standard techniques to evaluate their hypothesis ... These 
>>> may use available data, or they may require designing new experiments 
>>> to get data.  I would consider these both data-driven.  And, these 
>>> are definitely useful endeavors -- e.g., in the field of medicine.
>> I meant to say, that medicine is another field that I would say is 
>> "data driven" in this way.
>>> The way you characterize data-driven below reminds me of data mining 
>>> -- you have some data and you try to find patterns in the data.  I 
>>> definitely don't advocate this approach in SEEK ... and this isn't 
>>> really what I was suggesting before.
>>> So, I think we agree that pure tool-driven (not sure of an example) 
>>> and data-driven approaches (data mining in the traditional sense) are 
>>> out (I don't think we ever thought they were in, but anyway ...), and 
>>> users of SEEK technology generally will have a hypothesis in mind 
>>> when they interact with the system, is it useful to try to capture / 
>>> represent hypothesis in the system, and if so, how could they be 
>>> exploited and how could they be practically represented?
>>> For example, could workflows be organized based on their 
>>> applicability to certain styles of hypothesis?  Or, as a holy grail, 
>>> you could imagine a scientist entering a hypothesis and the system 
>>> actually trying to organize data and services that could be used to 
>>> test the hypothesis (where the hypothesis is like a query, I 
>>> suppose). For the latter, GEON is actually designing, and has 
>>> designed, many of their test cases and use cases around specific 
>>> hypothesis ... as opposed to the approach in SEEK of focusing test 
>>> cases on a tool (GARP).
>>> Shawn
>>> Deana Pennington wrote:
>>>> Sorry so long to reply...I've been at a conference without e-mail...
>>>> The entire scientific process is designed around testing 
>>>> hypotheses.  You come up with a research question of interest, then 
>>>> create an analysis to test it.  NSF funding (and other funding 
>>>> sources) are completely based on the strength (scientific merit) of 
>>>> the question and how well thought out the proposed methodology is.  
>>>> The idea of integrating data simply to see if anything comes out of 
>>>> it is strongly resisted, as is the idea of tool-driven science.  The 
>>>> general argument is that science should be directed and focused 
>>>> along paths that have been rationally determined.  Occasionally a 
>>>> tool comes along that changes the way we can think about science 
>>>> (like the microsope, for example), and for a short time, some 
>>>> exploratory analysis is funded.  But that is the exception, not the 
>>>> norm.  The synthetic work that is being encouraged may depend on 
>>>> data integration, but it will have to be proposed as a traditional 
>>>> research question to get funded.  Its the difference between saying 
>>>> you want to put climate and hydrology data together over time to 
>>>> look for interesting patterns, and having a focused question that 
>>>> requires data integration to do the analysis (hypothesis: drought in 
>>>> the western US has resulted in reduced evapotranpiration in high 
>>>> elevation forests, which should result in an increase in runoff for 
>>>> a given increase in precipitation).
>>>> Actually, this seems to me to be a fundamental difference in the way 
>>>> CIS/IM and domain scientists approach problems.  I've been having a 
>>>> long-term discussion about this with Samantha.  The RCN classes have 
>>>> presented a data-centric view that works well with information 
>>>> managers, but did not work well with the domain scientists at the 
>>>> new fac/postdoc workshop.  They kept wondering what the 
>>>> goals/objectives were of the information that was presented early in 
>>>> the week (Why are we doing this?).  For the distributed graduate 
>>>> seminar, we have intentionally changed that order around to a 
>>>> research question focus.  We'll see what kind of response we get, 
>>>> but I think it will resonate with them.  Formulating your ideas 
>>>> through knowledge representation, pulling together concepts, 
>>>> creating approaches to workflows...those are early in the seminar, 
>>>> and would occur early in the scientific process, long before a 
>>>> scientists thinks about data models, structures, or metadata.
>>>> Deana
>>>> Bertram Ludaescher wrote:
>>>>> Interesting discussioN!
>>>>> Btw: I think of two very different things when I hear data-driven:
>>>>> (a) what you guys say below: I have some data, let's see what I can 
>>>>> do with it, and
>>>>> (b) some scientific workflow that is data-driven (optionally in a very
>>>>> technical sense as in Ptolemy/Kepler)
>>>>> Bertram
>>>>>>>>>> "SB" == Shawn Bowers <bowers at sdsc.edu> writes:
SB> SB> Comments on one of your comments :-)
SB> SB> (Also, I CC'd seek-dev in case anyone else is interested in 
>>>>> the thread)
SB> SB> Deana Pennington wrote:
>>>>>>>> - In the scenario, interestingly, the researcher first searches 
>>>>>>>> for appropriate workflows and once found, searches for the 
>>>>>>>> data.  It seems like it could go either way: the researcher may 
>>>>>>>> have found roughly the data to support their hypothesis, and 
>>>>>>>> then wants to find the right workflow/analysis to use on the 
>>>>>>>> data.  The latter seems like it has more potential need for data 
>>>>>>>> integration/transformation in that as a researcher looking for 
>>>>>>>> data, you wouldn't be restricted by everything being "uniform" 
>>>>>>>> just so you could plug it into the right model (of course, you 
>>>>>>>> wouldn't necessarily be limited in this way by finding analyses 
>>>>>>>> first, but I think it becomes a much harder problem).  Instead, 
>>>>>>>> you would be looking for "good" data, regardless of whether it 
>>>>>>>> is nicely formatted (which seems to be true for the mammal case 
>>>>>>>> -- I believe that is the motivation for using IPCC data).
>>>>>>> Yes, it could go either way.  However, I think for most 
>>>>>>> scientists, they think of the problem first, then look for data.  
>>>>>>> The order is more likely to be, "I want to compare NPP at 
>>>>>>> grassland sites around the world, and there are 4 different ways 
>>>>>>> I could calculate NPP, and each of those ways requires different 
>>>>>>> types of data".  The "4 different ways" would be expressed as 
>>>>>>> analytical workflows.  It is possible, thought, that after 
>>>>>>> framing the question "I want to compare NPP", then they would 
>>>>>>> decide to look and see what data are available before thinking 
>>>>>>> about the appropriate analysis.  In fact, the whole idea of 
>>>>>>> data-driven analyses is a new one in ecology (and science in 
>>>>>>> general), and there are whole groups of people who think it is a 
>>>>>>> completely wrong approach.
SB> SB> I take back my original statement that the problem is 
>>>>> harder in one SB> direction than in the other. Basically, our 
>>>>> problem is to match datasets SB> with services.  There is a set of 
>>>>> constraints we want the datasets to SB> satisfy Dq (the query) and 
>>>>> a set of implied constraints in the datasets SB> found Dc (e.g., 
>>>>> structural and semantic constraints) by Dq.  Similarly, SB> there 
>>>>> is a set of constraints we want the services to satisfy Sq (e.g., 
SB> that the services/workflow computes NPP), and a set of implied 
SB> constraints in the services found Ds (structural and semantic 
SB> constraints on inputs, e.g.) by Dq. So generally, regardless of 
>>>>> whether SB> we search for datasets first or services first, our 
>>>>> goal is to figure SB> out a way to transform and group the datasets 
>>>>> to make the implied SB> constraints on the datasets fit with the 
>>>>> implied constraints on the SB> services.  The problem changes 
>>>>> (which is what my original point was SB> trying to say) if we 
>>>>> assume that the datasets we look for *must* match SB> (without any 
>>>>> transformation) the service constraints, which is the SB> current 
>>>>> motivation for choosing the IPCC data (which isn't necessarily a 
SB> bad thing, it just isn't general, which we already know).
SB> SB> Also, for data-driven analysis, what is the argument as to 
>>>>> why people SB> say it is the wrong approach?  Aren't there other 
>>>>> "scientific" fields, SB> such as medicine or psychology (I consider 
>>>>> these scientific, but that SB> probably isn't the general 
>>>>> classification), that are very much "data SB> driven" in this way? 
>>>>> I am just curious, and am interested in hearing SB> your opinion ...
SB> SB> Shawn
SB> SB> SB> SB> _______________________________________________
SB> seek-dev mailing list
SB> seek-dev at ecoinformatics.org
SB> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
>>>>> _______________________________________________
>>>>> seek-dev mailing list
>>>>> seek-dev at ecoinformatics.org
>>>>> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
>>> _______________________________________________
>>> seek-kr-sms mailing list
>>> seek-kr-sms at ecoinformatics.org
>>> http://www.ecoinformatics.org/mailman/listinfo/seek-kr-sms
>> _______________________________________________
>> seek-kr-sms mailing list
>> seek-kr-sms at ecoinformatics.org
>> http://www.ecoinformatics.org/mailman/listinfo/seek-kr-sms
DP> -- 
DP> ********
DP> Deana D. Pennington, PhD
DP> Long-term Ecological Research Network Office
DP> UNM Biology Department
DP> MSC03  2020
DP> 1 University of New Mexico
DP> Albuquerque, NM  87131-0001
DP> 505-272-7288 (office)
DP> 505 272-7080 (fax)

More information about the Seek-kr-sms mailing list