[seek-dev] Re: [Fwd: [seek-kr-sms] Taxon/KR integration prototype proposal]
Shawn Bowers
bowers at sdsc.edu
Tue Apr 27 15:50:30 PDT 2004
Shawn Bowers wrote:
>
> (Note that I moved this thread to kr-sms ... and off of seek-dev)
>
> > Actually, this seems to me to be a fundamental difference in the way
> > CIS/IM and domain scientists approach problems.
>
> Even in computer science, research follows the model you describe
> directly below. In addition, I believe that most fields in CS are not
> data-driven -- even in database fields (we don't care about the data, we
> care about the algorithms and their generality). There are CS fields
> that are more closely related to branches of psychology (e.g.,
> human-computer interaction, and natural language processing) that are
> exceptions. Typically the hypothesis testing in these more
> "touchy-feely" CS fields strongly depends on the experimental data, and
> use standard techniques to evaluate their hypothesis ... These may use
> available data, or they may require designing new experiments to get
> data. I would consider these both data-driven. And, these are
> definitely useful endeavors -- e.g., in the field of medicine.
I meant to say, that medicine is another field that I would say is "data
driven" in this way.
>
> The way you characterize data-driven below reminds me of data mining --
> you have some data and you try to find patterns in the data. I
> definitely don't advocate this approach in SEEK ... and this isn't
> really what I was suggesting before.
>
> So, I think we agree that pure tool-driven (not sure of an example) and
> data-driven approaches (data mining in the traditional sense) are out (I
> don't think we ever thought they were in, but anyway ...), and users of
> SEEK technology generally will have a hypothesis in mind when they
> interact with the system, is it useful to try to capture / represent
> hypothesis in the system, and if so, how could they be exploited and how
> could they be practically represented?
>
> For example, could workflows be organized based on their applicability
> to certain styles of hypothesis? Or, as a holy grail, you could imagine
> a scientist entering a hypothesis and the system actually trying to
> organize data and services that could be used to test the hypothesis
> (where the hypothesis is like a query, I suppose). For the latter, GEON
> is actually designing, and has designed, many of their test cases and
> use cases around specific hypothesis ... as opposed to the approach in
> SEEK of focusing test cases on a tool (GARP).
>
> Shawn
>
>
>
> Deana Pennington wrote:
>
>> Sorry so long to reply...I've been at a conference without e-mail...
>>
>> The entire scientific process is designed around testing hypotheses.
>> You come up with a research question of interest, then create an
>> analysis to test it. NSF funding (and other funding sources) are
>> completely based on the strength (scientific merit) of the question
>> and how well thought out the proposed methodology is. The idea of
>> integrating data simply to see if anything comes out of it is strongly
>> resisted, as is the idea of tool-driven science. The general argument
>> is that science should be directed and focused along paths that have
>> been rationally determined. Occasionally a tool comes along that
>> changes the way we can think about science (like the microsope, for
>> example), and for a short time, some exploratory analysis is funded.
>> But that is the exception, not the norm. The synthetic work that is
>> being encouraged may depend on data integration, but it will have to
>> be proposed as a traditional research question to get funded. Its the
>> difference between saying you want to put climate and hydrology data
>> together over time to look for interesting patterns, and having a
>> focused question that requires data integration to do the analysis
>> (hypothesis: drought in the western US has resulted in reduced
>> evapotranpiration in high elevation forests, which should result in an
>> increase in runoff for a given increase in precipitation).
>>
>> Actually, this seems to me to be a fundamental difference in the way
>> CIS/IM and domain scientists approach problems. I've been having a
>> long-term discussion about this with Samantha. The RCN classes have
>> presented a data-centric view that works well with information
>> managers, but did not work well with the domain scientists at the new
>> fac/postdoc workshop. They kept wondering what the goals/objectives
>> were of the information that was presented early in the week (Why are
>> we doing this?). For the distributed graduate seminar, we have
>> intentionally changed that order around to a research question focus.
>> We'll see what kind of response we get, but I think it will resonate
>> with them. Formulating your ideas through knowledge representation,
>> pulling together concepts, creating approaches to workflows...those
>> are early in the seminar, and would occur early in the scientific
>> process, long before a scientists thinks about data models,
>> structures, or metadata.
>>
>> Deana
>>
>>
>> Bertram Ludaescher wrote:
>>
>>> Interesting discussioN!
>>>
>>> Btw: I think of two very different things when I hear data-driven:
>>>
>>> (a) what you guys say below: I have some data, let's see what I can
>>> do with it, and
>>> (b) some scientific workflow that is data-driven (optionally in a very
>>> technical sense as in Ptolemy/Kepler)
>>>
>>> Bertram
>>>
>>>
>>>
>>>
>>>
>>>>>>>> "SB" == Shawn Bowers <bowers at sdsc.edu> writes:
>>>>>>>>
>>>
>>>
>>> SB> SB> Comments on one of your comments :-)
>>> SB> SB> (Also, I CC'd seek-dev in case anyone else is interested in
>>> the thread)
>>> SB> SB> Deana Pennington wrote:
>>> SB>
>>>
>>>>>> - In the scenario, interestingly, the researcher first searches
>>>>>> for appropriate workflows and once found, searches for the data.
>>>>>> It seems like it could go either way: the researcher may have
>>>>>> found roughly the data to support their hypothesis, and then wants
>>>>>> to find the right workflow/analysis to use on the data. The
>>>>>> latter seems like it has more potential need for data
>>>>>> integration/transformation in that as a researcher looking for
>>>>>> data, you wouldn't be restricted by everything being "uniform"
>>>>>> just so you could plug it into the right model (of course, you
>>>>>> wouldn't necessarily be limited in this way by finding analyses
>>>>>> first, but I think it becomes a much harder problem). Instead,
>>>>>> you would be looking for "good" data, regardless of whether it is
>>>>>> nicely formatted (which seems to be true for the mammal case -- I
>>>>>> believe that is the motivation for using IPCC data).
>>>>>>
>>>>>
>>>>>
>>>>> Yes, it could go either way. However, I think for most scientists,
>>>>> they think of the problem first, then look for data. The order is
>>>>> more likely to be, "I want to compare NPP at grassland sites around
>>>>> the world, and there are 4 different ways I could calculate NPP,
>>>>> and each of those ways requires different types of data". The "4
>>>>> different ways" would be expressed as analytical workflows. It is
>>>>> possible, thought, that after framing the question "I want to
>>>>> compare NPP", then they would decide to look and see what data are
>>>>> available before thinking about the appropriate analysis. In fact,
>>>>> the whole idea of data-driven analyses is a new one in ecology (and
>>>>> science in general), and there are whole groups of people who think
>>>>> it is a completely wrong approach.
>>>>>
>>>>>
>>>
>>>
>>> SB> SB> I take back my original statement that the problem is harder
>>> in one SB> direction than in the other. Basically, our problem is to
>>> match datasets SB> with services. There is a set of constraints we
>>> want the datasets to SB> satisfy Dq (the query) and a set of implied
>>> constraints in the datasets SB> found Dc (e.g., structural and
>>> semantic constraints) by Dq. Similarly, SB> there is a set of
>>> constraints we want the services to satisfy Sq (e.g., SB> that the
>>> services/workflow computes NPP), and a set of implied SB> constraints
>>> in the services found Ds (structural and semantic SB> constraints on
>>> inputs, e.g.) by Dq. So generally, regardless of whether SB> we
>>> search for datasets first or services first, our goal is to figure
>>> SB> out a way to transform and group the datasets to make the implied
>>> SB> constraints on the datasets fit with the implied constraints on
>>> the SB> services. The problem changes (which is what my original
>>> point was SB> trying to say) if we assume that the datasets we look
>>> for *must* match SB> (without any transformation) the service
>>> constraints, which is the SB> current motivation for choosing the
>>> IPCC data (which isn't necessarily a SB> bad thing, it just isn't
>>> general, which we already know).
>>> SB> SB> Also, for data-driven analysis, what is the argument as to
>>> why people SB> say it is the wrong approach? Aren't there other
>>> "scientific" fields, SB> such as medicine or psychology (I consider
>>> these scientific, but that SB> probably isn't the general
>>> classification), that are very much "data SB> driven" in this way? I
>>> am just curious, and am interested in hearing SB> your opinion ...
>>> SB> SB> Shawn
>>> SB> SB> SB> SB> _______________________________________________
>>> SB> seek-dev mailing list
>>> SB> seek-dev at ecoinformatics.org
>>> SB> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
>>> _______________________________________________
>>> seek-dev mailing list
>>> seek-dev at ecoinformatics.org
>>> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
>>>
>>>
>>
>
> _______________________________________________
> seek-kr-sms mailing list
> seek-kr-sms at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/seek-kr-sms
More information about the Seek-kr-sms
mailing list