[seek-dev] Re: update

Thu Aug 19 09:29:36 PDT 2004

Hi Dave and Susan,

I agree that 'data registration' is an important issue that we have not 
managed to fully address yet.  For ecological data, we currently 
'register' data by providing an EML description of the data, which 
includes a statement about the taxonomic species covererd by the data. 
That field is optional, and few people have filled it out yet.  The 
major exception is the 181 data sets in EcoGrid from the GCE LTER site, 
each of which has extensive taxonomic information in the EML.  We're 
hoping that more sites become comprehensive with their metadata. But in 
the absence of that, or in addition to it, Dave's approach using 
taxonomic guesswork will be extremely valuable.  But we also need them 
to be more comprehensive with other metadata too (other than taxon 
metadata), such as spatial and temporal coverage, and entity and 
attribute descriptions.  The spatial and temporal metadata may be 
amenable to a similar guessing approach to the one Dave is taking for 
taxa, but other parts of the metadata will not be (i.e., attribute 
descriptions).  For that we need the scientists to provide the metadata.

A further level of metadata annotation about a data set would be 
'semantic registration' of the data.  This includes associating the data 
and components of the data such as attributes with semantically precise 
terms from our budding measurement ontology.  Shawn and Bertram and the 
rest of the SMS/KR group are working out how to create and reason over 
these semantic registrations.  Their system would benefit from an index 
just as you propose for the taxon concept annotations.  I actually view 
these things as closely related, almost identical in terms of their 
needs for an indexing and aggregation service.  So EcoGrid seems like 
the appropriate place to handle it.  One hard part will be figuring out 
how to deal with a very distributed set of data but wanting aggregated, 
central indices that stay up-to-date with the distributed data as it 
changes.

I think we need to talk about this further.  Thanks for kicking off the 
discussion.  I'm cc'ing seek-dev so that other relevant people can 
participate in the conversation.

Matt

sgauch at ittc.ku.edu wrote:
> Dave,
> 
> That's great.  I think that the annotation of the data sets is an
> important component that no one is working on.  If we PERFECTLY resolve
> the user queries with our new system (oh yes, I can dream), it only works
> if we can then find data sets that relate to those organisms.  Having you
> work on this problem is nice and big so you can keep very busy, but is
> quite independent so that we don't trip over shared modules.
> 
> Good progress.  I think that having a basic system that does little
> guessing is a good start.  There are 2 potential next steps - working out
> an architecture with the EcoGrid folks and Matt about how data
> registration will work and where the index (yes index!) that maps from
> taxonomic concept -> data set will be stored and how it will be updated
> when
>    - we add a new data set
>    - we add a new taxonomic concept
>    - we add new data in a previously "indexed" dataset
>    - consider modify/delete of all 3 above
> 
> This may be a bit much to design on your own (who makes the decision? who
> implements what pieces of it?).  I picture an explicit data registration
> module initially that fires up your extraction piece and adds the data set
> to the index.  No updates from anyone.  Then, we can work on harder
> things.
> 
> It may be easiest to work on a "Data Set Registration" tool by which
> people fill in metadata, your extraction runs, then it is enhanced to make
> "guesses" and get user confirmation.
> 
> Susan.
> 
> 
> 
>>Howdy!
>>
>>Here's an update of what I've been doing lately.  The LSID stuff is
>>installed, but final touches await word from Rob that he's done with the
>>big changes he's been making.  I think he's either done, or close.  I'll
>>ask him.
>>
>>Meanwhile, picking up on a suggestion of Susan's, and following up on the
>>work I've been doing with Shawn, I'm looking at how much taxonomic
>>information can be extracted automatically from EML and datasets.  My
>>notion is that we could use some automatic extraction to help index or
>>annotate datasets for later discovery.
>>
>>I've spent a couple of hours working on a perl script which goes through a
>>bunch of datasets and EML markups I received from Shawn
>>(ArthropodDensities_AND, BirdCensus_CAP, CrabPopulations_GCE, and several
>>others).  The script, which is ignorant of EML, does a pretty good job
>>picking out taxonomic names.  It picks out many more from the datasets
>>than it does from the associated EML.  The numbers are like this:
>>
>>clean version
>>from datasets: 319 taxonomic terms
>>from eml: 86 taxonomic terms
>>
>>noisy version
>>from datasets: 392 taxonomic terms
>>from eml: 141 taxonomic terms
>>
>>The script basically pulls out anything that looks like a word and runs it
>>against a database I have of around a million taxonomic groups.  The
>>database was compiled from about 10 sources about two years ago.  One of
>>the sources is NCBI and that data is pretty noisy - all sorts of garbage
>>taxonomic names, but also some names which aren't in the other sources.
>>So, the noisy version includes the NCBI data, and the clean version does
>>not.
>>
>>I have two conclusions at this point:
>>
>>1. it's not hard to automatically extract some taxanomic information from
>>data sets
>>
>>2. the eml I have is pretty sparse in terms of taxonomic markup
>>
>>And a few observations:
>>
>>* I haven't checked to see what % of taxonomic terms are extracted -
>>that's an obvious next step - which I'm doing now.
>>
>>* The script is dumb - it does very little guesswork.  There are a number
>>of tricks I can do to increase the % extracted (whatever it is)
>>
>>* The EML and the datasets are chock full of common names.  The database I
>>have has common names from species 2000, so I can leverage those too.  I
>>believe that the current notion is that SEEK isn't going to worry about
>>common names in the short term.  However, for data set indexing, it might
>>be critical.
>>
>>And that's the update.  Shawn's on vacation this week, but next week we're
>>going to start talking about where to go next with our data discovery
>>stuff.  I think this kind of data extraction will be an important
>>component of any data set annotation tool we might build - at least to
>>give users a heads up ("I see you have these taxonomic groups.... are
>>there others you'd like to mention in your metadata?")
>>
>>Dave
>>
>>p.s. attached is a clean run of the datasets and eml combined.
>>
>>
>>
>>
>>
>>
> 
> 

-- 
-------------------------------------------------------------------
Matt Jones                                     jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439    Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org
-------------------------------------------------------------------