[obs] flexible database schemas

Law, Jason Jason.Law at portlandoregon.gov
Fri May 6 13:27:11 PDT 2011


Thanks for the comments.  

I think most of your questions can be addressed mainly by saying the fields are not really there, just the tables.  I've been working on this project mainly through a python proof-of-concept toy application.  So the ERD I sent was kind of hastily thrown together for explanatory purposes.

* I'm thinking for our purposes that the domain feature would be stored as a combination of references to other enterprise GIS layers and  (e.g., for a sampling station on a stream we would encode the actual reference to a reach on our stream layer.  For cases where the domain feature doesn't have a clear representation in our information systems, I'd like to use something like a reference to an ontology (e.g., SWEET's GeoMagneticField).

* Sample is Specimen, I'm trying to stay consistent with the data systems already in place here, as it's meant as a proof of concept to my peers here that we can do better than what we're currently doing.

* I struggled with observation and result structure for quite a long time.  Mainly with how to create a relatively simple relational model using the coverage observations.  Finally, I decided that for my current purposes, I would stick to a simple model where observation and result were one table and deal with the replication that entailed for some result types (i.e., repeating the phenomenon time for every analyte in an analytical chemistry result).  The lack of result-time and phenomenon-time is just because this ERD was a bang up job to show my thoughts on the overall design.  They are in my proof-of-concept application.

Do other folks see Specimen as the best way to model biological data rather than a separate entity?  I modeled it separately because there are lots of ecological data where an organism is measured in-situ.  Would these be samplingFeatures as opposed to Specimens?  It seems that in the examples given in O&M organisms are noticeably absent from the examples given for sampling features?

Thanks again for the help,

Jason Law

-----Original Message-----
From: Simon.Cox at csiro.au [mailto:Simon.Cox at csiro.au] 
Sent: Thursday, May 05, 2011 9:50 PM
To: Law, Jason; obs at ecoinformatics.org
Subject: RE: flexible database schemas

Jason - 

This is interesting work. Good to see. 

Looking through my O&M spectacles, I have to ask:
* where is the domain feature? (i.e. the sampled-feature)
* is your 'sample' intended to be equivalent to the O&M Specimen class? 
* it looks like you have focussed on what O&M calls the 'result' instead of the observation event. But as a consequence, I can't see where the key temporal properties of the observation act are found - result-time, phenomenon-time. Are they unimportant in your applications? Note that separation of these times is the key to having a system that can report on forecasts as well as estimates of phenomena from the deep past. 

Simon Cox

-----Original Message-----
From: obs-bounces at ecoinformatics.org [mailto:obs-bounces at ecoinformatics.org] On Behalf Of Law, Jason
Sent: Friday, 6 May 2011 4:11 AM
To: 'obs at ecoinformatics.org'
Subject: [obs] flexible database schemas

Hello everyone,

I'm going to apologize immediately for the long e-mail. Thanks to anyone who as the patience to read and offer an opinion.

I'm in the midst of modeling a database schema for environmental and ecological observation data.  As an organization, the city I work for has run up against the common problem of an inflexible data management system specifically designed for one type of observational data.  New programs and the new methods and data that they generate have fallen outside this enterprise system and end up being managed by individuals and small groups all over our agency.  We collect a wide variety of environmental information: hydrology, weather (mostly rain), geological (boreholes, soil cores), avian point counts, stream macroinvertebrate data, habitat (from simple surveys to EPA EMAP stream habitat protocols), and a large amount of analytical chemistry data (river sediment, soil samples, water quality).  As someone who is trying to integrate data from many sources, I'm necessarily trying to come up with a better solution.

I've tried to do as much research as possible into how others have solved the problem.  I've looked at other government agencies with similar data (USGS and US EPA), commercial systems, and things like O&M, OBOE, etc.  I think combining ideas from O&M with concrete ideas from actual database schemas like ODM version 2 (AKA EnviroDB) might be a good route.  For example, the translation layer and collections layer from EnvioroDB seem like great ways to approach translating data from multiple sources.  However, the concrete database systems I've looked at always seem to fall short of encompassing the wide range of data sources that we have.  For example, putting biological data like toxicity data into EnviroDB seems like a stretch unless you define variable names like "P. promelas Survival LC50".  Doing things like that is basically where we are.

I've tried to combine what I see as good ideas from a bunch of places and have attached a ERD of my ideas.  The fields are dummied in and are by no means complete.  Two submodels don't show any relationships (Activity and Controlled Vocabulary) because they would be connected in a bunch of other places.  The core features are to use the O&M sampling feature model to model observed real world entities as subclasses of a general sampling feature entity and to allow arbitrary relationships between these entities.  This would allow us to do things like separate the 'P. promelas' entity that we've observed the survival of from the physical sample that was used for the toxicity test.  My 'Activity' is essentially a combination of SF_Process and OM_Process from O&M and is supposed to represent activities done by people (making measurements, collecting samples, performing a point count, etc).

My questions for folks are:

In general, do I seem to be on the right track?

Is there an existing example of the type of system I'm envisioning?

Is there a better way to integrate biological data into an O&M like schema?

I'm trying to cram a wide variety of data into a single system because a lot of our projects are pretty multidisciplinary and our limited resources means that we can probably only get enough IT support to create a single new system.  I'm mostly trying to come up with a plan so that I can guide our IT folks into coming up with the right solution (getting the best people to do the work, making the specifications, etc).  Any thoughts or pointers to other resources, or opportunies for collaboration with other cash strapped organizations are greatly appreciated.

Thanks again,

Jason Law
Statistician
City of Portland, Bureau of Environmental Services Water Pollution Control Laboratory
6543 N Burlington Avenue
Portland, OR 97203-5452
503-823-1038
jason.law at portlandoregon.gov


More information about the obs mailing list