Measurement scale in EML

Mon Feb 21 08:03:14 PST 2005

This exchange served to fire up the IMexec meeting last week so we all
owe John thanks (blame?) for launching it. It gave us a venue to vent
similar frustrations many of us have had and inspired us to ask
ourselves - if semantic ontologies (as opposed to gross statistical
categories) are required to really make use of a data variable, then why
not investigate whether we can build an LTER library of variable
descriptions in OWL? Again, the storet code system was cited as a
precedent for how we could use these. We defined a working group task
which (assuming it draws interest from IM) will attempt to inventory the
potential for generating a limited set of ontologies for a suite of
commoly occurring measures accross LTER sites. This is potentially a
significant leverage to SEEK which, up to now at least, has focused on
ontologies that map ecological knowledge at a very high level with
little thought yet to how these will link to actual online datasets. a
few of us spent some time after hours looking at GEON's use of
ontologies and felt that a community process similar to that which has
generated our unit defintions, could begin to generate some variable
definitions that can be shared and linked to EML as external domain
definitions. 

That said, we still have a long way to go to  figure out how we will
start, so votes of interest and suggestions from the floor are welcome!

On Fri, 2005-02-18 at 16:07 -0900, Matt Jones wrote:
> Dear John,
> 
> Sorry for the delayed response.  I found myself nodding in agreement as 
> I read through Peter's response.  As Peter mentioned, we did extensively 
> debate the issue you raised and Wilkinson's papers specifically during 
> the development of EML 2 drafts (you can read the emails on the list 
> archives), so I wanted to clarify some of our motivation.  We all agree 
> there are many measurement scale classifications, that the Stevens 
> typology has its limitations, and that the Stevens typology does not 
> truly distinguish what one *can* do with the data in an analysis.
> 
> That said, we had a very specific and pragmatic reason for including the 
> typology in EML: it is the simplest well-known classification that 
> allowed us to distinguish the types of additional information that 
> should be associated with each attribute.  So, for example, we felt it 
> was inappropriate to provide a 'unit' for a categorical variable (units 
> and other properties only apply to quantities).  We also wanted to be 
> able to distinguish when one should provide the definititions for 
> categories (such as "P = phosphorus treatment, Z = no treatment"), which 
> are inappropriate for numerical quantities.  The list of properties for 
> each attribute was important to get from the original data provider, so 
> we wanted to make some of the fields required (e.g., we wanted unit to 
> be required for numerical quantities).  These desires led us to the need 
> to classify attributes into groups that allowed us to apply the right 
> set of requirements on the additional information (such as units). 
> Stevens typology was standard, broadly understood, and had about the 
> right level of granularity.  With the addition of datetime, which is an 
> entirely other can of worms that we debated ad-nauseum, we thought the 
> typology would serve us nicely in distinguishing which information 
> should be provided.  But to be clear, that we ask the metadata provider 
> to classify attributes and provide the different additional information 
> depending on the measurementScale in no way limits the analysis of the 
> data by a later user.  People can anlyse the data any way they see fit 
> -- we just want to make sure that if the original collector thought that 
> a particular variable had unit 'meter' that this information was 
> recorded correctly.
> 
> With more experience, I now think we could probably have done with one 
> fewer measurement scales.  In particular, the difference between 
> interval and ratio is subtle, and the domain specification for them is 
> the same -- they both are numeric quantities with units and precision, 
> etc.  This seems to be where almost all of the confusion over 
> measurement scale arises.  At this point, we could probably get away 
> with 4 measurement scales:
>    * nominal
>      -- information provided defines meaning of categories
>    * ordinal
>      -- information provided defines meaning of categories and
>         order of values
>    * numeric (includes interval and ratio)
>      -- information provided defines numeric domain, precision, unit, etc
>    * datetime
>      -- information provided defines format and date domain
> 
> That set of 4, however, is an entirely non-standard typology, so a 
> little harder to defend, but it would suffice for partitioning up our 
> attributes into the groups needed to set the domains and other 
> information properly.  I think there is something to be gained by 
> sticking with the well-known Stevens typology, despite its known 
> shortcomings.  Nevertheless, we could open this issue up again for 
> another revision of EML.  I'm not sure its worth the trouble of the 
> change (there would be backwards compatibility issues), but we can 
> certainly discuss it on eml-dev.  What I probably wouldn't go along with 
> is eliminating measurementScale altogher because I think it is needed to 
> properly segregate metadata about numeric quantities from metadata 
> needed for nominals, ordinals, and dates.  Do you have some other 
> mechanism than classifying into measurementScales that would allow EML 
> to require the 'unit' field for certain variables (e.g., body_length) 
> but not allow 'unit' to be filled in for others (e.g., species_name)?
> 
> I recommend that you send any future emails to 
> eml-dev at ecoinformatics.org, as that is the list with the appropriate EML 
> expertise to address questions such as these. Thanks a ton for your 
> thoughtful comments. I'll watch for further discussions on eml-dev.
> 
> Matt
> 
> Peter McCartney wrote:
> > John, thanks for your input on what i think is a important issue in the
> > future development of eml. 
> > 
> > we actually did debate this issue a lot during the final months of EML 2
> > (even enduring Leland Wilkinson's usual treatises on why most of us are
> > doing statistics wrong). we went ahead with it anyway becuase some
> > compelling arguments were made that, right or wrong, the information
> > could empower us to do some useful things. Unfortunately, the design of
> > the complex type requires that it be provided in order to convey other
> > information like precision, enumerations, or units so that we could
> > define rules for when such information was appropriate to include.
> > 
> > Im not second-guessing anything we did in EML, but with the benefit of
> > time, we can see two facts: first, the applications that really do
> > anything for us using the measurement scale information are, i think,
> > either experimental, trivial, or non-existent.  Other than being able
> > validate ones eml or create a semi-intelligent editor, there is as yet
> > no compelling research-related reward that is realized by those who put
> > all this information in to EML.
> > 
> > second, through the work under SEEK and GEON, we now know a great deal
> > more about ontologies than we did when we wrote EML 2. Looking at the
> > demo apps geon has made (GEONgrid.org) for example, it is obvious that
> > while knowing a variable is nominal or ratio can help you make a better
> > 'guess" at how to use it, linking it to an ontology can let you
> > explicitly inform an app on exactly what you can and should do with the
> > data. not only can ontologies provide all the domain, units, and other
> > information, they can be linked to explicit application metadata such as
> > really useful display rules in this example. 
> > 
> > so given that ontologies give us much more info for making automated
> > processing decisions, we should be asking 1) does measurement scale (or
> > even the slightly richer but still very abstract scale of Velleman and
> > Wilkinson) still give us any useful info once a field has been mapped to
> > an ontology? and 2) will the measurement scale construct help in future
> > links between EML and an ontology knowlegebase for ecology or will it
> > actually confuse things as John is suggesting. If i know that a column
> > is a personal name, or an SCS classification, or a measurment of tree
> > diameter, I automatically know the answers to most of what we are
> > putting into EML anyway, except that now i could actually make use of
> > that info. 
> > 
> > So perhaps rather than thrashing further with measurement scale or
> > considering alternate systems like V&W (that i would contend are also
> > limited in how much they really tell me about the data), we should start
> > doing some investigation as to how many shared measurement domains we
> > really have in LTER and how feasible would it be to start working in
> > that direction, using them as external libraries similiar to how we
> > handle projection systems and units now. We may find if we did a survey
> > a large number of datasets could be accomodated with a smaller number of
> > explict ontologies (the 80/20 rule). 
> > 
> > If they could, the relative effort to define and publish a limited
> > number of (for example) shared climate measurement ontologies that we
> > can reference in our metadata could be less than what we are doing now
> > by many precip datasets with lots of redundant information on everything
> > except what is really being measured. A good example of this is the
> > STORET code list commonly used by USGS and other water quality
> > databases. by referencing a single 5digit code, one has access to all
> > the methods, domain, units, etc, without putting any of that in the
> > metadata for each individual dataset.
> > 
> > PS. eda's email just came in as i was pressing send. I think my answer
> > to her comments is implicit in what ive written, but an explicit
> > response would be to say that the best i think we can hope from
> > automated processing based on what is currently in EML are products that
> > may be statistically valid (in some sense) but scientifically
> > meaningless. To use EML encoded data to answer a scientific question
> > (lets leave a few examples of simple qaqc checks aside), you have to
> > read the text to find out what the data mean - none of the other
> > information there can help you without that. The point John raised (i
> > think) is - once you know the meaning, you can choose to categorize that
> > meaning in any kind of measurement scale system you want, without any
> > need for the data provider to do it for you.
> > 
> > On Fri, 2005-02-11 at 18:42 -0700, John Anderson wrote:
> > 
> >>Matt,
> >>
> >>I suspect I'm stepping in a big cowpie here since it's commonly known
> >>that a little knowledge can be a dangerous thing.  My involvement with
> >>EML being recent I hope you'll forgive what is sure to be considered
> >>belated comment better contributed to long ago discussion.  You may
> >>recall there was some discussion on measurement scale at the KNB data
> >>management workshop last week.  I read an article yesterday that has
> >>clarified a number of issues for me, caused many others to roll their
> >>eyes, but stimulated me to write to you nonetheless (Velleman PF and
> >>Wilkinson L (1993), "Nominal, Ordinal, Interval, and Ratio Typologies
> >>Are Misleading, " The American Statistician, 47, 65-72.).  I'm not a
> >>statistician, don't profess to be, and I recognize that there is much
> >>controversy over measurement scale typologies, including that generated
> >>by this paper.  However, having read Velleman's paper and some counter
> >>arguments, I would submit that use of Steven's measurement scale of
> >>nominal, ordinal, interval, and ratio are counterproductive to what I
> >>understand to be the intent of its use in EML as a tool to facilitate
> >>data analysis and synthesis and should not only not be required, but
> >>that another classification be used in its place.  
> >>
> >>You are probably already familiar with it, but if not, I invite you to
> >>read the paper by Velleman and Wilkinson at
> >>http://www.spss.com/research/wilkinson/Publications/Stevens.pdf and
> >>perhaps you'll be able to empathize with my viewpoint.  If data are to
> >>be used in ways different from the initial intent and to address new
> >>questions unformed at the time of its collection, it is dangerous to put
> >>potential limits on the type of data analyses that can be done when this
> >>classification is based on the original question(s).  Even if
> >>considering only the original question, it is possible that the scale
> >>type would change dependent on how one views it.  Examples of this can
> >>be found in the referenced paper and I've certainly had similar
> >>discussions with others at this end. I think this excerpt says it rather
> >>nicely, "Scale type, as defined by Stevens, is not an attribute of the
> >>data, but rather depends upon the questions we intend to ask of the data
> >>and upon any additional information we may have. It may change due to
> >>transformation of the data, it may change with the addition of new
> >>information that helps us to interpret the data differently, or it may
> >>change simply because of the questions we choose to ask."  "...Scale
> >>types are not fundamental attributes of the data, but rather, derive
> >>from both how the data were measured and what we conclude from the
> >>data."  Tools such as Kepler will encourage more automated solutions to
> >>data analysis and synthesis efforts, all fine and good, but if there is
> >>automated variable selection that weights appropriateness for an
> >>analysis based on a variable's measurement scale in the EML metadata for
> >>a particular dataset, then the constraints are potentially artificial
> >>and overly restrictive.
> >>
> >>I don't argue that Steven's measurement scale isn't useful when
> >>appropriately applied based on the data and the question asked, but that
> >>assignment in EML is premature.  Assignment to measurement scale should
> >>be done by the end user at the time of analysis when the current
> >>question being asked of a dataset is known.  The questions to be asked
> >>secondarily to why the data was originally collected cannot be
> >>anticipated.  Steven's measurement scale appears to both prescribe and
> >>proscribe statistical methods that may be used, and the timing of a
> >>variable's entry into EML is too early to pigeonhole it when the intent
> >>is to present it for potential use in ways that the original
> >>investigator never envisioned. Alternative taxonomies for data types may
> >>be more useful that don't necessarily categorize based on original
> >>meaning or intent of data but simply on the form of the data itself.
> >>Velleman and Wilkinson present one option by another pair of authors:
> >>- names
> >>- grades (ordered labels such as freshman, sophomore, junior, senior)
> >>- ranks (starting from 1, which may represent either the largest or
> >>smallest)
> >>- counted fractions (bounded by zero and one; e.g., percentages)
> >>- counts (non-negative integers)
> >>- amounts (non-negative real numbers)
> >>- balances (unbounded, positive or negative values)
> >>
> >>From eml-2.0.1/docs/eml-2.0.1/eml-attribute.html:
> >>"The authors decided to sharpen the model of attribute by nesting unit
> >>under measurementScale. Measurement Scale is a data typology, borrowed
> >>from Statistics, that was introduced in the 1940's. Under the adopted
> >>model, attributes are classified as nominal, ordinal, interval, and
> >>ratio. Though widely criticized, this classification is well-known and
> >>provides at least first-order utility in EML. For example, nesting unit
> >>under measurementScale allows EML to prevent its meaningless inclusion
> >>for categorical data -- an approach judged superior to making unit
> >>universally required or universally optional."
> >>
> >>I believe much of the confusion people have in assigning a measurement
> >>scale to a variable is because it may be viewed as belonging to
> >>alternate scales depending on how one views the question that it is
> >>addressing.  I don't believe that the benefits of "first-order utility"
> >>as noted above are tangible but instead serve as a placeholder that may
> >>hinder rather than advance data analysis efforts.  Note that I make
> >>these comments in the context of EML and its potential for expanding the
> >>meaning and usefulness of historical and archived data through
> >>appropriate metadata.
> >>
> >>Thanks for hearing me out, Matt.  
> >>
> >>All the best,
> >>
> >>John
> >>
> >>John Anderson
> >>P.O. Box 30003, MSC 3JER
> >>New Mexico State University
> >>Las Cruces, NM 88003
> >>
> >>voice: 505-646-5818
> >>fax: 505-646-5889
> >>e-mail: janderso at nmsu.edu
> >>
> >>
> >>
> >>-------------------------------------------------
> >>Long-Term Ecological Research Network Mailing List
> >>im at LTERnet.edu
> >>http://sql.lternet.edu/cgi/mailgroups_view.pl?im
>