Measurement scale in EML

Fri Feb 18 17:07:27 PST 2005

Dear John,

Sorry for the delayed response.  I found myself nodding in agreement as 
I read through Peter's response.  As Peter mentioned, we did extensively 
debate the issue you raised and Wilkinson's papers specifically during 
the development of EML 2 drafts (you can read the emails on the list 
archives), so I wanted to clarify some of our motivation.  We all agree 
there are many measurement scale classifications, that the Stevens 
typology has its limitations, and that the Stevens typology does not 
truly distinguish what one *can* do with the data in an analysis.

That said, we had a very specific and pragmatic reason for including the 
typology in EML: it is the simplest well-known classification that 
allowed us to distinguish the types of additional information that 
should be associated with each attribute.  So, for example, we felt it 
was inappropriate to provide a 'unit' for a categorical variable (units 
and other properties only apply to quantities).  We also wanted to be 
able to distinguish when one should provide the definititions for 
categories (such as "P = phosphorus treatment, Z = no treatment"), which 
are inappropriate for numerical quantities.  The list of properties for 
each attribute was important to get from the original data provider, so 
we wanted to make some of the fields required (e.g., we wanted unit to 
be required for numerical quantities).  These desires led us to the need 
to classify attributes into groups that allowed us to apply the right 
set of requirements on the additional information (such as units). 
Stevens typology was standard, broadly understood, and had about the 
right level of granularity.  With the addition of datetime, which is an 
entirely other can of worms that we debated ad-nauseum, we thought the 
typology would serve us nicely in distinguishing which information 
should be provided.  But to be clear, that we ask the metadata provider 
to classify attributes and provide the different additional information 
depending on the measurementScale in no way limits the analysis of the 
data by a later user.  People can anlyse the data any way they see fit 
-- we just want to make sure that if the original collector thought that 
a particular variable had unit 'meter' that this information was 
recorded correctly.

With more experience, I now think we could probably have done with one 
fewer measurement scales.  In particular, the difference between 
interval and ratio is subtle, and the domain specification for them is 
the same -- they both are numeric quantities with units and precision, 
etc.  This seems to be where almost all of the confusion over 
measurement scale arises.  At this point, we could probably get away 
with 4 measurement scales:
   * nominal
     -- information provided defines meaning of categories
   * ordinal
     -- information provided defines meaning of categories and
        order of values
   * numeric (includes interval and ratio)
     -- information provided defines numeric domain, precision, unit, etc
   * datetime
     -- information provided defines format and date domain

That set of 4, however, is an entirely non-standard typology, so a 
little harder to defend, but it would suffice for partitioning up our 
attributes into the groups needed to set the domains and other 
information properly.  I think there is something to be gained by 
sticking with the well-known Stevens typology, despite its known 
shortcomings.  Nevertheless, we could open this issue up again for 
another revision of EML.  I'm not sure its worth the trouble of the 
change (there would be backwards compatibility issues), but we can 
certainly discuss it on eml-dev.  What I probably wouldn't go along with 
is eliminating measurementScale altogher because I think it is needed to 
properly segregate metadata about numeric quantities from metadata 
needed for nominals, ordinals, and dates.  Do you have some other 
mechanism than classifying into measurementScales that would allow EML 
to require the 'unit' field for certain variables (e.g., body_length) 
but not allow 'unit' to be filled in for others (e.g., species_name)?

I recommend that you send any future emails to 
eml-dev at ecoinformatics.org, as that is the list with the appropriate EML 
expertise to address questions such as these. Thanks a ton for your 
thoughtful comments. I'll watch for further discussions on eml-dev.

Matt

Peter McCartney wrote:
> John, thanks for your input on what i think is a important issue in the
> future development of eml. 
> 
> we actually did debate this issue a lot during the final months of EML 2
> (even enduring Leland Wilkinson's usual treatises on why most of us are
> doing statistics wrong). we went ahead with it anyway becuase some
> compelling arguments were made that, right or wrong, the information
> could empower us to do some useful things. Unfortunately, the design of
> the complex type requires that it be provided in order to convey other
> information like precision, enumerations, or units so that we could
> define rules for when such information was appropriate to include.
> 
> Im not second-guessing anything we did in EML, but with the benefit of
> time, we can see two facts: first, the applications that really do
> anything for us using the measurement scale information are, i think,
> either experimental, trivial, or non-existent.  Other than being able
> validate ones eml or create a semi-intelligent editor, there is as yet
> no compelling research-related reward that is realized by those who put
> all this information in to EML.
> 
> second, through the work under SEEK and GEON, we now know a great deal
> more about ontologies than we did when we wrote EML 2. Looking at the
> demo apps geon has made (GEONgrid.org) for example, it is obvious that
> while knowing a variable is nominal or ratio can help you make a better
> 'guess" at how to use it, linking it to an ontology can let you
> explicitly inform an app on exactly what you can and should do with the
> data. not only can ontologies provide all the domain, units, and other
> information, they can be linked to explicit application metadata such as
> really useful display rules in this example. 
> 
> so given that ontologies give us much more info for making automated
> processing decisions, we should be asking 1) does measurement scale (or
> even the slightly richer but still very abstract scale of Velleman and
> Wilkinson) still give us any useful info once a field has been mapped to
> an ontology? and 2) will the measurement scale construct help in future
> links between EML and an ontology knowlegebase for ecology or will it
> actually confuse things as John is suggesting. If i know that a column
> is a personal name, or an SCS classification, or a measurment of tree
> diameter, I automatically know the answers to most of what we are
> putting into EML anyway, except that now i could actually make use of
> that info. 
> 
> So perhaps rather than thrashing further with measurement scale or
> considering alternate systems like V&W (that i would contend are also
> limited in how much they really tell me about the data), we should start
> doing some investigation as to how many shared measurement domains we
> really have in LTER and how feasible would it be to start working in
> that direction, using them as external libraries similiar to how we
> handle projection systems and units now. We may find if we did a survey
> a large number of datasets could be accomodated with a smaller number of
> explict ontologies (the 80/20 rule). 
> 
> If they could, the relative effort to define and publish a limited
> number of (for example) shared climate measurement ontologies that we
> can reference in our metadata could be less than what we are doing now
> by many precip datasets with lots of redundant information on everything
> except what is really being measured. A good example of this is the
> STORET code list commonly used by USGS and other water quality
> databases. by referencing a single 5digit code, one has access to all
> the methods, domain, units, etc, without putting any of that in the
> metadata for each individual dataset.
> 
> PS. eda's email just came in as i was pressing send. I think my answer
> to her comments is implicit in what ive written, but an explicit
> response would be to say that the best i think we can hope from
> automated processing based on what is currently in EML are products that
> may be statistically valid (in some sense) but scientifically
> meaningless. To use EML encoded data to answer a scientific question
> (lets leave a few examples of simple qaqc checks aside), you have to
> read the text to find out what the data mean - none of the other
> information there can help you without that. The point John raised (i
> think) is - once you know the meaning, you can choose to categorize that
> meaning in any kind of measurement scale system you want, without any
> need for the data provider to do it for you.
> 
> On Fri, 2005-02-11 at 18:42 -0700, John Anderson wrote:
> 
>>Matt,
>>
>>I suspect I'm stepping in a big cowpie here since it's commonly known
>>that a little knowledge can be a dangerous thing.  My involvement with
>>EML being recent I hope you'll forgive what is sure to be considered
>>belated comment better contributed to long ago discussion.  You may
>>recall there was some discussion on measurement scale at the KNB data
>>management workshop last week.  I read an article yesterday that has
>>clarified a number of issues for me, caused many others to roll their
>>eyes, but stimulated me to write to you nonetheless (Velleman PF and
>>Wilkinson L (1993), "Nominal, Ordinal, Interval, and Ratio Typologies
>>Are Misleading, " The American Statistician, 47, 65-72.).  I'm not a
>>statistician, don't profess to be, and I recognize that there is much
>>controversy over measurement scale typologies, including that generated
>>by this paper.  However, having read Velleman's paper and some counter
>>arguments, I would submit that use of Steven's measurement scale of
>>nominal, ordinal, interval, and ratio are counterproductive to what I
>>understand to be the intent of its use in EML as a tool to facilitate
>>data analysis and synthesis and should not only not be required, but
>>that another classification be used in its place.  
>>
>>You are probably already familiar with it, but if not, I invite you to
>>read the paper by Velleman and Wilkinson at
>>http://www.spss.com/research/wilkinson/Publications/Stevens.pdf and
>>perhaps you'll be able to empathize with my viewpoint.  If data are to
>>be used in ways different from the initial intent and to address new
>>questions unformed at the time of its collection, it is dangerous to put
>>potential limits on the type of data analyses that can be done when this
>>classification is based on the original question(s).  Even if
>>considering only the original question, it is possible that the scale
>>type would change dependent on how one views it.  Examples of this can
>>be found in the referenced paper and I've certainly had similar
>>discussions with others at this end. I think this excerpt says it rather
>>nicely, "Scale type, as defined by Stevens, is not an attribute of the
>>data, but rather depends upon the questions we intend to ask of the data
>>and upon any additional information we may have. It may change due to
>>transformation of the data, it may change with the addition of new
>>information that helps us to interpret the data differently, or it may
>>change simply because of the questions we choose to ask."  "...Scale
>>types are not fundamental attributes of the data, but rather, derive
>>from both how the data were measured and what we conclude from the
>>data."  Tools such as Kepler will encourage more automated solutions to
>>data analysis and synthesis efforts, all fine and good, but if there is
>>automated variable selection that weights appropriateness for an
>>analysis based on a variable's measurement scale in the EML metadata for
>>a particular dataset, then the constraints are potentially artificial
>>and overly restrictive.
>>
>>I don't argue that Steven's measurement scale isn't useful when
>>appropriately applied based on the data and the question asked, but that
>>assignment in EML is premature.  Assignment to measurement scale should
>>be done by the end user at the time of analysis when the current
>>question being asked of a dataset is known.  The questions to be asked
>>secondarily to why the data was originally collected cannot be
>>anticipated.  Steven's measurement scale appears to both prescribe and
>>proscribe statistical methods that may be used, and the timing of a
>>variable's entry into EML is too early to pigeonhole it when the intent
>>is to present it for potential use in ways that the original
>>investigator never envisioned. Alternative taxonomies for data types may
>>be more useful that don't necessarily categorize based on original
>>meaning or intent of data but simply on the form of the data itself.
>>Velleman and Wilkinson present one option by another pair of authors:
>>- names
>>- grades (ordered labels such as freshman, sophomore, junior, senior)
>>- ranks (starting from 1, which may represent either the largest or
>>smallest)
>>- counted fractions (bounded by zero and one; e.g., percentages)
>>- counts (non-negative integers)
>>- amounts (non-negative real numbers)
>>- balances (unbounded, positive or negative values)
>>
>>From eml-2.0.1/docs/eml-2.0.1/eml-attribute.html:
>>"The authors decided to sharpen the model of attribute by nesting unit
>>under measurementScale. Measurement Scale is a data typology, borrowed
>>from Statistics, that was introduced in the 1940's. Under the adopted
>>model, attributes are classified as nominal, ordinal, interval, and
>>ratio. Though widely criticized, this classification is well-known and
>>provides at least first-order utility in EML. For example, nesting unit
>>under measurementScale allows EML to prevent its meaningless inclusion
>>for categorical data -- an approach judged superior to making unit
>>universally required or universally optional."
>>
>>I believe much of the confusion people have in assigning a measurement
>>scale to a variable is because it may be viewed as belonging to
>>alternate scales depending on how one views the question that it is
>>addressing.  I don't believe that the benefits of "first-order utility"
>>as noted above are tangible but instead serve as a placeholder that may
>>hinder rather than advance data analysis efforts.  Note that I make
>>these comments in the context of EML and its potential for expanding the
>>meaning and usefulness of historical and archived data through
>>appropriate metadata.
>>
>>Thanks for hearing me out, Matt.  
>>
>>All the best,
>>
>>John
>>
>>John Anderson
>>P.O. Box 30003, MSC 3JER
>>New Mexico State University
>>Las Cruces, NM 88003
>>
>>voice: 505-646-5818
>>fax: 505-646-5889
>>e-mail: janderso at nmsu.edu
>>
>>
>>
>>-------------------------------------------------
>>Long-Term Ecological Research Network Mailing List
>>im at LTERnet.edu
>>http://sql.lternet.edu/cgi/mailgroups_view.pl?im

-- 
-------------------------------------------------------------------
Matt Jones                                     jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439    Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org
-------------------------------------------------------------------