measurementScale Was: [VOTE] should we release eml-2.0.0 now?

Thu Jan 30 11:43:00 PST 2003

Hi All,
    I am currently working on a stylesheet to transform eml-beta6 to 
eml2, so I am coming up against several of the problems discussed here. 
For whatever it may be worth, I wrote down some of my thoughts in the 
following paragraph. Maybe it will help provide some different 
perspectives on the issues.

---
Dan Higgins, 30 Jan 2003

Comments on Stevens hierarchy of measurment scales (nominal, ordinal, 
interval, ratio)     

Eml 2.0 has added an required element to the attribute module called 
'measurementScale'. One must choose one of 5 child elements under 
'measurementScale': namely, 'nominal', 'ordinal', 'interval', 'ratio', 
or 'datetime'. The first four of this list are based on ideas first 
presented in the early 1940's by Harvard psycholgist S.S.Stevens.

First, I should give some personal background that probably flavors my 
views on this subject. I was trained as a physicist and worked for many 
years on physics and engineering problems and datasets. I am thus not an 
ecologist and am not aware of all the traditions of that field. However, 
physics/engineering works with measured data and has many of the same 
problems with regard to handling data and datasets.

I should note that the idea of classifying attributes using the terms 
'nominal', 'ordinal', 'interval', 'ratio' was entirely new to me. I 
checked several statistics books aimed at engineers and physicists and 
could not find ANY references to these 'measurement scales'. I wold thus 
claim that the use of these scales in not essential to applying 
statistics to measured data. (I do understand that the concepts are 
common in BioStatistics texts.)

A concept can be useful, however, even if not essential. Let me try to 
explain what I think these scales mean in order to determine if they are 
useful.

nominal - I think of 'nominal' as referring to a set of named items. A 
column of nominal data in a table is a set of names, mathematically a 
'set' which can be finite, infinite, enumerated, etc. Anything listed in 
a table fits in the 'nominal' category (since it is listed by a name 
symbol) but some lists have other characteristics that move them to 
other categories. Unordered non-numeric strings are a practical example 
of this scale. I think most physicists/engineers would simply call this 
category a 'set' or 'list'.

ordinal - I think of an ordinal scale as simply an 'ordered set'. The 
labels 'first', 'second', and 'third' is set of three symbols which have 
an 'order' implied by the common useage of the symbols as English words.

Both 'nominal' and 'ordinal' seem to be lists of data items that are 
mathematical sets that do not have a mapping to real numbers. One thus 
cannot apply numerical operations (other than sorting) to these sets, 
even if the symbols used appear as numbers (e.g. one might use '1=high', 
2='medium', 3='low'). I would argue that it would be better to simply 
call both nominal and ordinal data 'set' data and then further 
categorize the set as finite, infinite, ordered, enumerated, etc. A 
further problem that I have with these 2 current scales is that both are 
handled in eml2 as NonNumericTypes. 'ordinal' is an ordered set, but 
there is no information in its metadata to indicate how it is ordered. 
This seems to be implied by the linguistic meaning of the items in the 
set. (Perhaps one needs to state that the order of an enumeration is the 
ordering of the ordinal elements).

How else might one characterize a nominal or ordinal data column? For 
human use, information in the attributeDefinition metadata element 
should say that the column is a set of names or an ordered set. For 
machine based parsing, it may be much easier to specifically include a 
statement that the attribute is a set, or nominal or ordinal.

interval, ratio - These two measurement scales both refer to attributes 
that can be mapped to real numbers (or perhaps a subset like the 
integers). Thus most numerical/algebraic operations can be applied to 
interval or ratio scale items. The differences between the two seems 
somewhat fuzzy (as illustrated by arguments about how dates/times should 
be categorized). It seems that the addition/subtraction of interval 
scale items is allowed, while multiplicative scaling of ratio items is 
meaningful. The classic example seems to be that temperature in degrees 
Celsius is 'interval', but temperature in degrees Kelvin is 'ratio'. It 
is often stated that a 'ratio' scale has a "meaningful zero point" (and 
this seems to mean that negative values are not allowed).    

The first mental block that I ran into in trying to understand the 
difference between interval and ratio is example of temperature 
measurements. If I have a column of temperature data in degrees Celsius 
it is an interval scale, but if I make a simple unit conversion to 
degrees Kelvin the data changes to a ratio scale? Is the zero implied by 
kinetic theory for the Kelvin scale really "meaningful", while the 
temperature where water turns to ice is not? Stevens measurement scales 
may have a use in determining what statistics can be applied, but they 
don't seem useful for helping with simple unit transformations which are 
likely to be required for automatic data comparisons (since changing a 
unit can change the Stevens type). It would seem better for unit 
conversions to have data about the unitType (e.g. temperature) higher in 
the eml2 hierarchy.

Also, there are some things that would seem to always be non-negative 
(e.g. size or length) and thus are ratio values. But how do we measure 
size? One often would place a specimen next to a ruler and simply take 
the absolute value of the difference between the reading where one end 
hits the ruler and the reading on the other end. The zero on the ruler 
can be quite arbitrary (interval), but when one takes the difference of 
two interval values and then applies a absolute value function the 
result would seem to be ratio? Does this mean that any set of interval 
numbers can be made into to a set of ratio values by a simple 
transformation that eliminates all negative values?

I guess that as I currently understand the concepts, both interval and 
ratio are measurements that can be mapped to real numbers and any 
operation that can be applied to real numbers is meaningful. It seems 
that ratio measurements are always mapped to non-negative real numbers 
while interval measurements may be mapped to both negative and positive 
numbers.

I thus conclude that if we know an attribute is numeric and that its 
minimum bound is non-negative, we can conclude that it is a ratio 
measurement. If it is numeric and it has a lower bound that is negative, 
then one assume that it is interval. In other words, simply establishing 
that an attribute is a NumericDomainType and having a required minimum 
element would eliminate that need for having interval or ratio 
elements.    

Can one determine the Stevens measurement scale from the data? Well, 
numbers can be distinguished from non-numeric strings, but sometimes 
numeric strings (particularly integers) are used to simply indicate 
items in a list and sometimes item ordering. Also, a list of floating 
point numbers may be non-negative, but negative numbers still allowed. 
There can thus be problems in determining measurement scales from data only.

But what if one looks not at the data but at a different set of 
metadata, such as that defined by eml-beta6 (which does not include the 
Stevens scales). That schema does include an attributeDomain element 
which is either enumerated/text or numeric. That immediately says that 
the attribute can be divided into nominal/ordinal or interval/ratio 
categories. It does lack information on ordering of text, but eml2 also 
has no information on how items are ordered. And one can guess whether 
an attribute is interval or ratio by looking at the minimum numeric 
value set in the metadata.

I would thus favor removing the Stevens measurement scales from eml, or 
at least not making the values required and in-line for all attributes. 
I think set theory terms might be better for characterizing non-numeric 
data (ordered, unordered, finite, infinite, etc) and numeric bounds, 
unitTypes, etc. used to characterize numerical values. But then I am not 
an ecologist. But recent discussions about how to handle dates/times 
seem to indicate that the Stevens scales are confusing to ecologists also!
---

Peter McCartney wrote:

> Scott et al.
>
> We have put some simple rules in xylographa for making best guesses 
> since, as we all realize, you must make if you want to be able to put 
> in the information that you DO know such as domain info. Given that 
> interval and ordinal scales are much less common than ratio and 
> nominal scales in ecological data, we have made the assumption that 
> unless the data contain negative values or the imported metadata 
> indicates a range that includes negative values, the data are probably 
> ratio. If the data are alpha numeric, then we assume they are nominal. 
> None of these rules are reliable of course, and you have to be 
> committed to checking them over after the fact. However, since we can 
> often glean something about domain in an automated way, you have to 
> make a choice if you want to be able to put it in the document. In 
> retrospect, it was bad design (I can say this because I was one of the 
> ones that suggested doing it this way!).
>
> Frankly, im more bothered by our failure to fix temporal coverage so 
> that it will accept something other than a date type. We did this for 
> other places where dates were used but not here. This has turned out 
> to be the most aggravating flaw we've had to deal with. Any feedback 
> on how people are handline that? The fix would be to make its domain a 
> union of xs:string and xs:date.
>
>
>
> Peter McCartney (peter.mccartney at asu.edu)
> Center for Environmental-Studies
> Arizona State University
>  
>
>
> -----Original Message-----
> From: Tim Bergsma [mailto:tbergsma at kbs.msu.edu]
> Sent: Thursday, January 23, 2003 10:24 AM
> To: Matt Jones
> Cc: Scott Chapal; eml-dev
> Subject: Re: measurementScale Was: [VOTE] should we release eml-2.0.0 
> now?
>
>
> Scott,
>
> from a philosophical point of view...
>
> No, there is no way to automate choice of measurement scale.  Someone 
> has pointed out choice of measurement scale on the Stevens typology 
> depends in part on what you want to do with the data.  So, for 
> instance, measurement scale can't be predicted from, say, 
> DictionaryUnit.  This is probably a flaw in the typology, and as Matt 
> points out, will probably persist for 2.X.  Let me know if I've 
> misunderstood your question.
>
> Tim.
>
>
> Matt Jones wrote:
> >
> > Scott,
> >
> > I have no intention of making any changes to EML whatsoever. There
> > might be some bug fixes if they arise (e.g., spelling, mismatched
> > element names, etc.), but I would personally oppose any structural
> > change that broke backwards compatibility over the next several years.
> > That includes changes to measurementScale and ID's and references
> > elements.  At this point our focus should be on adoption and use of
> > the agreed EML 2.0.0 specification.  Let's produce some metadata and 
> make the data available!
> >   I know that's what we're focusing on now at NCEAS.
> >
> > Matt
> >
> > Scott Chapal wrote:
> > > Peter McCartney <peter.mccartney at asu.edu> writes:
> > >
> > >
> > >>for example, im not sure we wont just decide upon more feedback that
> > >>classifying attributes into stevens measurement scales was a
> > >>mistake. if so, id rather scrap it then struggle to make it work.
> > >
> > >
> > > OK, here's some feedback.
> > >
> > > Has anyone proposed a way to automate the choice of
> > > measurementScale? Ironically, as a result of the 11th hour dateTime
> > > debate, dates and times are the easiest!
> > >
> > > We are now faced with encoding NOMINAL, ORDINAL, INTERVAL and RATIO
> > > into the attribute (variable) definitions of the data
> > > themselves...in order to provide a mechanism for auto-generating EML
> > > via XSL.
> > >
> > > If, however, the 2.0 version of eml-attribute (measurementScale)
> > > will not persist in future versions, I don't want to bother going to
> > > all that work.  Is measurementScale, in its current form here to
> > > stay?
> > >
> > > Comments?
> > >
> > >
> > >>Im also not sure that we wont decide to return to triples as better
> > >>tools for working with them are developed in SEEK.
> > >
> > >
> > >
> >
> > --
> > *******************************************************************
> > Matt Jones                                    jones at nceas.ucsb.edu
> > http://www.nceas.ucsb.edu/    Fax: 425-920-2439   Ph: 907-789-0496
> > National Center for Ecological Analysis and Synthesis (NCEAS)
> >
> > Interested in ecological informatics? http://www.ecoinformatics.org
> > *******************************************************************
> >
> > _______________________________________________
> > eml-dev mailing list
> > eml-dev at ecoinformatics.org
> > http://www.ecoinformatics.org/mailman/listinfo/eml-dev
>
> -- 
> Tim Bergsma
> LTER Information Manager
> W.K. Kellogg Biological Station
> Michigan State University
> Hickory Corners, MI   49060
> 629/671-2337
> tbergsma at kbs.msu.edu
> http://lter.kbs.msu.edu _______________________________________________
> eml-dev mailing list
> eml-dev at ecoinformatics.org 
> http://www.ecoinformatics.org/mailman/listinfo/eml-dev
>

-- 
*******************************************************************
Dan Higgins                                  higgins at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Ph: 805-892-2531
National Center for Ecological Analysis and Synthesis (NCEAS) 
735 State Street - Room 205
Santa Barbara, CA 93195
*******************************************************************