measurementScale Was: [VOTE] should we release eml-2.0.0 now?

Fri Jan 31 09:04:37 PST 2003

Well sorry guys, my comments obviously came off more contentious than
intended. My point wasn't to argue that factions were controlling the
process, but simply to remind the thread that the current design is indeed
the result of compromise and I should have used language that referenced the
conflicting design goals and not the people behind them. My other remarks
aside, I thought that the gist of my message was pretty much inline with the
spirit that debating these decisions wont make or break the acceptance of
EML.

Peter McCartney (peter.mccartney at asu.edu)
Center for Environmental-Studies
Arizona State University

-----Original Message-----
From: Matt Jones [mailto:jones at nceas.ucsb.edu] 
Sent: Thursday, January 30, 2003 6:03 PM
To: eml-dev
Subject: Re: measurementScale Was: [VOTE] should we release eml-2.0.0 now?

Hey,

Despite Dan's most recent treatise (with its mistakes) on the Steven's 
typology, I think we should just focus on getting EML 2.0.0 deployed. 
It is counter-productive to the community to debate it yet again.  I 
agree with Peter that the typology is not really a problem.

On Peter's note about text data, we do specifically have a place in EML 
for them: nominal, with a domain of textDomain.  We were specifically 
addressing the idea of free-form comments and unrestrained value spaces 
when we decided to add the textDomain element as an alternative to 
enumeratedDomain.  For more on this, see [1].

As one of those "certain individuals" that Peter refers to, I agree with 
Chad that I found Peter's comments to be debilitating rather than 
constructive with respect to community-building.  If you want to have a 
shared metadata standard, then use the EML2.0.0 that we all 
democratically approved after several years of technical discussion -- 
this was not last minute, there was ample opportunity for dissent on 
technical merits, and there was no rigid timeline on the release, so 
nobody can claim to not have had the option of changing it back then.

Matt

[1] And, just because a text field is not an enumeration does not mean 
that it isn't nominal data.  In the case of comments, they just happen 
to be nominal attributes with a unique classification value for every 
observation.  For example, a comment field containing the value "I got 
hit by a wave when I took this measurement" is just as much a 
classification, and therefore nominal, as an enumerated field containing 
the value "wave-interference".  Bottom line: we DO accomodate text 
fields, in the nominal measurement scale with a textDomain domain.

Peter McCartney wrote:
> Ain't this fun?
> 
> I have no love lost for Stevens scale in EML. It was introduced in an
> unfortunate flurry of last minute changes because some members would not 
> live with making certain domain and unit-related elements optional, 
> which forced us to enter nonsense information in those places. 
> Introducing the stevens scale enabled us to make these required when 
> appropriate. To drop them means to readdress the issue of optionality. 
> In my opinion, EML itself is still considered "optional" by the 
> community, so I think we might be a bit anal worrying about it at this 
> level.
> 
> The ten cent rule remains:
> 
> Nominal data are categorical. You can say that A is not B
> 
> Ordinal data are ranked. You can say that A preceeds B which preceeds 
> C
> according to some rule of order, but you can not place any value on the 
> relative differences beyond their order.
> 
> Interval data are measured. You can say that the difference between 
> -20
> and -10 is the same as the difference between 40 and 50
> 
> Ratio data are measured from an origin. You can say 40 cm is twice as
> long as 20 cm
> 
> Datetime is interval, but because its so inextricably bound to its
> physical encoding formats, we punted on that issue.
> 
> I find it annoying that we failed to provide an element for Text data
> (such as descriptive paragraphs and so on), so we are forced to just 
> class those as nominal even though we would never do any kind of 
> statistical analysis using those values as categories within a set.
> 
> I don't find it particularly difficult or confusing for people to
> understand data that has been classified in this way - whats more 
> difficult is dealing with the frequent situation where we know something 
> about the domain, but have to guess the scale. In principle, if we know 
> the units, then we probably know the stevens classification, even though 
> STMML doesn't recognize this concept in the unit definitions (all length 
> measturments are ratio).
> 
> In retrospect, I agree it was a dumb idea, but it was the only way to
> achieve the rigid schema control that certain individuals wanted. I 
> don't think this is the thing that will make or break EML
> 
> Peter McCartney (peter.mccartney at asu.edu)
> Center for Environmental-Studies
> Arizona State University
>  
> 
> 
> -----Original Message-----
> From: Dan Higgins [mailto:higgins at nceas.ucsb.edu]
> Sent: Thursday, January 30, 2003 12:43 PM
> To: Peter McCartney
> Cc: 'Tim Bergsma'; Matt Jones; Scott Chapal; eml-dev
> Subject: Re: measurementScale Was: [VOTE] should we release eml-2.0.0 
> now?
> 
> 
> Hi All,
>     I am currently working on a stylesheet to transform eml-beta6 to 
> eml2, so I am coming up against several of the problems discussed 
> here. For whatever it may be worth, I wrote down some of my thoughts 
> in the following paragraph. Maybe it will help provide some different 
> perspectives on the issues.
> 
> ---
> Dan Higgins, 30 Jan 2003
> 
> Comments on Stevens hierarchy of measurment scales (nominal, ordinal,
> interval, ratio)    
> 
> Eml 2.0 has added an required element to the attribute module called 
> 'measurementScale'. One must choose one of 5 child elements under
> 'measurementScale': namely, 'nominal', 'ordinal', 'interval', 'ratio', 
> or 'datetime'. The first four of this list are based on ideas first 
> presented in the early 1940's by Harvard psycholgist S.S.Stevens.
> 
> First, I should give some personal background that probably flavors my 
> views on this subject. I was trained as a physicist and worked for 
> many years on physics and engineering problems and datasets. I am thus 
> not an ecologist and am not aware of all the traditions of that field. 
> However, physics/engineering works with measured data and has many of 
> the same problems with regard to handling data and datasets.
> 
> I should note that the idea of classifying attributes using the terms 
> 'nominal', 'ordinal', 'interval', 'ratio' was entirely new to me. I 
> checked several statistics books aimed at engineers and physicists and 
> could not find ANY references to these 'measurement scales'. I wold 
> thus claim that the use of these scales in not essential to applying 
> statistics to measured data. (I do understand that the concepts are 
> common in BioStatistics texts.)
> 
> A concept can be useful, however, even if not essential. Let me try to 
> explain what I think these scales mean in order to determine if they 
> are useful.
> 
> nominal - I think of 'nominal' as referring to a set of named items. A 
> column of nominal data in a table is a set of names, mathematically a 
> 'set' which can be finite, infinite, enumerated, etc. Anything listed 
> in a table fits in the 'nominal' category (since it is listed by a 
> name
> symbol) but some lists have other characteristics that move them to
> other categories. Unordered non-numeric strings are a practical example
> of this scale. I think most physicists/engineers would simply call this
> category a 'set' or 'list'.
> 
> ordinal - I think of an ordinal scale as simply an 'ordered set'. The 
> labels 'first', 'second', and 'third' is set of three symbols which 
> have an 'order' implied by the common useage of the symbols as English 
> words.
> 
> Both 'nominal' and 'ordinal' seem to be lists of data items that are 
> mathematical sets that do not have a mapping to real numbers. One thus 
> cannot apply numerical operations (other than sorting) to these sets, 
> even if the symbols used appear as numbers (e.g. one might use 
> '1=high', 2='medium', 3='low'). I would argue that it would be better 
> to simply call both nominal and ordinal data 'set' data and then 
> further categorize the set as finite, infinite, ordered, enumerated, 
> etc. A further problem that I have with these 2 current scales is that 
> both are handled in eml2 as NonNumericTypes. 'ordinal' is an ordered 
> set, but there is no information in its metadata to indicate how it is 
> ordered. This seems to be implied by the linguistic meaning of the 
> items in the set. (Perhaps one needs to state that the order of an 
> enumeration is the ordering of the ordinal elements).
> 
> How else might one characterize a nominal or ordinal data column? For 
> human use, information in the attributeDefinition metadata element 
> should say that the column is a set of names or an ordered set. For 
> machine based parsing, it may be much easier to specifically include a 
> statement that the attribute is a set, or nominal or ordinal.
> 
>

> 
> 
> interval, ratio - These two measurement scales both refer to 
> attributes that can be mapped to real numbers (or perhaps a subset 
> like the integers). Thus most numerical/algebraic operations can be 
> applied to interval or ratio scale items. The differences between the 
> two seems somewhat fuzzy (as illustrated by arguments about how 
> dates/times should be categorized). It seems that the 
> addition/subtraction of interval scale items is allowed, while 
> multiplicative scaling of ratio items is meaningful. The classic 
> example seems to be that temperature in degrees Celsius is 'interval', 
> but temperature in degrees Kelvin is 'ratio'. It is often stated that a
'ratio' scale has a "meaningful zero point" (and
> this seems to mean that negative values are not allowed).   
> 
> The first mental block that I ran into in trying to understand the 
> difference between interval and ratio is example of temperature 
> measurements. If I have a column of temperature data in degrees 
> Celsius it is an interval scale, but if I make a simple unit 
> conversion to degrees Kelvin the data changes to a ratio scale? Is the 
> zero implied by kinetic theory for the Kelvin scale really 
> "meaningful", while the temperature where water turns to ice is not? 
> Stevens measurement scales may have a use in determining what 
> statistics can be applied, but they don't seem useful for helping with 
> simple unit transformations which are likely to be required for 
> automatic data comparisons (since changing a unit can change the 
> Stevens type). It would seem better for unit conversions to have data 
> about the unitType (e.g. temperature) higher in the eml2 hierarchy.
> 
> Also, there are some things that would seem to always be non-negative 
> (e.g. size or length) and thus are ratio values. But how do we measure 
> size? One often would place a specimen next to a ruler and simply take 
> the absolute value of the difference between the reading where one end 
> hits the ruler and the reading on the other end. The zero on the ruler 
> can be quite arbitrary (interval), but when one takes the difference 
> of two interval values and then applies a absolute value function the 
> result would seem to be ratio? Does this mean that any set of interval 
> numbers can be made into to a set of ratio values by a simple 
> transformation that eliminates all negative values?
> 
> I guess that as I currently understand the concepts, both interval and 
> ratio are measurements that can be mapped to real numbers and any 
> operation that can be applied to real numbers is meaningful. It seems 
> that ratio measurements are always mapped to non-negative real numbers 
> while interval measurements may be mapped to both negative and 
> positive numbers.
> 
> I thus conclude that if we know an attribute is numeric and that its 
> minimum bound is non-negative, we can conclude that it is a ratio 
> measurement. If it is numeric and it has a lower bound that is 
> negative, then one assume that it is interval. In other words, simply 
> establishing that an attribute is a NumericDomainType and having a 
> required minimum element would eliminate that need for having interval or
ratio
> elements.   
> 
> Can one determine the Stevens measurement scale from the data? Well, 
> numbers can be distinguished from non-numeric strings, but sometimes 
> numeric strings (particularly integers) are used to simply indicate 
> items in a list and sometimes item ordering. Also, a list of floating 
> point numbers may be non-negative, but negative numbers still allowed. 
> There can thus be problems in determining measurement scales from data 
> only.
> 
> But what if one looks not at the data but at a different set of 
> metadata, such as that defined by eml-beta6 (which does not include 
> the Stevens scales). That schema does include an attributeDomain 
> element which is either enumerated/text or numeric. That immediately 
> says that the attribute can be divided into nominal/ordinal or 
> interval/ratio categories. It does lack information on ordering of 
> text, but eml2 also has no information on how items are ordered. And 
> one can guess whether an attribute is interval or ratio by looking at 
> the minimum numeric value set in the metadata.
> 
> I would thus favor removing the Stevens measurement scales from eml, 
> or at least not making the values required and in-line for all 
> attributes. I think set theory terms might be better for 
> characterizing non-numeric data (ordered, unordered, finite, infinite, 
> etc) and numeric bounds, unitTypes, etc. used to characterize 
> numerical values. But then I am not an ecologist. But recent 
> discussions about how to handle dates/times seem to indicate that the 
> Stevens scales are confusing to ecologists also!
> ---
> 
> Peter McCartney wrote:
> 
>  > Scott et al.
>  >
>  > We have put some simple rules in xylographa for making best guesses  
> > since, as we all realize, you must make if you want to be able to 
> put  > in the information that you DO know such as domain info. Given 
> that  > interval and ordinal scales are much less common than ratio 
> and  > nominal scales in ecological data, we have made the assumption 
> that  > unless the data contain negative values or the imported 
> metadata  > indicates a range that includes negative values, the data 
> are probably  > ratio. If the data are alpha numeric, then we assume 
> they are nominal.  > None of these rules are reliable of course, and 
> you have to be  > committed to checking them over after the fact. 
> However, since we can  > often glean something about domain in an 
> automated way, you have to  > make a choice if you want to be able to 
> put it in the document. In  > retrospect, it was bad design (I can say 
> this because I was one of the  > ones that suggested doing it this 
> way!).  >  > Frankly, im more bothered by our failure to fix temporal 
> coverage so  > that it will accept something other than a date type. 
> We did this for  > other places where dates were used but not here. 
> This has turned out  > to be the most aggravating flaw we've had to 
> deal with. Any feedback  > on how people are handline that? The fix 
> would be to make its domain a  > union of xs:string and xs:date.
>  >
>  >
>  >
>  > Peter McCartney (peter.mccartney at asu.edu)
>  > Center for Environmental-Studies
>  > Arizona State University
>  > 
>  >
>  >
>  > -----Original Message-----
>  > From: Tim Bergsma [mailto:tbergsma at kbs.msu.edu]
>  > Sent: Thursday, January 23, 2003 10:24 AM
>  > To: Matt Jones
>  > Cc: Scott Chapal; eml-dev
>  > Subject: Re: measurementScale Was: [VOTE] should we release eml-2.0.0
>  > now?
>  >
>  >
>  > Scott,
>  >
>  > from a philosophical point of view...
>  >
>  > No, there is no way to automate choice of measurement scale.  Someone
>  > has pointed out choice of measurement scale on the Stevens typology
>  > depends in part on what you want to do with the data.  So, for
>  > instance, measurement scale can't be predicted from, say,
>  > DictionaryUnit.  This is probably a flaw in the typology, and as Matt
>  > points out, will probably persist for 2.X.  Let me know if I've
>  > misunderstood your question.
>  >
>  > Tim.
>  >
>  >
>  > Matt Jones wrote:
>  > >
>  > > Scott,
>  > >
>  > > I have no intention of making any changes to EML whatsoever. There
>  > > might be some bug fixes if they arise (e.g., spelling, mismatched
>  > > element names, etc.), but I would personally oppose any structural
>  > > change that broke backwards compatibility over the next several
>  > > years. That includes changes to measurementScale and ID's and
>  > > references elements.  At this point our focus should be on adoption
>  > > and use of the agreed EML 2.0.0 specification.  Let's produce some
>  > > metadata and
>  > make the data available!
>  > >   I know that's what we're focusing on now at NCEAS.
>  > >
>  > > Matt
>  > >
>  > > Scott Chapal wrote:
>  > > > Peter McCartney <peter.mccartney at asu.edu> writes:
>  > > >
>  > > >
>  > > >>for example, im not sure we wont just decide upon more feedback
>  > > >>that classifying attributes into stevens measurement scales was a
>  > > >>mistake. if so, id rather scrap it then struggle to make it work.
>  > > >
>  > > >
>  > > > OK, here's some feedback.
>  > > >
>  > > > Has anyone proposed a way to automate the choice of
>  > > > measurementScale? Ironically, as a result of the 11th hour
>  > > > dateTime debate, dates and times are the easiest!
>  > > >
>  > > > We are now faced with encoding NOMINAL, ORDINAL, INTERVAL and
>  > > > RATIO into the attribute (variable) definitions of the data
>  > > > themselves...in order to provide a mechanism for auto-generating
>  > > > EML via XSL.
>  > > >
>  > > > If, however, the 2.0 version of eml-attribute (measurementScale)
>  > > > will not persist in future versions, I don't want to bother going
>  > > > to all that work.  Is measurementScale, in its current form here
>  > > > to stay?
>  > > >
>  > > > Comments?
>  > > >
>  > > >
>  > > >>Im also not sure that we wont decide to return to triples as
>  > > >>better tools for working with them are developed in SEEK.
>  > > >
>  > > >
>  > > >
>  > >
>  > > --
>  > > *******************************************************************
>  > > Matt Jones                                    jones at nceas.ucsb.edu
>  > > http://www.nceas.ucsb.edu/    Fax: 425-920-2439   Ph: 907-789-0496
>  > > National Center for Ecological Analysis and Synthesis (NCEAS)
>  > >
>  > > Interested in ecological informatics? http://www.ecoinformatics.org
>  > > *******************************************************************
>  > >
>  > > _______________________________________________
>  > > eml-dev mailing list
>  > > eml-dev at ecoinformatics.org
>  > > http://www.ecoinformatics.org/mailman/listinfo/eml-dev
>  >
>  > --
>  > Tim Bergsma
>  > LTER Information Manager
>  > W.K. Kellogg Biological Station
>  > Michigan State University
>  > Hickory Corners, MI   49060
>  > 629/671-2337
>  > tbergsma at kbs.msu.edu
>  > http://lter.kbs.msu.edu _______________________________________________
>  > eml-dev mailing list
>  > eml-dev at ecoinformatics.org
>  > http://www.ecoinformatics.org/mailman/listinfo/eml-dev
>  >
> 
> 
> --
> *******************************************************************
> Dan Higgins                                  higgins at nceas.ucsb.edu
> http://www.nceas.ucsb.edu/    Ph: 805-892-2531
> National Center for Ecological Analysis and Synthesis (NCEAS)
> 735 State Street - Room 205
> Santa Barbara, CA 93195
> *******************************************************************
> 
> 

-- 
*******************************************************************
Matt Jones                                    jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439   Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)

Interested in ecological informatics? http://www.ecoinformatics.org
*******************************************************************

_______________________________________________
eml-dev mailing list
eml-dev at ecoinformatics.org
http://www.ecoinformatics.org/mailman/listinfo/eml-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20030131/5abf03ba/attachment.htm