more questions on eml2rc2

Matt Jones jones at nceas.ucsb.edu
Mon Oct 28 14:09:11 PST 2002


Barbara,

Thanks once again for excellent comments.  You saved us again!

Barbara Benson wrote:
> 
> 
> Hi Matt,
> 
> More issues and questions have arisen as we have worked responding to 
> your feedback and to bug reports.  We would appreciate more comments 
> from you.
> 
> (1) We have been interpreting measurement scale categories from standard 
> definitions used discussing measurement scales in statistics books.  The 
> term “ratio-scale level” implies a certain relationship between measured 
> values and true magnitudes.  Let t(o) = the true magnitude of object o 
> and m(o) = the measured value.  Then measurement m is ratio-scale if it 
> satisfies the following condition:
> t(o) = x if and only if m(o) = ax, where a>0
> 
> Thus, a measurement of the length of a fish in millimeters is ratio 
> scale (whereas a temperature reading is interval scale).  The discussion 
> in bugzilla:  “Srictly speaking only interval scales have units, the 
> rest are dimensionless. In practice, there is still some value of 
> knowing the units of the denominator and/or numerator in ratios of two 
> dimensions,…” seems to be implying that ratio scale means that the value 
> involved is a ratio.  I don’t think this interpretation is a standard 
> use of the term.

You're absolutely right.  We had the definition wrong.  I found a good, 
approachable summary on the follwoing web site: 
http://www.math.sfu.ca/~cschwarz/Stat-301/Handouts/node5.html
This means to me that ratio scale attributes should indeed have unit, 
precision, and attributeDomain specified.  We will make this change.

> (2) We have been thinking about the physical module.  I am realizing 
> that we may have many uses for describing data in EML.  (1) In some 
> cases such as the collaboration with SDSC on ClimDB what we want to 
> describe is an actual Oracle table that will be queried by a harvester 
> web service.  This dataFormat is an ExternallyDefinedFormat.  (2) For 
> our dynamic database query system that provides access to data via the 
> NTL-LTER web page, the user has a choice of output formats for the 
> query.  I think we want to describe the data format option for EML 
> output of the data along with the metadata.  The data will be 
> serialized.  Would this description fit under ExternallyDefinedFormat?  
> (3) In the case where our EML metadata reside in Metacat, ideally we 
> would want to point people to our dynamic database query system for 
> distribution of the data.  We could provide an URL to get people into 
> the system but this system allows the end user to choose between csv 
> text format, Excel, and EML as output formats.  How would our metadata 
> in Metacat accommodate this richness?  Would we be forced to point 
> people at a fixed output file?

Distribution is repeatable specifically to allow someone to indicate 
that a given physical stream is available in multiple locations. 
However, in the current model, only one physical format can be described 
for each entity. So, if you have two ways of getting a logical entity, 
one of which produces a text file and one produces an excel file, the 
current structure is not sufficient.  This was always intended as a 
possibility, and was simply lost in the conversion along the way.  I 
will recommend that "physical" be made repreatable to support the 
multiple physical incarnations of an entity.

In terms of describing the NTL-LTER web page, you've got some choices. 
The NTL web site is an 'application' that determines a specific set of 
interactions between the client and your server needed to access a 
specific data set.  You can either, 1) describe that connection in a 
general way, making it essentially an informative description but not 
intended for machine processing, or 2) provide all of the info needed to 
directly connect and download a data stream.  If you do (2), then the 
information in about the data format should precisely describe what the 
user gets back.  If it is a well-known binary format like Excel, you'll 
probably want to use externallyDefinedFormat.  If its some text stream, 
use textFormat.  If the user can get back two different physical 
formats, you really need to provide different physical descriptors for 
each (which isn't possible now but will be by then end of today :)

> (3) We are trying to define our custom units according to stmml.  Could 
> you provide an example as to how the stmml unit definitions should be 
> included and referenced within the additional metadata section?

There's an example shipped with EML in the lib/sample/eml-sample.xml 
document that includes the definition of custom units in 
additionalMetadata.  Our recent discussions on STMML indicate it may 
need revision, but the basic idea is there in the example.

> (4) We are grappling with the numeric attribute domain.  Some 
> measurements have a logical range over which they are defined like pH.  
> How do you recommend dealing with something like average daily air 
> temperature?

I think you are referring to the definition of the domain.  There are 
two aspects of domain you should consider.  First, what is the domain of 
the quantity in the general sense.  For example, no matter what you are 
measuring, all length measures should be a real number >= 0, and it is 
highly likely that all velocities will be <= speed of light.  Other 
quantities might be less restrictive (e.g., all real numbers, or all 
integer numbers), or more restrictive.  The second part is to consider 
the quantity in the particular sense.  Here you can restrict the values 
to the actual domain of the measurement, rather than the general 
measurement type.  For example, although length measures are simple real 
numbers greater than 0, a reasonable domain for measurments of the 
length of ant bodies might be more like 0 cm <= x <= 10 cm.  This is, of 
course, an approximation.  You should be sure that all conceivable (and 
some inconceivable) values for that measure fall within your domain, not 
just highly likely ones.  For example, although most ants are < 1 cm, 
its not reasonable to set the domain maximum at 1 cm because there is a 
low possibility of there being one > 1cm.  This of course requires 
substantial scientific knowledge.  Its better to leave the domain at the 
level of measurementType if you can't accurately restrict it.

So, for air temperature, I would look for a domain that includes all 
values that are even remotely possible under most or all global warming 
scenarios, etc.  You might come up with something on the order of -50 C 
to 50 C for Wisconsin, although I don't really know.  After 100 years of 
global warming from today, what is the highest conceivable air 
temperature? That should get you in the ballpark.  But its really a 
scientific question, not a technology question.

> Thank you for your help.  I really appreciated your thoughtful replies 
> to our last set of questions.
> 
> Barbara
> 
> 

Sure thing. And thank you for your excellent and thorough review of EML. 
It has been extremely valuable.

Matt

-- 
*******************************************************************
Matt Jones                                    jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439   Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)

Interested in ecological informatics? http://www.ecoinformatics.org
*******************************************************************




More information about the Eml-dev mailing list