more questions on eml2rc2
Matt Jones
jones at nceas.ucsb.edu
Mon Oct 28 14:09:11 PST 2002
Barbara,
Thanks once again for excellent comments. You saved us again!
Barbara Benson wrote:
>
>
> Hi Matt,
>
> More issues and questions have arisen as we have worked responding to
> your feedback and to bug reports. We would appreciate more comments
> from you.
>
> (1) We have been interpreting measurement scale categories from standard
> definitions used discussing measurement scales in statistics books. The
> term “ratio-scale level” implies a certain relationship between measured
> values and true magnitudes. Let t(o) = the true magnitude of object o
> and m(o) = the measured value. Then measurement m is ratio-scale if it
> satisfies the following condition:
> t(o) = x if and only if m(o) = ax, where a>0
>
> Thus, a measurement of the length of a fish in millimeters is ratio
> scale (whereas a temperature reading is interval scale). The discussion
> in bugzilla: “Srictly speaking only interval scales have units, the
> rest are dimensionless. In practice, there is still some value of
> knowing the units of the denominator and/or numerator in ratios of two
> dimensions,…” seems to be implying that ratio scale means that the value
> involved is a ratio. I don’t think this interpretation is a standard
> use of the term.
You're absolutely right. We had the definition wrong. I found a good,
approachable summary on the follwoing web site:
http://www.math.sfu.ca/~cschwarz/Stat-301/Handouts/node5.html
This means to me that ratio scale attributes should indeed have unit,
precision, and attributeDomain specified. We will make this change.
> (2) We have been thinking about the physical module. I am realizing
> that we may have many uses for describing data in EML. (1) In some
> cases such as the collaboration with SDSC on ClimDB what we want to
> describe is an actual Oracle table that will be queried by a harvester
> web service. This dataFormat is an ExternallyDefinedFormat. (2) For
> our dynamic database query system that provides access to data via the
> NTL-LTER web page, the user has a choice of output formats for the
> query. I think we want to describe the data format option for EML
> output of the data along with the metadata. The data will be
> serialized. Would this description fit under ExternallyDefinedFormat?
> (3) In the case where our EML metadata reside in Metacat, ideally we
> would want to point people to our dynamic database query system for
> distribution of the data. We could provide an URL to get people into
> the system but this system allows the end user to choose between csv
> text format, Excel, and EML as output formats. How would our metadata
> in Metacat accommodate this richness? Would we be forced to point
> people at a fixed output file?
Distribution is repeatable specifically to allow someone to indicate
that a given physical stream is available in multiple locations.
However, in the current model, only one physical format can be described
for each entity. So, if you have two ways of getting a logical entity,
one of which produces a text file and one produces an excel file, the
current structure is not sufficient. This was always intended as a
possibility, and was simply lost in the conversion along the way. I
will recommend that "physical" be made repreatable to support the
multiple physical incarnations of an entity.
In terms of describing the NTL-LTER web page, you've got some choices.
The NTL web site is an 'application' that determines a specific set of
interactions between the client and your server needed to access a
specific data set. You can either, 1) describe that connection in a
general way, making it essentially an informative description but not
intended for machine processing, or 2) provide all of the info needed to
directly connect and download a data stream. If you do (2), then the
information in about the data format should precisely describe what the
user gets back. If it is a well-known binary format like Excel, you'll
probably want to use externallyDefinedFormat. If its some text stream,
use textFormat. If the user can get back two different physical
formats, you really need to provide different physical descriptors for
each (which isn't possible now but will be by then end of today :)
> (3) We are trying to define our custom units according to stmml. Could
> you provide an example as to how the stmml unit definitions should be
> included and referenced within the additional metadata section?
There's an example shipped with EML in the lib/sample/eml-sample.xml
document that includes the definition of custom units in
additionalMetadata. Our recent discussions on STMML indicate it may
need revision, but the basic idea is there in the example.
> (4) We are grappling with the numeric attribute domain. Some
> measurements have a logical range over which they are defined like pH.
> How do you recommend dealing with something like average daily air
> temperature?
I think you are referring to the definition of the domain. There are
two aspects of domain you should consider. First, what is the domain of
the quantity in the general sense. For example, no matter what you are
measuring, all length measures should be a real number >= 0, and it is
highly likely that all velocities will be <= speed of light. Other
quantities might be less restrictive (e.g., all real numbers, or all
integer numbers), or more restrictive. The second part is to consider
the quantity in the particular sense. Here you can restrict the values
to the actual domain of the measurement, rather than the general
measurement type. For example, although length measures are simple real
numbers greater than 0, a reasonable domain for measurments of the
length of ant bodies might be more like 0 cm <= x <= 10 cm. This is, of
course, an approximation. You should be sure that all conceivable (and
some inconceivable) values for that measure fall within your domain, not
just highly likely ones. For example, although most ants are < 1 cm,
its not reasonable to set the domain maximum at 1 cm because there is a
low possibility of there being one > 1cm. This of course requires
substantial scientific knowledge. Its better to leave the domain at the
level of measurementType if you can't accurately restrict it.
So, for air temperature, I would look for a domain that includes all
values that are even remotely possible under most or all global warming
scenarios, etc. You might come up with something on the order of -50 C
to 50 C for Wisconsin, although I don't really know. After 100 years of
global warming from today, what is the highest conceivable air
temperature? That should get you in the ballpark. But its really a
scientific question, not a technology question.
> Thank you for your help. I really appreciated your thoughtful replies
> to our last set of questions.
>
> Barbara
>
>
Sure thing. And thank you for your excellent and thorough review of EML.
It has been extremely valuable.
Matt
--
*******************************************************************
Matt Jones jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/ Fax: 425-920-2439 Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
Interested in ecological informatics? http://www.ecoinformatics.org
*******************************************************************
More information about the Eml-dev
mailing list