Measurement scale in EML

Fri Mar 4 15:26:45 PST 2005

Hi,

Your proposed structure seems acceptable.  Its not ideal but better than 
nothing.  None of the existing metadata standards are structured to 
accomodate such an abstracted relational model as yours, so there's not 
much more you can do if you don't want to produce views in a more 
typical data format.  The real shame is that you aren't using the EML 
unit-dictionary and so you aren't linking your units titghtly to the 
semantically defined units in EML -- this significantly reduces the 
utility of the metadata for automated processing.  A mapping between 
your unit names and the standardized EMl unit names would help, although 
would probably still be hard to make use of through automated processing.

Matt

Xiaoping Wang wrote:
> Dear Matt and Peter,
> 
> Thank you very much for your useful inputs.  The following is the 
> revised piece of my EML document.  Please note that I am not using the 
> values from standardUnit/unitDictionary to describe the unit for the 
> variables.  I think the unitDictionary is used when you use the <unit> 
> element to discribe the unit.  Here I use whatever that actually appear 
> in the varUnit column of the table as the "code" (for example, PPT as 
> the code for the unit of Sanility) for the <code> element and give 
> further explanation about the unit in the <codeDefinition> element.  
> Please let me know if you have further advices.
> 
> <attribute>      <attributeName>varName</attributeName>
>    <attributeDefinition>Name of chemical or physical property 
> measured</attributeDefinition>
>    <storageType>String</storageType>
>    <measurementScale>
>        <nominal>
>            <nonNumericDomain>
>                <enumeratedDomain>
>                    <codeDefinition>
>                        <code>T</code>
>                        <definition>Temperature</definition>
>                    </codeDefinition>
>                    <codeDefinition>
>                        <code>S</code>
>                        <definition>Salinity</definition>
>                    </codeDefinition>
>                    <codeDefinition>
>                        <code>ST</code>
>                        <definition>Sigma-T</definition>
>                    </codeDefinition>
>                </enumeratedDomain>
>            </nonNumericDomain>
>        </nominal>
>    </measurementScale>
> </attribute>
> <attribute>
>    <attributeName>varUnit</attributeName>
>    <attributeDefinition>Unit of chemical or physical property 
> measured</attributeDefinition>
>    <storageType>String</storageType>
>    <measurementScale>
>        <nominal>
>            <nonNumericDomain>
>                <enumeratedDomain>
>                    <codeDefinition>
>                        <code>C</code>
>                        <definition>Degree, the unit for 
> Temperature</definition>
>                    </codeDefinition>
>                    <codeDefinition>
>                        <code>PPT</code>
>                        <definition>Unit for Salinity</definition>
>                    </codeDefinition>
>                    <codeDefinition>
>                        <code>KG/M**3</code>
>                        <definition>Kilogram per cubic meter, the unit 
> for Sigma-T</definition>
>                    </codeDefinition>
>                </enumeratedDomain>
>            </nonNumericDomain>
>        </nominal>
>         </nonNumericDomain>
> </attribute>
> 
> Thank you!
> 
> Xiaoping Wang
> 
> PMEL / NOAA
> 
> Matt Jones wrote:
> 
>> Hi Xiaoping,
>>
>> As Peter mentioned, your problems have arisen before.  See below for 
>> some additional recommendations beyond Peter's from my personal 
>> perspective.
>>
>> Xiaoping Wang wrote:
>>
>>> Dear Matt and Peter:
>>>
>>> I have seen a lot of discussions recently on issues about measurement 
>>> scale and temporal coverage.  They are very helpful for our better 
>>> understanding of EML.  The following are my questions and concerns I 
>>> raised during my work on our EML-based metadata. <#temporalCoverage>
>>>
>>> 1. About the Measurement scale
>>>
>>> The measurementSclae is a little bit confusing.  I spent a lot of 
>>> time working on the measurementScale for nominal data.  Here I want 
>>> to give you an example about how I use the measurmentScale to 
>>> describe nominal data in our dataset, and you can see whether my 
>>> implementation is based on correct understanding of this element.
>>>
>>> We have a data table with four columns (attributes): recordID, 
>>> variable_name, variable_unit, and avriable_value.  The values for 
>>> variable_name column include certain measurements for the chemical 
>>> and physical properites of sea water such as temperature, salinity, 
>>> nitrate......  The following is a sample piece of my EML file for 
>>> this dataset.
>>> - <#> <attribute>
>>>      <attributeName>varName</attributeName>
>>>      <attributeDefinition>Name of chemical or physical property 
>>> measured</attributeDefinition>
>>>      <storageType>String</storageType>
>>> - <#>     <measurementScale>
>>> - <#>         <nominal>
>>> -            <#><nonNumericDomain>
>>> -                <#><enumeratedDomain>
>>> -                    <#><codeDefinition>
>>>                      <code>T</code>
>>>                      <definition>Temperature, unit: C</definition>
>>>                  </codeDefinition>
>>> -                <#>    <codeDefinition>
>>>                         <code>S</code>
>>>                         <definition>Salinity, unit: PPT</definition>
>>>                  </codeDefinition>
>>> -                    <#><codeDefinition>
>>>                         <code>ST</code>
>>>                         <definition>Sigma-T, unit: KG/M**3</definition>
>>>                     </codeDefinition>   <#>
>>>              </enumeratedDomain>
>>>          </nonNumericDomain>
>>>      </nominal>
>>>  </measurementScale>
>>> </attribute>
>>> - <#> <attribute>
>>>      <attributeName>varUnit</attributeName>
>>>      <attributeDefinition>Unit of chemical or physical property 
>>> measured</attributeDefinition>
>>>      <storageType>String</storageType>
>>> - <#>     <measurementScale>
>>> - <#>         <nominal>
>>> - <#>             <nonNumericDomain>
>>> - <#>                 <textDomain>
>>>                      <definition>*</definition>
>>>              </textDomain>
>>>          </nonNumericDomain>
>>>      </nominal>
>>>  </measurementScale>
>>> </attribute>
>>>
>>> My questions / concerns are:
>>> (1) Is it suitable to use enumeratedDomain element to describe varName?
>>
>>
>> Yes, that is fine, although if you wanted it to be free text that 
>> would be ok too (just use textDomain instead of enumeratedDomain).  
>> Encoding the unit information in the variable name is somewhat 
>> repetitive if you have the same unit information in the varUnit column.
>>
>>>
>>> (2) For the varUnit, I don't think it is necessary to include 
>>> measurementScale element.  However, since the measurementScale is an 
>>> required field, I have to put something there in order to pass the 
>>> EML validation.  So I put a "*" sign for the definition element.  I 
>>> have seen some other similar cases in which the EML metadata 
>>> developers use a "*" for the definition element.  Obviously, the 
>>> measurementScale content described here tells no useful information 
>>> about the varUnit.
>>
>>
>> The use of the '*' is inappropriate.  The field is required because 
>> the authors of EML thought the information was important.  In this 
>> case, I think you should put in the definition something that 
>> indicates that the values are names of units.  One major thing that is 
>> missing here is that  you don't use the EML Unit Dictionary when 
>> choosing your unit definitions.  This eliminates the major advantage 
>> of EML in being able to provide quantitative information about units.  
>> If there is a 1:1 correspondence between your units and the EML unit 
>> dictionary, I think it would be good if you defined varUnit as an 
>> enumerated domain and for each of your units provide the EML standard 
>> name for the unit in the definition.  This would help in translating, 
>> although it is unlikely that anyone could use this in automated 
>> systems because its such a non-standard use of the eml descriptors.
>>
>> In general, this model of variablename, varunit, value is a 
>> non-standard use of the relational model as the attributes do not 
>> really represent a single type.  The relational model is generally 
>> intended to have attributes that contain a semantically homogenous set 
>> of values.  In your case this is not true, unless considered from a 
>> meta-level.  So, I think you are using the relational model as a 
>> schema language itself. This significantly complicates use of the data 
>> in standard analytical systems (e.g., SAS< Splus, R, Matlab) -- they 
>> basically all require different views of the data as described in 
>> Peter's note.  Personally I think that documenting these more 
>> traditional views if you have them would be far more useful to 
>> scientists who wish to analyze the data. That would have the added 
>> benefit of being better described by EML structures.  Documenting your 
>> "meta-level" schema isn't particularly informative because the 
>> information in one attribute is so heterogeneous.
>>
>>>
>>> 2. About the information of metadata itself
>>>
>>> Based on my understanding of EML schemas, the only inforamtion 
>>> associated with the metadata itself is the information about metadata 
>>> provider(s).  However, my supervisors and I  think that  it is 
>>> important to provide other metadata information, such as when 
>>> metadata document is created, if further update of metadata is neede, 
>>> and if the answer is yes, what is the metadata update frequency and 
>>> the date of last update.  Those pieces of  information are 
>>> particularly important in the case when the endDate value for the 
>>> dataset from on-going projects is going to change, because first they 
>>> can remind metadata providers / developer when they should update 
>>> their metadata, and second they can tell metadata users if the 
>>> metadata document provides the most current information about the 
>>> dataset described.
>>
>>
>> Sure.  In hindsight, I think we should have included these metadata 
>> information fields, particularly the timestamp fields.  But we do have 
>> some related fields that describe ongoing data collection.  Take a 
>> look at /eml/dataset/maintenance/description and 
>> /eml/dataset/maintenance/maintenanceUpdateFrequency.  The latter is 
>> probably what you want.  Ay fields that you want but that don't exist 
>> in the schema can be put in the "/eml/additionalMetadata" field, so 
>> you always have that as a recourse.  If you have specific 
>> recommendations for fields that are needed you could send them to 
>> eml-dev at ecoinformatics.org and we'll try to get them into plans for a 
>> future release.
>>
>>>
>>> 3. About the temporal coverage <#temporalCoverage>
>>>
>>> We have many metadata records with uncertain endDate because the new 
>>> data are being continuously loaded into the dataset.  Whenever new 
>>> data are loaded, we have to change the values for end date, number of 
>>> records, and /or size of table......  I am wondering when you can 
>>> provide a solution for this issue.
>>
>>
>> Personally I think this is a good thing.  At any given point in time 
>> there is a finite amount of data available, and the metadata should 
>> describe that.  If you have an automated data collection process, then 
>> you would simply have to update your metadata as part of that process. 
>> The number of records, table size, and checksum are useful when people 
>> get your data to validate that they got the data without error.  The 
>> end date for temporal coverage provides valuable discovery 
>> information, and should simply be made to match the data that you 
>> release.
>>
>>>
>>> In addition, I found from John's email that you had a KNB data 
>>> management workshop early this year.  I am very interested in this 
>>> kind of workshop, particular workshop associated with the use of 
>>> metacat.  If you have this type of workshop in the future, please let 
>>> me know.
>>
>>
>> Yeah, we had one in February.  We announce these opportunities on 
>> various web sites and mailing lists.  You should subscribe to 
>> ecoinfo at ecoinformatics.org and watch http://seek.ecoinformatics.org in 
>> particular for announcements.
>>
>> Like Peter I also recommend that you get involved in the ongoing 
>> improvements related to EML.  Your feedback and contributions would be 
>> extremely vauable.  Good luck.  Let us know if you have more questions.
>>
>> Matt
>>
>>>
>>> Thank you very much for your support!
>>>
>>> Xiaoping Wang
>>>
>>> PMEL /NOAA
>>>
>>>
>>>
>>>
>>>
>>>
>>

-- 
-------------------------------------------------------------------
Matt Jones                                     jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439    Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org
-------------------------------------------------------------------