Recommendation about information associated with EML metadata document itself

Mon Feb 28 12:13:33 PST 2005

Dear eml-dev:

As mentioned in my email to Matt and Peter (see below), providing 
necessary timestamp inforamtion for the EML metadata document is 
important not only to metadata generators but also to metadata users.  
Both /eml/dataset/maintenance/description and 
/eml/dataset/maintenance/maintenanceUpdateFrequency in EML schemas are 
used for description of dataset, not for the metadata document itself.  
Although we can use /eml/additionalMetadata to say something about the 
metadata document, I believe that the timestamp information about the 
EML metadata document is so important that it needs to be highlighted.  
The following is my recommendataion about the way you can provide more 
information about the metadata itself.

In the /eml/dataset/res:ResourceGroup, instead of using metadataProvider 
as one of the elements, use metadataInformation as suggested below:
<metadataInformation>
       A sequency of
             <metadataProvider>                
required                    (comment: inforamtion about metadata 
providers is listed here)
             <metadataCreationDate>         required                    
(comment: the date when the metadata document is originally created)
             <metadateMaintenance>          Optional                   
(comment: this element is used when the metadata document needs to be 
updated in the future)
                A sequency of
                   <lastUpdateDate>              
required                    (comment: the date of last metadata update)
                   <oldVule>                          
required                    (comment: for example, the endDate for 
rangeOfDates, numberOfRecords for an entity (table), size of entity 
(table)........  These values will be changed after new data are loaded 
into the dataset)
                   <updateFrequence>           
required                    (comment: by comapring updateFrequency and 
lastUpdateDate, metadata developers know when they need to update their 
metadata document, and metadata users know if the metadata document 
describes the most current information about the dataset)

These are the necessary elements that I think they should be provided in 
EML metadata document.  Hope my recommendation helps.

Thank you very much for your support.

Xiaoping Wang

PMEL /NOAA

Matt Jones wrote:

> Hi Xiaoping,
>
> As Peter mentioned, your problems have arisen before.  See below for 
> some additional recommendations beyond Peter's from my personal 
> perspective.
>
> Xiaoping Wang wrote:
>
>> Dear Matt and Peter:
>>
>> I have seen a lot of discussions recently on issues about measurement 
>> scale and temporal coverage.  They are very helpful for our better 
>> understanding of EML.  The following are my questions and concerns I 
>> raised during my work on our EML-based metadata. <#temporalCoverage>
>>
>> 1. About the Measurement scale
>>
>> The measurementSclae is a little bit confusing.  I spent a lot of 
>> time working on the measurementScale for nominal data.  Here I want 
>> to give you an example about how I use the measurmentScale to 
>> describe nominal data in our dataset, and you can see whether my 
>> implementation is based on correct understanding of this element.
>>
>> We have a data table with four columns (attributes): recordID, 
>> variable_name, variable_unit, and avriable_value.  The values for 
>> variable_name column include certain measurements for the chemical 
>> and physical properites of sea water such as temperature, salinity, 
>> nitrate......  The following is a sample piece of my EML file for 
>> this dataset.
>> - <#> <attribute>
>>      <attributeName>varName</attributeName>
>>      <attributeDefinition>Name of chemical or physical property 
>> measured</attributeDefinition>
>>      <storageType>String</storageType>
>> - <#>     <measurementScale>
>> - <#>         <nominal>
>> -            <#><nonNumericDomain>
>> -                <#><enumeratedDomain>
>> -                    <#><codeDefinition>
>>                      <code>T</code>
>>                      <definition>Temperature, unit: C</definition>
>>                  </codeDefinition>
>> -                <#>    <codeDefinition>
>>                         <code>S</code>
>>                         <definition>Salinity, unit: PPT</definition>
>>                  </codeDefinition>
>> -                    <#><codeDefinition>
>>                         <code>ST</code>
>>                         <definition>Sigma-T, unit: KG/M**3</definition>
>>                     </codeDefinition>   <#>
>>              </enumeratedDomain>
>>          </nonNumericDomain>
>>      </nominal>
>>  </measurementScale>
>> </attribute>
>> - <#> <attribute>
>>      <attributeName>varUnit</attributeName>
>>      <attributeDefinition>Unit of chemical or physical property 
>> measured</attributeDefinition>
>>      <storageType>String</storageType>
>> - <#>     <measurementScale>
>> - <#>         <nominal>
>> - <#>             <nonNumericDomain>
>> - <#>                 <textDomain>
>>                      <definition>*</definition>
>>              </textDomain>
>>          </nonNumericDomain>
>>      </nominal>
>>  </measurementScale>
>> </attribute>
>>
>> My questions / concerns are:
>> (1) Is it suitable to use enumeratedDomain element to describe varName?
>
> Yes, that is fine, although if you wanted it to be free text that 
> would be ok too (just use textDomain instead of enumeratedDomain).  
> Encoding the unit information in the variable name is somewhat 
> repetitive if you have the same unit information in the varUnit column.
>
>>
>> (2) For the varUnit, I don't think it is necessary to include 
>> measurementScale element.  However, since the measurementScale is an 
>> required field, I have to put something there in order to pass the 
>> EML validation.  So I put a "*" sign for the definition element.  I 
>> have seen some other similar cases in which the EML metadata 
>> developers use a "*" for the definition element.  Obviously, the 
>> measurementScale content described here tells no useful information 
>> about the varUnit.
>
> The use of the '*' is inappropriate.  The field is required because 
> the authors of EML thought the information was important.  In this 
> case, I think you should put in the definition something that 
> indicates that the values are names of units.  One major thing that is 
> missing here is that  you don't use the EML Unit Dictionary when 
> choosing your unit definitions.  This eliminates the major advantage 
> of EML in being able to provide quantitative information about units.  
> If there is a 1:1 correspondence between your units and the EML unit 
> dictionary, I think it would be good if you defined varUnit as an 
> enumerated domain and for each of your units provide the EML standard 
> name for the unit in the definition.  This would help in translating, 
> although it is unlikely that anyone could use this in automated 
> systems because its such a non-standard use of the eml descriptors.
>
> In general, this model of variablename, varunit, value is a 
> non-standard use of the relational model as the attributes do not 
> really represent a single type.  The relational model is generally 
> intended to have attributes that contain a semantically homogenous set 
> of values.  In your case this is not true, unless considered from a 
> meta-level.  So, I think you are using the relational model as a 
> schema language itself. This significantly complicates use of the data 
> in standard analytical systems (e.g., SAS< Splus, R, Matlab) -- they 
> basically all require different views of the data as described in 
> Peter's note.  Personally I think that documenting these more 
> traditional views if you have them would be far more useful to 
> scientists who wish to analyze the data. That would have the added 
> benefit of being better described by EML structures.  Documenting your 
> "meta-level" schema isn't particularly informative because the 
> information in one attribute is so heterogeneous.
>
>>
>> 2. About the information of metadata itself
>>
>> Based on my understanding of EML schemas, the only inforamtion 
>> associated with the metadata itself is the information about metadata 
>> provider(s).  However, my supervisors and I  think that  it is 
>> important to provide other metadata information, such as when 
>> metadata document is created, if further update of metadata is neede, 
>> and if the answer is yes, what is the metadata update frequency and 
>> the date of last update.  Those pieces of  information are 
>> particularly important in the case when the endDate value for the 
>> dataset from on-going projects is going to change, because first they 
>> can remind metadata providers / developer when they should update 
>> their metadata, and second they can tell metadata users if the 
>> metadata document provides the most current information about the 
>> dataset described.
>
> Sure.  In hindsight, I think we should have included these metadata 
> information fields, particularly the timestamp fields.  But we do have 
> some related fields that describe ongoing data collection.  Take a 
> look at /eml/dataset/maintenance/description and 
> /eml/dataset/maintenance/maintenanceUpdateFrequency.  The latter is 
> probably what you want.  Ay fields that you want but that don't exist 
> in the schema can be put in the "/eml/additionalMetadata" field, so 
> you always have that as a recourse.  If you have specific 
> recommendations for fields that are needed you could send them to 
> eml-dev at ecoinformatics.org and we'll try to get them into plans for a 
> future release.
>
>>
>> 3. About the temporal coverage <#temporalCoverage>
>>
>> We have many metadata records with uncertain endDate because the new 
>> data are being continuously loaded into the dataset.  Whenever new 
>> data are loaded, we have to change the values for end date, number of 
>> records, and /or size of table......  I am wondering when you can 
>> provide a solution for this issue.
>
> Personally I think this is a good thing.  At any given point in time 
> there is a finite amount of data available, and the metadata should 
> describe that.  If you have an automated data collection process, then 
> you would simply have to update your metadata as part of that process. 
> The number of records, table size, and checksum are useful when people 
> get your data to validate that they got the data without error.  The 
> end date for temporal coverage provides valuable discovery 
> information, and should simply be made to match the data that you 
> release.
>
>>
>> In addition, I found from John's email that you had a KNB data 
>> management workshop early this year.  I am very interested in this 
>> kind of workshop, particular workshop associated with the use of 
>> metacat.  If you have this type of workshop in the future, please let 
>> me know.
>
> Yeah, we had one in February.  We announce these opportunities on 
> various web sites and mailing lists.  You should subscribe to 
> ecoinfo at ecoinformatics.org and watch http://seek.ecoinformatics.org in 
> particular for announcements.
>
> Like Peter I also recommend that you get involved in the ongoing 
> improvements related to EML.  Your feedback and contributions would be 
> extremely vauable.  Good luck.  Let us know if you have more questions.
>
> Matt
>
>>
>> Thank you very much for your support!
>>
>> Xiaoping Wang
>>
>> PMEL /NOAA
>>
>>
>>
>>
>>
>>
>