Recommendation about information associated with EML metadata document itself

Matt Jones jones at nceas.ucsb.edu
Mon Feb 28 13:08:27 PST 2005


Xiaoping,

I entered your request into our bug tracking system and targeted it at 
EML 2.1.0.  You can follow its progress and our decisions on it here:
http://bugzilla.ecoinformatics.org/show_bug.cgi?id=1991

Thanks for the feedback.

Matt

Xiaoping Wang wrote:
> Dear eml-dev:
> 
> As mentioned in my email to Matt and Peter (see below), providing 
> necessary timestamp inforamtion for the EML metadata document is 
> important not only to metadata generators but also to metadata users.  
> Both /eml/dataset/maintenance/description and 
> /eml/dataset/maintenance/maintenanceUpdateFrequency in EML schemas are 
> used for description of dataset, not for the metadata document itself.  
> Although we can use /eml/additionalMetadata to say something about the 
> metadata document, I believe that the timestamp information about the 
> EML metadata document is so important that it needs to be highlighted.  
> The following is my recommendataion about the way you can provide more 
> information about the metadata itself.
> 
> In the /eml/dataset/res:ResourceGroup, instead of using metadataProvider 
> as one of the elements, use metadataInformation as suggested below:
> <metadataInformation>
>       A sequency of
>             <metadataProvider>                
> required                    (comment: inforamtion about metadata 
> providers is listed here)
>             <metadataCreationDate>         required                    
> (comment: the date when the metadata document is originally created)
>             <metadateMaintenance>          Optional                   
> (comment: this element is used when the metadata document needs to be 
> updated in the future)
>                A sequency of
>                   <lastUpdateDate>              
> required                    (comment: the date of last metadata update)
>                   <oldVule>                          
> required                    (comment: for example, the endDate for 
> rangeOfDates, numberOfRecords for an entity (table), size of entity 
> (table)........  These values will be changed after new data are loaded 
> into the dataset)
>                   <updateFrequence>           
> required                    (comment: by comapring updateFrequency and 
> lastUpdateDate, metadata developers know when they need to update their 
> metadata document, and metadata users know if the metadata document 
> describes the most current information about the dataset)
> 
> These are the necessary elements that I think they should be provided in 
> EML metadata document.  Hope my recommendation helps.
> 
> Thank you very much for your support.
> 
> Xiaoping Wang
> 
> PMEL /NOAA
>                 
> Matt Jones wrote:
> 
>> Hi Xiaoping,
>>
>> As Peter mentioned, your problems have arisen before.  See below for 
>> some additional recommendations beyond Peter's from my personal 
>> perspective.
>>
>> Xiaoping Wang wrote:
>>
>>> Dear Matt and Peter:
>>>
>>> I have seen a lot of discussions recently on issues about measurement 
>>> scale and temporal coverage.  They are very helpful for our better 
>>> understanding of EML.  The following are my questions and concerns I 
>>> raised during my work on our EML-based metadata. <#temporalCoverage>
>>>
>>> 1. About the Measurement scale
>>>
>>> The measurementSclae is a little bit confusing.  I spent a lot of 
>>> time working on the measurementScale for nominal data.  Here I want 
>>> to give you an example about how I use the measurmentScale to 
>>> describe nominal data in our dataset, and you can see whether my 
>>> implementation is based on correct understanding of this element.
>>>
>>> We have a data table with four columns (attributes): recordID, 
>>> variable_name, variable_unit, and avriable_value.  The values for 
>>> variable_name column include certain measurements for the chemical 
>>> and physical properites of sea water such as temperature, salinity, 
>>> nitrate......  The following is a sample piece of my EML file for 
>>> this dataset.
>>> - <#> <attribute>
>>>      <attributeName>varName</attributeName>
>>>      <attributeDefinition>Name of chemical or physical property 
>>> measured</attributeDefinition>
>>>      <storageType>String</storageType>
>>> - <#>     <measurementScale>
>>> - <#>         <nominal>
>>> -            <#><nonNumericDomain>
>>> -                <#><enumeratedDomain>
>>> -                    <#><codeDefinition>
>>>                      <code>T</code>
>>>                      <definition>Temperature, unit: C</definition>
>>>                  </codeDefinition>
>>> -                <#>    <codeDefinition>
>>>                         <code>S</code>
>>>                         <definition>Salinity, unit: PPT</definition>
>>>                  </codeDefinition>
>>> -                    <#><codeDefinition>
>>>                         <code>ST</code>
>>>                         <definition>Sigma-T, unit: KG/M**3</definition>
>>>                     </codeDefinition>   <#>
>>>              </enumeratedDomain>
>>>          </nonNumericDomain>
>>>      </nominal>
>>>  </measurementScale>
>>> </attribute>
>>> - <#> <attribute>
>>>      <attributeName>varUnit</attributeName>
>>>      <attributeDefinition>Unit of chemical or physical property 
>>> measured</attributeDefinition>
>>>      <storageType>String</storageType>
>>> - <#>     <measurementScale>
>>> - <#>         <nominal>
>>> - <#>             <nonNumericDomain>
>>> - <#>                 <textDomain>
>>>                      <definition>*</definition>
>>>              </textDomain>
>>>          </nonNumericDomain>
>>>      </nominal>
>>>  </measurementScale>
>>> </attribute>
>>>
>>> My questions / concerns are:
>>> (1) Is it suitable to use enumeratedDomain element to describe varName?
>>
>>
>> Yes, that is fine, although if you wanted it to be free text that 
>> would be ok too (just use textDomain instead of enumeratedDomain).  
>> Encoding the unit information in the variable name is somewhat 
>> repetitive if you have the same unit information in the varUnit column.
>>
>>>
>>> (2) For the varUnit, I don't think it is necessary to include 
>>> measurementScale element.  However, since the measurementScale is an 
>>> required field, I have to put something there in order to pass the 
>>> EML validation.  So I put a "*" sign for the definition element.  I 
>>> have seen some other similar cases in which the EML metadata 
>>> developers use a "*" for the definition element.  Obviously, the 
>>> measurementScale content described here tells no useful information 
>>> about the varUnit.
>>
>>
>> The use of the '*' is inappropriate.  The field is required because 
>> the authors of EML thought the information was important.  In this 
>> case, I think you should put in the definition something that 
>> indicates that the values are names of units.  One major thing that is 
>> missing here is that  you don't use the EML Unit Dictionary when 
>> choosing your unit definitions.  This eliminates the major advantage 
>> of EML in being able to provide quantitative information about units.  
>> If there is a 1:1 correspondence between your units and the EML unit 
>> dictionary, I think it would be good if you defined varUnit as an 
>> enumerated domain and for each of your units provide the EML standard 
>> name for the unit in the definition.  This would help in translating, 
>> although it is unlikely that anyone could use this in automated 
>> systems because its such a non-standard use of the eml descriptors.
>>
>> In general, this model of variablename, varunit, value is a 
>> non-standard use of the relational model as the attributes do not 
>> really represent a single type.  The relational model is generally 
>> intended to have attributes that contain a semantically homogenous set 
>> of values.  In your case this is not true, unless considered from a 
>> meta-level.  So, I think you are using the relational model as a 
>> schema language itself. This significantly complicates use of the data 
>> in standard analytical systems (e.g., SAS< Splus, R, Matlab) -- they 
>> basically all require different views of the data as described in 
>> Peter's note.  Personally I think that documenting these more 
>> traditional views if you have them would be far more useful to 
>> scientists who wish to analyze the data. That would have the added 
>> benefit of being better described by EML structures.  Documenting your 
>> "meta-level" schema isn't particularly informative because the 
>> information in one attribute is so heterogeneous.
>>
>>>
>>> 2. About the information of metadata itself
>>>
>>> Based on my understanding of EML schemas, the only inforamtion 
>>> associated with the metadata itself is the information about metadata 
>>> provider(s).  However, my supervisors and I  think that  it is 
>>> important to provide other metadata information, such as when 
>>> metadata document is created, if further update of metadata is neede, 
>>> and if the answer is yes, what is the metadata update frequency and 
>>> the date of last update.  Those pieces of  information are 
>>> particularly important in the case when the endDate value for the 
>>> dataset from on-going projects is going to change, because first they 
>>> can remind metadata providers / developer when they should update 
>>> their metadata, and second they can tell metadata users if the 
>>> metadata document provides the most current information about the 
>>> dataset described.
>>
>>
>> Sure.  In hindsight, I think we should have included these metadata 
>> information fields, particularly the timestamp fields.  But we do have 
>> some related fields that describe ongoing data collection.  Take a 
>> look at /eml/dataset/maintenance/description and 
>> /eml/dataset/maintenance/maintenanceUpdateFrequency.  The latter is 
>> probably what you want.  Ay fields that you want but that don't exist 
>> in the schema can be put in the "/eml/additionalMetadata" field, so 
>> you always have that as a recourse.  If you have specific 
>> recommendations for fields that are needed you could send them to 
>> eml-dev at ecoinformatics.org and we'll try to get them into plans for a 
>> future release.
>>
>>>
>>> 3. About the temporal coverage <#temporalCoverage>
>>>
>>> We have many metadata records with uncertain endDate because the new 
>>> data are being continuously loaded into the dataset.  Whenever new 
>>> data are loaded, we have to change the values for end date, number of 
>>> records, and /or size of table......  I am wondering when you can 
>>> provide a solution for this issue.
>>
>>
>> Personally I think this is a good thing.  At any given point in time 
>> there is a finite amount of data available, and the metadata should 
>> describe that.  If you have an automated data collection process, then 
>> you would simply have to update your metadata as part of that process. 
>> The number of records, table size, and checksum are useful when people 
>> get your data to validate that they got the data without error.  The 
>> end date for temporal coverage provides valuable discovery 
>> information, and should simply be made to match the data that you 
>> release.
>>
>>>
>>> In addition, I found from John's email that you had a KNB data 
>>> management workshop early this year.  I am very interested in this 
>>> kind of workshop, particular workshop associated with the use of 
>>> metacat.  If you have this type of workshop in the future, please let 
>>> me know.
>>
>>
>> Yeah, we had one in February.  We announce these opportunities on 
>> various web sites and mailing lists.  You should subscribe to 
>> ecoinfo at ecoinformatics.org and watch http://seek.ecoinformatics.org in 
>> particular for announcements.
>>
>> Like Peter I also recommend that you get involved in the ongoing 
>> improvements related to EML.  Your feedback and contributions would be 
>> extremely vauable.  Good luck.  Let us know if you have more questions.
>>
>> Matt
>>
>>>
>>> Thank you very much for your support!
>>>
>>> Xiaoping Wang
>>>
>>> PMEL /NOAA
>>>
>>>
>>>
>>>
>>>
>>>
>>
> 
> _______________________________________________
> eml-dev mailing list
> eml-dev at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/eml-dev

-- 
-------------------------------------------------------------------
Matt Jones                                     jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439    Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org
-------------------------------------------------------------------



More information about the Eml-dev mailing list