provisional GCE-LTER eml available, comments appreciated

Matt Jones jones at nceas.ucsb.edu
Mon Oct 20 10:42:22 PDT 2003


Wade,

Congratulations.  I think you now have the distinction of being the 
first LTER site to expose fully-validating EML2.0.0 compliant metadata 
on the web!  Excellent.  I looked over a few of your entries, read and 
validated them -- what you provide looks nicely complete.  I'll look 
forward to seeing your data table and attribute metadata when you get 
the issues worked out.

I'll ask David and James to start trying to harvest your metadata into 
the LNO metacat to start the centralized search via the KNB.  I think 
Andrew's (AND) is also close to ready with this, and probably NTL too. 
Thanks for all of your hard work on this.

As far as your questions go:

> 1) Some method description text in our database includes high-bit 
 >ASCII codes (degrees, microns, etc) which were throwing validation
 >errors with a UTF-8 encoding directive; I changed the encoding to
 >ISO-8859-1 and that fixed the validation problem, but is that
 >acceptable as a general practice or is there a way to escape
 >individual characters without converting the whole doc to UTF-16
 >(which would require a different programming approach)?

UTF-8 is your best choice for representing all Unicode characters. 
Whatever system you are using is probably generating an encoding of 
ISO-8859-1, and so the UTF-8 declaration in the XML prolog was wrong. 
If you truly generate UTF-8 encoded documents and declare them as such, 
you will be able to use any special characters.

> 2) I included dataset/method/sampling to incorporate sampling 
 >design info we store, but the required studyExtent tag isn't an exact
 >semantic match to our sampling design field. I will probably have to
 >revise the way we store this info in the future to comply with eml.
That's interesting.  Do you think there would be any information loss or 
gain in such a conversion?
>  
> 3) We store a lot of granular information about instrumentation (nested 
 >under method steps) which I wanted to include, but it wasn't clear how to
 >best format these in instrumentation elements. For now I kludged it by
 >stringing together the various fields, delimiting sections with
 >parenthesis and semicolons.
I guess we didn't envision a need for formatting in instrumentation.  I 
suppose it could be of type TextType rather than string if you think the 
formatting provided in TextType would be useful.  If so, could you 
submit a feature change request to eml-dev?
>  
> 4) I plan to include taxonomicCoverage info as well (our data sets are 
 >linked to our taxonomic database), but I need to study the eml
 >implementation a bit more. Pointers to specific examples would be helpful.
I looked at your taxon coverage fields in this doc:
   http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?id=INS-GCEM-0310

Overall, it looks great.  A couple of minor issues, though.  First, when 
you list rank=Species, most taxonomists would put the value as a 
binomial epithet (i.e, Orchelimum fidicinum rather than just fidicinum). 
  Although this repeates the genus information and so is anathema to us 
db types, I think it is technically how taxonomists do it.  We had this 
discussion in the VegBank project, and I lost this battle over the value 
that goes in this field.  I would still prefer to do it the way that you 
did, but just wanted to mention that biologists have rejected this in 
the past for me.  Second, you frequently list commonName=none, when I 
think it would just be better to omit the commonName filed in this case 
because "none" is not the common name of the species.  Third, I notice 
that you have a separate taxonomicCoverage section for each species, and 
you don't use the nesting of taxonomicClassification elements.  The 
documentation may not be clear, but when you are providing higher-level 
taxa for a species, EML generally assumes you will nest the values to 
indicate the parent-child relations.  Is there a reson you didn't do it 
this way?   One of the interpretations of your listing is that you 
"cover" all species within the kingdom "Animalia" in one of your 
datasets (as it isn't qualified by lower level restrictions).

> 5) In order to make our eml roll-out more manageable, I am considering 
 >providing an eml-optimized comma-delimited format as an additional
 >static or dynamically-generated data format option for all tabular data
 >sets. I'm not sure if I can accurately (or more importantly, usefully)
 >describe some of our customizable data formats in eml (e.g. MATLAB
 >arrays and matrices), so focusing on one eml-friendly format will
 >simplify the problem space and speed up implementation of entity and
 >attribute-level metadata. (Whether we continue to provide
 >non-eml-described custom data sets down the road depends on where this
 >whole thing goes in LTER).
If you are using matrices in Matlab, maybe the simplest thing would be 
to add a Matrix entity type in EML.  Informally, a data table is a 
collection of attributes with potentially different types while a matrix 
is an n-dimensional collection of values of the same type.  Each 
dimension is indexed, and the index values may correspond to indices in 
other vectors or matrices.   This is a type or relation among the 
matrices that is similar to a foreign key in the relational model, but 
not formally specified as existing.  Is this how you model your data in 
matlab?  Would such an entityType be useful to you?  How do you formally 
describe the relationships among matrices?

That's about it.  Thanks again,

Matt

Wade Sheldon wrote:
> David and all,
>  
> OK, I spent some additional time this weekend refining our eml 
> implementation to chase down a bunch of schema validation errors and add 
> full taxonomic coverage. I had screwed up on some element nesting in 
> coverage and methods and had neglected some <para> tags in various 
> descriptive sections. I also had to omit the study descriptors I was 
> putting in methods/methodStep/sampling, because they just didn't fit 
> (logically or structurally). I'll look over the schema and normative 
> docs again and come back to that issue after my head has cleared a bit 
> (I'll probably bug you and Janine about that next week).
>  
> Now at least our eml docs all have geographic, temporal, and taxonomic 
> coverage (when relevant) plus methods and instrumentation and they are 
> all validating -- at least the representative data sets I've tried so 
> far. The taxonomic coverage info is being pulled directly from our 
> taxonomic database via data set cross-references, so I'm happy about 
> that from a maintenance standpoint. All other info except for 
> boilerplate project and contact descriptors are live from our existing 
> metadata database. I was able to produce this level of eml from our 
> existing database schema without modification so far -- I just used 
> views to generate xml fragments for each section and server-side ASP 
> scripts to query the database to retrieve the relevant records from 
> the views, and then wrap and nest the fragments to return the xml.
>  
> As far as content goes I guess we're between 2.5 and 3 on the 'lter eml 
> tiers' scale until I get dataTable worked out. I also want to add some 
> more info to the <intellectualRights> section to cover data release time 
> tables, etc. as discussed in the best practices doc.
>  
> If you want a good feel for some of our major metadata types, here are 
> some direct links:
>  
> Met data for 1 station (simple):
> http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?id=MET-GCEM-0109
>  
>  
> Oceanography data set for 1 site with extensive methods, instrumentation 
> but no taxa:
> http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?id=PHY-GCEM-0310c1
>  
> Marsh insect sampling data set for multiple sites with minimal methods, 
> multiple taxa:
> http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?id=INS-GCEM-0310
>  
> Aquatic invertebrate sampling data set for multiple sites with 
> methodology, multiple taxa:
> http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?id=INV-GCEM-0301b
>  
> Fungal data set with multiple sites, methodology and multiple taxa:
> http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?id=FNG-GCEM-0301
>  
> Chemistry study with complex methodology, instrumentation:
> http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?id=POR-GCED-0210
>  
>  
> Regards,
>  
> Wade Sheldon
> GCE-LTER Information Manager
>  
> ___________________________________________________________
>  
>  Wade M. Sheldon
>  Management Information Specialist
>  Department of Marine Sciences
>  University of Georgia
>  Athens, GA  30602-3636
>  http://gce-lter.marsci.uga.edu/lter/bios/wsheldon.htm
>  
> "I love deadlines. I like the whooshing sound they make as they fly 
> by."  -- Douglas Adams

-- 
-------------------------------------------------------------------
Matt Jones                                     jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439    Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org
-------------------------------------------------------------------




More information about the Eml-dev mailing list