provisional GCE-LTER eml available, comments appreciated
Matt Jones
jones at nceas.ucsb.edu
Mon Oct 20 10:42:22 PDT 2003
Wade,
Congratulations. I think you now have the distinction of being the
first LTER site to expose fully-validating EML2.0.0 compliant metadata
on the web! Excellent. I looked over a few of your entries, read and
validated them -- what you provide looks nicely complete. I'll look
forward to seeing your data table and attribute metadata when you get
the issues worked out.
I'll ask David and James to start trying to harvest your metadata into
the LNO metacat to start the centralized search via the KNB. I think
Andrew's (AND) is also close to ready with this, and probably NTL too.
Thanks for all of your hard work on this.
As far as your questions go:
> 1) Some method description text in our database includes high-bit
>ASCII codes (degrees, microns, etc) which were throwing validation
>errors with a UTF-8 encoding directive; I changed the encoding to
>ISO-8859-1 and that fixed the validation problem, but is that
>acceptable as a general practice or is there a way to escape
>individual characters without converting the whole doc to UTF-16
>(which would require a different programming approach)?
UTF-8 is your best choice for representing all Unicode characters.
Whatever system you are using is probably generating an encoding of
ISO-8859-1, and so the UTF-8 declaration in the XML prolog was wrong.
If you truly generate UTF-8 encoded documents and declare them as such,
you will be able to use any special characters.
> 2) I included dataset/method/sampling to incorporate sampling
>design info we store, but the required studyExtent tag isn't an exact
>semantic match to our sampling design field. I will probably have to
>revise the way we store this info in the future to comply with eml.
That's interesting. Do you think there would be any information loss or
gain in such a conversion?
>
> 3) We store a lot of granular information about instrumentation (nested
>under method steps) which I wanted to include, but it wasn't clear how to
>best format these in instrumentation elements. For now I kludged it by
>stringing together the various fields, delimiting sections with
>parenthesis and semicolons.
I guess we didn't envision a need for formatting in instrumentation. I
suppose it could be of type TextType rather than string if you think the
formatting provided in TextType would be useful. If so, could you
submit a feature change request to eml-dev?
>
> 4) I plan to include taxonomicCoverage info as well (our data sets are
>linked to our taxonomic database), but I need to study the eml
>implementation a bit more. Pointers to specific examples would be helpful.
I looked at your taxon coverage fields in this doc:
http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?id=INS-GCEM-0310
Overall, it looks great. A couple of minor issues, though. First, when
you list rank=Species, most taxonomists would put the value as a
binomial epithet (i.e, Orchelimum fidicinum rather than just fidicinum).
Although this repeates the genus information and so is anathema to us
db types, I think it is technically how taxonomists do it. We had this
discussion in the VegBank project, and I lost this battle over the value
that goes in this field. I would still prefer to do it the way that you
did, but just wanted to mention that biologists have rejected this in
the past for me. Second, you frequently list commonName=none, when I
think it would just be better to omit the commonName filed in this case
because "none" is not the common name of the species. Third, I notice
that you have a separate taxonomicCoverage section for each species, and
you don't use the nesting of taxonomicClassification elements. The
documentation may not be clear, but when you are providing higher-level
taxa for a species, EML generally assumes you will nest the values to
indicate the parent-child relations. Is there a reson you didn't do it
this way? One of the interpretations of your listing is that you
"cover" all species within the kingdom "Animalia" in one of your
datasets (as it isn't qualified by lower level restrictions).
> 5) In order to make our eml roll-out more manageable, I am considering
>providing an eml-optimized comma-delimited format as an additional
>static or dynamically-generated data format option for all tabular data
>sets. I'm not sure if I can accurately (or more importantly, usefully)
>describe some of our customizable data formats in eml (e.g. MATLAB
>arrays and matrices), so focusing on one eml-friendly format will
>simplify the problem space and speed up implementation of entity and
>attribute-level metadata. (Whether we continue to provide
>non-eml-described custom data sets down the road depends on where this
>whole thing goes in LTER).
If you are using matrices in Matlab, maybe the simplest thing would be
to add a Matrix entity type in EML. Informally, a data table is a
collection of attributes with potentially different types while a matrix
is an n-dimensional collection of values of the same type. Each
dimension is indexed, and the index values may correspond to indices in
other vectors or matrices. This is a type or relation among the
matrices that is similar to a foreign key in the relational model, but
not formally specified as existing. Is this how you model your data in
matlab? Would such an entityType be useful to you? How do you formally
describe the relationships among matrices?
That's about it. Thanks again,
Matt
Wade Sheldon wrote:
> David and all,
>
> OK, I spent some additional time this weekend refining our eml
> implementation to chase down a bunch of schema validation errors and add
> full taxonomic coverage. I had screwed up on some element nesting in
> coverage and methods and had neglected some <para> tags in various
> descriptive sections. I also had to omit the study descriptors I was
> putting in methods/methodStep/sampling, because they just didn't fit
> (logically or structurally). I'll look over the schema and normative
> docs again and come back to that issue after my head has cleared a bit
> (I'll probably bug you and Janine about that next week).
>
> Now at least our eml docs all have geographic, temporal, and taxonomic
> coverage (when relevant) plus methods and instrumentation and they are
> all validating -- at least the representative data sets I've tried so
> far. The taxonomic coverage info is being pulled directly from our
> taxonomic database via data set cross-references, so I'm happy about
> that from a maintenance standpoint. All other info except for
> boilerplate project and contact descriptors are live from our existing
> metadata database. I was able to produce this level of eml from our
> existing database schema without modification so far -- I just used
> views to generate xml fragments for each section and server-side ASP
> scripts to query the database to retrieve the relevant records from
> the views, and then wrap and nest the fragments to return the xml.
>
> As far as content goes I guess we're between 2.5 and 3 on the 'lter eml
> tiers' scale until I get dataTable worked out. I also want to add some
> more info to the <intellectualRights> section to cover data release time
> tables, etc. as discussed in the best practices doc.
>
> If you want a good feel for some of our major metadata types, here are
> some direct links:
>
> Met data for 1 station (simple):
> http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?id=MET-GCEM-0109
>
>
> Oceanography data set for 1 site with extensive methods, instrumentation
> but no taxa:
> http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?id=PHY-GCEM-0310c1
>
> Marsh insect sampling data set for multiple sites with minimal methods,
> multiple taxa:
> http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?id=INS-GCEM-0310
>
> Aquatic invertebrate sampling data set for multiple sites with
> methodology, multiple taxa:
> http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?id=INV-GCEM-0301b
>
> Fungal data set with multiple sites, methodology and multiple taxa:
> http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?id=FNG-GCEM-0301
>
> Chemistry study with complex methodology, instrumentation:
> http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?id=POR-GCED-0210
>
>
> Regards,
>
> Wade Sheldon
> GCE-LTER Information Manager
>
> ___________________________________________________________
>
> Wade M. Sheldon
> Management Information Specialist
> Department of Marine Sciences
> University of Georgia
> Athens, GA 30602-3636
> http://gce-lter.marsci.uga.edu/lter/bios/wsheldon.htm
>
> "I love deadlines. I like the whooshing sound they make as they fly
> by." -- Douglas Adams
--
-------------------------------------------------------------------
Matt Jones jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/ Fax: 425-920-2439 Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org
-------------------------------------------------------------------
More information about the Eml-dev
mailing list