provisional GCE-LTER eml available, comments appreciated

Mon Oct 20 11:50:41 PDT 2003

Matt,

Thanks for the comments. Responses are list below each one.

----- Original Message -----
From: "Matt Jones" <jones at nceas.ucsb.edu>
To: "Wade Sheldon" <sheldon at uga.edu>
Cc: <eml-dev at ecoinformatics.org>; "Jim Reichman"
<reichman at nceas.ucsb.edu>
Sent: Monday, October 20, 2003 1:42 PM
Subject: Re: provisional GCE-LTER eml available, comments appreciated

> Wade,
>
> Congratulations.  I think you now have the distinction of being the
> first LTER site to expose fully-validating EML2.0.0 compliant metadata
> on the web!  Excellent.  I looked over a few of your entries, read and
> validated them -- what you provide looks nicely complete.  I'll look
> forward to seeing your data table and attribute metadata when you get
> the issues worked out.
>
> I'll ask David and James to start trying to harvest your metadata into
> the LNO metacat to start the centralized search via the KNB.  I think
> Andrew's (AND) is also close to ready with this, and probably NTL too.
> Thanks for all of your hard work on this.
>

Thanks, but it was mostly just a matter of getting around to it. My
metadata RDMS model was designed to support granular ESA-FLED metadata,
so it just took a little time to settle on a strategy to mark up our
existing contents.

Regarding bulk harvesting of our metadata into a metacat, note that I
also added support for numerical dataset ids as an alternative to
accession strings to support easier ingestion, e.g.
  http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?dataset=153
is exactly equivalent to:

http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?accession=INS-GC
EM-0310
but much easier to blind script. Our data catalog currently contains 177
sequential data sets, so it would be a lot more efficient to grab xml
files via number with a script.

> As far as your questions go:
>
> > 1) Some method description text in our database includes high-bit
>  >ASCII codes (degrees, microns, etc) which were throwing validation
>  >errors with a UTF-8 encoding directive; I changed the encoding to
>  >ISO-8859-1 and that fixed the validation problem, but is that
>  >acceptable as a general practice or is there a way to escape
>  >individual characters without converting the whole doc to UTF-16
>  >(which would require a different programming approach)?
>
> UTF-8 is your best choice for representing all Unicode characters.
> Whatever system you are using is probably generating an encoding of
> ISO-8859-1, and so the UTF-8 declaration in the XML prolog was wrong.
> If you truly generate UTF-8 encoded documents and declare them as
such,
> you will be able to use any special characters.

Thanks for the clarification. I'll look into this some more and try to
provide UTF-8 encoding in the future.

> > 2) I included dataset/method/sampling to incorporate sampling
>  >design info we store, but the required studyExtent tag isn't an
exact
>  >semantic match to our sampling design field. I will probably have to
>  >revise the way we store this info in the future to comply with eml.
> That's interesting.  Do you think there would be any information loss
or
> gain in such a conversion?
> >

I'm not sure I understand your question. To clarify my problem, my
existing metadata model associates multiple independent sets of study
descriptors (e.g. survey legs of a cruise, field excursions) which
contain details about sampling and statistical design plus date info for
each study element, and multiple independent sets of method descriptors
(with associated instrumentation) with each data set. I could try to
couple these elements in our database, but the eml schema seems to nest
sampling design info under methods, as I understand it, and in our
implementation it would be more natural to go the other way (studies
contain methods). As a work-around I could include multiple
temporalCoverage elements to store the granular date time info (I'm
using SQL aggregation across all studies for this now), and add sampling
descriptors as separate methodStep descriptions.

> > 3) We store a lot of granular information about instrumentation
(nested
>  >under method steps) which I wanted to include, but it wasn't clear
how to
>  >best format these in instrumentation elements. For now I kludged it
by
>  >stringing together the various fields, delimiting sections with
>  >parenthesis and semicolons.
> I guess we didn't envision a need for formatting in instrumentation.
I
> suppose it could be of type TextType rather than string if you think
the
> formatting provided in TextType would be useful.  If so, could you
> submit a feature change request to eml-dev?

OK, I'll consider that, although I don't think there's much to be
gained.

> >
> > 4) I plan to include taxonomicCoverage info as well (our data sets
are
>  >linked to our taxonomic database), but I need to study the eml
>  >implementation a bit more. Pointers to specific examples would be
helpful.
> I looked at your taxon coverage fields in this doc:
>
http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?id=INS-GCEM-0310
>
> Overall, it looks great.  A couple of minor issues, though.  First,
when
> you list rank=Species, most taxonomists would put the value as a
> binomial epithet (i.e, Orchelimum fidicinum rather than just
fidicinum).
>   Although this repeates the genus information and so is anathema to
us
> db types, I think it is technically how taxonomists do it.  We had
this
> discussion in the VegBank project, and I lost this battle over the
value
> that goes in this field.  I would still prefer to do it the way that
you
> did, but just wanted to mention that biologists have rejected this in
> the past for me.  Second, you frequently list commonName=none, when I
> think it would just be better to omit the commonName filed in this
case
> because "none" is not the common name of the species.  Third, I notice
> that you have a separate taxonomicCoverage section for each species,
and
> you don't use the nesting of taxonomicClassification elements.  The
> documentation may not be clear, but when you are providing
higher-level
> taxa for a species, EML generally assumes you will nest the values to
> indicate the parent-child relations.  Is there a reson you didn't do
it
> this way?   One of the interpretations of your listing is that you
> "cover" all species within the kingdom "Animalia" in one of your
> datasets (as it isn't qualified by lower level restrictions).

I made some changes to the tag nesting this morning to make
taxonomicCoverage make more sense (e.g. I am now nesting within
taxonomicClassification), and after receiving your reply I removed the
<commonName>none</commonName> for null fields and added genus to the
species element (inelegant, but you've got a lot more experience work
with systematists and I'll take your word for it). I am still wrapping
each species record in its own <taxonomicCoverage> tagset to avoid
repeating blocks of <taxonomicClassification> running together, but I
could throw them all into a single set of coverage tags if you think
that would be better (I tested it and it does validate, but is a bit
ugly). See the attached doc for an example of my revised implementation.

>
> > 5) In order to make our eml roll-out more manageable, I am
considering
>  >providing an eml-optimized comma-delimited format as an additional
>  >static or dynamically-generated data format option for all tabular
data
>  >sets. I'm not sure if I can accurately (or more importantly,
usefully)
>  >describe some of our customizable data formats in eml (e.g. MATLAB
>  >arrays and matrices), so focusing on one eml-friendly format will
>  >simplify the problem space and speed up implementation of entity and
>  >attribute-level metadata. (Whether we continue to provide
>  >non-eml-described custom data sets down the road depends on where
this
>  >whole thing goes in LTER).
> If you are using matrices in Matlab, maybe the simplest thing would be
> to add a Matrix entity type in EML.  Informally, a data table is a
> collection of attributes with potentially different types while a
matrix
> is an n-dimensional collection of values of the same type.  Each
> dimension is indexed, and the index values may correspond to indices
in
> other vectors or matrices.   This is a type or relation among the
> matrices that is similar to a foreign key in the relational model, but
> not formally specified as existing.  Is this how you model your data
in
> matlab?  Would such an entityType be useful to you?  How do you
formally
> describe the relationships among matrices?

The way I'm dealing with matrices now for documentation purposes is to
treat them as a specific instantiation of a conventional table, with
columns as attributes and rows as records. I don't offer matrix-based
MATLAB files on our public web site any more to prevent confusion, but
it is still an option for dynamically-generated files on our private
site. The binaries for these files contain a single matrix named 'data'
(with any textual columns automatically encoded as unique integers and
documented in the metadata as coded numerical columns), along with
matching arrays of column names, column units, a corresponding flag
array, and a character array containing formatted text metadata. I also
generate more conventional array-based MATLAB binaries which would be
easier to describe in eml (1 column = 1 array, plus a metadata character
array).

I agree it may be possible to describe such binary files in eml, but I'm
still not convinced it's useful. I would rather concentrate on providing
delimited ASCII files thorougly described in eml which would be more
amenable to long-term archiving and mediation apps, and link those
procedures to my MATLAB web server engine to support the kind of basic
grid-centric functionality you and Peter have been describing for
picking attributes, resampling, sub-selecting, etc in response to web
service queries ('send eml', 'send data', 'query data', etc). That's
very practical with my current set-up, and I could provide MATLAB
binaries as a custom 'boutique' data product until I figure out a
worthwhile way to describe them in eml.

Thanks for your comments.

Regards,

Wade
-------------- next part --------------
A non-text attachment was scrubbed...
Name: INS-GCEM-0310.1.0.xml
Type: text/xml
Size: 21694 bytes
Desc: not available
Url : http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20031020/afc47966/INS-GCEM-0310.1.0.xml