[eml-dev] Documenting simulation data and selecting appropriate attributes

Matt Jones jones at nceas.ucsb.edu
Sat Dec 28 16:11:48 PST 2013


Carl --

Sorry I'm just getting back to this from my email backlog.  You raise a
number of important issues that again would represent areas where we could
improve EML.  Currently, EML documents data and, outside of the methods
section, does not have a machine readable way of differentiating the  data
source  (e.g., observation versus simulation).  This would be valuable. And
even more so if it were a community standard way of flagging the source,
i.e., a standard vocabulary.  I haven't searched for such a vocabulary, but
I'll bet there has been some work done.  Once we had a vocabulary, we could
decide if it merits a new field in EML, should be put in
'additionalMetadata' (with its drawbacks), or if there is an existing field
that is appropriate.

For the case of 'arbitrary abundance units', I think your model has some
form of abundance unit, implied by the unit of the initial abundance input
(X_t where t=0) -- everything else derives from that, right?  In any case,
either the derived values have physical units (e.g., kg), or they are
dimensionless numbers -- both of which can be indicated with EML's standard
units.  I don't think you should need any custom units just because its a
simulation result.

Whether the data should be archived is largely a pragmatic issue.  For
complex models, its certainly valuable to archive the sim output and cite
it in your papers along with the code that generated it.  Some simulation
data is too large to archive conveniently, but my guess is that is not the
case for most fisheries simulations.

Good luck!

Matt



On Fri, Nov 1, 2013 at 2:56 PM, Carl Boettiger <cboettig at gmail.com> wrote:

> Hi eml-dev,
>
> I am looking to document a simple simulated time series appropriately in
> EML and am unclear how to best describe units used.  For example, perhaps
> one data set generates stock abundance from a stochastic Ricker model:
>
> X_{t+1} = Z_t X_t exp(r (1 - X_t / K))
>
> and the data consist then of time t, abundance X, and metadata stating the
> parameters.
>
> One approach is to document the data in the same way comparable
> 'real-world' data might be documented.  For instance, declaring that the
> simulated dynamics pertain to a particular species such as Anchovy,
> declaring the the abundance is in units of tonnes, time in years, the
> geography restricted to the Peruvian current, etc.  This raises several
> issues.
>
> While such assignments may best reflect the intended representation of the
> data, perhaps we want to be able to still distinguish these from real
> observations (programmatically, without consulting the methods section of
> the EML).  What would be the best way to indicate that these are simulated
> instead of true observations?
>
> Second, for many simulations such specificity is mere fiction, as the
> example may be intended to represent a certain species but the abundance
> has not been calibrated to any meaningful unit scale and would be better
> described merely as "arbitrary abundance units".  On the other hand, a
> proliferation of customUnits seems undesirable as well.
>
> Third, one might argue that such data should not be archived at all, but
> documented only by the code required to re-generate it (including random
> number seed, etc).  Obviously this approach faces non-trivial challenges of
> computational replication (architecture, compiler, etc).
>
>
> Thanks for any input and suggestions.
>
>
> Cheers,
>
> Carl
>
> --
> Carl Boettiger
> UC Santa Cruz
> http://carlboettiger.info/
>
> _______________________________________________
> Eml-dev mailing list
> Eml-dev at ecoinformatics.org
> http://lists.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20131228/d0a55e9a/attachment.html>


More information about the Eml-dev mailing list