[seek-kr-sms] Re: [SEEK-Taxon] Question about EML
Matt Jones
jones at nceas.ucsb.edu
Fri Mar 5 19:18:23 PST 2004
Hi Shawn,
I am not a taxonomist, and I have only a superficial view of these
issues, but I think I can help clarify with a highly simplistic
exposition of the issues. So here's a simplified version of the issue
and how concept info helps to resolve the problems. I'm going to mostly
ignore the idea of type specimens, even though it is actually central to
the discussion. It'll still be a tome, even though simplistic. Others
on seek-taxon can point out my mistakes :) and hopefully clarify the
utility of the approach.
--- Taxonomy ---
Taxonomists collect specimens from the field and use them to classify
clusters of organisms into groups at various levels (aka Ranks) in a
hierarchy (e.g., at the Species rank). These groups are generally
defined by a description of the characters from the specimens that can
be used to distinguish the groups from each other. The suite of
characters used to distinguish one group from another need not match
(ie, overall_length might distinguish species A from species B but
number_of_Hairs might distinguish species B from species C). So, the
taxonomists have a "concept" of the group in mind when they write the
description of the species in a manuscript -- the description is the
manifestation of the taxonomic concept the person had in mind. The
taxonomist who first writes the description that defines a concept can
be called the "concept author". There are specific rules about how to
create the scientific name when a taxonomist wants to create a new
grouping (aka concept). Upon first creation of a concept A with name a,
the person creating the name a is usually called the name author and is
credited with the 'discovery'. Taxonomists generally try to preserve
this precendence, but don't worry too much about the concept author.
You might see a species written as 'Acer rubrum L.'; the 'L.' stands for
Linnaeus who is the name author (there are many abbreviations used). For
the very first definition of a concept and name, the name author and the
concept author are the same. Later on they are not the same. So far so
good.
Taxonomists usually disagree about how to classify (e.g., what
characters are important), and so they want to change the concepts as
time progresses and new informaiton surfaces. They generally try to
distinguish the new concepts from some existing concepts by splitting or
lumping various concepts using new descriptions based on the earlier
descriptions. The nomenclature rules for those groups say how the new
concepts should be named. Usually, if a concept (A) with name (a) is
split into two new concepts (A' and A'') then one of those usually
retains the original name (a) and the other gets a new name (a'). So
now there are 3 concepts in existince (A, A', and A''), but only two
names (a and a'). Thus, the name 'a' actually can be used to refer to
two distinct concepts (A and A') with distinct definitions. The name
author for 'a' is still the same, but the concept author is different
for A and A'.
So the current situation is that one name can refer to many concepts,
AND that one concept can have many names. Quite ambiguous. There are
millions of species concepts in most views of things, and many of them
have been revised multiple times over a several hundred year history of
classification. Also, the species are organized into higher level ranks
(e.g., genera), and these have the same name/concept issues as the
lowest level ranks). Egad.
--- Biology using taxonomy ---
Biologists use scientific names to identify organisms in the field and
elsewhere. When they collect data, they use a field guide or otherwise
learn to identify species according to the descriptions of the species,
usually provided in a field guide or other authority. Thus, if you know
the name that the biologist used to identify an organism and the
reference that contains the description of the concept that that name
refers to, you have a good idea of exactly what concept the biologist
thought the organism was. Unfortunately, most biologists do not write
down the authoritative reference that they were using to identify
species, instead providing the name only in their data sets (and
sometimes they provide the name author, especially for plants). Thus, a
biologist who references name 'a' in a dataset in 1950 might be
referring to a different taxonomic concept than another biologist who
references name 'a' later, say in 2000. Thus, if you were to do a
retrospective analysis of the properties of 'a' over time without
resolving the concepts that 'a' refers to, you'd be comparing apples and
oranges (or, more likely, apples and MacIntosh apples). In studies like
biodiversity studies, this could result in inflation or deflation of
changes in species abundance simply based on changes in the predominant
view of species classifications.
--- Taxon databases ---
Typical taxonomic databases today use the taxon name as a surrogate for
the concept, and so are actually just lists of names. The seek-taxon
group (along with others in the taxonomic community) is proposing that a
better approach is to explicitly model the distinction between taxonomic
concepts and taxonomic names by specifiying a unique identifier for
every concept ever created based on the reference in which it was
described (yep, BIG task), and then associate the many names that have
been used to refer to that concept. They also propose that
relationships among concepts can be mapped (e.g., that concept A defines
a superset of concept A'). The relationships between concepts
recognized now are: congruent, includes, included in, overlaps, excludes
(see
http://www.bgbm.org/BioDivInf/Projects/MoreTax/standard_liste_en.htm for
details).
--- Using taxon concepts ---
OK, so to your example data set. Your dataset has a series of taxonomic
names at various ranks. If you take the 'sci_name' column as
representing the species rank, you have a dataset that identifies only a
name, not a concept. This alone does not unambiguously tell you what
the biologist was referring to. The metadata provides references to the
authorities that they used for species identification, and so
(theoretically at least :), you could find each name in those references
and get an exact concept definition that the biologist meant. Of
course, this requires matching the name and reference in the metadata to
a corresponding name /concept in the seek-taxon concept database.
Once you've done this, you have more ability to reason about the
relationships among data items in different data sets. For example, if
I want to search for data about name "a", a taxon concept resolution
service might tell me that the name has been used to represent concepts
A and A' at different times, and that searching for *all* names for both
A and A' might be what the user wants (a type of query expansion). Thus
a better query can be defined semi-automatically. Also, if a user wants
to combine two datasets from different times that use taxonomic names,
the concept database might be used to reason about the relationships
between concepts used in the different data sources (e.g., that the name
'a' used in dataset 1 refers to concept A but the name a used in dataset
2 refers to A', and so the data MAY not be comparable. I say MAY
because this is an extremely subjective and subtle decision -- it really
depends on what kinds of measurements the scientist is comparing and why
they are comparing them. So scientific judgement will be critical here.
Sometimes we'll be able to tell a nice tidy relationship among concepts
(for congruence, superset, subset) but other times it might be ambiguous
(intersection, disjunction). Finally, when we know only a taxon name
and have no info about the concept, some contextual information about
the data may allow us to assign a probabilistic estimate of the name
representing a series of concepts. For example, if the data contained
name 'a' and was collected in 1900, and the only concept that used name
a in 1900 was A, then we might assign a high probability that a
represented A in that data. Later on, we might see that a is used in
data collected in 2004, and we might know that everyone has thought that
A' and A'' are the right concepts to use so no data has referred to A
since 1930, so there is a high probablility that 'a' references A'.
Part of the seek-taxon work is to work on probabilistic approaches to
making these judgements based on various corpora.
To address your last question (what needs to be in a concept schema),
you might want to review current concept schema that the seek-taxon
group (and J. Kennedy in particular) has been developing (its in cvs).
Hope this has helped. It certainly took a while to write, even though I
know its pretty sloppy in some places. Unfortuantely, I'm in an NSF
site review all next week so won't be able to respond to any questions,
but I hope the rest of the seek-taxon group can follow up and correct my
mistakes, and then I will try to during the week after next.
Cheers,
Matt
Shawn Bowers wrote:
>
> Hi,
>
> I recently found this statement in the methods section of an EML document:
>
> "Nomenclature for common names follow the 1987 edition of the
> National Geographic Society's field guide, 'Bird's of North
> America'. Species codes used are those of the American
> Ornithologist's Union. The USFWS Checklist OF Vertebrates, 1991,
> was used to quantify scientific names from the common names."
>
> Can anyone help me interpret what these two sentences mean, and how I
> might use the Taxon-group work to "understand/resolve" the actual
> species references in a dataset based on the above sentence? Here is a
> snippet from the corresponding dataset with the only those columns that
> refer to something "taxonomic". (Note that there are actually 23 columns
> in the dataset and about 165 rows.)
>
>
> class tax_order family sci_name aoucode commonname
> ----- --------- ------ -------- ------- ----------
> aves anseriformes anatidae aix sponsa wodu wood duck
> aves apodiformes apodidae chaetura vauxi vasw vauxs wift
> aves ciconiiformes ardeidae ardea alba greg great egret
> ...
>
> I am particularly interested in understanding the relationship between
> the concept XML schema and it's use for "registering" or "mapping" this
> data set to information captured in the concept work. For example, if I
> want to search for datasets based on taxonomic concepts.
>
> (You have to be patient with me because I am clueless about these
> issues), but it seems like the common name and aoucode represent
> redundant information: the aoucode is some kind of convention for
> representing the common name, and the class/tax_order/family/sci_name
> uniquely identifies the common name? How would one align this dataset
> with an instantiated taxon concept schema -- in particular, what
> information would need to be available in the instantiated concept schema?
>
> Any help is greatly appreciated,
>
> Shawn
>
>
>
>
>
> _______________________________________________
> seek-taxon mailing list
> seek-taxon at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/seek-taxon
More information about the Seek-kr-sms
mailing list