[seek-kr-sms] Re: [SEEK-Taxon] Question about EML

Matt Jones jones at nceas.ucsb.edu
Fri Mar 5 19:18:23 PST 2004


Hi Shawn,

I am not a taxonomist, and I have only a superficial view of these 
issues, but I think I can help clarify with a highly simplistic 
exposition of the issues.  So here's a simplified version of the issue 
and how concept info helps to resolve the problems. I'm going to mostly 
ignore the idea of type specimens, even though it is actually central to 
the discussion. It'll still be a tome, even though simplistic.  Others 
on seek-taxon can point out my mistakes :) and hopefully clarify the 
utility of the approach.

--- Taxonomy ---
Taxonomists collect specimens from the field and use them to classify 
clusters of organisms into groups at various levels (aka Ranks) in a 
hierarchy (e.g., at the Species rank).  These groups are generally 
defined by a description of the characters from the specimens that can 
be used to distinguish the groups from each other.  The suite of 
characters used to distinguish one group from another need not match 
(ie, overall_length might distinguish species A from species B but 
number_of_Hairs might distinguish species B from species C).  So, the 
taxonomists have a "concept" of the group in mind when they write the 
description of the species in a manuscript -- the description is the 
manifestation of the taxonomic concept the person had in mind.  The 
taxonomist who first writes the description that defines a concept can 
be called the "concept author".  There are specific rules about how to 
create the scientific name when a taxonomist wants to create a new 
grouping (aka concept).  Upon first creation of a concept A with name a, 
the person creating the name a is usually called the name author and is 
credited with the 'discovery'. Taxonomists generally try to preserve 
this precendence, but don't worry too much about the concept author. 
You might see a species written as 'Acer rubrum L.'; the 'L.' stands for 
Linnaeus who is the name author (there are many abbreviations used). For 
the very first definition of a concept and name, the name author and the 
concept author are the same.  Later on they are not the same.  So far so 
good.

Taxonomists usually disagree about how to classify (e.g., what 
characters are important), and so they want to change the concepts as 
time progresses and new informaiton surfaces.  They generally try to 
distinguish the new concepts from some existing concepts by splitting or 
lumping various concepts using new descriptions based on the earlier 
descriptions.  The nomenclature rules for those groups say how the new 
concepts should be named.  Usually, if a concept (A) with name (a) is 
split into two new concepts (A' and A'') then one of those usually 
retains the original name (a) and the other gets a new name (a').    So 
now there are 3 concepts in existince (A, A', and A''), but only two 
names (a and a').  Thus, the name 'a' actually can be used to refer to 
two distinct concepts (A and A') with distinct definitions.  The name 
author for 'a' is still the same, but the concept author is different 
for A and A'.

So the current situation is that one name can refer to many concepts, 
AND that one concept can have many names.  Quite ambiguous.  There are 
millions of species concepts in most views of things, and many of them 
have been revised multiple times over a several hundred year history of 
classification.  Also, the species are organized into higher level ranks 
(e.g., genera), and these have the same name/concept issues as the 
lowest level ranks).  Egad.

--- Biology using taxonomy ---
Biologists use scientific names to identify organisms in the field and 
elsewhere.  When they collect data, they use a field guide or otherwise 
learn to identify species according to the descriptions of the species, 
usually provided in a field guide or other authority.  Thus, if you know 
the name that the biologist used to identify an organism and the 
reference that contains the description of the concept that that name 
refers to, you have a good idea of exactly what concept the biologist 
thought the organism was.   Unfortunately, most biologists do not write 
down the authoritative reference that they were using to identify 
species, instead providing the name only in their data sets (and 
sometimes they provide the name author, especially for plants).  Thus, a 
biologist who references name 'a' in a dataset in 1950 might be 
referring to a different taxonomic concept than another biologist who 
references name 'a' later, say in 2000.  Thus, if you were to do a 
retrospective analysis of the properties of 'a' over time without 
resolving the concepts that 'a' refers to, you'd be comparing apples and 
oranges (or, more likely, apples and MacIntosh apples).  In studies like 
biodiversity studies, this could result in inflation or deflation of 
changes in species abundance simply based on changes in the predominant 
view of species classifications.

--- Taxon databases ---
Typical taxonomic databases today use the taxon name as a surrogate for 
the concept, and so are actually just lists of names.  The seek-taxon 
group (along with others in the taxonomic community) is proposing that a 
better approach is to explicitly model the distinction between taxonomic 
concepts and taxonomic names by specifiying a unique identifier for 
every concept ever created based on the reference in which it was 
described (yep, BIG task), and then associate the many names that have 
been used to refer to that concept.  They also propose that 
relationships among concepts can be mapped (e.g., that concept A defines 
a superset of concept A').  The relationships between concepts 
recognized now are: congruent, includes, included in, overlaps, excludes 
(see 
http://www.bgbm.org/BioDivInf/Projects/MoreTax/standard_liste_en.htm for 
details).

--- Using taxon concepts ---
OK, so to your example data set.  Your dataset has a series of taxonomic 
names at various ranks.  If you take the 'sci_name' column as 
representing the species rank, you have a dataset that identifies only a 
name, not a concept.  This alone does not unambiguously tell you what 
the biologist was referring to.  The metadata provides references to the 
authorities that they used for species identification, and so 
(theoretically at least :), you could find each name in those references 
and get an exact concept definition that the biologist meant.  Of 
course, this requires matching the name and reference in the metadata to 
a corresponding name /concept in the seek-taxon concept database.

Once you've done this, you have more ability to reason about the 
relationships among data items in different data sets.  For example, if 
I want to search for data about name "a", a taxon concept resolution 
service might tell me that the name has been used to represent concepts 
A and A' at different times, and that searching for *all* names for both 
A and A' might be what the user wants (a type of query expansion).  Thus 
a better query can be defined semi-automatically.  Also, if a user wants 
to combine two datasets from different times that use taxonomic names, 
the concept database might be used to reason about the relationships 
between concepts used in the different data sources (e.g., that the name 
'a' used in dataset 1 refers to concept A but the name a used in dataset 
2 refers to A', and so the data MAY not be comparable.  I say MAY 
because this is an extremely subjective and subtle decision -- it really 
depends on what kinds of measurements the scientist is comparing and why 
they are comparing them.  So scientific judgement will be critical here. 
Sometimes we'll be able to tell a nice tidy relationship among concepts 
(for congruence, superset, subset) but other times it might be ambiguous 
(intersection, disjunction).  Finally, when we know only a taxon name 
and have no info about the concept, some contextual information about 
the data may allow us to assign a probabilistic estimate of the name 
representing a series of concepts.  For example, if the data contained 
name 'a' and was collected in 1900, and the only concept that used name 
a in 1900 was A, then we might assign a high probability that a 
represented A in that data.  Later on, we might see that a is used in 
data collected in 2004, and we might know that everyone has thought that 
A' and A'' are the right concepts to use so no data has referred to A 
since 1930, so there is a high probablility that 'a' references A'. 
Part of the seek-taxon work is to work on probabilistic approaches to 
making these judgements based on various corpora.

To address your last question (what needs to be in a concept schema), 
you might want to review current concept schema that the seek-taxon 
group (and J. Kennedy in particular) has been developing (its in cvs).

Hope this has helped.  It certainly took a while to write, even though I 
know its pretty sloppy in some places.  Unfortuantely, I'm in an NSF 
site review all next week so won't be able to respond to any questions, 
but I hope the rest of the seek-taxon group can follow up and correct my 
mistakes, and then I will try to during the week after next.

Cheers,
Matt

Shawn Bowers wrote:
> 
> Hi,
> 
> I recently found this statement in the methods section of an EML document:
> 
>       "Nomenclature for common names follow the 1987 edition of the
>     National Geographic Society's field guide, 'Bird's of North
>     America'. Species codes used are those of the American
>     Ornithologist's Union. The USFWS Checklist OF Vertebrates, 1991,
>     was used to quantify scientific names from the common names."
> 
> Can anyone help me interpret what these two sentences mean, and how I 
> might use the Taxon-group work to "understand/resolve" the actual 
> species references in a dataset based on the above sentence? Here is a 
> snippet from the corresponding dataset with the only those columns that 
> refer to something "taxonomic". (Note that there are actually 23 columns 
> in the dataset and about 165 rows.)
> 
> 
> class  tax_order      family    sci_name        aoucode  commonname
> -----  ---------      ------    --------        -------  ----------
> aves   anseriformes   anatidae  aix sponsa      wodu     wood duck
> aves   apodiformes    apodidae  chaetura vauxi  vasw     vauxs wift
> aves   ciconiiformes  ardeidae  ardea alba      greg     great egret
> ...
> 
> I am particularly interested in understanding the relationship between 
> the concept XML schema and it's use for "registering" or "mapping" this 
> data set to information captured in the concept work.  For example, if I 
> want to search for datasets based on taxonomic concepts.
> 
> (You have to be patient with me because I am clueless about these 
> issues), but it seems like the common name and aoucode represent 
> redundant information: the aoucode is some kind of convention for 
> representing the common name, and the class/tax_order/family/sci_name 
> uniquely identifies the common name? How would one align this dataset 
> with an instantiated taxon concept schema -- in particular, what 
> information would need to be available in the instantiated concept schema?
> 
> Any help is greatly appreciated,
> 
> Shawn
> 
> 
> 
> 
> 
> _______________________________________________
> seek-taxon mailing list
> seek-taxon at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/seek-taxon





More information about the Seek-kr-sms mailing list