No subject

Tue Mar 22 16:43:01 PST 2005

agreed on the model of concepts to use then the relevant bit of EML will be
changed. Eventually the EML section will allow ecologists to register
concepts (for our purposes at the moment - a full scientific name e.g. Aus
bus L. with the reference to the publication the concept was described in).
It is possible that we will need to develop tools to help the ecologist mark
up their data with concepts. So part of the XML Schema will be incorporated
into EML.
Also, I believe that at the last Taxon group meeting we agreed to ignore
common names for the time being. So an ecologist would only give the
scientific name and publication (as a mechanism to identify the organism
they recorded in the field) - they wouldn't give the full hierarchy from
Kingdom down to species (which is really classificatory information or a
means to access names/concepts which would be held inthe SEEK DB).

If you have any specific questions, maybe I can help while Matt is busy....

Jessie

> -----Original Message-----
> From: Matt Jones [mailto:jones at nceas.ucsb.edu]
> Sent: 06 March 2004 03:18
> To: Shawn Bowers
> Cc: seek-taxon at ecoinformatics.org; seek-kr-sms at ecoinformatics.org
> Subject: Re: [SEEK-Taxon] Question about EML
> 
> 
> Hi Shawn,
> 
> I am not a taxonomist, and I have only a superficial view of these 
> issues, but I think I can help clarify with a highly simplistic 
> exposition of the issues.  So here's a simplified version of 
> the issue 
> and how concept info helps to resolve the problems. I'm going 
> to mostly 
> ignore the idea of type specimens, even though it is actually 
> central to 
> the discussion. It'll still be a tome, even though 
> simplistic.  Others 
> on seek-taxon can point out my mistakes :) and hopefully clarify the 
> utility of the approach.
> 
> --- Taxonomy ---
> Taxonomists collect specimens from the field and use them to classify 
> clusters of organisms into groups at various levels (aka Ranks) in a 
> hierarchy (e.g., at the Species rank).  These groups are generally 
> defined by a description of the characters from the specimens 
> that can 
> be used to distinguish the groups from each other.  The suite of 
> characters used to distinguish one group from another need not match 
> (ie, overall_length might distinguish species A from species B but 
> number_of_Hairs might distinguish species B from species C).  So, the 
> taxonomists have a "concept" of the group in mind when they write the 
> description of the species in a manuscript -- the description is the 
> manifestation of the taxonomic concept the person had in mind.  The 
> taxonomist who first writes the description that defines a 
> concept can 
> be called the "concept author".  There are specific rules 
> about how to 
> create the scientific name when a taxonomist wants to create a new 
> grouping (aka concept).  Upon first creation of a concept A 
> with name a, 
> the person creating the name a is usually called the name 
> author and is 
> credited with the 'discovery'. Taxonomists generally try to preserve 
> this precendence, but don't worry too much about the concept author. 
> You might see a species written as 'Acer rubrum L.'; the 'L.' 
> stands for 
> Linnaeus who is the name author (there are many abbreviations 
> used). For 
> the very first definition of a concept and name, the name 
> author and the 
> concept author are the same.  Later on they are not the same. 
>  So far so 
> good.
> 
> Taxonomists usually disagree about how to classify (e.g., what 
> characters are important), and so they want to change the concepts as 
> time progresses and new informaiton surfaces.  They generally try to 
> distinguish the new concepts from some existing concepts by 
> splitting or 
> lumping various concepts using new descriptions based on the earlier 
> descriptions.  The nomenclature rules for those groups say 
> how the new 
> concepts should be named.  Usually, if a concept (A) with name (a) is 
> split into two new concepts (A' and A'') then one of those usually 
> retains the original name (a) and the other gets a new name 
> (a').    So 
> now there are 3 concepts in existince (A, A', and A''), but only two 
> names (a and a').  Thus, the name 'a' actually can be used to 
> refer to 
> two distinct concepts (A and A') with distinct definitions.  The name 
> author for 'a' is still the same, but the concept author is different 
> for A and A'.
> 
> So the current situation is that one name can refer to many concepts, 
> AND that one concept can have many names.  Quite ambiguous.  
> There are 
> millions of species concepts in most views of things, and 
> many of them 
> have been revised multiple times over a several hundred year 
> history of 
> classification.  Also, the species are organized into higher 
> level ranks 
> (e.g., genera), and these have the same name/concept issues as the 
> lowest level ranks).  Egad.
> 
> --- Biology using taxonomy ---
> Biologists use scientific names to identify organisms in the 
> field and 
> elsewhere.  When they collect data, they use a field guide or 
> otherwise 
> learn to identify species according to the descriptions of 
> the species, 
> usually provided in a field guide or other authority.  Thus, 
> if you know 
> the name that the biologist used to identify an organism and the 
> reference that contains the description of the concept that that name 
> refers to, you have a good idea of exactly what concept the biologist 
> thought the organism was.   Unfortunately, most biologists do 
> not write 
> down the authoritative reference that they were using to identify 
> species, instead providing the name only in their data sets (and 
> sometimes they provide the name author, especially for 
> plants).  Thus, a 
> biologist who references name 'a' in a dataset in 1950 might be 
> referring to a different taxonomic concept than another biologist who 
> references name 'a' later, say in 2000.  Thus, if you were to do a 
> retrospective analysis of the properties of 'a' over time without 
> resolving the concepts that 'a' refers to, you'd be comparing 
> apples and 
> oranges (or, more likely, apples and MacIntosh apples).  In 
> studies like 
> biodiversity studies, this could result in inflation or deflation of 
> changes in species abundance simply based on changes in the 
> predominant 
> view of species classifications.
> 
> --- Taxon databases ---
> Typical taxonomic databases today use the taxon name as a 
> surrogate for 
> the concept, and so are actually just lists of names.  The seek-taxon 
> group (along with others in the taxonomic community) is 
> proposing that a 
> better approach is to explicitly model the distinction 
> between taxonomic 
> concepts and taxonomic names by specifiying a unique identifier for 
> every concept ever created based on the reference in which it was 
> described (yep, BIG task), and then associate the many names 
> that have 
> been used to refer to that concept.  They also propose that 
> relationships among concepts can be mapped (e.g., that 
> concept A defines 
> a superset of concept A').  The relationships between concepts 
> recognized now are: congruent, includes, included in, 
> overlaps, excludes 
> (see 
> http://www.bgbm.org/BioDivInf/Projects/MoreTax/standard_liste_
> en.htm for 
> details).
> 
> --- Using taxon concepts ---
> OK, so to your example data set.  Your dataset has a series 
> of taxonomic 
> names at various ranks.  If you take the 'sci_name' column as 
> representing the species rank, you have a dataset that 
> identifies only a 
> name, not a concept.  This alone does not unambiguously tell you what 
> the biologist was referring to.  The metadata provides 
> references to the 
> authorities that they used for species identification, and so 
> (theoretically at least :), you could find each name in those 
> references 
> and get an exact concept definition that the biologist meant.  Of 
> course, this requires matching the name and reference in the 
> metadata to 
> a corresponding name /concept in the seek-taxon concept database.
> 
> Once you've done this, you have more ability to reason about the 
> relationships among data items in different data sets.  For 
> example, if 
> I want to search for data about name "a", a taxon concept resolution 
> service might tell me that the name has been used to 
> represent concepts 
> A and A' at different times, and that searching for *all* 
> names for both 
> A and A' might be what the user wants (a type of query 
> expansion).  Thus 
> a better query can be defined semi-automatically.  Also, if a 
> user wants 
> to combine two datasets from different times that use 
> taxonomic names, 
> the concept database might be used to reason about the relationships 
> between concepts used in the different data sources (e.g., 
> that the name 
> 'a' used in dataset 1 refers to concept A but the name a used 
> in dataset 
> 2 refers to A', and so the data MAY not be comparable.  I say MAY 
> because this is an extremely subjective and subtle decision 
> -- it really 
> depends on what kinds of measurements the scientist is 
> comparing and why 
> they are comparing them.  So scientific judgement will be 
> critical here. 
> Sometimes we'll be able to tell a nice tidy relationship 
> among concepts 
> (for congruence, superset, subset) but other times it might 
> be ambiguous 
> (intersection, disjunction).  Finally, when we know only a taxon name 
> and have no info about the concept, some contextual information about 
> the data may allow us to assign a probabilistic estimate of the name 
> representing a series of concepts.  For example, if the data 
> contained 
> name 'a' and was collected in 1900, and the only concept that 
> used name 
> a in 1900 was A, then we might assign a high probability that a 
> represented A in that data.  Later on, we might see that a is used in 
> data collected in 2004, and we might know that everyone has 
> thought that 
> A' and A'' are the right concepts to use so no data has referred to A 
> since 1930, so there is a high probablility that 'a' references A'. 
> Part of the seek-taxon work is to work on probabilistic approaches to 
> making these judgements based on various corpora.
> 
> To address your last question (what needs to be in a concept schema), 
> you might want to review current concept schema that the seek-taxon 
> group (and J. Kennedy in particular) has been developing (its in cvs).
> 
> Hope this has helped.  It certainly took a while to write, 
> even though I 
> know its pretty sloppy in some places.  Unfortuantely, I'm in an NSF 
> site review all next week so won't be able to respond to any 
> questions, 
> but I hope the rest of the seek-taxon group can follow up and 
> correct my 
> mistakes, and then I will try to during the week after next.
> 
> Cheers,
> Matt
> 
> Shawn Bowers wrote:
> > 
> > Hi,
> > 
> > I recently found this statement in the methods section of 
> an EML document:
> > 
> >       "Nomenclature for common names follow the 1987 edition of the
> >     National Geographic Society's field guide, 'Bird's of North
> >     America'. Species codes used are those of the American
> >     Ornithologist's Union. The USFWS Checklist OF Vertebrates, 1991,
> >     was used to quantify scientific names from the common names."
> > 
> > Can anyone help me interpret what these two sentences mean, 
> and how I 
> > might use the Taxon-group work to "understand/resolve" the actual 
> > species references in a dataset based on the above 
> sentence? Here is a 
> > snippet from the corresponding dataset with the only those 
> columns that 
> > refer to something "taxonomic". (Note that there are 
> actually 23 columns 
> > in the dataset and about 165 rows.)
> > 
> > 
> > class  tax_order      family    sci_name        aoucode  commonname
> > -----  ---------      ------    --------        -------  ----------
> > aves   anseriformes   anatidae  aix sponsa      wodu     wood duck
> > aves   apodiformes    apodidae  chaetura vauxi  vasw     vauxs wift
> > aves   ciconiiformes  ardeidae  ardea alba      greg     great egret
> > ...
> > 
> > I am particularly interested in understanding the 
> relationship between 
> > the concept XML schema and it's use for "registering" or 
> "mapping" this 
> > data set to information captured in the concept work.  For 
> example, if I 
> > want to search for datasets based on taxonomic concepts.
> > 
> > (You have to be patient with me because I am clueless about these 
> > issues), but it seems like the common name and aoucode represent 
> > redundant information: the aoucode is some kind of convention for 
> > representing the common name, and the 
> class/tax_order/family/sci_name 
> > uniquely identifies the common name? How would one align 
> this dataset 
> > with an instantiated taxon concept schema -- in particular, what 
> > information would need to be available in the instantiated 
> concept schema?
> > 
> > Any help is greatly appreciated,
> > 
> > Shawn
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > seek-taxon mailing list
> > seek-taxon at ecoinformatics.org
> > http://www.ecoinformatics.org/mailman/listinfo/seek-taxon
> 
> 
> _______________________________________________
> seek-taxon mailing list
> seek-taxon at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/seek-taxon
>