[seek-kr-sms] Re: [SEEK-Taxon] Question about EML
Shawn Bowers
bowers at sdsc.edu
Tue Mar 9 09:08:52 PST 2004
Matt, Robert, Jessie, and Nico,
Thanks for all the feedback on this question. I really appreciate it
and have a better understanding of the taxon concept approach.
I don't have questions at this time, but may in the future :)
Thanks a lot!
Shawn
Kennedy, Jessie wrote:
> Hi Matt/Shawn
>
> I think Matt gave a very good summary of the situation - enough detail to
> get the point over wihtout missing anything vital to the understanding of
> the problem. I didn't read anything that needs "correcting" - only as Matt
> says we could expand more into related issues like type specimens and
> exceptions to the rules but I don't think that is useful just now.
>
>>From Shawn's point of view - I guess the thing to note is that when we've
> agreed on the model of concepts to use then the relevant bit of EML will be
> changed. Eventually the EML section will allow ecologists to register
> concepts (for our purposes at the moment - a full scientific name e.g. Aus
> bus L. with the reference to the publication the concept was described in).
> It is possible that we will need to develop tools to help the ecologist mark
> up their data with concepts. So part of the XML Schema will be incorporated
> into EML.
> Also, I believe that at the last Taxon group meeting we agreed to ignore
> common names for the time being. So an ecologist would only give the
> scientific name and publication (as a mechanism to identify the organism
> they recorded in the field) - they wouldn't give the full hierarchy from
> Kingdom down to species (which is really classificatory information or a
> means to access names/concepts which would be held inthe SEEK DB).
>
> If you have any specific questions, maybe I can help while Matt is busy....
>
> Jessie
>
>
>
>>-----Original Message-----
>>From: Matt Jones [mailto:jones at nceas.ucsb.edu]
>>Sent: 06 March 2004 03:18
>>To: Shawn Bowers
>>Cc: seek-taxon at ecoinformatics.org; seek-kr-sms at ecoinformatics.org
>>Subject: Re: [SEEK-Taxon] Question about EML
>>
>>
>>Hi Shawn,
>>
>>I am not a taxonomist, and I have only a superficial view of these
>>issues, but I think I can help clarify with a highly simplistic
>>exposition of the issues. So here's a simplified version of
>>the issue
>>and how concept info helps to resolve the problems. I'm going
>>to mostly
>>ignore the idea of type specimens, even though it is actually
>>central to
>>the discussion. It'll still be a tome, even though
>>simplistic. Others
>>on seek-taxon can point out my mistakes :) and hopefully clarify the
>>utility of the approach.
>>
>>--- Taxonomy ---
>>Taxonomists collect specimens from the field and use them to classify
>>clusters of organisms into groups at various levels (aka Ranks) in a
>>hierarchy (e.g., at the Species rank). These groups are generally
>>defined by a description of the characters from the specimens
>>that can
>>be used to distinguish the groups from each other. The suite of
>>characters used to distinguish one group from another need not match
>>(ie, overall_length might distinguish species A from species B but
>>number_of_Hairs might distinguish species B from species C). So, the
>>taxonomists have a "concept" of the group in mind when they write the
>>description of the species in a manuscript -- the description is the
>>manifestation of the taxonomic concept the person had in mind. The
>>taxonomist who first writes the description that defines a
>>concept can
>>be called the "concept author". There are specific rules
>>about how to
>>create the scientific name when a taxonomist wants to create a new
>>grouping (aka concept). Upon first creation of a concept A
>>with name a,
>>the person creating the name a is usually called the name
>>author and is
>>credited with the 'discovery'. Taxonomists generally try to preserve
>>this precendence, but don't worry too much about the concept author.
>>You might see a species written as 'Acer rubrum L.'; the 'L.'
>>stands for
>>Linnaeus who is the name author (there are many abbreviations
>>used). For
>>the very first definition of a concept and name, the name
>>author and the
>>concept author are the same. Later on they are not the same.
>> So far so
>>good.
>>
>>Taxonomists usually disagree about how to classify (e.g., what
>>characters are important), and so they want to change the concepts as
>>time progresses and new informaiton surfaces. They generally try to
>>distinguish the new concepts from some existing concepts by
>>splitting or
>>lumping various concepts using new descriptions based on the earlier
>>descriptions. The nomenclature rules for those groups say
>>how the new
>>concepts should be named. Usually, if a concept (A) with name (a) is
>>split into two new concepts (A' and A'') then one of those usually
>>retains the original name (a) and the other gets a new name
>>(a'). So
>>now there are 3 concepts in existince (A, A', and A''), but only two
>>names (a and a'). Thus, the name 'a' actually can be used to
>>refer to
>>two distinct concepts (A and A') with distinct definitions. The name
>>author for 'a' is still the same, but the concept author is different
>>for A and A'.
>>
>>So the current situation is that one name can refer to many concepts,
>>AND that one concept can have many names. Quite ambiguous.
>>There are
>>millions of species concepts in most views of things, and
>>many of them
>>have been revised multiple times over a several hundred year
>>history of
>>classification. Also, the species are organized into higher
>>level ranks
>>(e.g., genera), and these have the same name/concept issues as the
>>lowest level ranks). Egad.
>>
>>--- Biology using taxonomy ---
>>Biologists use scientific names to identify organisms in the
>>field and
>>elsewhere. When they collect data, they use a field guide or
>>otherwise
>>learn to identify species according to the descriptions of
>>the species,
>>usually provided in a field guide or other authority. Thus,
>>if you know
>>the name that the biologist used to identify an organism and the
>>reference that contains the description of the concept that that name
>>refers to, you have a good idea of exactly what concept the biologist
>>thought the organism was. Unfortunately, most biologists do
>>not write
>>down the authoritative reference that they were using to identify
>>species, instead providing the name only in their data sets (and
>>sometimes they provide the name author, especially for
>>plants). Thus, a
>>biologist who references name 'a' in a dataset in 1950 might be
>>referring to a different taxonomic concept than another biologist who
>>references name 'a' later, say in 2000. Thus, if you were to do a
>>retrospective analysis of the properties of 'a' over time without
>>resolving the concepts that 'a' refers to, you'd be comparing
>>apples and
>>oranges (or, more likely, apples and MacIntosh apples). In
>>studies like
>>biodiversity studies, this could result in inflation or deflation of
>>changes in species abundance simply based on changes in the
>>predominant
>>view of species classifications.
>>
>>--- Taxon databases ---
>>Typical taxonomic databases today use the taxon name as a
>>surrogate for
>>the concept, and so are actually just lists of names. The seek-taxon
>>group (along with others in the taxonomic community) is
>>proposing that a
>>better approach is to explicitly model the distinction
>>between taxonomic
>>concepts and taxonomic names by specifiying a unique identifier for
>>every concept ever created based on the reference in which it was
>>described (yep, BIG task), and then associate the many names
>>that have
>>been used to refer to that concept. They also propose that
>>relationships among concepts can be mapped (e.g., that
>>concept A defines
>>a superset of concept A'). The relationships between concepts
>>recognized now are: congruent, includes, included in,
>>overlaps, excludes
>>(see
>>http://www.bgbm.org/BioDivInf/Projects/MoreTax/standard_liste_
>>en.htm for
>>details).
>>
>>--- Using taxon concepts ---
>>OK, so to your example data set. Your dataset has a series
>>of taxonomic
>>names at various ranks. If you take the 'sci_name' column as
>>representing the species rank, you have a dataset that
>>identifies only a
>>name, not a concept. This alone does not unambiguously tell you what
>>the biologist was referring to. The metadata provides
>>references to the
>>authorities that they used for species identification, and so
>>(theoretically at least :), you could find each name in those
>>references
>>and get an exact concept definition that the biologist meant. Of
>>course, this requires matching the name and reference in the
>>metadata to
>>a corresponding name /concept in the seek-taxon concept database.
>>
>>Once you've done this, you have more ability to reason about the
>>relationships among data items in different data sets. For
>>example, if
>>I want to search for data about name "a", a taxon concept resolution
>>service might tell me that the name has been used to
>>represent concepts
>>A and A' at different times, and that searching for *all*
>>names for both
>>A and A' might be what the user wants (a type of query
>>expansion). Thus
>>a better query can be defined semi-automatically. Also, if a
>>user wants
>>to combine two datasets from different times that use
>>taxonomic names,
>>the concept database might be used to reason about the relationships
>>between concepts used in the different data sources (e.g.,
>>that the name
>>'a' used in dataset 1 refers to concept A but the name a used
>>in dataset
>>2 refers to A', and so the data MAY not be comparable. I say MAY
>>because this is an extremely subjective and subtle decision
>>-- it really
>>depends on what kinds of measurements the scientist is
>>comparing and why
>>they are comparing them. So scientific judgement will be
>>critical here.
>>Sometimes we'll be able to tell a nice tidy relationship
>>among concepts
>>(for congruence, superset, subset) but other times it might
>>be ambiguous
>>(intersection, disjunction). Finally, when we know only a taxon name
>>and have no info about the concept, some contextual information about
>>the data may allow us to assign a probabilistic estimate of the name
>>representing a series of concepts. For example, if the data
>>contained
>>name 'a' and was collected in 1900, and the only concept that
>>used name
>>a in 1900 was A, then we might assign a high probability that a
>>represented A in that data. Later on, we might see that a is used in
>>data collected in 2004, and we might know that everyone has
>>thought that
>>A' and A'' are the right concepts to use so no data has referred to A
>>since 1930, so there is a high probablility that 'a' references A'.
>>Part of the seek-taxon work is to work on probabilistic approaches to
>>making these judgements based on various corpora.
>>
>>To address your last question (what needs to be in a concept schema),
>>you might want to review current concept schema that the seek-taxon
>>group (and J. Kennedy in particular) has been developing (its in cvs).
>>
>>Hope this has helped. It certainly took a while to write,
>>even though I
>>know its pretty sloppy in some places. Unfortuantely, I'm in an NSF
>>site review all next week so won't be able to respond to any
>>questions,
>>but I hope the rest of the seek-taxon group can follow up and
>>correct my
>>mistakes, and then I will try to during the week after next.
>>
>>Cheers,
>>Matt
>>
>>Shawn Bowers wrote:
>>
>>>Hi,
>>>
>>>I recently found this statement in the methods section of
>>
>>an EML document:
>>
>>> "Nomenclature for common names follow the 1987 edition of the
>>> National Geographic Society's field guide, 'Bird's of North
>>> America'. Species codes used are those of the American
>>> Ornithologist's Union. The USFWS Checklist OF Vertebrates, 1991,
>>> was used to quantify scientific names from the common names."
>>>
>>>Can anyone help me interpret what these two sentences mean,
>>
>>and how I
>>
>>>might use the Taxon-group work to "understand/resolve" the actual
>>>species references in a dataset based on the above
>>
>>sentence? Here is a
>>
>>>snippet from the corresponding dataset with the only those
>>
>>columns that
>>
>>>refer to something "taxonomic". (Note that there are
>>
>>actually 23 columns
>>
>>>in the dataset and about 165 rows.)
>>>
>>>
>>>class tax_order family sci_name aoucode commonname
>>>----- --------- ------ -------- ------- ----------
>>>aves anseriformes anatidae aix sponsa wodu wood duck
>>>aves apodiformes apodidae chaetura vauxi vasw vauxs wift
>>>aves ciconiiformes ardeidae ardea alba greg great egret
>>>...
>>>
>>>I am particularly interested in understanding the
>>
>>relationship between
>>
>>>the concept XML schema and it's use for "registering" or
>>
>>"mapping" this
>>
>>>data set to information captured in the concept work. For
>>
>>example, if I
>>
>>>want to search for datasets based on taxonomic concepts.
>>>
>>>(You have to be patient with me because I am clueless about these
>>>issues), but it seems like the common name and aoucode represent
>>>redundant information: the aoucode is some kind of convention for
>>>representing the common name, and the
>>
>>class/tax_order/family/sci_name
>>
>>>uniquely identifies the common name? How would one align
>>
>>this dataset
>>
>>>with an instantiated taxon concept schema -- in particular, what
>>>information would need to be available in the instantiated
>>
>>concept schema?
>>
>>>Any help is greatly appreciated,
>>>
>>>Shawn
>>>
>>>
>>>
>>>
>>>
>>>_______________________________________________
>>>seek-taxon mailing list
>>>seek-taxon at ecoinformatics.org
>>>http://www.ecoinformatics.org/mailman/listinfo/seek-taxon
>>
>>
>>_______________________________________________
>>seek-taxon mailing list
>>seek-taxon at ecoinformatics.org
>>http://www.ecoinformatics.org/mailman/listinfo/seek-taxon
>>
More information about the Seek-kr-sms
mailing list