[tcs-lc] Modularisation of standards

Tue Mar 8 01:51:19 PST 2005

Gregor Hagedorn wrote:

> Closely related, my feeling about much of the discussion whether LC should

> be in TCS is that it misses the point. I think that rather than embedding 
> LC in TCS, both name and concept issues, together with specimen and 
> publication, and description and ecology, and... issues all belong 
> together. That is what UBIF aims at. SDD does not work without much 
> peripheral infrastructure of names, taxonomic hierarchy, publications. 
> geography, agents, etc. Rather than considering it as part of descriptive 
> data, we tried to push it out into UBIF. 
>
> Now UBIF is far from resolved, it is a discussion platform that needs 
> input.  And it may be aiming to high, trying to solve too much. My 
> suggestion would be, however, to distinguish between a TCS-core that truly

> deals with taxon concepts, and a shell that ties LC, TCS-core, 
> publications, specimens together. 

I am happy for the discussion around the separate use of names outside
concepts to be resolved either way (provided we end up with something that
can handle nomenclatural resources as well as taxonomic resources).  However
I would like very much to support modularisation of the kind that Gregor
outlined here (and which Rich mentioned in one of his earlier posts).

Modularisation will be really important for the long-term success of the
TDWG standards.  It may be a long way from where we are today, but the TDWG
standards could (and probably should) ultimately become a library of
reusable data types.  Better still there should also be a set of defined
inter-type relationships.  This would not be to restrict the relationships
that provider could define, but it would help to provide structure to some
of the core connections within our information domain.  

For example a TaxonConceptType would include a relationship element such as
<voucheredBy>, and the content of the <voucheredBy> element would be a TDWG
TaxonOccurrenceType or SpecimenType or a reference to an element of such a
type.  

In fact it would be really good to see a model in which each type could be
specialised into a variety of forms, for example providing the following
forms for providing a TaxonConceptType:

<TaxonConceptRefType>
  - just a reference to a TaxonConceptType elsewhere in the document

<TaxonConceptIdType>
  - a globally unique reference to a taxon concept

<TaxonConceptBasicType>
  - a "core" representation of a taxon concept (Darwin Core level of detail)

<TaxonConceptDetailType>
  < a "detail" representation of a taxon concept (ABCD level of details)

Each of these types could substitute for a <TaxonConceptType> at any point
in any document.  This could in fact mean that at least the BasicType and
the DetailType became exactly the same thing with different choices for the
level of detail in included elements.  Some of you may see that I have been
heavily influenced by GML in this vision...

The existing TDWG standards would then become standard top-level document
structures that combine such data types in different ways.  If different
user groups add new document structures based on these data types, it would
still be possible for any tool to infer a great deal of information for
other purposes based on the standard inter-type relationships.

I think such an approach could ultimately be the most powerful way for us to
manage our data standards.  We could then concentrate on defining the data
types to represent our objects and the standard relationships between them
(identifiedAs, typifiedBy, hasAuthor, etc.).  These would then be building
blocks for any kind of biodiversity data document.  The TDWG standards would
effectively define the core ontology for our data.  All data types would be
extensible to support communities transferring additional information (again
see GML for some possible models).

I guess (returning to the modelling of names and concepts) that this means
that we should create a separate TaxonNameType if we believe that there
would be places in such a framework where we would wish to model:

<SomeType><isRelatedTo><Name/></isRelatedTo></SomeType>

If the only such place is <Concept><hasName><Name/></hasName></Concept>, we
can just carry on with the Linnean Core as an obligate child of a
TaxonConcept.  Otherwise we should consider separate data types.

(Now that I think of it) I guess there is one other possible reason why we
may wish to be able to separate names out.  This is a data processing issue
and Bob Morris may just tell me that my problem comes from assuming a
particular implementation, but here goes...

Assume a document in which two concepts refer to the same published name
(using an abbreviated representation of TCS data):

<TaxonConcepts>
  <TaxonConcept id="tc1">
    <Name>
      <Label>Aus bus</Label>
      <CanonicalAuthorship>Black, 1965</CanonicalAuthorship>
    </Name>
    <AccordingTo>Smith</AccordingTo>
  </TaxonConcept>
  <TaxonConcept id="tc2">
    <Name>
      <Label>Aus bus</Label>
      <CanonicalAuthorship>Black, 1965</CanonicalAuthorship>
    </Name>
    <AccordingTo>Jones</AccordingTo>
  </TaxonConcept>
</TaxonConcepts>

The nomenclatural data here under name is rather simple and there may be
little problem with denormalising it.  A DiGIR-style search could allow a
user to find all TaxonConcepts based on "Aus bus Black, 1965".

However will it ever matter to an application processing such a document
that the two <Name> elements are the same?  Do we need a better way to
indicate this than simply relying on the byte-identity of the XML content?

An alternative representation (if this mattered) would be:

<TaxonConcepts>
  <TaxonConcept id="tc1">
    <NameRef id="tn1">
    <AccordingTo>Smith</AccordingTo>
  </TaxonConcept>
  <TaxonConcept id="tc2">
    <NameRef id="tn1">
    <AccordingTo>Jones</AccordingTo>
  </TaxonConcept>
</TaxonConcepts>
<TaxonNames>
  <TaxonName id="tn1">
    <Label>Aus bus</Label>
    <CanonicalAuthorship>Black, 1965</CanonicalAuthorship>
  </TaxonName>
</TaxonNames>

[Note that it is much harder reliably to assign and police meaningful
identifiers for name elements if they are fully embedded.  There would
certainly be no way to enforce a single consistent representation in all
occurrences for the same <Name> across different <TaxonConcept> elements.]

Will there ever be any reasons why tools that generate TCS data would need
to present it in a more normalised form like this?  I am fairly confident
that the answer here is "No".

Will there ever be any reasons why tools that process TCS data would be
better served by the more normalised form?  Here I am less sure.  Do we have
any use cases that would drive us that way?

Donald

---------------------------------------------------------------
Donald Hobern (dhobern at gbif.org)
Programme Officer for Data Access and Database Interoperability 
Global Biodiversity Information Facility Secretariat 
Universitetsparken 15, DK-2100 Copenhagen, Denmark
Tel: +45-35321483   Mobile: +45-28751483   Fax: +45-35321480
---------------------------------------------------------------