[tcs-lc] Names as Objects

Sat Mar 5 16:58:44 PST 2005

Thanks, Roger --

> Rich basically proposes that ScientificName (or whatever the element is
> called) should be a top level structure along with taxon concepts,
> vouchers and publications. I'll call this the 'modular' approach.

Just to be clear, I tried to emphasize that I'm *not* (necessarily)
proposing this as an actual alternative for TCS (yet).  I just wanted to use
it as a discussion tool for teasing apart what the advantages &
disadvantages would be of treating names as separate, stand-alone objects.
The top-level "TaxonName" element essentially forces it to be treated as a
separate object, and allows us to examine the "what if" consequences of
doing so.  I do like the term "Modular" (as inspired by James) to label this
approach.

> Most of the arguments have been put forward and all parties seem to
> agree that either method would 'work'. There is no right and wrong here
> we are just trying to pick the better of two options.

That's my contention, but I'm not sure I'm right about this, and I'm not
sure everyone agrees yet.  What I think we *do* all agree on is that the
final TCS should attempt to meet the largest/bradest set of needs, while
also maintaining some optimal level of structural "elegance". (I use that
word "elegance" as a catch-all descriptor that implies low processing
overhead requirements, inherently enforced data integrity, general
simplicity of design, convenience of modularity, and a number of other
qualities that database programmers generally aspire to.)

> One way to look at this kind of situation is to do a 'regret analysis'.
> If we were all chatting in 10 years time what would we regret about
> choosing the modular over the embedded approach or visa versa.  We are
> trying to guess which of the options we will cause us fewer headaches in
> the future.

AGREED!!!!  That's a great way to look at it.  However, it does need to be
tempered a bit by the fact that we need a working draft ASAP.

> Currently my money is on the embedded approach causing fewer problems. I
> imagine some one who hasn't been involved in the discussions here (and
> probably isn't even a taxonomist) implementing a system to publish
> checklists from surveys and they look at the schema and think
> "ScientificNames that is what I've got!
> I'll just map the names in the database to those elements in the schema".

Well...maybe, but this is balanced by the same fellow who might encounter
the embedded approach and say "TaxonConcepts -- what's that?  This schema is
of no use to me."

Also, as I tried to hint at (but haven't yet thought through), there might
be design "elegance" in defining a strict 1:1 correspondence between a
"ScientificName" object, and a "Nominal" Concept object, in which case we
could consider identical GUID values for both.  In that circumstance, your
hypothetical naive user would be doing no harm by plugging in directly to
names, because that would simultaneously plug into the corresponding Nominal
Concept (which is exactly what we want to do if they have name-only data).

Now, having just proposed that idea (matching GUIDs between Nominal concepts
and name objects), the database dude in me prefers to embed Names within
Concepts in the schema -- but in a way that compartmentalizes (i.e., keeps
modular) the name-relevant data as separate from the concept data.  After
this "TaxonNames as top-level elements" thought experiment runs its course,
I'll shift gears into advocating an "all nomenclatural information embedded
within a Nominal Concept" arguement.  For now, though, I don't want to
clutter the discussion any more than I have already cluttered it.

I do have one question for the XML-gurus:  How do you represnet a "Subtype"
in XML?  By "Subtype", I mean an unambiguously defined specific subset of a
larger set of more generalized records.  I.e., "Person" and "Organization"
are each subtypes of "Agent".

Stated another way, if TaxonConcepts can be one of, say, six different
types -- how do you represent a set of elements in XML that says "these
elements only apply to TaxonConcept instances of Type 1, but not to
instances of Types 2-6"?

> Months, maybe years, later
> some one else realizes that the data being published by this
> organization is useless because it is a list of names not taxa and has
> to go and work out how to get it corrected and correct any decision that
> have been taken on the basis of the data. I imagine this happening quite
> a lot.

BUT!!! If there is an unambiguous connection between each name object and
its corresponding Nominal concept (as there absolutetly must be -- and
shared GUIDs would be only one way of achieving this), then the task of
converting data linked directly to names over to links to Nominal concepts
would be extremely trivial.

> The down side of the embedded approach is that it is slightly less
> convenient for taxonomists.

I'm not sure this is a downside, because the vast, VAST majority of
taxonomists will be accesing the data via some UI that hides the structural
complexity of the data.

The downsides I see have to do with structural elegance -- specifically,
mixing what I see as apples and oranges (relationships between names, vs.
relationships between concepts) in one place as though them meant more or
less the same thing.  Clearly, the designers of TCS as it currently exists
appreciate the value of structurally separating "similar" sorts of data into
different structures, as indicated by the separation of "Relationships"
(within TaxonConcepts) from "RelationshipAssrtions".  Both structures do the
same thing (establish relationships between a pair of concepts in the
context of an "AccordingTo") -- but they exist in different parts of of the
schema because there is a structural elegance in unambiguously separating
those relationships that form part of the *definition* of a concept, from
those elements that represent secondary *interpretations* of concept
relationships.

My basic point is that name-object data elements (and intra-name
relationships) are sufficiently different from concept-object (and
inrea-concept relationships) that they warrant compartmentalization
(modularization) in the data schema.  Exactly how that modularization is
optimally achieved is another topic of discussion.

> If you are publishing 15 different concepts
> that use the same name there will be a lot of redundancy but this
> redundancy is only in the instance of a document that may appear briefly
>   - not in a database. In 10 years time I would hope this will go away
> as ALL published names will be cataloged and have GUIDs - anyway one
> would hope so.

I would certainly hope so, but I'm not sure I get what you mean in the
paragraph above.

> Rich proposes that there could be a brief summary of a name in the
> TaxonConcept element as well as a pointer the to full scientific name.

...as currently exists with "NameSimple" and "NameDetailed".  The only other
element to consider is "NameVerbatim", which is neccesary if you are going
to decouple "literal string of characters as appears in the concept
definition" from "name".  I get the sense from Jessie's recent posts that
TCS assumes that "unique string of characters" *defines* "new name".  If
that's true, and if the TCS schema is designed around that premise, then it
is of limited use to nomenclators, and thus encourages the nomenclators to
abandon TCS as a mechanism for exchanging name (sensu nomenclaturalist)
data.  I can't imagine a scenario where anyone benefits from such a
separation.

> This leads us into the several-ways-to-achieve-the-same-thing situation
> which is hell for a programmer. Which do we display to the user? Which
> do we use to make judgments about whether two concepts from different
> data sources are talking about the same Name (my hobby horse)?

I certainly understand and agree with the first sentence, but I don't quite
understand why the two questions relate to the discussion at hand.  I mean,
of course they "relate" -- but no matter which approach (modular vs.
embedded) we end up with, those questions will still need to be answered.  I
can't see any intrinsic reason why one approach or the other necessarily
makes those questions easier to answer.

> Basically my point is that taxonomists can handle the concept of a NULL
> or nominal concept that just contains name data much more easily than
> non-taxonomists can grasp the subtle difference between taxon concepts
> and the names we use for them. It is, after all, our job to think about
> these things but we need to produce a schema that is used by people who
> aren't us.

I would agree that a schema design that attaches/embeds name data into
Nominal concepts such that there is an unambiguous 1:1 match between a
"name" (sensu nomenclaturalist) and "Nominal" concept (sensu TCS) would
probably be acceptable to the nomenclatural users.  However, I am not
convinced that a schema that embeds the name info in a pseudo-concept
instance is necessarily more comprehensable to a non-taxonomist than one
that modularizes name data as distinct from concept data.  BOTH approaches
(at the XML schema level) would be difficult to grasp by a non-taxonomist
(hell, I'm a taxonomist who specializes in electronic data management and
I'm *still* not sure I understand as much about TCS as I need to).

The point here is that the data will have to be rendered from an XML schema
into a screen-load of information via some sort of UI; and as long as the UI
programmers understand the schema, the difference between the two structural
approaches is really irrelevant (provided they both contain the same
complement of information).

So...I don't accept that the naive user is relevant in this discussion about
the schema structure.  What is relevant are questions about package size,
informational flexibility, and processing performance.  These are the things
that affect how broad the user base is that finds the exchange schema
"useful" to their particular needs.

> So currently I am in the embedded camp. I could defect at any moment but
> I am looking for a good reason to. The arguments are closely related to
> another thread that I am just about to start. "Are we passing the
> product of taxonomic research or raw taxonomic data?"

> Can anyone give a scenario of regretting going with embedded approach. I
> am sure some one can!

At one level, there is the hypothetical regret that the people who manage
taxonomic names data did not find the embedded approach workable (in a
practical sense -- not in a technical sense) to serve their data needs, and
therefore developed their own separate name-based schema.  I know that *I*
would regret this.

There is also the regret of adopting a "system of convenience" in a world
that preceeded universal taxon name registration, which the post
taxon-name-registration world got stuck with as a legacy mechanism of data
exchange.

I would also deeply regret the adoption international standard that ws
generated without a full mutual understanding of the issues.  I know that I,
for one, do not fully understand all the issues yet; and if I had reason to
believe that I was the only one in this situation, I would certainly not be
spending so much time in articulating the stuff that I do understand (or at
least *think* I understand).

As I wrote this morning in an off-list email:

There are several VERY complex issues that all have to be considered
simultaneously: Nomenclatural rules & practice (separate for Botany &
Zoology), Taxon Concept Circumscriptions, general information structure and
management theory, and specific computer technologies (like XML).  Any one
of these has a very steep learning curve; I seriously doubt that anyone on
the CC list of this conversation has a mastery of ALL of these things (e.g.,
I'm very weak on botanical nomenclature rules & practice and on XML, and
have varying degrees of comprehension of the others).

So...I would regret it if a standard was adopted that did not satisfy the
respective experts in all of these complex disciplines.

Aloha,
Rich