Sunday, September 6, 2009

Nomenclature is Dead! Long Live Pragmatic Taxa!

Roger Hyam's blog post is great. A few somewhat random thoughts on it follow, with apologies in advance to Roger for any (likely) misinterpretations. I don't read taxacom, so there might be some redundancy here. I'm not sure that what follows makes any sense, nor is it my personal position, that is, I'm trolling.

What Roger seems to point out, and what I've argued to others in the past is that something like this will ultimately exist in the future:
  1. In the future we will have a lot of data on "the web".
  2. These data will be very accessible to machines OR humans (hereafter "mahus").
  3. The way mahus will access data of interest will be by using other data/observations,/measurements (hereafter lumped as "daomes").
  4. Since taxonomic names are not daomes they are are pointless.
How does this work? Let's imagine some sci-fi:

Situation A: The mahu plumber laying fibre optic cable is having a bad day. In the depths of its subterranean workplace something is fouling its locomotion. The mahu applies a vacuum to the space itself sucking some of the offending stuff into a daome chamber. Some daomes are taken, including a daome which identifies some hydrocarbon present only in life (at least according to the mahu's reference ontology). Life has DNA, therefor the mahu's built in DNA reader makes another daome, which indicates that the offending stuff is a red, stinging thing that lives in colonies in the soil, fouls up electrical equipment, and can be killed by a dose of the contents of the red canister attached to the side of its locomotory appendage. The pesticide is applied, the problem goes away, and the mahu continues plumbing.

Situation B: Back on the surface two mahus are playing. Something hops onto the field in front of them and they want to know whether this thing might be dangerous and interfere with their game. They can see that this thing has a wing, beak, feather, is red and is singing a song (I'm looking at one out my window now). A few interactions with the cloud later they realize that they are looking at something that is mostly harmless.

Both cases illustrate that we don't need to work with names at all in many cases. Science would benefit immensely if we depended on daomes rather than names, since all our hypotheses and inferences would necessarily be tightly integrated with our primary daomes. Among other things this would result in an excellent audit trail, as is required by good science.

Note that this system only works if #2 above is (mostly) true. Names are useful now because data are not accessible. Names act as temporal place holders that allow us to jump from one reference to another without losing track of where we are. If data are unified, linked, and very easily (=instantaneously) accessible we don't need these place holders to do work.

Roger writes "If the purpose of taxonomy is to produce a system that people can actually use to hang data on - so that both people and machines can then infer more knowledge from the linked data - then this is really the only game in town." I think that Roger doesn't want to do taxonomy, he wants to do phenomics. He wants to produce a system of data that we can hang names on. This is a good idea.

Uncertainty

I have minor quibbles with Roger suggesting that DNA is the way to implement the necessary nomenclatory system, though this might be misunderstanding on my part. If his system works then I think it can work with genomic or phenomic daomes. What I'm doing here is disagreeing, for the point of argument, with Roger's statement "What the machine can’t do is take into account additional properties, that weren’t considered in the first place, like the human user can.", and taking the discussion in a different direction.

Roger's method works by instantiating a taxon and pointing to it with hard link. A hard link is something with two features, it is both an address and a property of the class itself. It's like using an id in a relational database as a meaningful property of the record it identifies, for instance using your SSN as a id for your medical records. Creating hard link identifiers in a RDBMS is typically taboo. His hard link is a string of letters, which just happens to be a DNA barcode, which is a property of an instance of his taxon. Note that hard links could be anything, it need not be a string of letters referring to a DNA barcode, it could be "blue". "Blue" the string of letters is unique, it's not "BlueRed" or "RedBlue", and it's also a property of the (polyphyletic) taxon that is blue.

There is another subtle aspect to these identifiers. "Blue" is obviously not unique as a property, and "ACGTG" (of suitable length) is. How about 475.0nm, the wavelength of blue light? That's as unique as some long string of "ACGTG". Thou taxon shall be identified by 475.0, not 475.1, nor 475.2, 457.3 is right out (or something like that, my MP is not so good). My taxon is Rhododendron luteum[475.0], Roger's is Rhododendron luteum [ACGTG...].

Going from "Blue" to "475.0" is really not that hard, it's just a matter of using a reference ontology to link the two (well, ok, that's a little tricky). I mean "blue" as in this ontology. This brings up a slightly more abstract possibility. What if within this ontology we include logically consistent classes that are compositions which reference complex phenotypes. As required by Roger's system our hard link both identifies a taxon, and is itself unique. Our identifier might look like this: "headsbluespinesverylongeyesredandsomeothertextthatjusthappenstodescribemytaxonandmakeslongstring". In the same way we might see issues with using "blue", we can instead point to something more specific in our reference ontology, so our human label just becomes some identifier, which references the class in the ontology, which logically defines our very long string.

The problem with using blue or phenomic characters is that the phenome ontology doesn't exist, and will be difficult to make. This shouldn't prevent us from trying.

The real issue with Roger's system is pragmatic. It is that it requires that [A_label] points to a method(s) which return a daome which is a hard link of the class [A_Label]. In other words, the labels "475.0", "ACGTG" and "headsbluespinesverylongeyesredandsomeothertextthatjusthappenstodescribemytaxonandmakeslongstring" are meaningless without context, and, unfortunately, context brings with it uncertainty. How exactly do I generate the daomes that also act as hard links? You likely require YAO (yet another ontology) which documents an experiment which when performed returns a daome which has some reasonable chance of being a hard link of the type you want. Since uncertainty is necessarily invoked when someone generates daomes of any type Roger's nomeclature using genomic daomes is no more valid from an ontological standpoint than is nomenclature using phenomic daomes. You can never completely be sure you are returning the string of letters that the label [atpH] is meant to return. This is true for any number of reasons, many of them well documented, many just due to the nature of making observations. The leap from label, to daome, is huge. Note that what Roger wants, a completely logically consistent ontology and all the goodness that comes with it, can't just mostly work (as he notes), it has to completely work, or it completely collapses. If we are OK with a system that "mostly works", then we should be OK with either phenomic or genomic approaches. One last way of saying this- it's all well and good if the ontology is completely internally consistent, but if I have to guess at whether my external data match the ontology, then guessing about morphology is as equally valid as guessing about DNA.

What typically happens in this point of the argument is that the issue becomes a pissing-match about how accurate we want to be (e.g. 80% correct identification), and whether or not genome out performs phenome in this regard. This is of course an argument of method, not nomenclature, so it's changing the topic.

Pragmatic Taxa

There are obviously lots of other uses for names, I don't see them going away anytime soon. One way to perhaps encourage (force?) people to think about the inverted process Roger is promoting would be to have journal editors require a statement of purpose with each new taxon circumscription. If you want to provide a new name, you have to provide a statement indicating what your taxon circumscription could be used for, i.e. make it pragmatic. The only rule is that the statement can not be "If you use my description then you can identify more of my taxon." That's a given. Statements could be something along the lines of providing the species concept definition you're using but more specific. It should come in the if-then form. If variation in the wing length of my taxon is studied then we can say something about flight dynamics. One fallout of this requirement is that there would quickly become several accepted "stock" answers, one of which might be "If you use the characters I discuss, which are all referenced in an ontology, then a mahu can reason over your results and mine." A nice way of getting data integrated into ontologies.

p.s. I purposely used the nonsense "daomes" and "mahus" not to confuse but rather to see if I can trace their reference in future discussion. Note also how well Google does with a few completely general terms in it's image search.