Saturday, November 14, 2009

Longevity, agility, and standards in biodiversity informatics

This post is mixed bag of equal parts #tdwg09 tweets response, clarification of the philosophy behind mx, and end-of-the-week-flu-induced rant. I really don't mean it to be an argument for using mx vs. say, Scratchpads (that's sort of an apples-oranges argument). I've always been very upfront in my cautioning potential users as to the pitfalls of using mx. I'm very aware of the arguments for standardization, stability etc., however, what is also painfully clear is that there is far to little diversity in biodiversity applications, i.e. tools which enable scientists to do new science. The worlds biodiversity is 10% described, and there are only so many ways we can mark up, search, map, and re-index this minuscule fraction of potential knowledge. Where are the new applications that are going to capture the remaining 90% of the data while making it inherently more useful?

Agility
A couple of premises. New science happens fast, really fast, to accommodate it we need "agile" (yes it's cliche) solutions. The funded life of many projects happens within 1-3 years, and the notice of funding is abrupt. mx exists, in part, because I need an environment that I can quickly modify if I want to test new things (visualizations), promote new ways of thinking about taxonomy (morphological ontology-based descriptions), mock up proof of concepts for grants, or provide new functionality for my self or my research colleges. I need to be able to provide features and functionality for active research immediately, as in today. The consequences of this type of approach is not always positive, among other things the probability of more features failing increases because they met too specific a need or just generally sucked is higher. The benefit of providing many mutations is that the best (one-click matrix coding) are rapidly selected for.

One practical example, poking at Scratchpads (because I think it's a great project) to make a point. This year Scratchpads is reporting its Nexus importing functionality. Last year mx needed to import Nexus formatted matrices and I justified the time to develop a general purpose parser and importer because we had people who would immediately be using the functionality. The functionality and has been available for over a year. This year we've added a more robust framework for handling sequence data (FASTA uploads automatically tying to specimen data) because I and my colleges will benefit greatly from it (we've already added hundreds of extract, pcr, and specimen records). I have no doubt that "next year" Scratchpads will get FASTA support. There is also no doubt that at some point Scratchpads, or some other uber project (e.g. Lifedesks, Lucid or Mesquite) will have all the functionality that anyone could ever dream of right now. At that point somebody will want to do something else, and it will take years for them to catch up and do that something else. You get my point. This definitely doesn't imply that mx will have everything a biodiversity app needs, nor that it is better, more quickly developed, more comprehensive, more reasonable or any such claim, it's just an illustration of a very pragmatic issue that biodiversity apps in general need to address- we need tools and we need them now.

Standards
Scientists will capture whatever data it is that is interesting to them, and waiting to do science because some standard is in place is a sure fire way to fail. One example among many- As various folks have tweeted (see #tdwg09, and if you dare read taxacom) there is some "discord" about the role of LSIDs in biodiversity informatics. Had my first priority been to ensure that everything that mx did was LSID compliant I would have spent an inordinate amount of time on functionality that to date I'm still not sure is going to be useful. The same goes for SDD, does anybody actually use this to do real work, and how might I find out if they did? It might turn out that LSIDs are critical, in which case I believe that mx is perfectly situated to take advantage of them, and perhaps even provide them.

I have very little fear that the data model used by mx is somehow incompatible with core biodiversity data. We've already provided or capture data in a large number of formats including DarwinCore, Nexml, OBO, FASTA, Nexus, TNT, ITIS import tables, and have many mechanisms for exporting data for users. Since mx is open source it's open to anyone (particularly those with ample experience- nudge nudge) to hack up new ways to expose or manage standards-compliant data. Getting data "out" will always be relatively easy, making data digital (getting it "in") is hard work that 1) doesn't get enough credit and 2) doesn't get enough funding.

Longevity
mx originally evolved largely out of my desire to have a unified environment for my research as a systematist. It exists because at the time no solution existed that integrated the types of data I wanted to manage. I maintain that no such solution (including mx) yet exists today. It is meant to by my "life-long-companion". If I get hit by a bus (better not go outside today on the 13th) will mx die? Probably. Will the data be lost? If someone finds it useful then probably not. Is it in some cryptic silo that only I know of? No- source is on SF, numerous labs have vested interested in keeping their data pertinent. Longevity should not get in the way of getting things done in the first place.

mx
mx is evolving, and funding or not I see no reason why it can't continue to do so. Because it does a lot of different things its role is somewhat foggy (harder to pin down and bash, I like it that way), but some central objectives are emerging:
  1. Focus on functionality that allows researchers to do work. When in doubt about standards capture verbatim data. Worry about who to make friends with (which standards to adopt) when the apocalypse ends.
  2. mx is not a final repository for data (is there such a thing?). It is an environment for gathering and working with data. When you publish your data your sequences should go to Genbank, your taxon names to Zoobank, your descriptions to EOL and plazi, your trees to Treebase, your specimen data to... oh hell just send it all to Rod Page.
  3. Provide a tool that helps to generate new data that others want (the best way ensure longevity is to have useful data), then give others a way to access that data however they may want (a flexible web-based framework like Rails)- they just have to hack on their preferred API.
  4. mx is not (just) about making taxon pages, taxon pages are somewhat of a nebulous concept that is best compiled by larger entities (EOL, ALA) with budgets for flash. We've provided some of the basics to make pages, but we hope that these are stopgaps for an integrated approach with larger projects. The multipurpose utility of an environment like mx is perhaps best exhibited by the HAO which was rapidly prototyped in mx and then NSF funded. The end product of the HAO is an OBO file which alone justifies the work on the application which may or may not exist in 5 years.
  5. mx is an expert system and it makes no apologies for being so. Taxonomists, phylogenetists and ontologists are experts that work with complex data (that can't be handled by wikis...yet). Just as we don't expect mathematicians to use the Microsoft Equation Editor, but rather to understand the complexities of software like Mathematica, and expect engineers to use CAD, we should expect people who study biodiversity to be expert informatics tool users. Another way of saying this- you see nobody convincing math profs to do their work with wikis instead of Mathematica/R etc., there simply isn't the specialization in these applications to enable the data exploration and management that needs to be done (again, this is a big "for the foreseeable future" ... during which many grants will be written and much research will be done). While mx wants things to be easy and straightforward as possible (it's likely failing miserably now) it is not aspiring to be the Microsoft Word of Biodiversity Informatics.
mx is open source largely for selfish reasons- I wanted to see if there are other systematists who hack (whether it is possible to do both at the same time is unclear and will be the topic of an upcoming post), and hoped for mutual benefits if such existed. It's an experiment which I hope will slowly emerge as a successful one- big matrices take a long time to compile, as do taxonomic catalogs, specimen inventories and complex ontologies. These data in instances of mx are already significant (e.g. 85k new matrix cells from over 1800 characters and 5400 character states, 7700+ OTUs, 40k references, 12k taxon names, 60k tags, an ontology of 2500 morphological terms etc.) and the commitment to using and funding mx suggest an exciting future.

Sunday, September 6, 2009

Nomenclature is Dead! Long Live Pragmatic Taxa!

Roger Hyam's blog post is great. A few somewhat random thoughts on it follow, with apologies in advance to Roger for any (likely) misinterpretations. I don't read taxacom, so there might be some redundancy here. I'm not sure that what follows makes any sense, nor is it my personal position, that is, I'm trolling.

What Roger seems to point out, and what I've argued to others in the past is that something like this will ultimately exist in the future:
  1. In the future we will have a lot of data on "the web".
  2. These data will be very accessible to machines OR humans (hereafter "mahus").
  3. The way mahus will access data of interest will be by using other data/observations,/measurements (hereafter lumped as "daomes").
  4. Since taxonomic names are not daomes they are are pointless.
How does this work? Let's imagine some sci-fi:

Situation A: The mahu plumber laying fibre optic cable is having a bad day. In the depths of its subterranean workplace something is fouling its locomotion. The mahu applies a vacuum to the space itself sucking some of the offending stuff into a daome chamber. Some daomes are taken, including a daome which identifies some hydrocarbon present only in life (at least according to the mahu's reference ontology). Life has DNA, therefor the mahu's built in DNA reader makes another daome, which indicates that the offending stuff is a red, stinging thing that lives in colonies in the soil, fouls up electrical equipment, and can be killed by a dose of the contents of the red canister attached to the side of its locomotory appendage. The pesticide is applied, the problem goes away, and the mahu continues plumbing.

Situation B: Back on the surface two mahus are playing. Something hops onto the field in front of them and they want to know whether this thing might be dangerous and interfere with their game. They can see that this thing has a wing, beak, feather, is red and is singing a song (I'm looking at one out my window now). A few interactions with the cloud later they realize that they are looking at something that is mostly harmless.

Both cases illustrate that we don't need to work with names at all in many cases. Science would benefit immensely if we depended on daomes rather than names, since all our hypotheses and inferences would necessarily be tightly integrated with our primary daomes. Among other things this would result in an excellent audit trail, as is required by good science.

Note that this system only works if #2 above is (mostly) true. Names are useful now because data are not accessible. Names act as temporal place holders that allow us to jump from one reference to another without losing track of where we are. If data are unified, linked, and very easily (=instantaneously) accessible we don't need these place holders to do work.

Roger writes "If the purpose of taxonomy is to produce a system that people can actually use to hang data on - so that both people and machines can then infer more knowledge from the linked data - then this is really the only game in town." I think that Roger doesn't want to do taxonomy, he wants to do phenomics. He wants to produce a system of data that we can hang names on. This is a good idea.

Uncertainty

I have minor quibbles with Roger suggesting that DNA is the way to implement the necessary nomenclatory system, though this might be misunderstanding on my part. If his system works then I think it can work with genomic or phenomic daomes. What I'm doing here is disagreeing, for the point of argument, with Roger's statement "What the machine can’t do is take into account additional properties, that weren’t considered in the first place, like the human user can.", and taking the discussion in a different direction.

Roger's method works by instantiating a taxon and pointing to it with hard link. A hard link is something with two features, it is both an address and a property of the class itself. It's like using an id in a relational database as a meaningful property of the record it identifies, for instance using your SSN as a id for your medical records. Creating hard link identifiers in a RDBMS is typically taboo. His hard link is a string of letters, which just happens to be a DNA barcode, which is a property of an instance of his taxon. Note that hard links could be anything, it need not be a string of letters referring to a DNA barcode, it could be "blue". "Blue" the string of letters is unique, it's not "BlueRed" or "RedBlue", and it's also a property of the (polyphyletic) taxon that is blue.

There is another subtle aspect to these identifiers. "Blue" is obviously not unique as a property, and "ACGTG" (of suitable length) is. How about 475.0nm, the wavelength of blue light? That's as unique as some long string of "ACGTG". Thou taxon shall be identified by 475.0, not 475.1, nor 475.2, 457.3 is right out (or something like that, my MP is not so good). My taxon is Rhododendron luteum[475.0], Roger's is Rhododendron luteum [ACGTG...].

Going from "Blue" to "475.0" is really not that hard, it's just a matter of using a reference ontology to link the two (well, ok, that's a little tricky). I mean "blue" as in this ontology. This brings up a slightly more abstract possibility. What if within this ontology we include logically consistent classes that are compositions which reference complex phenotypes. As required by Roger's system our hard link both identifies a taxon, and is itself unique. Our identifier might look like this: "headsbluespinesverylongeyesredandsomeothertextthatjusthappenstodescribemytaxonandmakeslongstring". In the same way we might see issues with using "blue", we can instead point to something more specific in our reference ontology, so our human label just becomes some identifier, which references the class in the ontology, which logically defines our very long string.

The problem with using blue or phenomic characters is that the phenome ontology doesn't exist, and will be difficult to make. This shouldn't prevent us from trying.

The real issue with Roger's system is pragmatic. It is that it requires that [A_label] points to a method(s) which return a daome which is a hard link of the class [A_Label]. In other words, the labels "475.0", "ACGTG" and "headsbluespinesverylongeyesredandsomeothertextthatjusthappenstodescribemytaxonandmakeslongstring" are meaningless without context, and, unfortunately, context brings with it uncertainty. How exactly do I generate the daomes that also act as hard links? You likely require YAO (yet another ontology) which documents an experiment which when performed returns a daome which has some reasonable chance of being a hard link of the type you want. Since uncertainty is necessarily invoked when someone generates daomes of any type Roger's nomeclature using genomic daomes is no more valid from an ontological standpoint than is nomenclature using phenomic daomes. You can never completely be sure you are returning the string of letters that the label [atpH] is meant to return. This is true for any number of reasons, many of them well documented, many just due to the nature of making observations. The leap from label, to daome, is huge. Note that what Roger wants, a completely logically consistent ontology and all the goodness that comes with it, can't just mostly work (as he notes), it has to completely work, or it completely collapses. If we are OK with a system that "mostly works", then we should be OK with either phenomic or genomic approaches. One last way of saying this- it's all well and good if the ontology is completely internally consistent, but if I have to guess at whether my external data match the ontology, then guessing about morphology is as equally valid as guessing about DNA.

What typically happens in this point of the argument is that the issue becomes a pissing-match about how accurate we want to be (e.g. 80% correct identification), and whether or not genome out performs phenome in this regard. This is of course an argument of method, not nomenclature, so it's changing the topic.

Pragmatic Taxa

There are obviously lots of other uses for names, I don't see them going away anytime soon. One way to perhaps encourage (force?) people to think about the inverted process Roger is promoting would be to have journal editors require a statement of purpose with each new taxon circumscription. If you want to provide a new name, you have to provide a statement indicating what your taxon circumscription could be used for, i.e. make it pragmatic. The only rule is that the statement can not be "If you use my description then you can identify more of my taxon." That's a given. Statements could be something along the lines of providing the species concept definition you're using but more specific. It should come in the if-then form. If variation in the wing length of my taxon is studied then we can say something about flight dynamics. One fallout of this requirement is that there would quickly become several accepted "stock" answers, one of which might be "If you use the characters I discuss, which are all referenced in an ontology, then a mahu can reason over your results and mine." A nice way of getting data integrated into ontologies.

p.s. I purposely used the nonsense "daomes" and "mahus" not to confuse but rather to see if I can trace their reference in future discussion. Note also how well Google does with a few completely general terms in it's image search.

Friday, May 15, 2009

wolfram alpha + taxonomy = ?

So whenever a new search engine is introduced I like to test it with the family of wasps I specialize on, the Diapriidae. Wolfram|alpha is getting a lot of hype lately, and tonight it went live. While it largely choked and acted oddly (if it worked at all) on my search terms, some searches did appear to complete. I tested the term Diapriidae, needless to say it didn't do too well as parasitic wasps are not dinosaurs. I suppose it gets some points for recognizing I was requesting something related to a taxonomic classification...maybe. "Hymenoptera" was more successful, but with (very) minimal results. "total species of hymenoptera" could not be interpreted. Searching for taxonomy returns some interesting results, including "taxonomic networks" (e.g.). You can click to see the sources for a given result, and those sources are (nicely) linked, Species 2000/ITIS is listed as one. It doesn't appear that you can drill down much past orders of insects. In general I'm very underwhelmed thus far. The primary source is listed as Wolfram|Alpha curated data, it will be interesting to see if they expand there database, watch for job postings cybertaxonomists!

Sunday, January 11, 2009

augmented reality



This technology is pretty cool, and seems to have been around for a while (see also this more recent application). I'm not sure how or what it has to do with cybertaxonomy, but it seems like it should have some application. Perhaps the technology could be used by including the "fiducial marker" along with barcodes (paper) that are attached to mounted specimens. By waving a device over a drawer of specimens one would get images (magnifications), meta-data, or some other cool or useful information. Maybe you could build a physical tree, with something like meccano or lego pieces "enhanced" with these markers, the relationships among the physical branches could be interpreted in the augmented reality, perhaps mapping character state transitions onto the physical tree, overlaying geographic distributions, or some such. This could also be a great way to get kids into museums. First, at sites (schools?) away from the museum, hand out "game" cards each with some information on an organism and an aforementioned marker. These cards could be brought to a museum that housed an augmentation system. There once "invisible" information would be revealed, for example movies, 3d representations, or pointers to where real live versions in the museum could be found.