Saturday, November 14, 2009

Longevity, agility, and standards in biodiversity informatics

This post is mixed bag of equal parts #tdwg09 tweets response, clarification of the philosophy behind mx, and end-of-the-week-flu-induced rant. I really don't mean it to be an argument for using mx vs. say, Scratchpads (that's sort of an apples-oranges argument). I've always been very upfront in my cautioning potential users as to the pitfalls of using mx. I'm very aware of the arguments for standardization, stability etc., however, what is also painfully clear is that there is far to little diversity in biodiversity applications, i.e. tools which enable scientists to do new science. The worlds biodiversity is 10% described, and there are only so many ways we can mark up, search, map, and re-index this minuscule fraction of potential knowledge. Where are the new applications that are going to capture the remaining 90% of the data while making it inherently more useful?

A couple of premises. New science happens fast, really fast, to accommodate it we need "agile" (yes it's cliche) solutions. The funded life of many projects happens within 1-3 years, and the notice of funding is abrupt. mx exists, in part, because I need an environment that I can quickly modify if I want to test new things (visualizations), promote new ways of thinking about taxonomy (morphological ontology-based descriptions), mock up proof of concepts for grants, or provide new functionality for my self or my research colleges. I need to be able to provide features and functionality for active research immediately, as in today. The consequences of this type of approach is not always positive, among other things the probability of more features failing increases because they met too specific a need or just generally sucked is higher. The benefit of providing many mutations is that the best (one-click matrix coding) are rapidly selected for.

One practical example, poking at Scratchpads (because I think it's a great project) to make a point. This year Scratchpads is reporting its Nexus importing functionality. Last year mx needed to import Nexus formatted matrices and I justified the time to develop a general purpose parser and importer because we had people who would immediately be using the functionality. The functionality and has been available for over a year. This year we've added a more robust framework for handling sequence data (FASTA uploads automatically tying to specimen data) because I and my colleges will benefit greatly from it (we've already added hundreds of extract, pcr, and specimen records). I have no doubt that "next year" Scratchpads will get FASTA support. There is also no doubt that at some point Scratchpads, or some other uber project (e.g. Lifedesks, Lucid or Mesquite) will have all the functionality that anyone could ever dream of right now. At that point somebody will want to do something else, and it will take years for them to catch up and do that something else. You get my point. This definitely doesn't imply that mx will have everything a biodiversity app needs, nor that it is better, more quickly developed, more comprehensive, more reasonable or any such claim, it's just an illustration of a very pragmatic issue that biodiversity apps in general need to address- we need tools and we need them now.

Scientists will capture whatever data it is that is interesting to them, and waiting to do science because some standard is in place is a sure fire way to fail. One example among many- As various folks have tweeted (see #tdwg09, and if you dare read taxacom) there is some "discord" about the role of LSIDs in biodiversity informatics. Had my first priority been to ensure that everything that mx did was LSID compliant I would have spent an inordinate amount of time on functionality that to date I'm still not sure is going to be useful. The same goes for SDD, does anybody actually use this to do real work, and how might I find out if they did? It might turn out that LSIDs are critical, in which case I believe that mx is perfectly situated to take advantage of them, and perhaps even provide them.

I have very little fear that the data model used by mx is somehow incompatible with core biodiversity data. We've already provided or capture data in a large number of formats including DarwinCore, Nexml, OBO, FASTA, Nexus, TNT, ITIS import tables, and have many mechanisms for exporting data for users. Since mx is open source it's open to anyone (particularly those with ample experience- nudge nudge) to hack up new ways to expose or manage standards-compliant data. Getting data "out" will always be relatively easy, making data digital (getting it "in") is hard work that 1) doesn't get enough credit and 2) doesn't get enough funding.

mx originally evolved largely out of my desire to have a unified environment for my research as a systematist. It exists because at the time no solution existed that integrated the types of data I wanted to manage. I maintain that no such solution (including mx) yet exists today. It is meant to by my "life-long-companion". If I get hit by a bus (better not go outside today on the 13th) will mx die? Probably. Will the data be lost? If someone finds it useful then probably not. Is it in some cryptic silo that only I know of? No- source is on SF, numerous labs have vested interested in keeping their data pertinent. Longevity should not get in the way of getting things done in the first place.

mx is evolving, and funding or not I see no reason why it can't continue to do so. Because it does a lot of different things its role is somewhat foggy (harder to pin down and bash, I like it that way), but some central objectives are emerging:
  1. Focus on functionality that allows researchers to do work. When in doubt about standards capture verbatim data. Worry about who to make friends with (which standards to adopt) when the apocalypse ends.
  2. mx is not a final repository for data (is there such a thing?). It is an environment for gathering and working with data. When you publish your data your sequences should go to Genbank, your taxon names to Zoobank, your descriptions to EOL and plazi, your trees to Treebase, your specimen data to... oh hell just send it all to Rod Page.
  3. Provide a tool that helps to generate new data that others want (the best way ensure longevity is to have useful data), then give others a way to access that data however they may want (a flexible web-based framework like Rails)- they just have to hack on their preferred API.
  4. mx is not (just) about making taxon pages, taxon pages are somewhat of a nebulous concept that is best compiled by larger entities (EOL, ALA) with budgets for flash. We've provided some of the basics to make pages, but we hope that these are stopgaps for an integrated approach with larger projects. The multipurpose utility of an environment like mx is perhaps best exhibited by the HAO which was rapidly prototyped in mx and then NSF funded. The end product of the HAO is an OBO file which alone justifies the work on the application which may or may not exist in 5 years.
  5. mx is an expert system and it makes no apologies for being so. Taxonomists, phylogenetists and ontologists are experts that work with complex data (that can't be handled by wikis...yet). Just as we don't expect mathematicians to use the Microsoft Equation Editor, but rather to understand the complexities of software like Mathematica, and expect engineers to use CAD, we should expect people who study biodiversity to be expert informatics tool users. Another way of saying this- you see nobody convincing math profs to do their work with wikis instead of Mathematica/R etc., there simply isn't the specialization in these applications to enable the data exploration and management that needs to be done (again, this is a big "for the foreseeable future" ... during which many grants will be written and much research will be done). While mx wants things to be easy and straightforward as possible (it's likely failing miserably now) it is not aspiring to be the Microsoft Word of Biodiversity Informatics.
mx is open source largely for selfish reasons- I wanted to see if there are other systematists who hack (whether it is possible to do both at the same time is unclear and will be the topic of an upcoming post), and hoped for mutual benefits if such existed. It's an experiment which I hope will slowly emerge as a successful one- big matrices take a long time to compile, as do taxonomic catalogs, specimen inventories and complex ontologies. These data in instances of mx are already significant (e.g. 85k new matrix cells from over 1800 characters and 5400 character states, 7700+ OTUs, 40k references, 12k taxon names, 60k tags, an ontology of 2500 morphological terms etc.) and the commitment to using and funding mx suggest an exciting future.