Sunday, March 28, 2010

Ridiculous Claims

An item last week in GenomeWeb covered a new analysis by Robin Cook-Deegan and colleagues of the Myriad BRCA patents. One bit in particular in the article has stuck in my craw & I need to spit it out.

This finding, involving an expressed sequence tag application filed by the NIH, was published in 1992 and the NIH abandoned its application two years later. Based on USPTO examiner James Martinell's estimation at the time, a full examination of all the oligonucleotide claims in the EST patent would have taken until 2035 "because of the computational time required to search for matches in over 700,000 15-mers claimed."

According to Kepler et al., this comprises "roughly half the number of molecules covered by claim 5 of Myriad's '282 patent."

While improvements in bioinformatics and computer hardware have made sequence comparisons much easier than they were in the early 1990s, the study authors arrive at no conclusions about why the USPTO granted Myriad claim 5 in patent '282 and not NIH's EST patent.

The claim by the USPTO examiner is bizarre, to say the least. The claim is 55 years to analyze 700,000 15mers for occurrence in other sequences. This works out to testing about 40 oligos per day. What algorithm were they using??

To look at it another way, if you use 2 bit encoding for each base, then the set of all 15mers can be described by 2^30 different bitstrings -- potentially storable in memory of the 32 bit machines available at the time (which can, of course, address 2^32 words of memory). Furthermore, this is a trivially splittable algorithm -- you can break the job into 2^N different jobs by having each run look only at sequences with a given bit prefix of length N. When I started as a grad student in fall 1991, one of my first projects involved a similar trivial partitioning of a large run -- each slice was its own shell script which was forked onto a machine.

Furthermore, anyone claiming that a job will take 50+ years really needs to make some reasonable assumptions about growth in compute power -- particularly since 64-bit machines were becoming available around that time (e.g. DEC Alpha). Sure, it's dangerous to extrapolate out 50 years (after all, progress in Moore's law from shrinking transistors will hit a wall at one atom per transistor), but this was a ridiculous bit of thinking.

Tuesday, March 23, 2010

What should freshman biology cover?

I've spent some time the last few weeks trying to remember what I learned in freshman biology. Partly this has been triggered by planning for my summer intern (now that I have a specific person lined up for that slot) -- not because of any perceived deficiencies but simply being reminded of the enormous breadth of the biological sciences. It's also no knock on my coursework -- I had a great freshman biology professor (who team-taught with an equally skilled instructor). It was a bit bittersweet to see his retirement announcement last year in the alumni newsletter; he certainly has earned a break but future Blue Hens will have to hope for a very able replacement.

It is my general contention that biology is very different from the other major sciences. My freshman-level physics class (which I couldn't schedule until my senior year) had a syllabus which essentially ended at the beginning of the 1900s. Again, this is no knock on the course or its wonderful professor; it's just that kinetics and electromagnetics on a macro scale was pretty much worked out by then. We had lots of supplementary material on more modern topics such as gravity assists from planets, fixing Hubble's mirror problem and quantum topics, but that was all gravy.

Similarly, my freshman chemistry course (again, quite good) covered science up to about World War II. My sophomore organic chemistry course pushed a little further in the century. It's not that these are backwards fields but quite the opposite -- enough had been learned by those time points to fill two semesters of introductory material.

But, I can't say the same about biology. In the two decades and change since my freshman biology coursework, I can certainly think of major discoveries either made or cemented in that time which deserve the attention of the earliest students. Like the course I assisted with at Harvard, my course had one semester of cells and smaller biology and one of organisms and bigger; I'll mostly focus on the cells and smaller because that's where I spend most of my time. But, there are certainly some strong candidates for inclusion in that other course. I'll also recognize the fact that perhaps for space reasons some of these topics would necessarily be pushed into the second tier courses which specialize in an area such as genetics or cell biology.

One significant problem I'll punt on: what to trim down. I don't remember much fat in my course (indeed, beyond the membrane we didn't speak much at all on it that year!). Perhaps that's what I have forgotten, but I think it more likely it was already pretty packed. I can think of some problem set items that can be jettisoned (Maxam-Gilbert sequencing is a historical curiosity at this point; I can think of much more relevant procedures to do on paper).

One topic I've convinced myself belongs in the early treatment is the proteasome, and not because I once spent a lot of time thinking about it (and also saw some financial gain -- though I no longer have such an interest in it). This is definitely a field which didn't exist on solid ground when I went through school, so it's absence from my early education First, it fits neatly into one of the key themes of introductory biology: homeostasis. Cells and organisms have mechanism for returning to a central tendency, and the proteasome plays a role for proteins. Proteasomes also form a nice bookend with ribosomes -- we learned that proteins are born but not how they die. Furthermore, not only do proteins have a lifespan, but not every protein has the same lifespan -- and lifespans are not fixed at birth. Finally, another great learning in freshman bio is around enzyme inhibitor types -- and the proteasome is the ultimate enzyme inhibitor. Plus, I'd try to mention the case of "the enemy of my enemy protein is my friend" -- proteasomes can activate one protein by destroying its inhibitor.

That's also a nice segue into another major there worth developing: regulation. I think the main message here is that any time a cell needs to process an mRNA or protein, it's an opportunity for regulation. Post-translational modifications of proteins play a key role here.

Furthermore, it's worth noting that regulation often uses chains of proteins ("pathways"). These chains offer both new opportunities for regulation and signal amplification. We spent a lot of time looking at the chains of enzymes that turn sugars into energy. Of nearly equal importance is the idea that chains of proteins (and not all of them enzymes) can control a cell. In addition, it is important to recognize that these pathways are organized into functional modules, reflecting both opportunities for control and their evolution.

Clearly in this spot the fact that we can now sequence entire genomes deserves mention. Beyond that, I think the most important fact to impress on young minds is how bewildered we still are by even the simplest genomes.

Stem cells are an important concept, and not only because they are a hot topic in the popular press and political arena. This is a key idea -- cell divisions which proceed in an asymmetric pattern.

One final clear concept for inclusion at this level is epigenetics. It is key to underline that there are means to transmit information in a heritable way which are not specifically encoded in the DNA sequence -- as important as that sequence is.

I'm sure I've missed a bunch of topics. There are a lot of ideas in the grey zone -- I haven't quite convinced myself they belong in freshman bio but certainly belong a course up. For example, the fact that organisms can borrow from other genomes (horizontal transfer) or even permanently capture entire organisms (endosymbionts) certainly belongs in cell bio or genetics, but I'm not sure it quite fits freshman year (but nor am I certain it doesn't). Lipid rafts and primary cilia and all sorts of other newly discovered (or re-discovered) subcellular structures definitely would fit in my curriculum there. Gaseous signalling molecules would definitely warrant mention, though perhaps along with the hormones in the organisms and bigger semester.

With luck, many will read this and be kind enough (and kind while doing it) to point out the big advances of the last score of years which deserve inclusion as well -- and I also have little doubt that many freshman this year are being exposed to many topics I wasn't because exist they didn't.

Thursday, March 18, 2010

Second Generation Sequencing Sample Prep Ecosystems

A characteristic of each of the existing second generation sequencing instruments is that each manufacturer provides its own collection of sample preparation reagents and kits.

Some of this variation is inherent to a particular platform. For example, the use of terminal transferase tailing is part of the supposed charm of the Helicos sample prep. Polonator needs to use a series of tag generation steps to overcome its extremely short read length. Illumina's flowcells need to be doped with specific oligos. So, some of this is natural.

On the other hand, it does complicate matters -- especially for various third parties which are producing sample preparation options. For example, targeted resequencing using hybridization really needs to have the sequencing adapters blocked with competing oligos -- and those will depend on which platform the sample has been prepared for. Epicentre has a clever technology using engineered transposases to hop amplification tags into template molecules -- but this must be adapted for each platform. Various academic protocols are developed with one platform in mind, even when there is really no striking functional reason for platform-specificity -- but a protocol developed on one really needs to be recalibrated for any other. And in any case, it would great for benchmarking instruments if precisely the same library could be fed into multiple machines -- and it would be great for researchers looking to buy sequencing capacity on the open market to be able to defer committing to a platform to the last minute.

In light of all this, it is interesting to contemplate whether this trend will continue. One semi-counter trend has been for all three major players, 454, Illumina and SOLiD, to announce smaller versions of their top instruments. Not only will these require less up-front investment, but they will apparently use all the same consumables as their big siblings -- but not as efficiently. So if you are looking at cost per base in reagents, they won't look good.

However, an even more interesting trend that might emerge is for new players to piggy-back atop the old. Ion Torrent has dropped a hint that they might pursue this direction -- while the precise sample preparation process has yet to be announced (and the ultimate stages prior to loading on the sequencer are likely to be platform-specific), Jonathon Rothberg suggested in his Marco Island talk (according to reports) that the instrument could sequence any library currently in existence. This suggests that they may be willing to encourage & support preparing libraries with other platform's kits.

Of course, for the green eyeshade folks at the companies this is a big trade-off. On the one hand, it means a new entrant can leverage all the existing preparation products and experience. Furthermore, it means a new instrument could easily enter an existing workflow. The Ion Torrent machine is particularly intriguing here as a potential QC check on a library prior to running on a big machine -- at $500 a run (proposed) it would be worth it (particularly if playing with method development) and with a very short runtime it wouldn't add much to the overall time for sequencing. PacBio may play in this space also, if libraries can be easily popped in. This also acts as a "camel's nose in the tent" strategy for gaining adoption -- first come in as a QC & backstop, later munch up the whole process.

Of course the other side of the equation for the money counters is that selling kits is potentially lucrative. Indeed, it could be so lucrative that an overt attempt to leverage other folks kits might meet with nasty (and silly) legal strategies -- such as a kit being licensed only for use with a particular platform. That would be silly -- if you are making money off every kit, why not market to all comers?

Thursday, March 11, 2010

Playing Director

Last weekend was the Academy Award presentations. Fittingly, just before I had my first theatrical success -- with Actors.

Actors are a Scala abstraction for multiprocessing. I've only really played with multiprocessing once, back in my waning days at Harvard. I tried writing some multithreaded Java code, and the results were pretty ugly. The code soon became cluttered with locks and unlocks and synchronized keywords, but my programs still locked up consistently. Multiple processes can be a real headache.

But, there's also the benefit -- especially since I have a brand new smoking fast oligoprocessor box (I keep some mystery in the precise number). Tools such as bowtie and BWA are multithreaded, but it would be useful to have some of the downstream data crunching tools enabled as well.

Actors are a high level abstraction which relies on message passing. Each Actor (or non-Actor process) communicates with other Actors by sending an object. The Actor figures out what sort of object has been thrown its way and acts on it. A given Actor will execute its tasks in the order given, but across the cast there is no guarantees; everything is asynchronous. Each Actor behaves as if it has its own thread, though in reality a pool of worker threads manages the execution of the Actors -- threads tend to be heavyweight to start up, so this scheme minimizes that overhead and thereby encourages casts of millions -- but I won't emulate de Mille for some time.

My first round of experiments left me with new bruises -- but I did come out on top. Some lessons learned are below.

First, get the screenplay nailed down as much as possible before involving the Actors. Debugging multithreaded code brings on its own headaches; don't bring down that mess of trouble before you need to. For example, in the IDE I am using (Eclipse with the Scala plug-in, which some day I will rant about) if you hit a debugging breakpoint in one thread the others keep going. In my case, that meant a println statement from my master process saying "I'm waiting for an Actor to finish" -- which kept printing and thereby prevented me from examining variables (because in Eclipse, if Console is being written to it automatically pops to center stage).

A corollary to this is after several iterations I had improved the algorithm so much it probably didn't need Actors any more! I really should time it with 0, 1 and 2 Actors (and both permutations of 1 actor -- the code runs in 3 stages and a single actor can do either the last one or the last two) -- the code is about 2/3 of the way to enabling that. Actually, one reason I went through a final bit of algorithmic rethink was the fact that the Actor enabled code was still a time pig -- the rethought version ran like a greased pig.

Second, remember the abstraction -- everything is passed as messages and these messages may be processed asynchronously. More importantly, always pretend that the messages are being passed by some degree of copying. An early version of my code ignored this and had code trying to change an object which had been thrown to an Actor. This is a serious no-no and leads to unpredictable results. You give stage commands to your Actors and then let them work!

Third, make sure you let them finish reciting their lines! My first master thread didn't bother to check if all the Actors were done -- which led to all sorts of interesting run-to-run variation in output which was mystifying (until I realized it was the synchrony problem). Checking for being done isn't trivial either. One way is to have a flag variable in your Actor which is set when it runs out of things to do. That's good -- as long as you can easily figure out how to set it. You can also look to see if an Actor is done processing messages -- except checking for an empty mailbox doesn't guarantee it is done processing that last message, only that it has picked it up. One approach that worked for my problem, since it is a simple pipelining exercise, is to have Actors throw "NOP" messages at themselves prior to doing any long process -- especially when the master thread sends them a "FLUSH" command to mark the end of the input stream. Such No OPeration messages keep the mailbox full until it gets done with the real work.

So, I have a working production. I'll be judicious in how I use this, as I have discovered the challenges (in addition to the problems I solved above, there is a way to send synchronous messages to Actors -- which I could not get to behave). But, I am already thinking of the next Actor-based addition to some of my code -- and my current treatment is pretty complicated. But, a plot snip here and a script change there and I should be ready for tryouts!