Omics! Omics!

Wednesday, July 21, 2010

Distractions -- there's an app for that

Today I finally gave in to temptation & developed a Hello World application for my Droid. Okay, developed is a gross overstatement -- I successfully followed a recipe. But, it take a while to install the SDK & its plugin for the Eclipse environment plus the necessary device driver so I can debug stuff on my phone.

Since I purchased my Droid in November the idea of writing something for it has periodically tempted me. Indeed, one attraction of Scala (which I've done little with for weeks) was that it can be used to write Android apps, though it definitely means a new layer of comlexity. This week's caving in had two drivers.

First, Google last week announced a novice "you can write an app even if you can't program" tool called AppInventor. I rushed to try it out, only to find that they hadn't actually made it available but only a registration form. Supposedly they'll get back to you, but they haven't yet. Perhaps it's because I'm not an educator -- the form has lots of fields tilted at educators.

The second trigger is that an Android book I had requested came in at the library. Now, it's for a few versions back of the OS -- but certainly okay for a start (trying to keep public library collections current on technical stuff is a quixotic task in my opinion, though I do enjoy the fruits of the effort). So that was my train reading this mornign & it got me stoked. The book is certainly not much more than a starting springboard -- I'm debating buying one called "Advanced Android Programming" (or something close to that) or whether just to sponge off on-line resources.

The big question is what to do next. The general challenge is choosing between apps that don't do anything particularly sophisticated but are clearly doable vs. more interesting apps that might be a bit to take on -- especially given the challenge of operating a simulator for a device very unlike my laptop (accelerometers! GPS!). I have a bunch of ideas for silly games or demos, most of which shouldn't be too hard -- and then one concept that could be somewhat cool but also really pushing the envelope on difficulty.

It would be nice to come up with something practical for my work, but right now I haven't many ideas in that area. Given that most of the datasets I work with now are enormous, it's hard to see any point to trying to access them via phone. A tiny browser for the UCSC genome database has some appeal, but that's sounding a bit ambitious.

If I were still back at Codon Devices, I could definitely see some app opportunities, either to demo "tech cred" or really useful. For example, at one point we were developing (though an outsource vendor) a drag-and-drop gene design interface. The full version probably wouldn't be very app appropriate, but something along those lines could be envisioned -- call up any protein out of Entrez & have it codon optimized with appopropriate constraints & sent to the quoting system. In our terminal phase, it would have been very handy to have a phone app to browse metabolic databases such as KEGG or BioCyc.

That thought has suggested what I would develop if I were back in school. There is a certain amount of simple rote memorization that is either demanded or turns out to expedite later studies. For example, I really do feel you need to memorize the single letter IUPAC codes for nucleotides and amino acids. I remember having to memorize amino acid structures and the Krebs cycle and glycolysis and all sorts of organic synthesis reactions and so forth. I often devised either decks of flash cards or study sheets, which I would look at while standing in line for the cafeteria or other bits of solitary time. Some of those decks were a bit sophisticated -- for the pathways I remember making both compound-centric and reaction-centric cards for the same pathways. That sort of flashcard app could be quite valuable -- and perhaps even profitable if you could get students to try it out. I can't quite see myself committing to such a business, even as a side-line, so I'm okay with suggesting it here.

Tuesday, July 13, 2010

There are 2 styles of Excel reports: Mine & Wrong

A key discovery which is made by many programmers, both inside and outside bioinformatics, is that Microsoft Excel is very useful as a general framework for reporting to users. Unfortunately, many developers don't get beyond that discovery to think about how to use this to best advantage. I've developed some pretty strong opinions on this, which have been repeatedly tested recently by various files I've been sent. I've also used this mechanism repeatedly, with some Codon reports for which I am guilty of excessive pride.

An overriding principle for me is that I am probably going to use any report in Excel as a starting point for further analysis, not an endpoint. I'm going to do further work in Excel or import it into Spotfire (my preference) or JMP or R or another fine tool. Unfortunately, there are a lot of practices which frustrate this.

First, as much data as possible should be packed into as few tabs as practical. Unless you have a very good reason, don't put data formatted the same way into multiple files or multiple tabs. I recently got some sequencing results from a vendor and there was one file per amplicon per sample. I want one file per total project!

Second, the column headers need to be ready for import. That means a single row of column headers and every column has a specific and unique header. Yes, for viewing it sometimes looks better to have multiple rows and use cell fusing and other tricks to minimize repetition -- but for import this is a disaster either guaranteed or likely to happen.

Third, every row needs to tell as complete a story as possible. Again, don't go fusing cells! It looks good, but nobody downstream can tell that the second row really repeats the first N cells of the row above (because they are fused).

Fourth, don't worry about extra rows. One tool I use for analysis of Sanger data spits out a single row per sample with N columns, one column for each mutation. This is not a good format! Similarly, think very carefully before packing a lot into a single cell -- Excel is terrible for parsing that back out. Don't be afraid to create lots of columns & rows -- Excel is much better at hiding, filtering or consolidating than it is at parsing or expanding.

Finally, color or font coding can be useful -- but use it carefully and generally redundantly. Ignoring the careful part means generating confusing "angry fruit salad" displays (and never EVER make text blink in a report or slide!!!).

Follow these simple rules and you can make reports which are springboards for further exploration. It's also a good start to thinking about using Excel as a simple front end to SQL databases.

So what was so great about my Codon reports? Well, I had figured out how to generate the XML to handle a lot of nice features of the sort I've discussed above. The report had multiple tabs, each giving a different view or summary of the data. The top tab did break my rules -- it was a purely summary table & was not formatted for input into other tools (though now I'm feeling guilty about that; perhaps I should have had another tab with it properly formatted). But each additional tab stuck to the rules. All of them had AutoFilter already turned on and had carefully chosen highlighting when useful -- using a combination of cell color and text highlighting to emphasize key cells. Furthermore, it also hewed to my absolute dictum "Sequences must always be in a fixed width font!". I didn't have it automatically generate Pivot Tables; perhaps eventually I would have gotten there.

Monday, June 28, 2010

Knome Cofactors Ozzie Ozbourne's Genome

I got an email today pointing out that Cofactor Genomics will be responsible for generating Ozzie Osbourne's genome sequence for Knome. Said email, unsurprisingly, originated at Cofactor.

Now, before anyone accuses me of being a shill let me point out that (a) I get no compensation from any of these companies and (b) I've asked for multiple quotes from Cofactor in good faith but have yet to actually send them any business. Having once been in the role of quote-generator and knowing how frustrating it is, I have a certain sympathy & see spotlighting them as a certain degree of reasonable compensation.

Knome's publicity around the Osbourne project has highlighted his bodyy's inexplicable ability to remain functioning despite the tremendous chemical abuse Mr. Osbourne has inflicted on it. The claim is that his genome will shed light on this question. Given that there are no controls in the experiment and he is an N of one, I doubt anything particularly valuable scientifically will come of this. I'm sure there will be a bunch of interesting polymorphisms which can be cherry-picked -- we all carry them. Plus, the idea that this particular individual will change his life in response to some genomics finding is downright comical -- clearly this is not someone thinks before he leaps! It's about as likely as finding the secret to his survival in bat head extract.

Still, it is a brilliant bit of publicity for Knome. Knome could use the press given that the FDA is breathing down their necks along with the rest of the Direct to Consumer crowd. Getting some celebrities to spring for genomes will set them up for other business as their price keeps dropping. The masses buy the clothes, cars and jewelry worn by the glitterati, so why not the genome analysis? Illumina ran Glenn Close's genome, but Ozzy probably (sad to say) has a broader appeal across age groups.

Sunday, June 27, 2010

Filling a gap in the previous post

After thinking about my previous entry on PacBio's sample prep and variant sensitivity paper I realized there was a significant gap -- neither dealt with gaps.

Gaps, or indels, are very important. Not only are indels a significant form of polymorphism in genomes, but small indels are one route in which tumor suppressor genes can be knocked out in cancer.

Small indels have also tended to be troublesome in short read projects -- one cancer genome project missed a known indel until the data was reviewed manually. Unfortunately, no more details were given of this issue, such as the depth of coverage which was obtained and whether the problem lay in the alignment phase or in recognizing the indel. The Helicos platform (perhaps currently looking like it should have been named Icarus) had a significant false indel rate due to missed nucleotide incorporations ("dark bases"). Even SOLiD, whose two-base encoding should in theory enable highly accurate base calling, had only a 25% rate of indel verification in one study (all of these are covered in my review article).

Since PacBio, like Helicos, is a single molecule system it is expected that missed incorporations will be a problem. Some of these will perhaps be due to limits in their optical scheme but probably many will be due to contamination of their nucleotide mixes with unlabeled nucleotides. This isn't some knock on PacBio -- it's just that any unlabeled nucleotide contamination (either due to failure to label or loss of the label later) will be trouble -- even if they achieve 1 part in a million purity that still means a given run will have tens or hundreds of incorporations of unlabeled nucleotides.

There's another interesting twist to indels -- how do you assign a quality score to them? After all, a typical phred score is the probability that the called base is really wrong. But what if no base should have been called? Or another base called prior to this one? For large numbers of reads, you can calculate some sort of probability of seeing the same event by chance. But for small numbers, can you somehow sort the true from the false? One stab I (and I think others) have made at this is to give an indel a pseudo-phred score derived from the phred scores of the flanking bases -- the logic being that if those were strong calls then there probably isn't a missing base but if those calls were weak then your confidence in not skipping a base is poor. The function to use on those adjacent bases is a matter of taste & guesswork (at least for me) -- I've tried averaging the two or taking the minimum (or even computing over longer windows), but have never benchmarked things properly.

Some variant detection algorithms for short read data deal with indels (e.g. VarScan) and some don't (e.g. SNVMix). Ideally they all would. There's also a nice paper on the subject that unfortunately didn't leave lazy folks like me their actual code.

So, in conclusion I'd love to see PacBio (and anyone else introducing a new platform) perform a similar analysis of their circular consensus sequencing on a small amount of a 1 nucleotide indel variant spiked into varying (but large) backgrounds of wildtype sequence. Indeed, ideally there would be a standard set of challenges which the scientific community would insist that every new sequencing platform either publish results on or confess inadequacy for the task. As I suggested in the previous post, dealing with extremes of %GC, hairpins, mono/di/tri/tetranucleotide repeats should be included, along with the single nucleotide substitution and single nucleotide deletion mixtures. To me these would be far more valuable in the near term than the sorts of criteria in the sequencing X-prize for exquisite accuracy. Those are some amazing specs (which I am on record as being skeptical they will be met anytime soon). What we trying to match platforms with experiments (and budgets) really need is the nitty gritty details of what works and what doesn't at what cost (or coverage).

Wednesday, June 23, 2010

PacBio oiBcaP PacBio oiBcaP

PacBio has a paper in Nucleic Acid Research giving a few more details on sample prep and some limited sequencing quality data on their platform.

The gist (which was largely in the previous Science paper) is that the templates are prepared by ligating a hairpin structure on to each end. In this paper they focused on a PCR product, with the primers designed with type IIS (BsaI) restriction sites flanking the insert. Digestion yields sticky ends (and with BsaI, these can be designed to be non-symmetric to discourage concatamerization) which enable ligating the adapters. Of course, for applying this on a large scale you do run into the problem of avoiding the restriction sites. There are other approaches, such as uracil-laden primer segments which can be specifically destroyed the DUT and UNG enzymes -- the only catch being that many of the most accurate polymerases can be poisoned by uracil. A couple of other schemes have either been proven or can be sketched quickly.

In any case, once these molecules are formed they have the nice property of being topologically circular and having the insert sequence represented twice (once in reverse complement) within that circle, with the two pieces separated by the known linker sequence. So if the polymerase lasts to go round-and-around, then each pass gives another crack at the sequence. For short templates, this gives lots of rounds and even on a longer (1Kb) PCR product they successfully had three bites at the apple. Feed these into a probabilistic aligner and each pass improves the quality values. Running the same template many times enabled calibrating the quality values (which are Phred-scaled), showing them to be a reasonable estimator of the likelihood of miscalling the consensus.

Given that eventually their polymerase dies on a template, they point out that the length of a fragment is highly predictive of the number of passes which can be observed which in turn implies the sequencing quality which will be obtained for a template. This makes for an interesting variation on the usual "what size library" question in sequencing. For sequencing a bacterial genome, it may make sense to go very deep with short fragments to get high quality pieces which can then be scaffolded using much longer individual reads without the circular consensus effect as well as the spaced "strobe" reads to provide very long range scaffolding. For a bacterial genome & the claims made of capacity per SMRT cell, it would seem that making 2 libraries (one short for consensus sequencing, one long for the others) and burning perhaps 4-5 $100 SMRT cells, one could get a very good bacterial genome draft. For targeted sequencing by PCR, the approach is quite attractive. Amplicon length becomes an important variable for final quality. Appropriate matching of the upstream multi-PCR technology (such as RainDance or Fluidigm or just lots of conventional PCRs) with PacBio for capacity will be necessary -- and hard numbers on throughput/SMRT cell are desperately needed!

One side thought: is there a risk with strobe sequencing of accidentally going off the end and reading back along the same insert -- but not realizing it since the flip turn in the hairpin was done in the dark? It would suggest that libraries for strobe sequencing need to be very stringently sized.

One curiosity of their plot of this phenomenon is that the values appear to be asymptotically approaching phred 40 -- an error rate of 1 in 10,000. Is this really where things top out? That's a good quality -- but for some applications possibly not good enough.

They go on to apply this to measuring the allele frequency of a SNP in defined mixtures. Given several thousand reads and the circular consensus, they succeeded at this -- even when the minor allele was present at a frequency of 2.5%. This isn't as far as some groups have pushed 2nd generation sequencing for rare allele detection (such as finding a cancer mutation in a background of normal DNA), but is certainly in the range of many schemes used in this context (such as real-time PCR). If my inference of a maximum phred score is correct, it would tend to limit the sensitivity for rare mutation detection somewhere in the parts-per-thousand range; the record is around this (1 in 0.1% reported by Yauch et al)

My major concern with this paper is the lack of a breadth of substrates. Are there template properties which give trouble to the system? The obvious candidates are extremes of composition, secondary structures and simple repeats. Will their polymerase power through 80+% GC? Complex hairpins? Will stuttering be observed on long simple repeats? Can long mononucleotide runs be measured accurately?

In the end, the key uses for the first release PacBio system will depend on where these problems are and those throughput numbers. This paper is a useful bit of information, but much more is needed to determine when PacBio is either cost-effective for an application (vs. other sequencing or non-sequencing competitors) or when the performance advantages in that application (such as turn-around time) push them to the fore. Personally, I think it is a mistake on PacBio's part not to be literally flooding the world with data. Or at least the bioinformatics world -- only when they start releasing data to the broad developer community will there be the critical tweaking of existing tools and development of new tools to really take advantage of this platform and also cope with whatever weaknesses it possesses.

Travers, K., Chin, C., Rank, D., Eid, J., & Turner, S. (2010). A flexible and efficient template format for circular consensus sequencing and SNP detection Nucleic Acids Research DOI: 10.1093/nar/gkq543

Monday, June 07, 2010

What's Gnu in Sequencing?

The latest company to make a splashy debut is GnuBio, a startup out of Harvard which gave a presentation at the recent Personal Genomics conference here in Boston. Today's Globe Business section had a piece, Bio-IT World covered it & also Technology Review. Last week Mass High Tech & In Sequence (subscription required) each a bit too.

GnuBio has some grand plans, which are really in two areas. For me the more interesting one is the instrument. The claim which they are making, with a attestation of plausibility from George Church (who is on their SAB, as is the case with about half of the sequencing instrument companies), is that a 30X human genome will be $30 in reagents on a $50K machine (library construction costs omitted, as is unfortunately routine in this business). The key technology from what I've heard is the microfluidic manipulation of sequencing reactions in picoliter droplets. This is similar to RainDance, which has commercialized technology out of the same group. The description I heard from someone who attended the conference is that GnuBio is planning to perform cyclic sequencing by synthesis within the droplets; this will allow miniscule reagent consumption and therefore low costs.

It's audacious & if they really can change out reactions within the picoliter droplets, technically it is quite a feat. From my imagination springs a vision of droplets running a racetrack, alternately getting reagents and being optically scanned for which base came next & an optical barcode on each droplet. I haven't seen this description, but I think it fits within what I have heard.

Atop those claims comes another one: despite having not yet read a base with the system, by year end two partners will have beta systems. It will be amazing to get proof-of-concept sequencing, let alone have an instrument shippable to a beta customer (this also assumes serious funding, which apparently they haven't yet found. Furthermore, it would be stunning to get reads long enough to do any useful human genome sequencing even after the machine starts making reads, let alone enough for 30X coverage.

The Technology Review article, a journal I once read regularly and had significant respect for, is a depressingly full of sloppy journalism & failure to understand the topic. One paragraph has two doozies

Because the droplets are so small, they require much smaller volumes of the chemicals used in the sequencing reaction than do current technologies. These reagents comprise the major cost of sequencing, and most estimates of the cost to sequence a human genome with a particular technology are calculated using the cost of the chemicals. Based solely on reagents, Weitz estimates that they will be able to sequence a human genome 30 times for $30. (Because sequencing is prone to errors, scientist must sequence a number of times to generate an accurate read.)

The first problem here is that yes, the reagents are currently the dominant cost. But, if library construction costs are somewhere in the $200-500 range, then after you drop reagents greatly below that cost then it's a bit dishonest to tout (and poor journalism to repeat) a $30/human genome figure. Now, perhaps they have a library prep trick up their sleeve or perhaps they can somehow go with a Helicos-style "look Ma no library construction" scheme. Since they have apparently not settled on a chemistry (which will also almost certainly impose technology licensing costs -- or developing a brand new chemistry -- or getting the Polonator chemistry, which is touted as license-free), anything is possible -- but I'd generally bet this will be a clonal sequencing scheme requiring in-droplet PCR. The second whopper there is the claim that the 30X coverage is needed for error detection. It certainly doesn't hurt, but even with perfect reads you still need to oversample just to have good odds of seeing both alleles in a diploid genome.

Just a little alter in the story is the claim "The current cost to sequence a human genome is just a few thousand dollars, though companies that perform the service charge $20,000 to $48,000", which confuses what one company (Complete Genomics) may have achieved with what all companies can achieve.

The other half of the business plan I find even less appealing. They are planning to offer anyone a deal: pay your own way or let us do it, but if we do it we get full use of the data after some time period. The thought is that by building a huge database of annotated sequence samples, a business around biomarker discovery can be built. This business plan has of course been tried multiple times (Incyte, GeneLogic, etc.) and has worked in the past.

Personally, I think whomever is buying into this plan is deluding themselves in a huge way. First, while some of the articles seem to be confident this scheme won't violate the consent agreements on samples, it's a huge step from letting one institution work with a sample to letting a huge consortium get full access to potentially deidentifying data. Second, without good annotation the sequence is utterly worthless for biomarker discovery; even with great annotation randomly collected data is going to be challenging to convert into something useful. Plus, any large scale distribution of such data will butt up against the widely accepted provision that subjects (or their heirs) can withdraw consent at any time.

The dream gets (in my opinion) just daffier beyond that -- subjects will be able to be in a social network which will notify them when their samples are used for studies. Yes, that might be something that will appeal to a few donors, but will it really push someone from not donating to donating? It's going to be expensive to set up & potentially a privacy leakage mechanism. In any case, it's very hard to see how that is going to bring in more cash.

My personal advice to the company is many-fold. First, ditch all those crazy plans around forming a biomarker discovery effort; focus on building a good tech (and probably selling it to an established player). Second, focus on RNA-Seq as your initial application -- this is far less demanding in terms of read length & will allow you to start selling instruments (or at least generating data) much sooner, giving you credibility. Of course, without some huge drops the cost of library construction will be dwarfing that $30 in reagent, perhaps by a factor of 10X. A clever solution there using the same picodroplet technology will be needed to really get the cost of a genome to low levels -- and could be cross-sold to the other platforms (and again, perhaps a source of a revenue stream while you work out the bugs in the sequencing scheme).

Finally, if you really can do an RNA-Seq run for $30 a run in total operating costs, could you drop an instrument by my shop?

Saturday, June 05, 2010

Trying to Kick the Bullet

I know the title looks like a malapropism, but it isn't. What I'm trying to do is wean myself away from bulleted lists in PowerPoint.

I have a complex relationship with PowerPoint, as with the other tools in Microsoft Office. Each has real value but has also been seriously junked up by the wizards of Redmond. Far too much time is spent investing the tool with features that are rarely or never of value.

Edward Tufte, whom I admire greatly, takes a far more negative view of PowerPoint. I'm a fan of Tufte's; not an acolyte. PowerPoint can be very useful, if you use it carefully. But, I'm always open to considering how I might improve how I use it. After looking through some of Tufte's specific criticisms of bulleted lists, I realize here is an opportunity to make a change.

Now, I am a heavy user of bulleted lists. I often think in hierarchies & outlines, which fits them well. I also find it challenging to draw complex diagrams in a manner which is both presentable & useful, at least in reasonable time. So I often write many slides of bulleted lists & then try to go back and decorate them with relevant & informative diagrams I can lift from various other sources (or from previous slides). I do spend some time designing a few careful diagrams using the tools in PowerPoint.

So what is wrong with bulleted lists? As Tufte points out, the standard Microsoft scheme uses four or more different attributes to display levels of hierarchy. First, inferior levels are more indented than their parents. Second, the type size is changed. Third, the type face is changed to a different family or italicized (or both). Fourth, the character used for the bullet is changed. On top of Tufte's sharp criticism, there is the pointed satire as found in The Gettysburg PowerPoint Address.

Now, I find some of this useful, but it is a good wakeup that some of it I have just been accepting. After all, do I really need the actual bullets? Rarely are they actually showing anything useful -- the one exception being when I change them around in one list to show a useful attribute (such as checks vs. X's ). Another variant is using a numbered list to emphasize a critical order of points, such as in a series of steps which must be executed in order. But, most bullets are just consuming valuable slide real estate without adding value.

I also find the indents useful to show hierarchy. And having a smaller typeface is useful since the upper levels are more on the order of headlines and the lower levels often details -- so a smaller face allows me to pack more in.

But the change in font family or italicization? Those aren't very useful either. I'd much rather save italics for emphasis.

The challenge is actually putting this into practice. I have started going through active slide decks and converting them to the reduced list scheme. I don't see myself giving up hierarchical lists, but I'll try to do better within that structure.

Monday, May 31, 2010

Sepsis: A Severe Shock

At the beginning of the weekend I received a real shock: one of the other father's in TNG's Cub Scout Pack had died of sepsis after a failed endoscopic procedure. His son is a year younger than mine's, but we interacted a bunch on various outings & in my mind's eye I can still see his smiling face illuminated by a lantern at a recent campout. He was a mathematician & I have a bit of a natural draw to anyone in a technical field. Plus, it is always unsettling to have a near contemporary pass so suddenly and from such an unexpected source.

That someone relatively young and being treated in a hospital widely acclaimed as one of the world's best illustrates the grim terror of sepsis. I have never worked directly on it, though when I was an intern at Centocor the lead agent in their therapeutic pipeline was directed against gram-negative sepsis, that is sepsis resulting from an infection by gram negative bacteria. At the time I was there, Centocor and their rival Xoma were cross-suing each other over patent issues for their sepsis drugs (both monoclonal antibodies) and the Department of Defense was accepting the unapproved drug for possible use in the First Gulf War.

Both Xoma & Centocor unfortunately ended up following the path which so far has characterized sepsis: both drugs failed in the clinic, nearly pulling both organizations down with them. Numerous drugs have failed in the clinic for sepsis. A significant challenge is that in sepsis one must somehow prevent the immune system from causing collateral damage to the body while not preventing it from combating the grave infection which is triggering the reaction.

Clearly, this is a tough nut. The Centocor & Xoma drugs both tried to target a toxin (endotoxin A) which is released by dying gram negative bacteria. One thought I had at the time is that a diagnostic would be valuable which would enable distinguishing those patients with gram negative infections who could potentially benefit from those with gram positive infections who could not. In retrospect, even such a diagnostic is a tough challenge -- to be of any clinical value it would need to return results in a matter of minutes or a few hours. That's a hard problem. Other therapies which have been tried in the clinic have tried to modulate the immune system and proven no more effective.

Even running a sepsis trial is clearly an even greater challenge than your average serious disease trial. Obtaining proper informed consent from patients who are at risk of dying in a very short timespan cannot be easy. Challenges in running other trials in emergency medicine situations have befouled another biotech horror land: blood substitutes.

Quite likely a key part of the problem is that we just don't understand this area of biology well enough. Perhaps intensive proteomic and metabolomic analysis on collected samples will yield new markers which will guide better management. Perhaps better animal models can be developed and exploited to understand the complex series of events which occur in sepsis.

I wish I had some answers; on this I'll declare complete defeat. That, and a haunting image in my mind of a cheerful face which now exists only in memories and photographs.

Tuesday, May 25, 2010

Guesting over at MolBio Research Highlights

I have an invited piece on the various sequencing instruments over at MolBio Research Highlights. The fact that much of it is an extended riff on the Winter Olympics is suggestive of how long ago the invite came; I was not diligent about turning around revisions quickly. Alejandro Montenegro-Montero there was nice to liven my text with some images and also put up with my inattention to schedule.

I'll make a public pledge here to do better the next time -- if anyone is daring enough to give me a next time.

Bike in the Commuting Fold

Okay, a little bragging: I biked into to work last week for Bike to Work week. I actually bike the last leg of my commute a lot of days now on a folding bike (pictured), but this was the whole enchilada on my new 24-speed road bike. By Google maps it's 23 miles each way, but with my accidental deviations the morning was definitely more like 24. I did meet my family for dinner part way home, so the last few miles were sans backpack.

I've always enjoyed a bicycle but am very sporadic about using one. My previous distance record for one day was 42 miles but that was for a charity fundraiser, was nearly dead flat (South Jersey; though we did cross the Ben Franklin Bridge first which is a climb) and my mitochondrial DNA donor insisted on regular practice runs for several weeks beforehand. This ride lacked that level of preparation, so the next day I was a bit saddlesore -- though thankfully none of my joints were complaining.

My more typical commute now is a 3-speed folding bike for the 4+ miles from North Station to Infinity. Infinity's location finally pushed me last year to contemplate this option, as it really is awkward from North Station. The choices by transit are: the EZRide bus and then a 10+ minute walk (through a pleasant neighborhood), walking or Orange/Green Line to the Red Line to Central and then walking or catching a shuttle provided by the landlord. No matter how you slice it, it is a bunch of connections and timing. Plus, the Red Line grows more unreliable and slow every year.

So last year I picked up a folding bike on Craigslist for $120. I had contemplated a bunch in different price ranges but ended up with this bike. I asked a lot of folks with bikes about theirs (there's two more in my railroad car tonight). Most folder owners are quite willing to answer intelligent questions about their gear, a practice which I try to uphold. This one was a good trial, though I am now monitoring Craigslist again looking for an upgrade -- more gears & bigger wheels please!

The T is at best lukewarm to the biking community. Most buses now have bike racks, but I haven't used them. A few stations now have bike cages which require special activation of your T-pass (or something similar), but unfortunately North Station has only an unprotected set of bike racks. The idea of leaving my machine exposed to the elements, vandals & thieves isn't pleasant -- especially since my college bike first rusted out a chain and later disappeared, though it's possible I just forgot where I parked it. Some conductors are quite nice, but one barks at me everytime I yank the bike on unfolded -- one minor issue with mine is a balky knurled nut that locks the frame. One colleague was refused entry to the subway with her's folded, which indicates not everyone at the T understands the long-time policy.

The folding bikes do have some downsides. Mine has very small (12" I think) wheels, which makes potholes and root-lifted sidewalks quite scary. For my commute I really could use a couple of more gears, for the occasional hill and to get speed on flat ground. But overall it works.

For North Station to Infinity there are two obvious routes. The fastest is along the Cambridge side of the river, but it is also narrower and rougher. The Boston side is a touch slower, but is shadier and has a higher density of dog walkers. Both offer great views of sculls and sailboats on the Charles. Both are heavily used, which is generally manageable but you are sometimes left pondering how a single runner can obstruct the path far more efficiently than a pack of four leashed dogs. Some stretches are badly lifted by roots -- and one stretch I don't routinely use upstream of Harvard Square is downright scary, with the path seriously undermined by erosion. In any case, I end up with a very predictable travel time (it's the variance which kills you when you need to catch a train!). Plus, very little interaction with Boston traffic! However, due to 2 years of a paper route I am averse to darkness and bad weather, so I all too often find excuses to skip the bike and retire it altogether during Standard Time. Plus, I do end up standing on the train more (as there are few places to store a bike -- and the seats nearby are usually taken) and so can't read or work on the commute sometimes.

The other great advantage of a bike is greater range off the commuter rail and T. I've used it for seminars and errands and meeting folks for lunch, as well as to meet family for dinner or run errands at home.

A major thought that hits me: why didn't I think about this sooner? During any of my Millennium or Codon days it would have made sense, especially in our old house where I had an annoying short drive the station (where sometimes parking was not to be had). Especially when I was at 640, the folding bike would have been great. But, I never seriously considered the idea. What an opportunity missed!

Tonight I'm going to look at an upgrade on my folding bike. A major investment in a better commute.

Saturday, May 22, 2010

Just say no to programming primitivism

A consistently reappearing thread in any bioinformatics discussion space is "What programming language should I learn/teach?". As one might expect, just about every language under the sun has some proponents (still waiting -- hopefully forever -- for a BioCobol fan), but the responses tend to cluster into a few camps & someone could probably carefully classify the arguments for each language into a small number of bins. I'm not going to do that, but each time I see these I do try to evaluate my own muddled opinions in this space. I've been debating writing one long post on some of the recent traffic, but the areas I think worth commenting on are distinct enough that they can't really fit well into a single body. So another of my erratic & indeterminate series of thematic posts.

One viewpoint I strongly disagree with was stated in one thread on SEQAnswers.

Learn C and bash and the most basic stuff first. LEARN vi as your IDE and your word processor and your only way of knowing how to enter text. Understand how to log into a machine with the most basic of linux available and to actually do something functional to bring it back to life. There will be times when there is no python, no jvm, no eclipse. If you cannot function in such an environment then you are shooting yourself in the foot.

Yes, there is something to be admired about being able to be dropped in the wilderness with nothing but a pocketknife and emerging alive. But the reality is that this is a very rare occurrence. Similarly, it is a neat trick to be able to work in a completely bare bones computing environment -- but few will ever face this. Nearly twenty years in the business, and I have yet to encounter such a situation.

The cost of such an attitude is what worries me. First, the demands of such a primitivist approach to programming will drive a lot of people out very early. That may appeal to some people, but not me. I would like to see as many people as possible get a taste of programming. In order to do that, you need to focus on stripping away the impediments and roadblocks which will trip up a newcomer. So from this viewpoint, a good IDE is not only desirable but near essential. Having to fire up a debugger and learn some terse syntax for exploring your code's behavior is far more daunting than a good graphical IDE. Similarly, the sort of down-to-the-compute-guts programming that C enables is very undesirable; you want a newcomer to be able to focus on program design and not tracking down memory leaks. Also, I believe Object Oriented Programming should be learned early, perhaps from the very beginning. That's easily the subject of an entire post. Finally, I strongly believe the first language learned should have powerful inherent support for advanced collection types such as associative arrays (aka hashtables or dictionaries)

Once you have passed those tests, then I get much less passionate. I increasingly believe Perl should only be taught as a handy text mangler and not a language in which to develop large systems -- but still break those rules daily (and will probably use Perl as a core piece of my teaching this summer). Python is generally what I recommend to others -- I simply am not comfortable enough in it to teach it. I'm liking Scala, but should it be a first language? I'm not quite ready to make that leap. Java or C#? Not bad choices either. R? Another one I don't really feel comfortable to teach (though there are some textbooks to help me get past that discomfort).

Thursday, May 20, 2010

The New Genome on the Block

The world is abuzz with the announcement by Craig Venter and colleaguesthat they have successfully booted up a synthetic bacterial genome.

I need to really read the paper but I have skimmed it and spotted a few things. For example, this is a really impressive feat of gene synthesis but even so a mutation slipped in which went unnoticed until one version was tested. Even bugs need debuggers!

It is also a small but important step. Describing it as a man-made organism is in some ways true and some ways not. In particular, any die-hard vitalists (which nobody will admit to being, though there are clearly huge number of health food products sold using vitalist claims) will point out that there was never a time when there wasn't a living cell -- the new genome was started up within an old one.

It is fun to speculate about possible next directions. For example, they booted a new Mycoplasma genome within another Mycoplasma cell -- different species, but very similar to the host. Clearly one research direction will be to try to create increasingly different genomes. A related one is to try to bolt on entire new subsystems. A Japanese group tried fusing B.subtilis (a heavily studied soil bug) with a cyanobacterium to see if they could build a hybrid which retained the photosynthetic capabilities of the cyano; alas they got only sickly hybrids that didn't do much of interest. Could you add in photosynthesis to the new bug? Or a bacterial flagellum? Or some other really complex more-than-just-coupled-enzymes subsystem?

But as someone with a computer background -- and someone who has thought off-and-on about this topic since graduate school (mostly off, to be honest), to me a really interesting demonstration would be a dual-boot genome. Again, in this case the two bacterial species were very similar, so their major operational signals are the same. Consider two of the most important systems which do vary widely from bacterial clade to clade (the genetic code is, of course, near universal -- though Mycoplasma do have an idiosyncratic variation on the code): promoters and ribosome binding sites. Could you build the second genome to use a completely incompatible set of one of these (later both) and successfully boot it? Clearly what you would need is for the host genome -- or an auxillary plasmid -- to supply the necessary factors. Probably the easier one would be to have the synthetic genome use the ribosomal signals of the host but a different promoter scheme. In theory just expressing the sigma factor for those promoters would be sufficient -- but would it be? To me this would be a fascinating exercise!

Now, I did claim dual-boot. A true dual-boot system could use both. That is much trickier, but particularly on the transcriptional side it is somewhat plausible -- just arrange the two promoters in tandem. Ribosome binding sites would need to be hybrids, which isn't as striking a change.

There are even more outlandish proposals floating out there -- synthetic bugs with very different genetic codes (perhaps even non-triplet codes) or the ultimate synthetic beast -- one with the reverse handedness to all its chiral molecules. Those are clearly a long ways off, but today's announcement is another step in these directions.

Tuesday, May 18, 2010

Journey to Atlantis

I've only seen it a few times, but the sight of the iconic cavernous building always makes my heart race. But this time even more so, as it meant the end of a race against the clock. We had reached our position for the big event with just minutes to go.

Attempting to be speedy but efficient, I assembled the fancy digital SLR rig atop my tripod. Except it wouldn't work. Removing the tele-extender restored autofocus (in retrospect, probably applying another newton of force would have too) and then I got the camera in the wrong shutter mode -- timer instead of multi-fire. A cheer rises from the crowd and the dark smudge to the left of the building emits a shape trailing a brilliant blaze of red-orange, a color which no photograph seems to capture remotely well. Below that is a growing, intricately braided cloud of smoke. I don't get my camera remotely under control until it is tilted about 45 degrees, and only then do I realize that in my fumbling I had it at minimum zoom! A photographic opportunity dreamed about for nearly a quarter century almost utterly botched! The crowd's sound builds again as the rumble of the engines finally reaches us.

But, I was there. We all were -- TNG will remember it for his entire lifetime. Atlantis punched right through a cloud (don't believe the reports of a cloudless sky!) and soared. All too quickly it was out of sight, leaving for many minutes the detailed smoke tail.

Our plans had been too optimistic, trying to squeeze the trip in with minimal disruption of other schedules, plus a final hesitancy to pull the trigger on plane tickets. What seemed like a plan with a little room for delay was undone by a rental car company that apparently stocks the break room with Protoslo and a traffic jam stretching from Orlando International Airport to the Cape.

I grew up with Apollo. I remember the last moon launches and moon walks. I do not remember the early manned Apollo missions, though I was technically around for all of them. Indeed, it is a great disappointment to me that none of those who could remember can remember if I was toddling in front of the TV when Neil Armstrong made his first steps. I devoured all the books in the school library and then the public library on space and watched many an early shuttle launch and landing (we had a school assembly for the 1st landing!). I remember precisely what I was doing when the news of Challenger's loss came & again with Columbia. I sometimes dreamed of being an astronaut, though never enough to force my academic path in that direction -- but I certainly spent more than a few times in bed before going to sleep as a kid on my back with my knees bent, imagining what liftoff must feel like (I still sometimes close my eyes on airplane takeoffs to try to return to those youthful fantasies). But I had never seen a launch. There are the near-mythical VIP tickets my family once had for a payload my father worked on, but that would have launched in May 1986. After the Challenger-imposed hiatus, somehow we didn't get the tickets again.

When I announced to some of my co-workers that I might try to go for this launch, I got a lot of support. That camera was a very generous loan from one colleague. But the most interesting reaction was the number of individuals who were shocked that the shuttle program was coming to an end. "What do you mean the third to last flight?". And it hit me -- for many of these folks, the shuttle IS the manned space program simply because it is older than they are.

I have a complex love for the shuttle program. It is one of the most amazing devices ever realized from human imagination. It is capable of so much and has contributed so many wonderful images. But it is also a mishmash of design requirements, resulting in a tool not optimal for any task and a design which has proven deadly twice and nearly so on other occasions. The shuttle also sucked so much post-Apollo, post-Vietnam funding that could have gone into some spectacular unmanned missions.

But now I have finally seen a launch. It is spectacular, and I am hungry for more. Alas, I wasted my youth in not making plans and now have that laundry list of responsibilities which come with adulthood. We were lucky that the launch occurred precisely on schedule; too few have stuck to their assigned time. I probably won't be able to do better than a giant screen TV for the last two -- you do get a better view, but it just isn't the same. But you can bet I'll be cross-referencing future vacations against unmanned launch schedules.

Of course, if anyone has some VIP tickets they aren't using, I won't claim I would resist temptation...

Thursday, May 06, 2010

Sales

Two weeks ago I participated in a roundtable sponsored by the Massachusetts Technology Leadership Council (MassTLC) titled "R&D IT Best Practices for Growing Small/Mid-Sized Biopharmas". It was a nice intimate gathering -- about a half dozen panelists, a few dozen audience members and NO SLIDES! A chance for some real discussion -- moderator Joseph Cerro would throw topics out or take them from the audience and the panel would address them as they saw fit. Nice and free-flowing.

I expected this event to be attended by a lot of biotech executives, and while there were more than a few a large fraction of the audience were actually in software sales. One of them expressed their interest in the topic quite succinctly: "Why aren't you guys buying from us?" In his view, his company offered excellent products that met his potential customer's needs, yet too rarely they bought.

One aspect of course -- or perhaps THE aspect, is that we don't have infinite budgets. In my current role, I can spend money on a variety of things -- I can buy software, order consulting or have a CRO generate data for me. I'll confess: my tastes tend to run towards data generation; I tend to lean towards the latter.

One reality which anyone trying to sell me software or databases must face is that it is guaranteed that their software (a) solves some of my problems (b) fails to solve some others and (c) overlaps with other solutions I have already or am strongly considering. When I brought this up one sales guy accused us of not having an overall software vision. That's a tricky subject -- part of me agreed and part wanted to yell "them's fighting words!". I have often had software visions; I have also often given up on them in despair. The truth is that any grand vision would require far too much custom work to be practical or to ever get done. Grand visions don't go well with compromises, and any off-the-shelf solution will involve compromises.

But, one does try to have an overall plan to how things will fit together. Again, one challenge is figuring out what constellation of imperfect yet overlapping pieces to assemble. At a more detailed level, it is deciding what desired features are critical and which are dispensable. Plus, generally you aren't starting with a tabula rasa but rather there is already a set of tools already in place or that are too near-and-dear to someone important to be ignored.

I'm sure trying to sell to me is exasperating. I want detailed technical information on a moment's notice. I'm routinely throwing out projects or configuations to be priced, with few if any actually going forward. At Codon I did exactly that part of the sales game, it was a lot of work and very frustrating to see so little ever come of it. I'm also a pain on software products and databases in insisting on hands-one trials. One database vendor never understood this, which is why I won't bother ever talking to that company again. Perhaps my only virtue is that I attempt to be unfailingly polite through the whole process. I suppose that counts for something.

Saturday, May 01, 2010

Asymptotically Approaching A Grok of Scala

When learning a new language, it is tempting to fall back on the patterns of a previous language. This isn't always a bad thing but is worth being aware of. For example, when I did a little bit of Python at Codon I realized that compared to someone else who had just learned Python, I tended to use dictionaries in my code quite frequently. That's a pattern coming from Perl. This was also reflected in my C# code, except there (to my glee!) I could use typesafe dictionaries. My code at Codon, in comparison with some other programmers, tended to be very Dictionary-rich (and they were always typesafe!) That's not saying my style was better, just distinctive and influenced by prior experience.

Now in some language transitions, there's very little of this -- because the new language is too different. SQL is an obvious example -- it's just not a procedural language and so I can't easily identify any of my SQL programming patterns which are influenced by prior languages.

But, if a programming not only supports but encourages a different style of programming, it is useful to recognize this bias and try to go outside it, and when you have a breakthrough it is wonderful. For me, to intuitively understand a subject is to "grok" it; Heinlein's invention is too rarely used.

I had that moment tonight with Scala. The assignment was to read genotype data out of a bunch of Affymetrix 6.0 CHP files from a vendor. Now, Affy makes available an SDK for this -- but it is a frustrating one. The C++ example code is all but a printf statement away from converting CHP to tab-delimited.

But I decided to make this a Scala moment. There's a Java SDK, but it is very spartanly documented -- there's really no documentation beyond what individual methods and classes do -- no attempt to help you grok the overall scheme of things.

Worse, the class design is inconsistent. One case: the example Java code parses an expression file and one key piece of information to get out is the number of probesets in the file, which is via the getHeader() method. Unfortunately, it turns out getHeader is defined in the specific class and not the base class, so code working on genotyping information needs to use a different approach. Personally, I'm already annoyed because I'd rather have an enumerator to step over the probesets rather than getting a count and asking for each one in turn -- but that is a point of style.

Okay, problem solved. The main part of the code reads in the data into a big HashTable (the dictionary-type generic class in Scala) -- that pattern again! Now I want to write the data out -- listing each genotype in a separate column with the 0th column containing the probeset name. So, I need to create a row of output values and then write it as a line to my file.

Version 1 is the straight old-style, what I used in Perl/C# and pretty much everything before it -- I initialize a Queue to hold the values I want to write on one line. Here out is a Java BufferedWriter which is writing to a file. The one significant Scala-ism is the code to write the line -- the reduceLeft function (bolded) is the equivalent here of a Perl join command to create the tab-delimited line


val q = new Queue[String]()
q.enqueue(probesetName)
for (sample<-sampleNames)   
   q.enqueue(ProbeSetMultiDataGenotypeData.genotypeCallToString(genotypes(probesetName)(sample)))      
out.write(String.format("%s\n",q.reduceLeft(_ + "\t" + _)))

Now, on looking at this I had working code, which should be time to stop. But could I take it to a more Scala-ish form? That's a challenge, which I'm happy to find I succeeded at.


out.write(String.format("%s\t%s\n",probesetName,
   (for (sample<-sampleNames)
       yield ProbeSetMultiDataGenotypeData.genotypeCallToString(genotypes(probesetName)(sample)))
    .reduceLeft(_ + "\t" + _)))

This version eliminates the queue -- an anonymous function (bold) simply generates a list which the reduceLeft trick consolidates. I had cheated before and loaded the probeset name onto the queue, so here I need to tweak the String.Format stuff to get that in.

Now, the question is -- is this better? One metric might be readability, and I'm not sure which I find more readable. The first is a style I'm used to reading and I tend to recognize the pattern -- or do I? If I revisit that code 6 months from now will I say "What is this queue for?". The second one is terser -- but is it a good terser? Perhaps if I start using that pattern repeatedly it will become second nature to read

Another would be performance -- which is tedious to measure but my guess is that since I am following the form suggested by the language, it is likely to optimize this better.

Ah, but after writing this entry I saw I could do better -- definitely cleaner. Instead of the explicit loop in the code I'll use the map function, which takes a series of values and applies a transformation on each. So I still have a long way to go before I can claim to grok Scala! I could blame this on being diverted away from Scala for a month plus (I'd actually created some code like the below before, now that I think about it)


out.write(String.format("%s\t%s\n",probesetName, 
    sampleNames.map(sample=>         
      ProbeSetMultiDataGenotypeData.genotypeCallToString(genotypes(probesetName)(sample)))
 .reduceLeft(_ + "\t" + _)))

It is worth noting that this final style is actually largely available in Perl, which has a map function and some other stuff to support this. I never really tried to work that way and personally I foresee all sorts of bugaboos from a lack of type safety. But I could have worked this way in the past.

One final note: I'm getting to like the way Scala can do a lot of compile-time type checking without my needing to clutter the code with lots of type annotations. C# is particularly bad about most type annotations being written twice, but even after cleaning that up Scala goes one further and infers many types. "sample" in both examples is strictly a String, but I don't have to declare that -- and so the code is stripped to nearly the bare essentials but I get a bit of proofreading

Thursday, April 29, 2010

Application of Second Generation Sequencing to Cancer Genomics

A review by me titled "Application of Second Generation Sequencing to Cancer Genomics" is now available on the Advance Access section of Briefings in Bioinformatics. You'll need a subscription to read it.

I got a little obsessive about making the paper comprehensive. While the paper does focus on using second generation sequencing for mutation, rearrangement and copy number aberration detection (explicitly ruling out of scope RNA-Seq and epigenomics), it does attempt to touch on every paper in the field up to March 1st. To my chagrin I discovered just after submitting the final revision that I had omitted one paper. I was able to slide it into the final proof, but not without making a small error. There's one other paper I might have mentioned that actually used whole genome amplification upstream of second generation sequencing on a human sample, though it's not a very good paper, the sequencing coverage is horrid and wasn't about cancer. In any case, it won't shock me completely -- but a lot -- if someone can find a paper in that timeframe that I missed. So don't gloat too much if you find one -- but please post here if you find any!

Of course, any constructive criticism is welcome. There are bits I would be tempted to rewrite if I went through the exercise again and the part on predicting the functional implications of mutations could easily be blown out into a review of its own. I don't have time to commit to that, but if anyone wants to draft one I'd help shepherd it at Briefings. I'm actually on the Editorial Board there and this review erases my long-term guilt over being on the masthead for a number of years without actually contributing anything.

As I state in the intro, in a field such as this a printed review is doomed to made incomplete very quickly. I'm actually a bit surprised that there has been only one major cancer genomics paper between my cutoff and the preprint emerging -- the breast cancer quartet paper from Wash U. I fully expect many more papers to appear before the physical issue shows up (probably in the fall) and certainly a year from now much should have happened. But, it is useful to mark off the state of a field at a certain time. In some fields it is common to publish annual or semi-annual reviews which update on all the major events since the last review; perhaps I should start logging papers with that sort of concept in mind.

One last note: now I can read "the competition". Seriously, another review on the subject by Elaine Mardis and Rick Wilson came out around the time I had my first crude set of paragraphs (it would be stretch to grant it the title of draft). At that time, I had two small targeted projects in process and they had already published two leukemia genome sequences. It was tempting to read it, but I feared I would be overly influenced by it or worse would be paranoid about plagiarizing bits, so I decided not to read it until my review published.

Wednesday, April 14, 2010

The value of cancer genomics

I recently got around to reading the "Human Genome at 10" issue of Nature. One feature, on facing pages, are opinion pieces by Robert Weinberg and Tood Golub on cancer genomics, with Weinberg giving a very negative review and Golub a positive outlook.

Weinberg is no ordinary critic of cancer genomics; to say he wrote the book on cancer biology is not to engage in hyperbole but rather acknowledge a truth; at work we're actually reviewing the field using his textbook. He made -- and continues to make -- key conceptual advances in cancer biology. So his comments should be considered carefully.

One of Weinberg's concerns is that the ongoing pouring of funds into cancer genomics is starving other areas of cancer research and driving talented researches from the field. Furthermore, he argues that the yields from cancer genomics to date have been paltry.

I can't agree with him on this score. He cites a few examples, but is being very stingy. I'm pretty sure the concept of lineage addiction, in which a cancer is dependent on overexpression of a wild-type transcription factor governing the normal tissue from which the cancer is derived, arose from several genomics studies. Another great example is the molecular subdivision of diffuse large B-cell lymphomas; each of the subsets (at least 3 peeled off so far) appears to have very different molecular characteristics.

On a broader scale, the key contribution of cancer genomics is, and will continue to be, to proved a concrete test of the theories generated by Weinberg and others using their experimental systems. For example, Weinberg worked extensively on the EGFR-Ras-MAP Kinase pathway. If we look in many cancers, this pathway is activated. For example, in non-small cell lung cancer (NSCLC), about half of all tumors are activated by KRAS mutations; in pancreatic cancer this may be near 90%. Other members of the pathway can be activated by mutation as well, but not nearly as frequently. In NSCLC, EGFR is another 20% or so but BRAF and MAP kinase mutations are rare. Why? Well, that's a new conceptual puzzle. Furthermore, EGFR-Ras-MAPK pathway mutations don't seem to explain all cancers. Indeed, some potent oncogenes in experimental systems are rarely if ever seen as driving patient cancers.

One example Weinberg mentions as part of the small haul is IDH1. This is a great story uncovered twice by cancer genomics and is still unfolding. IDH1 is part of the Krebs cycle, a key biochemical pathway unleashed on any biology or biochem freshman. Genomics studies in glioblastoma and AML (a leukemia) have uncovered mutations in IDH1; extensive searches to check in other tumors have come up negative (except a report in thyroid cancer). Why the specificity? An unresolved mystery. The really interesting part of the story is that it appears the IDH1 mutations alter the balance of metabolites generated by the enzyme. Unusual metabolites favoring cancer development -- this is a fascinating story, uncovered by genomics.

Another great cancer genomics story was the identification last summer of the causative mutation for granulosa cell tumor (GCT), a rare type of ovarian cancer. This was found by an mRNA sequencing approach.

As I mentioned before, DLBCL had been previously subdivided by expression profiling into distinct groups, which have different outcomes with standard chemotherapy and different underlying molecular mechanisms. The root of one of those mechanisms was recently identified by sequencing, showing a mutation in a chromatin structure regulation protein, a class of oncogenic mutation only recently found by non-genomic means.

Another recent example: using copy number microarrays (which provide much less information but more cheaply), microdeletions targeting cell polarity genes were identified.

Indeed, I would generally argue that cancer genomics is rapidly recapitulating most of what we have learned in the previous three decades of study on what genes can activate tumors by gain or loss of function. This doesn't replace many other things which the classical approaches have discovered, but does underscore the power of genomics in this setting. And, of course, not simply recapitulating but going beyond to identify new oncogenic players and enumerate the roles of all the current suspects.

My own belief is that Weinberg (and others with similar views) are trying to strangle the genomics effort before it can really spread its wings -- I don't mean something sinister by that, just that they are attempting to terminate it prematurely. Some cancer genome efforts indeed have little to show -- but very few have been done on a really large scale. With costs plummetting for data acquisition (though perhaps not for data analysis), it will be possible to sequence many, many cancer genomes and I am confident important discoveries will come in regularly.

What sort of discoveries and studies? There are hundreds of recognized cancers, some very rare. Even the rare ones will have important stories to tell us about human cell biology; they should definitely be extensively sequenced. We also shouldn't be strict speciesists; a number of cancers are hereditary to certain dog breeds and will also have valuable stories to tell. In common tumors, it is pretty clear that many of these definitions are really syndromes; there is not one lung cancer or even one NSCLC, but many. Each is defined by a different set of key genomic alterations. Enumerating all of those will put the various cancer theories to an acid test; the samples we can not explain will be new challenges. Current projects targeting major cancers are aiming to discover all mutations with 10% or greater frequency. I would argue that is a good start; 5% of a major cancer such as lung cancer is still tens of thousands of worldwide cases.

Cancer is also not a static disease; as in the recent WashU paper it will be critical to compare tumors with metastases to identify the changes which drive this process. Metastatic lesions tend to be what kills patients, so this is of high importance. Lesions also change with therapy, with a pressing need to understand those changes so we can devise therapeutics to address them.

All in all, I can easily envision the value of sequencing tens of thousands of samples or even more. Of course, this is what those skeptical of cancer genomics dread; even with the dropping cost of sequencing this will still require a lot of money and resources. Furthermore, really proving which mutations are cancer drivers and which are bystanders -- and what exactly those driver mutations are doing (particularly in genes which we can intuit little about from their sequence) -- will be an enormous endeavour. Cancer genomics will be framing many key problems for the next decade or two of cancer biology.

Of course, mutational and epigenomic information will not tell the entire story of cancer; there are many genes playing important roles in cancer-relevant pathways that never seem to be hit by mutations. Why not is an excellent unanswered question, as is why certain tissue types are more sensitive to the inhibition of specific universal proteins. For example, germline mutations in BRCA1 lead to higher risk of breast, ovarian and pancreatic cancer (with much stronger breast and ovarian risk increases) yet BRCA1 is part of a central DNA repair complex and not some female-specific system. Really fleshing out cancer pathways will take large scale interaction and functional screens -- which Weinberg specifically notes for dread the idea of such a "cancer cell wiring" project. Ironically, such a project is published in the same issue, the results of a genome-wide RNAi+imaging screen for genes relevant to cell division.

Which gets back to the root problem: if we view cancer funding as more-or-less a zero sum game, how much should we spend on cancer genomics and how much on investigator-focused functional efforts. That's not an easy question and I have no easy answer. It doesn't help that I don't even know the sums involved since I am not subject to the whims of grants (I have different capricious forces shaping my career!). But, clearly I would favor a sizable fraction (easily double digit percent) of cancer funding going to genomics projects.

One of the professors in my graduate department, who was actually no fan of genomics, said that a well-designed genetics experiment enables the cell to tell you what is important. Reading cancer genomes is precisely that, enabling us to discover what is truly important to cancer biology.

Thursday, April 08, 2010

Version Control Failure

Well, it was inevitable. A huge (and expensive) case of confused human genome versions.

While the human genome is 10 years old, that was hardly the final word. Lots of mopping up remained on the original project and perodically new versions would be released. The 2006 version, known as hg18, became very popular and is the default used by many sites. A later version (hg19) came out in March 2009 and are favored by other services, such as NCBI. UCSC supports all of them, but appears to have a richer set of annotation tracks for hg18. It isn't over yet: not only are there still gaps in the assembly (not to mention the centromeric badlands that are hardly represented), but with further investigation of structural variation across human populations it is likely that the reference will continue to evolve.

This is fraught with danger! Pairing an annotation track from one version with a different version results in very confusing results. Curiously, while the header line for the popular BED/BEDGRAPH formats has a number of optional fields, tagging them with version is not one of them. Software problems are one thing; doing experiments based on mismatched versions is another.

What came out in GenomeWeb's In Sequence (subscription required) is that the ABRF (a professional league of core facilities) had decided to study sequence capture methods and had chosen to test Nimblegen's & febit's array capture methods along with Agilent's in solution capture; various other technologies either weren't available or weren't quite up to their specs. I do wish ABRF had tested Agilent in both in solution and on array formats, as this would have been an interesting comparison.

What went south is that the design specification uploaded to Agilent used hg19 coordinates, but Agilent's design system (into a few days ago) uses hg18. So the wrong array was built and used to make the wrong in solution probes. So, when ABRF aligned the data, it was off. How much off depends on where you are on the chromosome: the farther down the chromosome the more likely it is to be off by a lot. ABRF apparently got a good amount of overlap, but there was an induced error.

I haven't yet found the actual study; I'm guessing it hasn't been released. If the GenomeWeb article is accurate, then it is in my opinion not kosher to grade the Agilent data according to the original experimental plan, since this wasn't followed. Either the Agilent data should be evaluated consistent to the actual design in its entirety OR the comparison of the platforms should be restricted to the regions that actually overlap between the actual Agilent design and the intended design.

In any case, I would like to put a plug in here that ABRF deposit the data in the Short Read Archive. Too few second generation sequencing datasets are ending up there, and targeted sequencing datasets would be particularly valuable from my viewpoint. Granted, a major issue is around confidentiality and donor's consents for their DNAs, which must be strictly observed. Personally, I believe if you don't deposit the dataset your paper should state this -- in other words, either way you must explicitly consider depositing and if you don't explain why it wasn't possible. The ideal of all data going into public archives was never quite perfect in the Sanger sequencing days, but we've slid far from that in the second generation world.

I had an extreme panic attack one day due to similar circumstances -- taking a quick first look at a gene of extreme interest in our first capture dataset it looked like the heavily captured region was shifted relative to my target gene -- and that my design had the same shift. Luckily, in my case it turned out to be all software -- I had mixed up versions but not in the actual design and so that part of the capture experiment was fine (I wish I could say the same about the results, but I can't really talk about them). I now make sure the genome version is part of the filename of all my BED/BEDGRAPH files to reduce the confusion and manually BLAT some key sequences to try While those are useful practices, I strongly believe that there should be a header tag which (if present) is checked on upload.

Sunday, March 28, 2010

Ridiculous Claims

An item last week in GenomeWeb covered a new analysis by Robin Cook-Deegan and colleagues of the Myriad BRCA patents. One bit in particular in the article has stuck in my craw & I need to spit it out.

This finding, involving an expressed sequence tag application filed by the NIH, was published in 1992 and the NIH abandoned its application two years later. Based on USPTO examiner James Martinell's estimation at the time, a full examination of all the oligonucleotide claims in the EST patent would have taken until 2035 "because of the computational time required to search for matches in over 700,000 15-mers claimed."

According to Kepler et al., this comprises "roughly half the number of molecules covered by claim 5 of Myriad's '282 patent."

While improvements in bioinformatics and computer hardware have made sequence comparisons much easier than they were in the early 1990s, the study authors arrive at no conclusions about why the USPTO granted Myriad claim 5 in patent '282 and not NIH's EST patent.

The claim by the USPTO examiner is bizarre, to say the least. The claim is 55 years to analyze 700,000 15mers for occurrence in other sequences. This works out to testing about 40 oligos per day. What algorithm were they using??

To look at it another way, if you use 2 bit encoding for each base, then the set of all 15mers can be described by 2^30 different bitstrings -- potentially storable in memory of the 32 bit machines available at the time (which can, of course, address 2^32 words of memory). Furthermore, this is a trivially splittable algorithm -- you can break the job into 2^N different jobs by having each run look only at sequences with a given bit prefix of length N. When I started as a grad student in fall 1991, one of my first projects involved a similar trivial partitioning of a large run -- each slice was its own shell script which was forked onto a machine.

Furthermore, anyone claiming that a job will take 50+ years really needs to make some reasonable assumptions about growth in compute power -- particularly since 64-bit machines were becoming available around that time (e.g. DEC Alpha). Sure, it's dangerous to extrapolate out 50 years (after all, progress in Moore's law from shrinking transistors will hit a wall at one atom per transistor), but this was a ridiculous bit of thinking.

Tuesday, March 23, 2010

What should freshman biology cover?

I've spent some time the last few weeks trying to remember what I learned in freshman biology. Partly this has been triggered by planning for my summer intern (now that I have a specific person lined up for that slot) -- not because of any perceived deficiencies but simply being reminded of the enormous breadth of the biological sciences. It's also no knock on my coursework -- I had a great freshman biology professor (who team-taught with an equally skilled instructor). It was a bit bittersweet to see his retirement announcement last year in the alumni newsletter; he certainly has earned a break but future Blue Hens will have to hope for a very able replacement.

It is my general contention that biology is very different from the other major sciences. My freshman-level physics class (which I couldn't schedule until my senior year) had a syllabus which essentially ended at the beginning of the 1900s. Again, this is no knock on the course or its wonderful professor; it's just that kinetics and electromagnetics on a macro scale was pretty much worked out by then. We had lots of supplementary material on more modern topics such as gravity assists from planets, fixing Hubble's mirror problem and quantum topics, but that was all gravy.

Similarly, my freshman chemistry course (again, quite good) covered science up to about World War II. My sophomore organic chemistry course pushed a little further in the century. It's not that these are backwards fields but quite the opposite -- enough had been learned by those time points to fill two semesters of introductory material.

But, I can't say the same about biology. In the two decades and change since my freshman biology coursework, I can certainly think of major discoveries either made or cemented in that time which deserve the attention of the earliest students. Like the course I assisted with at Harvard, my course had one semester of cells and smaller biology and one of organisms and bigger; I'll mostly focus on the cells and smaller because that's where I spend most of my time. But, there are certainly some strong candidates for inclusion in that other course. I'll also recognize the fact that perhaps for space reasons some of these topics would necessarily be pushed into the second tier courses which specialize in an area such as genetics or cell biology.

One significant problem I'll punt on: what to trim down. I don't remember much fat in my course (indeed, beyond the membrane we didn't speak much at all on it that year!). Perhaps that's what I have forgotten, but I think it more likely it was already pretty packed. I can think of some problem set items that can be jettisoned (Maxam-Gilbert sequencing is a historical curiosity at this point; I can think of much more relevant procedures to do on paper).

One topic I've convinced myself belongs in the early treatment is the proteasome, and not because I once spent a lot of time thinking about it (and also saw some financial gain -- though I no longer have such an interest in it). This is definitely a field which didn't exist on solid ground when I went through school, so it's absence from my early education First, it fits neatly into one of the key themes of introductory biology: homeostasis. Cells and organisms have mechanism for returning to a central tendency, and the proteasome plays a role for proteins. Proteasomes also form a nice bookend with ribosomes -- we learned that proteins are born but not how they die. Furthermore, not only do proteins have a lifespan, but not every protein has the same lifespan -- and lifespans are not fixed at birth. Finally, another great learning in freshman bio is around enzyme inhibitor types -- and the proteasome is the ultimate enzyme inhibitor. Plus, I'd try to mention the case of "the enemy of my enemy protein is my friend" -- proteasomes can activate one protein by destroying its inhibitor.

That's also a nice segue into another major there worth developing: regulation. I think the main message here is that any time a cell needs to process an mRNA or protein, it's an opportunity for regulation. Post-translational modifications of proteins play a key role here.

Furthermore, it's worth noting that regulation often uses chains of proteins ("pathways"). These chains offer both new opportunities for regulation and signal amplification. We spent a lot of time looking at the chains of enzymes that turn sugars into energy. Of nearly equal importance is the idea that chains of proteins (and not all of them enzymes) can control a cell. In addition, it is important to recognize that these pathways are organized into functional modules, reflecting both opportunities for control and their evolution.

Clearly in this spot the fact that we can now sequence entire genomes deserves mention. Beyond that, I think the most important fact to impress on young minds is how bewildered we still are by even the simplest genomes.

Stem cells are an important concept, and not only because they are a hot topic in the popular press and political arena. This is a key idea -- cell divisions which proceed in an asymmetric pattern.

One final clear concept for inclusion at this level is epigenetics. It is key to underline that there are means to transmit information in a heritable way which are not specifically encoded in the DNA sequence -- as important as that sequence is.

I'm sure I've missed a bunch of topics. There are a lot of ideas in the grey zone -- I haven't quite convinced myself they belong in freshman bio but certainly belong a course up. For example, the fact that organisms can borrow from other genomes (horizontal transfer) or even permanently capture entire organisms (endosymbionts) certainly belongs in cell bio or genetics, but I'm not sure it quite fits freshman year (but nor am I certain it doesn't). Lipid rafts and primary cilia and all sorts of other newly discovered (or re-discovered) subcellular structures definitely would fit in my curriculum there. Gaseous signalling molecules would definitely warrant mention, though perhaps along with the hormones in the organisms and bigger semester.

With luck, many will read this and be kind enough (and kind while doing it) to point out the big advances of the last score of years which deserve inclusion as well -- and I also have little doubt that many freshman this year are being exposed to many topics I wasn't because exist they didn't.

Thursday, March 18, 2010

Second Generation Sequencing Sample Prep Ecosystems

A characteristic of each of the existing second generation sequencing instruments is that each manufacturer provides its own collection of sample preparation reagents and kits.

Some of this variation is inherent to a particular platform. For example, the use of terminal transferase tailing is part of the supposed charm of the Helicos sample prep. Polonator needs to use a series of tag generation steps to overcome its extremely short read length. Illumina's flowcells need to be doped with specific oligos. So, some of this is natural.

On the other hand, it does complicate matters -- especially for various third parties which are producing sample preparation options. For example, targeted resequencing using hybridization really needs to have the sequencing adapters blocked with competing oligos -- and those will depend on which platform the sample has been prepared for. Epicentre has a clever technology using engineered transposases to hop amplification tags into template molecules -- but this must be adapted for each platform. Various academic protocols are developed with one platform in mind, even when there is really no striking functional reason for platform-specificity -- but a protocol developed on one really needs to be recalibrated for any other. And in any case, it would great for benchmarking instruments if precisely the same library could be fed into multiple machines -- and it would be great for researchers looking to buy sequencing capacity on the open market to be able to defer committing to a platform to the last minute.

In light of all this, it is interesting to contemplate whether this trend will continue. One semi-counter trend has been for all three major players, 454, Illumina and SOLiD, to announce smaller versions of their top instruments. Not only will these require less up-front investment, but they will apparently use all the same consumables as their big siblings -- but not as efficiently. So if you are looking at cost per base in reagents, they won't look good.

However, an even more interesting trend that might emerge is for new players to piggy-back atop the old. Ion Torrent has dropped a hint that they might pursue this direction -- while the precise sample preparation process has yet to be announced (and the ultimate stages prior to loading on the sequencer are likely to be platform-specific), Jonathon Rothberg suggested in his Marco Island talk (according to reports) that the instrument could sequence any library currently in existence. This suggests that they may be willing to encourage & support preparing libraries with other platform's kits.

Of course, for the green eyeshade folks at the companies this is a big trade-off. On the one hand, it means a new entrant can leverage all the existing preparation products and experience. Furthermore, it means a new instrument could easily enter an existing workflow. The Ion Torrent machine is particularly intriguing here as a potential QC check on a library prior to running on a big machine -- at $500 a run (proposed) it would be worth it (particularly if playing with method development) and with a very short runtime it wouldn't add much to the overall time for sequencing. PacBio may play in this space also, if libraries can be easily popped in. This also acts as a "camel's nose in the tent" strategy for gaining adoption -- first come in as a QC & backstop, later munch up the whole process.

Of course the other side of the equation for the money counters is that selling kits is potentially lucrative. Indeed, it could be so lucrative that an overt attempt to leverage other folks kits might meet with nasty (and silly) legal strategies -- such as a kit being licensed only for use with a particular platform. That would be silly -- if you are making money off every kit, why not market to all comers?

Thursday, March 11, 2010

Playing Director

Last weekend was the Academy Award presentations. Fittingly, just before I had my first theatrical success -- with Actors.

Actors are a Scala abstraction for multiprocessing. I've only really played with multiprocessing once, back in my waning days at Harvard. I tried writing some multithreaded Java code, and the results were pretty ugly. The code soon became cluttered with locks and unlocks and synchronized keywords, but my programs still locked up consistently. Multiple processes can be a real headache.

But, there's also the benefit -- especially since I have a brand new smoking fast oligoprocessor box (I keep some mystery in the precise number). Tools such as bowtie and BWA are multithreaded, but it would be useful to have some of the downstream data crunching tools enabled as well.

Actors are a high level abstraction which relies on message passing. Each Actor (or non-Actor process) communicates with other Actors by sending an object. The Actor figures out what sort of object has been thrown its way and acts on it. A given Actor will execute its tasks in the order given, but across the cast there is no guarantees; everything is asynchronous. Each Actor behaves as if it has its own thread, though in reality a pool of worker threads manages the execution of the Actors -- threads tend to be heavyweight to start up, so this scheme minimizes that overhead and thereby encourages casts of millions -- but I won't emulate de Mille for some time.

My first round of experiments left me with new bruises -- but I did come out on top. Some lessons learned are below.

First, get the screenplay nailed down as much as possible before involving the Actors. Debugging multithreaded code brings on its own headaches; don't bring down that mess of trouble before you need to. For example, in the IDE I am using (Eclipse with the Scala plug-in, which some day I will rant about) if you hit a debugging breakpoint in one thread the others keep going. In my case, that meant a println statement from my master process saying "I'm waiting for an Actor to finish" -- which kept printing and thereby prevented me from examining variables (because in Eclipse, if Console is being written to it automatically pops to center stage).

A corollary to this is after several iterations I had improved the algorithm so much it probably didn't need Actors any more! I really should time it with 0, 1 and 2 Actors (and both permutations of 1 actor -- the code runs in 3 stages and a single actor can do either the last one or the last two) -- the code is about 2/3 of the way to enabling that. Actually, one reason I went through a final bit of algorithmic rethink was the fact that the Actor enabled code was still a time pig -- the rethought version ran like a greased pig.

Second, remember the abstraction -- everything is passed as messages and these messages may be processed asynchronously. More importantly, always pretend that the messages are being passed by some degree of copying. An early version of my code ignored this and had code trying to change an object which had been thrown to an Actor. This is a serious no-no and leads to unpredictable results. You give stage commands to your Actors and then let them work!

Third, make sure you let them finish reciting their lines! My first master thread didn't bother to check if all the Actors were done -- which led to all sorts of interesting run-to-run variation in output which was mystifying (until I realized it was the synchrony problem). Checking for being done isn't trivial either. One way is to have a flag variable in your Actor which is set when it runs out of things to do. That's good -- as long as you can easily figure out how to set it. You can also look to see if an Actor is done processing messages -- except checking for an empty mailbox doesn't guarantee it is done processing that last message, only that it has picked it up. One approach that worked for my problem, since it is a simple pipelining exercise, is to have Actors throw "NOP" messages at themselves prior to doing any long process -- especially when the master thread sends them a "FLUSH" command to mark the end of the input stream. Such No OPeration messages keep the mailbox full until it gets done with the real work.

So, I have a working production. I'll be judicious in how I use this, as I have discovered the challenges (in addition to the problems I solved above, there is a way to send synchronous messages to Actors -- which I could not get to behave). But, I am already thinking of the next Actor-based addition to some of my code -- and my current treatment is pretty complicated. But, a plot snip here and a script change there and I should be ready for tryouts!