Monday, June 28, 2010

Knome Cofactors Ozzie Ozbourne's Genome

I got an email today pointing out that Cofactor Genomics will be responsible for generating Ozzie Osbourne's genome sequence for Knome. Said email, unsurprisingly, originated at Cofactor.

Now, before anyone accuses me of being a shill let me point out that (a) I get no compensation from any of these companies and (b) I've asked for multiple quotes from Cofactor in good faith but have yet to actually send them any business. Having once been in the role of quote-generator and knowing how frustrating it is, I have a certain sympathy & see spotlighting them as a certain degree of reasonable compensation.

Knome's publicity around the Osbourne project has highlighted his bodyy's inexplicable ability to remain functioning despite the tremendous chemical abuse Mr. Osbourne has inflicted on it. The claim is that his genome will shed light on this question. Given that there are no controls in the experiment and he is an N of one, I doubt anything particularly valuable scientifically will come of this. I'm sure there will be a bunch of interesting polymorphisms which can be cherry-picked -- we all carry them. Plus, the idea that this particular individual will change his life in response to some genomics finding is downright comical -- clearly this is not someone thinks before he leaps! It's about as likely as finding the secret to his survival in bat head extract.

Still, it is a brilliant bit of publicity for Knome. Knome could use the press given that the FDA is breathing down their necks along with the rest of the Direct to Consumer crowd. Getting some celebrities to spring for genomes will set them up for other business as their price keeps dropping. The masses buy the clothes, cars and jewelry worn by the glitterati, so why not the genome analysis? Illumina ran Glenn Close's genome, but Ozzy probably (sad to say) has a broader appeal across age groups.

Sunday, June 27, 2010

Filling a gap in the previous post

After thinking about my previous entry on PacBio's sample prep and variant sensitivity paper I realized there was a significant gap -- neither dealt with gaps.

Gaps, or indels, are very important. Not only are indels a significant form of polymorphism in genomes, but small indels are one route in which tumor suppressor genes can be knocked out in cancer.

Small indels have also tended to be troublesome in short read projects -- one cancer genome project missed a known indel until the data was reviewed manually. Unfortunately, no more details were given of this issue, such as the depth of coverage which was obtained and whether the problem lay in the alignment phase or in recognizing the indel. The Helicos platform (perhaps currently looking like it should have been named Icarus) had a significant false indel rate due to missed nucleotide incorporations ("dark bases"). Even SOLiD, whose two-base encoding should in theory enable highly accurate base calling, had only a 25% rate of indel verification in one study (all of these are covered in my review article).

Since PacBio, like Helicos, is a single molecule system it is expected that missed incorporations will be a problem. Some of these will perhaps be due to limits in their optical scheme but probably many will be due to contamination of their nucleotide mixes with unlabeled nucleotides. This isn't some knock on PacBio -- it's just that any unlabeled nucleotide contamination (either due to failure to label or loss of the label later) will be trouble -- even if they achieve 1 part in a million purity that still means a given run will have tens or hundreds of incorporations of unlabeled nucleotides.

There's another interesting twist to indels -- how do you assign a quality score to them? After all, a typical phred score is the probability that the called base is really wrong. But what if no base should have been called? Or another base called prior to this one? For large numbers of reads, you can calculate some sort of probability of seeing the same event by chance. But for small numbers, can you somehow sort the true from the false? One stab I (and I think others) have made at this is to give an indel a pseudo-phred score derived from the phred scores of the flanking bases -- the logic being that if those were strong calls then there probably isn't a missing base but if those calls were weak then your confidence in not skipping a base is poor. The function to use on those adjacent bases is a matter of taste & guesswork (at least for me) -- I've tried averaging the two or taking the minimum (or even computing over longer windows), but have never benchmarked things properly.

Some variant detection algorithms for short read data deal with indels (e.g. VarScan) and some don't (e.g. SNVMix). Ideally they all would. There's also a nice paper on the subject that unfortunately didn't leave lazy folks like me their actual code.

So, in conclusion I'd love to see PacBio (and anyone else introducing a new platform) perform a similar analysis of their circular consensus sequencing on a small amount of a 1 nucleotide indel variant spiked into varying (but large) backgrounds of wildtype sequence. Indeed, ideally there would be a standard set of challenges which the scientific community would insist that every new sequencing platform either publish results on or confess inadequacy for the task. As I suggested in the previous post, dealing with extremes of %GC, hairpins, mono/di/tri/tetranucleotide repeats should be included, along with the single nucleotide substitution and single nucleotide deletion mixtures. To me these would be far more valuable in the near term than the sorts of criteria in the sequencing X-prize for exquisite accuracy. Those are some amazing specs (which I am on record as being skeptical they will be met anytime soon). What we trying to match platforms with experiments (and budgets) really need is the nitty gritty details of what works and what doesn't at what cost (or coverage).

Wednesday, June 23, 2010

PacBio oiBcaP PacBio oiBcaP

PacBio has a paper in Nucleic Acid Research giving a few more details on sample prep and some limited sequencing quality data on their platform.

The gist (which was largely in the previous Science paper) is that the templates are prepared by ligating a hairpin structure on to each end. In this paper they focused on a PCR product, with the primers designed with type IIS (BsaI) restriction sites flanking the insert. Digestion yields sticky ends (and with BsaI, these can be designed to be non-symmetric to discourage concatamerization) which enable ligating the adapters. Of course, for applying this on a large scale you do run into the problem of avoiding the restriction sites. There are other approaches, such as uracil-laden primer segments which can be specifically destroyed the DUT and UNG enzymes -- the only catch being that many of the most accurate polymerases can be poisoned by uracil. A couple of other schemes have either been proven or can be sketched quickly.

In any case, once these molecules are formed they have the nice property of being topologically circular and having the insert sequence represented twice (once in reverse complement) within that circle, with the two pieces separated by the known linker sequence. So if the polymerase lasts to go round-and-around, then each pass gives another crack at the sequence. For short templates, this gives lots of rounds and even on a longer (1Kb) PCR product they successfully had three bites at the apple. Feed these into a probabilistic aligner and each pass improves the quality values. Running the same template many times enabled calibrating the quality values (which are Phred-scaled), showing them to be a reasonable estimator of the likelihood of miscalling the consensus.

Given that eventually their polymerase dies on a template, they point out that the length of a fragment is highly predictive of the number of passes which can be observed which in turn implies the sequencing quality which will be obtained for a template. This makes for an interesting variation on the usual "what size library" question in sequencing. For sequencing a bacterial genome, it may make sense to go very deep with short fragments to get high quality pieces which can then be scaffolded using much longer individual reads without the circular consensus effect as well as the spaced "strobe" reads to provide very long range scaffolding. For a bacterial genome & the claims made of capacity per SMRT cell, it would seem that making 2 libraries (one short for consensus sequencing, one long for the others) and burning perhaps 4-5 $100 SMRT cells, one could get a very good bacterial genome draft. For targeted sequencing by PCR, the approach is quite attractive. Amplicon length becomes an important variable for final quality. Appropriate matching of the upstream multi-PCR technology (such as RainDance or Fluidigm or just lots of conventional PCRs) with PacBio for capacity will be necessary -- and hard numbers on throughput/SMRT cell are desperately needed!

One side thought: is there a risk with strobe sequencing of accidentally going off the end and reading back along the same insert -- but not realizing it since the flip turn in the hairpin was done in the dark? It would suggest that libraries for strobe sequencing need to be very stringently sized.

One curiosity of their plot of this phenomenon is that the values appear to be asymptotically approaching phred 40 -- an error rate of 1 in 10,000. Is this really where things top out? That's a good quality -- but for some applications possibly not good enough.

They go on to apply this to measuring the allele frequency of a SNP in defined mixtures. Given several thousand reads and the circular consensus, they succeeded at this -- even when the minor allele was present at a frequency of 2.5%. This isn't as far as some groups have pushed 2nd generation sequencing for rare allele detection (such as finding a cancer mutation in a background of normal DNA), but is certainly in the range of many schemes used in this context (such as real-time PCR). If my inference of a maximum phred score is correct, it would tend to limit the sensitivity for rare mutation detection somewhere in the parts-per-thousand range; the record is around this (1 in 0.1% reported by Yauch et al)

My major concern with this paper is the lack of a breadth of substrates. Are there template properties which give trouble to the system? The obvious candidates are extremes of composition, secondary structures and simple repeats. Will their polymerase power through 80+% GC? Complex hairpins? Will stuttering be observed on long simple repeats? Can long mononucleotide runs be measured accurately?

In the end, the key uses for the first release PacBio system will depend on where these problems are and those throughput numbers. This paper is a useful bit of information, but much more is needed to determine when PacBio is either cost-effective for an application (vs. other sequencing or non-sequencing competitors) or when the performance advantages in that application (such as turn-around time) push them to the fore. Personally, I think it is a mistake on PacBio's part not to be literally flooding the world with data. Or at least the bioinformatics world -- only when they start releasing data to the broad developer community will there be the critical tweaking of existing tools and development of new tools to really take advantage of this platform and also cope with whatever weaknesses it possesses.
Travers, K., Chin, C., Rank, D., Eid, J., & Turner, S. (2010). A flexible and efficient template format for circular consensus sequencing and SNP detection Nucleic Acids Research DOI: 10.1093/nar/gkq543

Monday, June 07, 2010

What's Gnu in Sequencing?

The latest company to make a splashy debut is GnuBio, a startup out of Harvard which gave a presentation at the recent Personal Genomics conference here in Boston. Today's Globe Business section had a piece, Bio-IT World covered it & also Technology Review. Last week Mass High Tech & In Sequence (subscription required) each a bit too.

GnuBio has some grand plans, which are really in two areas. For me the more interesting one is the instrument. The claim which they are making, with a attestation of plausibility from George Church (who is on their SAB, as is the case with about half of the sequencing instrument companies), is that a 30X human genome will be $30 in reagents on a $50K machine (library construction costs omitted, as is unfortunately routine in this business). The key technology from what I've heard is the microfluidic manipulation of sequencing reactions in picoliter droplets. This is similar to RainDance, which has commercialized technology out of the same group. The description I heard from someone who attended the conference is that GnuBio is planning to perform cyclic sequencing by synthesis within the droplets; this will allow miniscule reagent consumption and therefore low costs.

It's audacious & if they really can change out reactions within the picoliter droplets, technically it is quite a feat. From my imagination springs a vision of droplets running a racetrack, alternately getting reagents and being optically scanned for which base came next & an optical barcode on each droplet. I haven't seen this description, but I think it fits within what I have heard.

Atop those claims comes another one: despite having not yet read a base with the system, by year end two partners will have beta systems. It will be amazing to get proof-of-concept sequencing, let alone have an instrument shippable to a beta customer (this also assumes serious funding, which apparently they haven't yet found. Furthermore, it would be stunning to get reads long enough to do any useful human genome sequencing even after the machine starts making reads, let alone enough for 30X coverage.

The Technology Review article, a journal I once read regularly and had significant respect for, is a depressingly full of sloppy journalism & failure to understand the topic. One paragraph has two doozies
Because the droplets are so small, they require much smaller volumes of the chemicals used in the sequencing reaction than do current technologies. These reagents comprise the major cost of sequencing, and most estimates of the cost to sequence a human genome with a particular technology are calculated using the cost of the chemicals. Based solely on reagents, Weitz estimates that they will be able to sequence a human genome 30 times for $30. (Because sequencing is prone to errors, scientist must sequence a number of times to generate an accurate read.)

The first problem here is that yes, the reagents are currently the dominant cost. But, if library construction costs are somewhere in the $200-500 range, then after you drop reagents greatly below that cost then it's a bit dishonest to tout (and poor journalism to repeat) a $30/human genome figure. Now, perhaps they have a library prep trick up their sleeve or perhaps they can somehow go with a Helicos-style "look Ma no library construction" scheme. Since they have apparently not settled on a chemistry (which will also almost certainly impose technology licensing costs -- or developing a brand new chemistry -- or getting the Polonator chemistry, which is touted as license-free), anything is possible -- but I'd generally bet this will be a clonal sequencing scheme requiring in-droplet PCR. The second whopper there is the claim that the 30X coverage is needed for error detection. It certainly doesn't hurt, but even with perfect reads you still need to oversample just to have good odds of seeing both alleles in a diploid genome.

Just a little alter in the story is the claim "The current cost to sequence a human genome is just a few thousand dollars, though companies that perform the service charge $20,000 to $48,000", which confuses what one company (Complete Genomics) may have achieved with what all companies can achieve.

The other half of the business plan I find even less appealing. They are planning to offer anyone a deal: pay your own way or let us do it, but if we do it we get full use of the data after some time period. The thought is that by building a huge database of annotated sequence samples, a business around biomarker discovery can be built. This business plan has of course been tried multiple times (Incyte, GeneLogic, etc.) and has worked in the past.

Personally, I think whomever is buying into this plan is deluding themselves in a huge way. First, while some of the articles seem to be confident this scheme won't violate the consent agreements on samples, it's a huge step from letting one institution work with a sample to letting a huge consortium get full access to potentially deidentifying data. Second, without good annotation the sequence is utterly worthless for biomarker discovery; even with great annotation randomly collected data is going to be challenging to convert into something useful. Plus, any large scale distribution of such data will butt up against the widely accepted provision that subjects (or their heirs) can withdraw consent at any time.

The dream gets (in my opinion) just daffier beyond that -- subjects will be able to be in a social network which will notify them when their samples are used for studies. Yes, that might be something that will appeal to a few donors, but will it really push someone from not donating to donating? It's going to be expensive to set up & potentially a privacy leakage mechanism. In any case, it's very hard to see how that is going to bring in more cash.

My personal advice to the company is many-fold. First, ditch all those crazy plans around forming a biomarker discovery effort; focus on building a good tech (and probably selling it to an established player). Second, focus on RNA-Seq as your initial application -- this is far less demanding in terms of read length & will allow you to start selling instruments (or at least generating data) much sooner, giving you credibility. Of course, without some huge drops the cost of library construction will be dwarfing that $30 in reagent, perhaps by a factor of 10X. A clever solution there using the same picodroplet technology will be needed to really get the cost of a genome to low levels -- and could be cross-sold to the other platforms (and again, perhaps a source of a revenue stream while you work out the bugs in the sequencing scheme).

Finally, if you really can do an RNA-Seq run for $30 a run in total operating costs, could you drop an instrument by my shop?

Saturday, June 05, 2010

Trying to Kick the Bullet

I know the title looks like a malapropism, but it isn't. What I'm trying to do is wean myself away from bulleted lists in PowerPoint.

I have a complex relationship with PowerPoint, as with the other tools in Microsoft Office. Each has real value but has also been seriously junked up by the wizards of Redmond. Far too much time is spent investing the tool with features that are rarely or never of value.

Edward Tufte, whom I admire greatly, takes a far more negative view of PowerPoint. I'm a fan of Tufte's; not an acolyte. PowerPoint can be very useful, if you use it carefully. But, I'm always open to considering how I might improve how I use it. After looking through some of Tufte's specific criticisms of bulleted lists, I realize here is an opportunity to make a change.

Now, I am a heavy user of bulleted lists. I often think in hierarchies & outlines, which fits them well. I also find it challenging to draw complex diagrams in a manner which is both presentable & useful, at least in reasonable time. So I often write many slides of bulleted lists & then try to go back and decorate them with relevant & informative diagrams I can lift from various other sources (or from previous slides). I do spend some time designing a few careful diagrams using the tools in PowerPoint.

So what is wrong with bulleted lists? As Tufte points out, the standard Microsoft scheme uses four or more different attributes to display levels of hierarchy. First, inferior levels are more indented than their parents. Second, the type size is changed. Third, the type face is changed to a different family or italicized (or both). Fourth, the character used for the bullet is changed. On top of Tufte's sharp criticism, there is the pointed satire as found in The Gettysburg PowerPoint Address.

Now, I find some of this useful, but it is a good wakeup that some of it I have just been accepting. After all, do I really need the actual bullets? Rarely are they actually showing anything useful -- the one exception being when I change them around in one list to show a useful attribute (such as checks vs. X's ). Another variant is using a numbered list to emphasize a critical order of points, such as in a series of steps which must be executed in order. But, most bullets are just consuming valuable slide real estate without adding value.

I also find the indents useful to show hierarchy. And having a smaller typeface is useful since the upper levels are more on the order of headlines and the lower levels often details -- so a smaller face allows me to pack more in.

But the change in font family or italicization? Those aren't very useful either. I'd much rather save italics for emphasis.

The challenge is actually putting this into practice. I have started going through active slide decks and converting them to the reduced list scheme. I don't see myself giving up hierarchical lists, but I'll try to do better within that structure.