Sunday, February 28, 2010

Programming Fossil Stumble, but Scala School Progresses

If you watch this space by RSS, make sure you catch the additions and corrections by BioIT World's Kevin Davies and my fellow Codon Devices alum Jack Leonard on the Ion Torrent technology (as well as Daniel MacArthur's piece on the last day of AGBT). All were on site & actually saw the machine -- it's a bit scary to see my piece on the PARE technology tweeted (and retweeted) as a substitute for a missed session.

I'll take a break for at least a few days from AGBT & try to regain some calm -- a sequencing instrument that you can buy with a home equity loan is a dangerous temptation.

I've been having trouble carving out time -- and enthusiasm -- for my Scala retraining exercise. A week or so ago I did make one try and hit an annoying roadblock. I had previously worked through online examples to automagically convert Java iterators into a clean Scala idiom. So, I decided to try this with BioJava sequence iterators -- and had the rude surprise that these don't implement the Iterator interface! Aaarggh! The documentation is suggestive of the reason -- when BioJava sequence iterators were created, Java didn't support typesafe iterators (due to a lack of generic types). That's since been grafted onto Java, but BioJava hasn't updated to embrace this. Most likely this is to guarantee backwards compatibility -- in a sense it is a fossil record of what Java once was.

On Friday I had a big block of meeting-free time and resolved to attack things again. It's a big jump really switching over to a functional programming style (and taking the leap of faith that the scala compiler and JVM JIT will make some efficient code out of it). Scala also is very forgiving of syntax marks -- but in a very context dependent manner. Personally, I'd prefer a stricter taskmaster as being rapped on the knuckles for every infraction tends to reinforce the lesson faster.

For my problem, I chose to write a tool which would stream through a SAM format second gen sequencing alignment file and compute the number of reads covering each position. The program assumes that the SAM file is sorted by chromosome and then by position. Also, this first cut can't work on a very large file -- memory conservation was not a design constraint (though I did try to work some in).

Now, in some sense this is not a good problem to tackle -- not only will I avoid using the Picard library for processing SAM but there are already tools out there to perform the calculation. So I'm guilty of the sin of reinventing the wheel. But, it is a simple problem to formulate and has a nice trajectory forward for exploring multiprocessing and other fun topics. Plus, I'll try to couple things loosely enough that dropping in Picard should be possible without much acrobatics.

The topmost code is a bit boring, opening a file of SAM data with a class that parses it (SamStreamer) and one to count coverage (chrCoverageCollector)

object samParser extends Application
{
val filename="chr21.20k.sam"
val samFile=new SamStreamer(filename)
var lineNumber = 0
val ccc = new chrCoverageCollector();
for (read <- samFile.typicalFragmentFronts )
ccc.addCoverage(read)
ccc.dump
}

SamStreamer has four parts. The first one (fromSamLine) converts a line from SAM into an object of class AlignedPairedShortRead -- pretty dull. The second one (reads) demonstrates some key concepts. First, it looks like the definition of a value type variable (immutable), but has a function body instead of a direct assignment. Second, it goes through the lines of the file with a "for comprehension" -- which also filters out any lines starting with "@" (header lines). Finally, it ends with a yield statement -- meaning this acts like an iterator over the set. The other three parts follow the same pattern -- iterate over a collection or stream, with a yield delivering each result. Iterators naturally -- without any special declarations!

class SamStreamer(samFile: String)
{
def fromSamLine (line : String): AlignedPairedShortRead =
{
var fields : Array[String] = line.split("\t")
return new AlignedPairedShortRead(
fields(0),fields(1),fields(2),
Integer.parseInt(fields(3)),
fields(9), fields(5),fields(6), Integer.parseInt (fields(7) ))
}
val reads = for { line <- Source.fromFile(samFile).getLines
if !line.startsWith("@") } yield fromSamLine(line)
val typicalFragmentFronts
= for { read <- reads
if (read.frontOfTypicalFragment) } yield read
val mapped = for { read <- reads
if (read.mapped) } yield read
}

AlignedPairedShortRead is mostly a collection of fields and accessors for them. I could have coded this much more compactly -- I think. But that will be another lesson, as I just stumbled on it. The other bit of methods are a lot of tests for various states. For this example, "frontOfTypicalFragment" returns true if a read is the lower coordinate member for a read pair which maps within a short distance of one another on the same chromosome. Actually, for this example calling into this is a bug -- I should have a method in SamStreamer to screen for all mapped reads. Actually, a more clever way would be to pass the filtering function in to a more generic scanner function -- another exercise for a future session.

class AlignedPairedShortRead(fragName: String, flag: String, chr: String, pos: Int,
read: String, cigar: String, mateChr: String, matePos: Int)
{
def mapped : Boolean = { return chr.startsWith("chr") && !cigar.equals("*") }
def mateMapped : Boolean = { return (mateChr.equals("=") && pos!=matePos) || mateChr.startsWith("chr") }
def bothMapped : Boolean = { mapped && mateMapped }
def frontOfTypicalFragment : Boolean = { return bothMapped && mateChr.equals("=") && pos < matePos+read.length }
def id : String = { return fragName }
def bounds : (Int,Int) = { return (pos,pos+read.length)}
def Chr : String = { return chr }
}

chrCoverageCollector (gad! no consistency in my naming convention!) takes reads and assigns them to a CoverageCollector according to which chromosome they are on (addCoverage). Another method dumps the results in BED format.

class chrCoverageCollector()
{
val Coverage=new HashMap[String,CoverageCollector]
def addCoverage(read: AlignedPairedShortRead)=
{
if (!Coverage.contains(read.Chr)) Coverage(read.Chr)=new CoverageCollector
Coverage(read.Chr).increment(read)
}
def dump()
{
for( chr:String <- Coverage.keys )
{
for (pc:RangeCoverage <- Coverage(chr).MergedSet )
{
println(pc.BedGraph(chr))
}
}
}
}

CoverageCollector does most of the real work -- except for one last class it introduces (I'm starting to wish I had written this from bottom up rather than top down!), RangeCoverage. CoverageCollector keeps a stash of RangeCoverage to store the coverage at individual positions. The one, mostly impotent attempt at memory conservation is a method (thin) to consolidate runs of the same coverage level -- but only those safely outside the last region incremented (remember, the reads come in sorted order!). allElemSorted can deliver all the RangeCoverage objects in positional order and MergedSet delivers the same, but with consolidation of merged elements.

class CoverageCollector()
{
val Coverage = new HashMap[Int,RangeCoverage]
def getCoverage (st:Int,en:Int) =
{ for (pos <- st to en )
yield { if (!Coverage.contains(pos)) Coverage(pos)=new RangeCoverage(pos);
Coverage(pos) }
}
var lastThin:Int = 0
def increment(read: AlignedPairedShortRead)=
{
val(st:Int,en:Int)=read.bounds;
for (coverCnt:RangeCoverage <- getCoverage(st,en ) ) coverCnt.increment
lastThin+= 1
if (lastThin>100000) thin(st-500)
}
def thin(thinMax:Int)
{
var prev=new RangeCoverage( -10 )
for (rc:RangeCoverage <- Coverage.values.toList.filter( p=> p.St < thinMax ) )
{
if (prev.mergable(rc))
{
prev.engulf(rc)
Coverage-=rc.St
}
prev=rc
}
lastThin=0
}
def allElemSorted:Array[RangeCoverage] =
Sorting.stableSort(Coverage.values.toList)
def MergedSet : List[RangeCoverage] = {
val rcf = new RangeCoverageMerger();
return allElemSorted.toList.filter(rcf.mergeFilter ) }
}

Finally, the two last classes (okay, I claimed only one more before -- forgot about one). RangeCoverage extends the Ordered trait so it can be sorted. Traits are Scala's approach to multiple inheritance (MI). I played with MI in C++ and got my fingers singed by it; traits will require some more study as it seems to be very controversial whether they really are a good solution. I'll need to play some more before I can give anything resembling an informed opinion. RangeCoverageMerger is a little helper class to consolidate RangeCoverage objects which are adjacent and have the same coverage. I probably could have buried this in RangeCoverage with a little more cleverness, but I ran out of time & cleverness. One final language note: the "override" keyword is required whenever you override an underlying method -- though Scala lacks the requirement to declare the parent method "virtual" to enable overriding (I think I have that all right)

class RangeCoverage (st:Int) extends Ordered[RangeCoverage] {
var coverage:Int =0
var en:Int=st
override def compare(that:RangeCoverage):Int = this.st compare that.en
def increment = { coverage+= 1 }
def Coverage:Int = { return coverage }
def St:Int = { return st}
def En:Int = { return en}
def BedGraph(Chr:String):String = { return
Chr+"\t"+st.toString+"\t"+en.toString+"\t"+coverage.toString }
def mergable(next:RangeCoverage):Boolean = (en+1==next.St && coverage==next.coverage)
def engulf(next:RangeCoverage) = { en=next.en }
}
class RangeCoverageMerger
{
var prev = new RangeCoverage(-1)
def mergeFilter (rc:RangeCoverage):Boolean = {
if (prev.St== -1 || !prev.mergable(rc)) { prev=rc; return true; }
prev.engulf(rc); return false
}
}

So, if I've copied this all correctly and with the bit below, one should be able to run the whole code (if not, my profuse apologies). On my laptop, it can get through 20,000 lines of aligned SAM data (which I can't post, both due to space and because it's company data) and not explode, though 50K blows out the Java heap. A next step is to deal with this problem -- and at set the stage for multiprocessing.

Okay, one final really dull, but important bit. This actually goes at the top, as these are the import declarations. Dull, but critical -- and I always gripe about folks leaving them out.

import scala.io.Source
import scala.collection.mutable.HashMap
import scala.util.Sorting

Saturday, February 27, 2010

Last Day of Eavesdropping on Marco Island

Today was the last day of the Marco Island conference, so I won't be hammering Twitter again for quite a while. The afternoon session focused on emerging technologies.

Complete Genomics appears to have dispelled the skepticism they had been met with last year. It certainly helped that two customers presented data (Anthony Fejes' notes on CG workshop). Apparently they hinted at some additional technological improvements coming down the pike to get even more data out.

Life Technologies presented on their single molecule system, which they hope to get to early access customers by the end of the year. It's a single molecule system with many similarities to Pacific Biosciences. One interesting twist is that they can add new polymerase when the old ones die, so in theory they can keep sequencing to extremely long lengths. This could be a huge plus for the system in de novo and metagenomic settings.

One other neat PacBio tidbit, thanks to Dan Koboldt, is that the polymerase reaction rates are so uniform that fragments can be sized (and therefore structural variants detected) by the time required to go from end-to-end.

Ion Torrent presented and apparently was received well, though the amount of detail available remotely is still frustratingly thin. A lot of key questions I have don't seem to have been answered, which I'm guessing is due to limited information in their presentation (though one can't rule out blogging fatigue hitting my sources). It also isn't helping that Twitter seems to be experiencing difficulty, perhaps because of the traffic due to the natural catastrophe in Chile & curiosity about tsunamis in the Pacific.

Ion Torrent's general scheme is to trap DNA (single molecules or clusters?) in wells in a micromachined plate (much like 454, though apparently no beads) and detect the release of a proton each time a nucleotide is incorporated. Detection is via a proprietary semiconductor detector built into the bottom of each well.

It isn't clear, for example, whether each of the micromachined wells in the system is watching a single DNA molecule or some sort of cluster of molecules. If the latter, what is the amplification scheme? The run times described seem incompatible with amplification.

How much sample goes in? What preparation is needed upstream? What sort of tagging is needed? Can, for example, the Ion Torrent machine be used to resequence (or QC) libraries from the other systems? Does the sample need to be linear, or can you sequence plasmids directly (I doubt it, due to supercoiling, but it's worth asking).

Ion Torrent is making several bold assertions. One is "The Chip is the Machine", which decodes to the fact that the chips (now seen on the website) determine the key performance attributes of the system; the box (reputedly $50K) is simply interface, data collection and reagent fluidics. Another bold claim is that the chips can be fabricated in any CMOS fab in the world. Of course, that presumably leaves out the specialized microfluidic setup on top. Still, that is an impressive supplier base.

Somewhere I saw a throughput of 160Mb per 1 hr experiment for $500 in consumables. The Ion Torrent website's video hints that part of their business model will be selling different chips of different densities for different applications. One nice feature of the consumables is that they should be just standard polymerases and unlabeled nucleotides. Of course, there could easily be some magic buffer components, but one part of the cost of many of the other systems is the need for either labeled nucleotides (everybody but 454) or complicated enzyme cocktails (454). Furthermore, it is the presence of unlabeled nucleotides in the reagents that are a major contributor to loss-of-phase in clonal systems and probably to "dark bases" in single molecule systems. Simple reagents should translate to low costs, and perhaps to high reliability and long reads.

How long? That's another key attribute I haven't seen. Again, knowing whether this is a single molecule system (in which case what would kill reads?) or clonal (with the dephasing problem) would be informative. How many reads per run? For some applications, getting lots of reads is more important than long reads -- and of course for others length is really important.

Error rates or modes? I haven't seen anything beyond an apparent bulletpoint that Ion Torrent sequenced E.coli (in a single run?) to 13X coverage, 99+% of genome in assembly and 99.9+% accuracy. Supposedly homopolymeric runs can be read out, but how accurately? Is there a length beyond which things get confusing?

One more neat aspect of the Ion Torrent system: no images. Sure, the traces from each pH run (the world's smallest pH meters, according to the website) should be much more compact, but not nothing -- though it is implied that the signal is sharp enough that there is no need to store them. Hence, unlike all the other systems there's no need for beefy on-board computers and no headache of storing enormous numbers of high resolution images.

A final thought: $500 per 1 hour run is attractive, but if you really kept one instrument going quite a tab would run up. Suppose one got in 10 runs in a day (does it have any autoloading capability?) -- that's $5K/day. Even keeping that up in the approximately 200 business days in a year is $1M in chips -- something Ion Torrent and their backers are licking lips over but will have to be faced by those who get the machines. Of course, you don't have to run the system constantly (and that's hardly constantly!) -- but if I had one, I'd certainly want to!

Friday, February 26, 2010

PacBio's big splash


The Pacific Biosciences instrument is officially unveiled now, with those lucky/smart (or SMRT?) enough to go to Marco Island filling in all of us not in that position. Sounds like a great lot of hoopla, though they didn't drag the Hornet for the splashdown.

First of all, it's a beast. "In this corner, weighing in a nearly an imperial ton...". Too bad their marketing picture has nothing good for judging the scale --
it's apparently 6.5 feet wide.

Kevin Davies at Bio-IT World has a wonderfully detailed article and there is a lot of nuggets in the Twitter feed. Anthony Fejes has two different sets of notes out -- one from a workshop and one from another speaker; Dan Koboldt has some good notes too (and if I haven't shouted out your notes, it's probably because I'm oblivious -- leave me a comment pointing to them). There was also a little bit of PacBio science in Elaine Mardis' talk (she's on their SAB) -- Anthony's notes & the twitter feed.

Okay, besides worrying about the capacity of floors & freight elevators, what's new? Well, not much on error rates from PacBio (apparently in the Q&A their presenter executed a jig, tango, waltz & rumba when asked) -- though the Mardis talk described resequencing samples of PacBio that had been done before by Illumina -- and the results are quite good. Another important note is that their system doesn't seem to have much bias in terms of composition -- bias against hi/lo %GC has been noted in all of the amplification-based systems and can be a serious problem.

There's also a lot of talk about being able to distinguish various modified bases by their effects on polymerase kinetics. PacBio has also demonstrated direct RNA sequencing (substituting a reverse transcriptase for DNA polymerase) and is talking about watching proteins being made. I haven't quite figured out why you'd want to do that last one, but presumably it's for more than a cool Nature cover.

Read lengths decay exponentially -- but with lots around 1Kb and quite a few around 5K. The big problem is apparently oxidative damage to the polymerase triggered by the laser -- so they are working on both getting the oxygen out of the system and engineering hardier polymerases (the sort of biz I used to be in). Their strobe sequencing mode -- in which the laser is turned off to enable elongation in safe darkness -- enables multiple reads separated by long gaps.

The instrument definitely raises the bar on sample prep -- it's apparently entirely automated within the monster. YEAH! A machine I can delude myself into thinking I could run it! One drawer takes the SMRT cells and another the DNA samples -- 500 ng of each. That doesn't sound like much (it's at least better than the 5-10ug most library prep protocols call for -- except the ones looking for 20-30ug), but it seems you don't get a lot from each sample.

The number of reads per cell isn't huge -- but you're still getting about 2 E.coli genome equivalents by my calculation. This is a bit undersized for a lot of applications -- but grand from many others. Mardis' talk discussed using PacBio for sequencing PCR amplified resequencing samples -- this would appear to be right in the PacBio sweet spot. Perhaps a few hundred long PCR products could be packed into one SMRT run and still get many hundreds of reads per sample -- well, maybe pack fewer amplicons.

What might be other good uses? Clearly metagenomics and similar. I just saw a posting on a professional board of someone pondering multiplexing hundreds of samples for an Illumina run (the current barcode schemes are for a few orders of magnitude fewer samples). Blitzing each sample through the PacBio instrument would seem to be obvious -- if the error rates are acceptable. Folks doing whole genome sequencing of small genomes will love having PacBio to generate scaffolds. For bigger genomes, it may just still be too expensive to get much coverage ($100 a SMRT cell sounds cheap, until you start multiplying that out for the numbers you need) -- but perhaps not (much too fried to do that calculation at the moment).

RNA-Seq might be a bit trickier. If you need 500ng of input material, that's an awful lot of ribosome-depleted or poly-A RNA. Plus, getting only tens of thousands of reads, making it hard to see lowly-expressed messages -- but very long ones, perhaps priceless. But, if you can get tons of RNA, then 100 SMRT cells would be about $10K and offer similar depth to what you can get today with Illumina but with those super long reads.

Now, who is this going to crimp the most? The instrument is clearly a ways from really threatening Illumina & SOLiD for the large genome market. 454 is a likely candidate to see growth pressured -- though between the new lower-priced "junior" and both PacBio's $700K price tag and their inability to flood the market with instruments, this will be ameliorated.

PacBio might have almost as much effect on the surrounding sequencing ecosystem. Making library prep reagents for this system is not going to make you lots of money! But, there will be a serious niche for targeted sequencing -- though with the scale it will probably require some rethought. Stuffing the whole exome into this doesn't really make sense -- if there are ~250K segments of the genome to read & you want 40X coverage of each, that's a lot of SMRT cells. But, intelligently chosen gene sets totaling about 500 regions (or around 20-50 genes) with pre-validated reagents -- now that might be a market (though one which might have 1-2 years of life -- better get cracking!). Simpler library prep will also go nicely with some of the enrichment systems -- a bugaboo of hybridization systems can be "daisy-chaining" of fragments via the amplification adapters -- but, on the other hand you don't get 500ng off an array or in-solution system without amplification. As with many disruptive technologies, it won't fit a lot of bills but will nibble off various parts of the business that are individually small but significant in aggregate. As noted above, RNA-Seq might be an initial success story for PacBio -- when RNA is abundant.

IMHO, PacBio does need to get some papers out on applications (Mardis' group apparently is close to having one) and make sure that the next tranche of installations not only includes the Sanger & BGI, but that there are also some core labs or commercial providers. Also, they need to start pumping data into the public domain -- while they signed a bunch of commercial software providers up, it is definitely out of academia that you find the most radical advances. Plus, there are a lot of now well-entrenched open source tools that need to be tested with the new kid. Even simple things like the semi-standard SAM/BAM format are going to need tweaking -- SAM/BAM stores all sorts of information on read pairs, and the strobe sequencing can generate many more than 2 tags per DNA fragment.

Of course, we have to wait another half day plus to find out what Ion Torrent is really delivering. That could really shake up the landscape -- at least the mental one.

A huge thanks to all the bloggers & twitterers for pouring out so much information. I'm still getting used to scanning past the retweets (is there a way to condense them) and there is the occasional shock-to-the-system (how could anyone in the field not have heard of Rodger Staden?!?), but that's a tiny price to pay for such fascinating stuff.

Thursday, February 25, 2010

Personalized Annoyance of Research Enthusiast (PARE)

Last night I finally got my paws on a paper which started out on a frustrating tack. Last week, a flurry of news items heralded a new approach from Vogelstein's group at Johns Hopkins that involved second generation sequencing of patient tumor samples. But, the early reports claimed it had been published in Science Translational Medicine, whereas it most certainly wasn't there except a suggestive teaser about the next week's issue. I thought perhaps someone had really blown it and ignored an embargo, but then it turned out the AAAS meeting is going on and the work was presented there. Few things more irritating than a paper being bandied about that I can't get my eyes on! Plus, I have a manuscript due next week that this might be relevant to, so the desire to get a copy was intense!

Yesterday, it really did come out. You'll need a subscription to read it -- though that is only $50 for online access if you already have a Science personal subscription. The gist of the paper showed up in the reports. Using SOLiD, they sequenced cancer genomes around 1X coverage using 1.5Kb mate-paired libraries using 25 long reads. For copy number analysis they also used single end reads. The key point is to identify rearrangements using the mate-paired fragments.

Now, many papers have looked at rearrangements in cancer using mate paired or paired end strategies. What sets this paper apart is doing something with it: turning these into patient specific tumor markers (an approach they call PARE for personalized analysis of rearranged ends). Because rearrangements are specific to the tumor and not at all like what is in the patient's normal DNA, they make great PCR amplicons for finding the tumor. Indeed, they were able to detect tumor DNA in blood with their assays.

This is an example of second generation sequencing getting very close to the clinic. But what will it take to get it there? Many of the news items claimed the cost might be soon down around $3K. Now, to do this properly you really need to either do the sequencing on both normal and tumor DNA or make a bunch of assays and expect some to be duds. Why? Because some of these structural changes will either be alignment noise or private germline structural variants. They do use copy-number analysis to filter the list -- many tumor rearrangments will be associated with local copy number amplification. But more importantly, the cost numbers sound suspiciously like reagent-only cost, not fully-loaded. Fully loaded costs include the ~$1.5M sequencing center (SOLiD + prep gear + compute farm), real estate & salaries. These could easily double or triple that cost, though someone who actually owns a green eyeshade should figure that out for sure.

The paper talks a little bit about the risk that as a tumor evolves one of these markers might be lost. This is particularly the case here because, unlike many papers, they really aren't worried if the rearrangement is driving the tumor. It's a handy landmark, though you would find driving rearrangements with it too. But, one particular worry is that a given rearrangement might not be in the dominant clone or a clone which treatment selects for survival. So having multiple markers will be a useful protection -- though that will up costs.

But back to irritating: a key value left out of this paper (and unfortunately most such papers) is the amount of input DNA for sequencing. Many of these sorts of protocols start with 5-10 micrograms of DNA, though some mate-pair schemes call for 5 to 10 times that. For some tumor types, that's a kings's ransom -- particularly for recurrent tumors or inoperable ones. Even beyond that, large scale application of this approach will require automating the library construction process end-to-end.

It's also worth noting that this is an application where absolute speed isn't critical . For generating a marker to be used for long-term following of the tumor, needing two weeks for SOLiD library prep & assembly and another few weeks to develop the PCR assays won't be a major roadblock. But, any sequencing-based approach used to determine treatment strategy needs to turn around results in not much more than 1-2 days. That's a high hurdle, and a wide open spot for fast sequencing technologies such as 454, PacBio, nanopores & Ion Torrent.

This is also an approach where someone with a long but noisy sequencing technology should take a hard look. Calling rearrangements with very long reads shouldn't require nearly the level of accuracy as calling point mutations.

ResearchBlogging.org
Leary, R., Kinde, I., Diehl, F., Schmidt, K., Clouser, C., Duncan, C., Antipova, A., Lee, C., McKernan, K., De La Vega, F., Kinzler, K., Vogelstein, B., Diaz, L., & Velculescu, V. (2010). Development of Personalized Tumor Biomarkers Using Massively Parallel Sequencing Science Translational Medicine, 2 (20), 20-20 DOI: 10.1126/scitranslmed.3000702

Wednesday, February 24, 2010

Marco Island is HOT!

The Marco Island Advances in Genome Biology and Technology, or AGBT (or just Marco Island) conference started up today. Whatever weather they're having is better than the cold rain that soaked my commute.

A sure sign a conference is hot is that there are lots of announcements prior to the conference that could be at the conference. So, we've been treated to lots of announcements from established players (such as Illumina and ABI) and new entrants -- Pacific Biosciences has announced that they will launch their system there and has already been lining up sample prep & informatics partners and announcing their early access sites. PacBio has also started making noise about a follow-on instrument that will be for clinical apps -- launched in 2014!! Puh-leeze, that is the inconceivable future!

ABI had a new announcement today -- they're own baby SOLiD (officially the P1) to come out later this year, joining the previously announced 454 junior and Illumina IIe. The claim of "cost per sample as low as $200" is a eyebrow raiser -- I'm guessing that is for a highly multiplexed sample mix. List price at $230K and 50Gbases per run is the claim.

Ion Torrentcompany has been in a very noisy stealth mode -- founder Jonathon Rothberg gave a huge tease of a talk at the Providence meeting that ended just before giving anything specific. Of course, given that he launched 454, he gets a little slack in the hype department as he has delivered. BioIT World has a very nice writeup (which editor Kevin Davies was kind enough to point out to me about 2 weeks ago -- a sign of sloth on my part that I haven't mentioned it earlier). They didn't exactly succeed in peeling off the layers of secrecy, but it is far more detail than I've seen anywhere else (but in line with the few rumors I did hear). Rothberg will be giving the final talk at AGBT, and was expected to actually reveal some details. The general buzz is 454-style chemistry but with electronic -- not optical -- detection.

So it dropped my jaw through the floor today when Ion Torrent announced that they will be launching in April, starting with the gifting of two systems through a grant competition. I'd figured from what I heard & from the general pattern with AGBT that this year would bring the wraps off but nothing would be operating for another year or two (some previous AGBT announcees have seemingly faded to oblivion).

Of course, the devil is in the execution. Will they actually be able to deliver working systems? What will the reagent costs run? How reliable will the instruments be? And what will the performance profile look like -- run lengths, error rates & error modes and input DNA amounts & preparation.

Hold onto your seats -- and watch the #AGBT Twitter feed! Things will continue to get interesting.

Friday, February 19, 2010

To Stockholm via Ph.D. Thesis

The Scientist has a profile of Aaron Ciechanover, who shared the Nobel Prize for work on the proteasome. His Nobel-cited work began in his Ph.D. thesis.

In one of the physics books I was recently reading (I forget which one now, might have been How to Teach Physics to Your Dog, but I think it was Six Easy Pieces) it was mentioned that Louis de Broglie's committee wasn't sure what to do with his crazy proposal that everything has both particle and wave natures, but after consulting with Einstein awarded him his degree. Of course, this proposal withstood experimental test and led to a Nobel.

Anyone know other examples of Nobels which cite the laureate's thesis work?

Thursday, February 18, 2010

Non-benign genetic carrier status

Earlier this week the Wall Street Journal carried an article addressing the growing interest in finding health issues related to being a carrier of a recessive genetic disease.

Three diseases were discussed in some detail. Sickle-cell anemia is generally thought of as being very harmful when homozygous but essentially benign when heterozygous. But, it has been known for a while that heterozygotes (called sickle trait) can experience red blood cell sickling (and the accompanying pain and tissue damage) under low oxygen tension. The WSJ journal article points out that such sickling can also occur during strenuous physical exercise; the NCAA even has specific guidelines for extra rest for sickle cell heterozygotes.

An emerging story mentioned in the article is the risks of being a carrier for fragile X, an X-linked disorder which can severely impede mental development. Fragile X is a nucleotide triplet repeat expansion disease, meaning that some males who have a disease allele will have mild or no symptoms but can transmit a more severe form of the disease. Male carriers of these alleles can develop severe neurodegeneration late in life, a condition called FXTAS. Female carriers appear to be at greater risk for anxiety and depression as well as premature ovarian failure.

Other examples mentioned are a greater risk of Parkinson's in Gaucher's disease carriers (at about 5-fold greater risk than the general population) and increased risks of chronic sinus disease and asthma in cystic fibrosis carriers.

Touched on in the article is the fact that many carriers are completely unaware of the fact. Most testing is done if someone is (a) aware of the disease in the family and (b) considering having children. Even then, not everyone is tested. Many of these diseases are rare enough that many carriers could be unaware of the disease being present in the family -- if it never happened to manifest or be correctly diagnosed. Other persons may have missing knowledge about their parentage.

I don't know for certain, but I doubt many of these disease-causing mutations are in the tests used by most personal genetic profiling companies, other than the emerging ones focused on reproductive counseling. The availability in the near future of cheap whole genome sequencing in the near future could lead to huge numbers of people discovering these genetic issues. But, for most recessive diseases we do not know any possible negative effects of carrier status. Much more research will be needed to tease out additional issues.

Wednesday, February 17, 2010

Anybody know some good bioinformatic programming problems?

I recently found out that I've received a summer undergraduate intern slot. I have a soft spot for summer internships -- my own was a great experience -- and the company runs a very nice program, with specific social and learning experiences for the cadre. Anyone interested in applying should do so through the company website (and not here!). I do promise not to fill this space with "can you believe what the intern did today?!?!?", though executing "sudo rm -r /" might earn a slot!

I'm still trying to sketch out a grand scheme for the internship. But, it will certainly combine a certain amount of data analysis with a certain amount of programming. One person I've phone-screened has already asked about suggestions for programming problems to practice on. It's a great show of initiative, which I like but discovered for which I wasn't really prepared.

The challenge for me is to rewind my brain back to an early stage and remember what makes a good -- but doable -- problem. In my head, everything either seems too trivial or potentially discouragingly difficult. So, I'd be very interested in examples of programming challenges given to early programmers with a significant bioinformatics angle -- no bubble sorts or games of Wumpus!

I did find a couple of links with some examples: one from MIT and another from Duke (these links are really a level above). I'd love to find other examples -- and mostly don't care about the language used in the examples. I'm probably going to nudge my intern towards Java/Scala (leveraging BioJava as much as possible), perhaps if only to encourage me to put some more time in on my own retraining project.

So, any suggestions?

Tuesday, February 16, 2010

More Than One Way to Skin a Kumquat

My recent piece on citrus seems to have struck a chord, based on the multiple comments and the fact that GenomeWeb's blog picked up on it as well. That's all very gratifying, tbut also stirred me to notice what I had missed on the subject. No, not the obvious point that getting some genome sequences is just a tiny first step to my grand bioengineering dream. And not what the TIGS review pointed out, that American markets in particular have tended to favor uniformity over quality or novelty (though perhaps that is changing, at least in high-end markets). Nope, what bugs me now is missing the obvious about kumquats.

Now, as I mentioned, they're hard to find. I checked some mail order places and the seasons apparently vary depending on where they are grown. But at Christmas time I could find only one market -- and I checked a half dozen -- which was selling them. Even the same high end chain that sold them next to work didn't offer them at the outlet nearest my home. So, don't be embarassed if you've never tried one.

The first beauty of a kumquat is you eat the whole thing -- skin and all (it is advisable to spit the pip). But the second beauty is within that -- the two very different tastes. The skin is thin but very sweet, whereas the flesh is tart.

Natural kumquats therefore are a binary package, and an unusual one. I'm trying to think of other fruits eaten skin-on for which the skin has a distinctive and pleasant taste. I eat lots of fruit skins, but most are just texture & roughage as far as I can tell. Concord grapes are an obvious exception -- I'll confess to swiping them from my neighbor's trellis growing up. The pulp is gteen and much like a seedless green grape in taste (but decidedly NOT seedless!) whereas the skin has the delicious Concord-ness to it. Many other grapes are probably similar. Certainly the winemakers use skin-in or no skin as a point of control over taste.

Now, with all citrus the skin and pulp can have very different aromas. Orange zest adds a distinctive flavor which is different than adding orange juice to a recipe. With a bit of genetic sleuthing (GFP limes?), the promoters responsible for specific production in skin and flesh can be worked out. And then the engineering can get another dimension -- different tastes in kumquat skin and pulp.

Clearly what I have in mind is a lot of genetically engineered fruit, which I will be happy to taste. GMO foods have not met much acceptance, but as some have pointed out before a significant issue is that most engineered traits have been to benefit producers (pest / pesticide resistance) with no benefit to the consumer beyond price. Early attempts at longer shelf life tomatoes and carrots flopped, but that's still more of a benefit for the producer than the consumer. Nutritionally-augmented foods (e.g. "golden rice") address nutritional needs which Western activists don't face.

Present something really novel and exciting in terms of flavor experience, and then you'll see a real separation of those who are truly committed to a no-GMO purity and those who can be tempted away. Furthermore, simply rewiring existing citrus biosynthetic pathways would dodge some of the other arguments raised against GMOs, in terms of introducing allergens or such.

It is a bit of optimistic to think I'll ever see a line of flavor-augmented mix-and-match kumquats. But if anyone starts making some, I'll be happy to volunteer for the taste testing squad.

Friday, February 12, 2010

Celebrating Citrus


I've been on a citrus kick at work lately, trying out different varieties I picked up at one of the adjacent grocery stores (curiously, we're sandwiched between 2). When I was growing up I think I knew only seeded oranges, navel oranges, tangerines, tangelos, grapefruit, lemons and limes. Through some combination of better awareness and better availability, there's a lot more I can find. I gained some notoriety this week by bringing a pummelo to a breakfast meeting; if you haven't seen one, they make grapefruit look small. Tastewise, it's a bit milder and a bit sweeter than a grapefruit.

A lot of this is seasonal, as I've been finding. Kumquats are a nearly perfect desk snack -- completely neat except for the need to spit the pips -- but seem to be available only around the New Year -- and then only in a few stores. Amongst the treats currently available are clementines -- almost as good a desk snack as kumquats though you do need to peel them and some wonderful non-orange oranges. Cara cara oranges turn out to be delightfully pink on the inside whereas the blood oranges are precisely named -- after one knife slip I found myself searching my skin in vain for the source of the red spots on the table.

What I can find in the store is still just a tiny sample of all the citrus known to exist. My brother sent me a New Yorker article on professional flavorists which mentioned many more, including pummelos with quite foul-smelling rinds but yet another delicious flavor of pulp.

Just after the turn of the century, a very good review of citrus genetics was published in Trends in Genetics. The molecular story appears to point to all this wonderful diversity originating from three wild species. Amazing! It gets stranger when you delve into the reproductive biology of citrus -- not only can they be propagated sexually or by cuttings, but they are quite adept at apomixis, the development of a new individual from an unfertilized ovum.

There is, of course, a citrus genome project, hosted at the JGI. And perhaps more predictably, I'm a bit impatient for completion. As the rationale page explains (and from which I've stolen the wonderful image of citrus diversity above), the sweet orange genome is only 382 Mb. When the new HiSeq and SOLiD instruments come on-line, they could sequence several individuals at 40X coverage in one run.

What I'd really like is to have the genetic blueprint for all those wonderful flavors and colors in order to repackage them. Keeping in mind, of course, that some useful characteristics (such as seedlessness) aren't simple traits but products of karytype (I would love a fully seedless kumquat!). Imagine if you could have a whole series of clementine-like fruits, with the size & easy peeling characteristics but with the whole range of other citrus flavors and colors genetically grafted in -- cara cara clementines and blood clementines and ruby red clementines and perhaps even sweet lemontines and key clemenlimes. What a wonderfully healthy snacking then!

Thursday, February 04, 2010

Disagreeing to Disagree

A year ago (almost exactly) I wrote an entry taking to task a paper analyzing protein kinases in the draft chimpanzee genome. After writing that entry, I felt it proper to leave a comment at the journal (BMC Genomics). Instead of publishing the comment, the editor invited me to formalize my criticisms and perhaps give positive suggestions of how to do such an analysis. Between Codon dissolving, my interlude of consulting and starting Infinity this got pushed to late May, it went out for review & a round of revision & then the original authors were invited to write a rebuttal. By the time their rebuttal came back (late fall), I decided I was getting a bit worn on the whole thing and just tweaked my submission to underline a few things rather than go hammer-and-tongs for a counter-rebuttal.

Anyhow, my criticism and the authors' response is now up on the BMC Genomics website & indexed in Medline (hooray!). I won't go hammer-and-tongs here either. But, whereas sometimes two parties in an argument agree to disagree, having established consensus on what they are arguing about, I would characterize this with this post's title: the authors' pretty much argue that all of my points are based on misunderstandings and misinterpretations of what they wrote. I of course don't agree on that point.

In the end, one key point is that they are arguing they did the best with the dataset they chose to use as a source and I argue that they should have been more skeptical of the source. In the end, what I believe is that many of their unusual results will go away if the chimp genome is finished or if the chimp mRNAs under dispute are questioned.

So, this ends up as another project for my "Proposal to sequence genomes KR thinks deserve sequencing" grant. Ha! It would be fun to have a slush fund to pursue these sorts of things. $10K and access to chimp poly-A RNA is all this problem would need. But, I'm not independently wealthy so it will remain a pipe dream. Of course, there is a bonobo sequencing project and that would be somewhat useful. But if you go looking for such, make sure you google "Bonobo ensembl" not "Bonobo ensemble" as my smartphone helped me do -- you get some bizarre links but nothing to do with great ape sequencing.