Omics! Omics!

Friday, January 29, 2010

Whither science museums?

Last week at this time I was surviving a terrible electrical storm -- tremendous cracks and crackles all around me. Luckily, it was just the lightning show at the Boston Museum of Science, featuring the world's largest Van de Graaf generator and a pair of huge Tesla coils and other sparking whatnots. TNG was part of a huge overnight group.

I love the MoS and always enjoy a trip there. But it isn't hard to go there and wonder what the future of such museums are and how the current management is taking them.

Exhibit #1: The big event at the MoS right now is the traveling Harry Potter show. We of course got tickets and lumped on audio tour. It's great fun, especially if you enjoyed the movies (it is mostly movie props & costumes), but makes absolutely no pretensions of having even a veneer of science. I've seen movie-centric exhibits at science museums before and they usually try to at least portray how movie technology works. None of that here, other than an occasional remark on the audio tour. Clearly, this is a money maker first and a big draw. But do such exhibits represent a dangerous distraction from the mission of a science museum? Would such a show be more appropriate in an art museum (it apparently was in the art museum in Chicago)?

Another exhibit could also could be described as more art than science -- but is that necessarily a bad thing. It was an exhibit of paintings by an artist who attempts to portray large numbers to make them comprehensible. The exhibit was sparse -- a small number of paintings in a large space, and they really didn't do much for me. Many had a pontillist style -- the dots summing to whatever the number was. One even aped Sunday in the Park. But were they really effective?

Art and science are overlapping domains, so please don't lump me as a philistine trying to keep art out. But I still think there are better ways to host art in a science museum. An in depth exhibit on detecting forgeries or verifying provenance would be an obvious example. The exhibit on optical illusions features a number of artworks which can be viewed in multiple ways.

We actually slept in an exhibit called "Science in the Park", which has a number of interactive exhibits illustrating basic physics concepts. It is certainly popular with the kids, but sometimes you wonder if they are actually extracting anything from it. Could such an exhibit be too fun? One example is a pair of swings of different lengths, to illustrate the properties of pendulums. A saw a lot of swinging, but rarely would a child try both. The exhibit also illustrated a serious challenge with interactive exhibits: durability. The section to illustrate angular momentum was fatally crippled by worn out bearings -- the turntable on which to spin oneself could barely allow 2 rotations. Two exhibits using trolleys looked pretty robust -- but then again some kids were slamming them along the track with all their might.

The lightning show is impressive. I think most kids got a good idea of the skin effect (allowing the operator to sit in a cage being tremendously zapped by the monster Van de Graaf). But how much did they take away? How much can we expect?

Some of the exhibits at the museum predate me by a decade or more. There are some truly ancient animal dioramas that don't seem to get much attention. Another exhibit that is quite old -- but has worn well -- is the Mathematica exhibit. There was only two obvious updatings (a computer-generated fractal mountain and an addendum to the wall of famous mathematicians). It has a few interactive items, and most were working (alas, the Moebius strip traverser was stuck).

One of the treats on Saturday morning was a movie in the Omnimax screen, a documentary on healthy and sick coral reefs in the South Pacific. It's an amazing film, and does illustrate one way to really pack a punch with photos. While the photos of dying and dead reefs are sobering, to me the most stunning photo was of a river junction in Fiji. One river's deep brown (loaded with silt eroded from upstream logging) flowed into another's deep blue. I grew up on National Geographic Cousteau specials, and could also appreciate the drama of one of the filmmakers' grim brush with the bends.

One last thought: late last fall I stumbled on a very intriguing exhibit, though it is unlikely to be the destination of any school field trip. It's the lobby of the Broad Institute, and they have a set of displays aimed at the street outside which both explain some of the high-throughput science methods being used and show data coming off the instruments in real time (some cell phone snapshots below). The Broad is just over a mile from the MoS and there's probably a really fat pipe between them -- it would be great to see these exhibits replicated where large crowds might see them and perhaps be inspired.

(01 Feb 2010 -- fixed stupid typo in title & added question mark)

Thursday, January 28, 2010

A little more Scala

I can't believe how thrilled I was to get a run-time error today! Because that was the first sign I had gotten past the Scala roadblock I mentioned in my previous post. It would have been nicer for the case to just work, but apparently my SAM file was incomplete or corrupt. But, moments later it ran correctly on a BAM file. For better or worse, I deserve nearly no credit for this step forward -- Mr. Google found me a key code example.

The problem I faced is that I have a Java class (from the Picard library for reading alignment data in SAM/BAM format). To get each record, an iterator is provided. But my first few attempts to guess the syntax just didn't work, so it was off to Google.

My first working version is


package hello
import java.io.File
import org.biojava.bio.Annotation
import org.biojava.bio.seq.Sequence
import org.biojava.bio.seq.impl.SimpleSequence
import org.biojava.bio.symbol.SymbolList
import org.biojava.bio.program.abi.ABITrace
import org.biojava.bio.seq.io.SeqIOTools
import net.sf.samtools.SAMFileReader

object HelloWorld extends Application {

val samFile=new File("C:/workspace/short-reads/aln.se.2.sorted.bam")
val inputSam=new SAMFileReader(samFile)
var counter=0

var recs=inputSam.iterator
while (recs.hasNext)
{
  var samRec=recs.next;
  counter=counter+1
}
  
 println("records: ",counter);

Ah, sweet success. But, while that's a step forward it doesn't really play with anything novel that Scala lends me. The example I found this in was actually implementing something richer, which I then borrowed (same imports as before)

First, I define a class which wraps an iterator and defines a foreach method:


class IteratorWrapper[A](iter:java.util.Iterator[A])
{
    def foreach(f: A => Unit): Unit = {
        while(iter.hasNext){
          f(iter.next)
        }
    }
}

Second, is the definition within the body of my object of a rule which allows iterators to be automatically converted to my wrapper object. Now, this sounds powerfully dangerous (and vice versa). A key constraint is Scala won't do this if there is any ambiguity -- if there are multiple legal solutions to what to promote to, it won't work. Finally, I rewrite the loop using the foreach construct.


object HelloWorld extends Application {
 implicit def iteratorToWrapper[T](iter:java.util.Iterator[T]):IteratorWrapper[T] = new IteratorWrapper[T](iter)

val samFile=new File("C:/workspace/short-reads/aln.se.2.sorted.bam")
val inputSam=new SAMFileReader(samFile)
var counter=0

for (val samRec<-recs) 
  { 
    counter=counter+1 
  }
println("records: ",counter);

Is this really better? Well, I think so -- for me. The code is terse but still clear. This also saves a lot of looking up some standard wordy idioms -- for some reason I never quite locked in the standard read-lines-one-at-a-time loop in C# -- always had to copy an example.

You can take some of this a bit far in Scala -- the syntax allows a lot of flexibility and some of the examples in the O'Reilly book are almost scary. I probably once would have been inspired to write my own domain specific language within Scala, but for now I'll pass.

Am I taking a performance hit with this? Good question -- I'm sort of trusting that the Scala compiler is smart enough to treat this all as syntactic sugar, but for most of what I do performance is well behind readibility and ease of coding & maintenance. Well, until the code becomes painfully slow.

I don't have them in front of me, but I can think of examples from back at Codon where I wanted to treat something like an iterator -- especially a strongly typed one. C# does let you use for loops using anything which implements the IEnumerable interface, but it can get tedious to wrap everything up when using a library class which I think should implement IEnumerable but the designer didn't.

I still have some playing to do, but maybe soon I'll put something together that I didn't have code to do previously. That would be a serious milestone.

Wednesday, January 27, 2010

The Scala Experiment

Well, I've taken the plunge -- yet another programming language.

I've written before about this. It's also a common question on various professional bioinformatics discussion boards: what programming language.

It is a decent time to ponder some sort of shift. I've written a bit of code, but not a lot -- partly because I've been more disciplined about using libraries as much as possible (versus rolling my own) but mostly because coding is a small -- but critical -- slice of my regular workflow.

At Codon I had become quite enamored with C#. Especially with the Visual Studio Integrated Development Environment (IDE), I found it very productive and a good fit for my brain & tastes. But, as a bioinformatics language it hasn't found much favor. That means no good libraries out there, so I must build everything myself. I've knocked out basic bioinformatics libraries a number of times (read FASTA, reverse complement a sequence, translate to protein, etc), but I don't enjoy it -- and there are plenty of silly mistakes that can be easy to make but subtle enough to resist detection for an extended period. Plus, there are other things I really don't feel like writing -- like my own SAM/BAM parser. I did have one workaround for this at Codon -- I could tap into Python libraries via a package called Python.NET, but it imposed a severe performance penalty & I would have to write small (but annoying) Python glue code. The final straw is that I'm finding it essential to have a Linux (Ubuntu) installation for serious second-generation sequencing analysis (most packages do not compile cleanly -- if at all -- in my hands on a Windows box using MinGW or Cygwin).

The obvious fallback is Perl -- which is exactly how I've fallen so far. I'm very fluent with it & the appropriate libraries are out there. I've just gotten less and less fond of the language & it's many design kludges (I haven't quite gotten to my brother's opinion: Perl is just plain bad taste). I lose a lot of time with stupid errors that could have been caught at compile time with more static typing. It doesn't help I have (until recently) been using the Perl mode in Emacs as my IDE -- once you've used a really polished tool like Visual Studio you realize how primitive that is.

Other options? There's R, which I must use for certain projects (microarrays) due to the phenomenal set of libraries out there. But R just has never been an easy fit for me -- somehow I just don't grok it. I did write a little serious Python (i.e. not just glue code) at Codon & I could see myself getting into it if I had peers also working in it -- but I don't. Infinity, like many company bioinformatics groups, is pretty much a C# shop though with ecumenical attitudes towards any other language. I've also realized I need as basic comprehension of Ruby, as I'm starting to encounter useful code in that. But, as with Python I can't seem to quite push myself to switch over -- it doesn't appeal to me enough to kick the Perl habit.

While playing around with various second generation sequencing analysis tools, I stumbled across a bit of wierd code in the Broad's Genome Analysis ToolKit (GATK) -- a directory labeled "scala". Turns out, that's yet another language -- and one that has me intrigued enough to try it out.

My first bit of useful code (derived from a Hello World program that I customized having it output in canine) is below and gives away some of the intriguing features. This program goes through a set of ABI trace files that fit a specific naming convention and write out FASTA of their sequences to STDOUT:


package hello
import java.io.File
import org.biojava.bio.Annotation
import org.biojava.bio.seq.Sequence
import org.biojava.bio.seq.impl.SimpleSequence
import org.biojava.bio.symbol.SymbolList
import org.biojava.bio.program.abi.ABITrace
import org.biojava.bio.seq.io.SeqIOTools
object HelloWorld extends Application {

  for (i <- 1 to 32)
    {
 val lz = new java.text.DecimalFormat("00")
 var primerSuffix="M13F(-21)"
 val fnPrefix="C:/somedir/readprefix-"
 if (i>16) primerSuffix="M13R"
 val fn=fnPrefix+lz.format(i)+"-"+primerSuffix+".ab1"
 val traceFile=new File(fn)
 val name = traceFile.getName()
 val trace = new ABITrace(traceFile)
 val symbols = trace.getSequence()
 val seq=new SimpleSequence(symbols,name,name,Annotation.EMPTY_ANNOTATION)
 SeqIOTools.writeFasta(System.out, seq);
    }
}

A reader might ask "Wait a minute? What's all this java.this and biojava.that in there?". This is one of the appeals of Scala -- it compiles to Java Virtual Machine bytecode and can pretty much freely use Java libraries. Now, I mentioned this to a colleague and he pointed out there is Jython (Python to JVM compiler) which reminded me of reference to JRuby (Ruby to JVM compiler). So, perhaps I should revisit my skipping over those two languages. But in any case, in theory Scala can cleanly drive any Java library.

The example also illustrates something that I find a tad confusing. The book keeps stressing how Scala is statically typed -- but I didn't type any of my variables above! However, I could have -- so I can get the type safety I find very useful when I want it (or hold myself to it -- it will take some discipline) but can also ignore it in many cases.

Scala has a lot in it, most of which I've only read about in the O'Reilly book & haven't tried. It borrows from both the Object Oriented Programming (OOP) lore and Functional Programming (FP). OOP is pretty much old hat, as most modern languages are OO and if not (e.g. Perl) the language supports it. Some FP constructs will be very familiar to Perl programmers -- I've written a few million anonymous functions to customize sorting. Others, perhaps not so much. And, like most modern languages all sorts of things not strictly in the language are supplied by libraries -- such as a concurrency model (Actors) that shouldn't be as much of a swamp as trying to work with threads (at least when I tried to do it way back yonder under Java). Scala also has some syntactic flexibility that is both intriguing and scary -- the opportunities for obfuscating code would seem endless. Plus, you can embed XML right in your file. Clearly I'm still at the "look at all these neat gadgets" phase of learning the language.

Is it a picnic? No, clearly not. My second attempt at a useful Scala program is a bit stalled -- I haven't figured out quite how to rewrite a Java example from the Picard (Java implementation of SAMTools) library into Scala -- my tries so far have raised errors. Partly because the particular Java idiom being used was unfamiliar -- if I thought Scala was a way to avoid learning modern Java, I'm quite deluded myself. And, I did note that tonight when I had something critical to get done on my commute I reached for Perl. There's still a lot of idioms I need to relearn -- constructing & using regular expressions, parsing delimited text files, etc. Plus, it doesn't help that I'm learning a whole new development environment (Eclipse) virtually simultaneously -- though there is Eclipse support for all of the languages I looks like I might be using (Java, Scala, Perl, Python, Ruby), so that's a good general tool to have under my belt.

If I do really take this on, then the last decision is how much of my code to convert to Scala. I haven't written a lot of code -- but I haven't written none either. Some just won't be relevant anymore (one offs or cases where I backslid and wrote code that is redundant with free libraries) but some may matter. It probably won't be hard to just do a simple transformation into Scala -- but I'll probably want to go whole-hog and show off (to myself) my comprehension of some of the novel (to me) aspects of the language. That would really up the ante.

Thursday, January 21, 2010

A plethora of MRSA sequences

The Sanger Institute's paper in Science describing the sequencing of multiple MRSA (methicillin-resistant Staphylococcus aureus) genomes is very nifty and demonstrates a whole new potential market for next-generation sequencing: the tracking of infections in support of better control measures.

MRSA is a serious health issue; a friend of mine's relative is battling it right now. MRSA is commonly acquired in health care facilities. Further spread can be combated by rigorous attention to disinfection and sanitation measures. A key question is when MRSA shows up, where did it come from? How does it spread across a hospital, a city, a country or the world?

The gist of the methodology is to grow isolates overnight in the appropriate medium and extract the DNA. Each isolate is then converted into an Illumina library, with multiplex tags to identify it. The reference MRSA strain was also thrown in as a control. Using the GAII as they did, they packed 12 libraries onto one run -- over 60 isolates were sequenced for the whole study. With increasing cluster density and the new HiSeq instrument, one could imagine 2-5 fold (or perhaps greater) packing being practical; i.e. the entire study might fit on one run.

The library prep method sounds potentially automatable -- shearing on the covaris instrument, cleanup using a 96 well plate system, end repair, removal of small (<150nt) fragments with size exclusion beads, A-tailing, another 150nt filtering by beads, adapter ligation, another 150nt filtering, PCR to introduce the multiplexing tags, another filtering for <150nt, quantitation and then pooling. Sequencing was "only" 36nt single end, with an average of 80Mb, Alignment to the reference genome was by ssaha (a somewhat curious choice, but perhaps now they'd use BWA) and SNP calling with ssaha_pileup; non-mapping reads were assembled with velvet and did identify novel mobile element insertions. According to GenomeWeb, the estimated cost was about $320 per sample. That's probably just a reagents cost, but gives a ballpark figure.

Existing typing methods either look at SNPs or specific sequence repeats, and while these often work they sometimes give conflicting information and other times lack the power to resolve closely related isolates. Having high resolution is important for teasing apart the history of an outbreak -- correlating patient isolates with samples obtained from the environment (such as hospital floors & such).

Phylogenetic analysis using SNPs in the "core genome" showed a strong pattern of geographical clustering -- but with some key exceptions, suggesting intercontinental leaps of the bug.

Could such an approach become routine for infection monitoring? A fully-loaded cost might be closer to $20K per experiment or higher. With appropriate budgeting, this can be balanced against the cost of treating an expanding number of patients and providing expensive support (not to mention the human misery involved). Full genome sequencing might also not always be necessary; targeted sequencing could potentially allow packing even more samples onto each run. Targeted sequencing by PCR might also enable eliding the culturing step. Alternatively, cheaper (and faster; this is still a multi-day Illumina run) sequencers might be used. And, of course, this can easily be expanded to other infectious diseases with important public health implications. For those that are expensive or slow to grow, PCR would be particularly appropriate.

It is also worth noting that we're only about 15 years since the first bacterial genome was sequenced. Now, the thought of doing hundreds a week is not at all daunting. Resequencing a known bug is clearly bioinformatically less of a challenge, but still how far we've come!

Simon R. Harris, Edward J. Feil, Matthew T. G. Holden, Michael A. Quail, Emma K. Nickerson, Narisara Chantratita, Susana Gardete, Ana Tavares, Nick Day, Jodi A. Lindsay, Jonathan D. Edgeworth, Hermínia de Lencastre, Julian Parkhill, Sharon J. Peacock, & Stephen D. Bentley (2010). Evolution of MRSA During Hospital Transmission and Intercontinental Spread Science, 327 (5964), 469-474 : 10.1126/science.1182395

Wednesday, January 20, 2010

I'm definitely not volunteering for sample collection on this project

My post yesterday was successful at narrowing down the identity of the mystery creature to a spider of the genus Araneus. Another victory for crowd sourcing! It also points out the value of questioning your assumptions & conclusions. My initial thought on looking at the photos was that the body plan looked spider-like and details of the head and abdomen sometimes pointed me that way -- but then the lack of eight legs convinced me it must be an insect (I thought I could count six in some photos).

One of the commenters addressed my mystery vine with a suggestion that underscored a key detail I inadvertantly left out. The vine raised welts on my legs when it grabbed me -- not in some exaggerated sense, but it definitely had some sort of fine projections (trichomes?) which were not what I'd call thorns, but which stuck to me (and my clothes) like velcro & left behind the small red welts.

The commenter asked if it could have been poison ivy, and I must confess a chuckle. I know that stuff! Boy do I! I've gotten that rash on most of both legs at one point, and all over my neck and arms another time. Numerous cases on my hands over time. It's definitely a very different rash -- much slower to come on, much more itchy with huge welts and very slow to disappear. My clinging plant's rash was short lived.

Poison ivy is easy to identify too. It's truly simple. First, you can start with the old saw "leaves of three, let it be". But, a lot of plants have leaves divided into three parts. Some are yummy: raspberries and strawberries. Others are pretty: columbines. And far more. Also, sometimes it's hard to count -- what else could explain the common confusion of virginia creeper (5-7 divisions) with poison ivy.

Some other key points which make identification simple:
Habitat: Woods, fields, lawns, gardens, roadsides, parking lot edges. Haven't seen it grow in standing water, but I wouldn't rule it out
Size: Tiny plantlets; vines climbing trees for 10+ meters
Habit: Individual plantlets, low growing weed, low growing shrub, climbing vine
Color: Generally dark green, except when not. Red in fall, except when not
Sun: Deep shade to complete sun

Now, this is the pattern where I grew up; my parents joke that it is the county flower. Here in Massachusetts, I don't see quite as much of it and it is very patchy. Wet areas are favorites, but there are also huge stands in non-wet areas. Landward faces of beach dunes are a spot to really watch out for it; Cape Cod is seriously infested.

While it causes humans great angst, poison ivy berries are an important wildlife food source. Indeed, at our previous house I had to be vigilant for sprouts along the flyways into my bird feeder. I've never seen Bambi or Thumper covered with ugly red welts; I'm not sure of the actual taxonomic range of the reaction to the poison (urushiol). Could it really be restricted to humans?

So, here you have a plant which thrives in many ecological niches, is important ecologically, a modest to major pest (inhalation of urushiol is quite dangerous & a hazard for wildfire fighters), an interesting secondary metabolites, and is related to at least two economically important plants (mangoes and cashews, which should be eaten with care by persons with high sensitivity to urushiol). Sounds like a good target for genome sequencing!

Tuesday, January 19, 2010

Green Krittah

As a youth, I was fortunate enough to spend four summers working as a camp counselor. Three of those were spent in the Nature Department, performing all sorts of environmental education functions (one year I taught archery).

One of my enjoyable duties was to wander the wilds of the camp looking for interesting living organisms to put in our terraria and aquaria, a job I relished -- though the campers could often top me for interesting (I never could find a stick insect, but we rarely lacked for one).

During one of those forays, most likely in my favorite sport of hand-catching of frogs, I stumbled onto a patch of a plant I did not recognize. I still remember it rather well -- perhaps because of the small itchy welts it raised on my unprotected legs. No, not nettles (we had plenty of those, and I often stumbled into them when focused on froggy prey), but rather a vining plant with light green triangular leaves about 5 cm on a side. Indeed, the leaves were nearly equilateral.

So, I took a sample back and poured through our various guides. Now, in an ideal world you would have a guide to every living plant expected in that corner of southeastern Pennsylvania -- which might be a tad tricky as we were on the edge of an unusual geologic formation (the Serpentine Barrens) which has unusual flora. But in general, such general plant guides are hard (or impossible) to come by. What I did have was a great book of trees -- but this wasn't a tree. I had several good wildflower books -- but I saw no blooms. I couldn't find it in the edible wild plant books -- so it was neither edible nor likely to be mistaken for one (rule #1 of wild foraging: never EVER collect "wild parsnips" -- if you're wrong they're probably one of several species which will kill you; if you are right and handle them incorrectly the books say you'll get a vicious rash).

But, being a bit stubborn, I didn't give up -- I kicked the problem upstairs. Partly, this was curiosity -- and partly dreams of being the discoverer of some exotic invader. I mailed a carefully selected sample of the plant along with a description to the county agricultural agent. At the end of the summer, I contacted him (I think by dropping in on his office). He was polite -- but politely stumped as well. So much for the experts.

I'm remembering this & relating it because I've come into a similar situation. My father, who is no slouch in the natural world department, spotted this "green krittah" (as he has named it) on his car last summer. Not recognizing it, and wanting to preserve it (plus, he is an inveterate shutterbug), he shot many closeups of it (the red bar, if I remember correctly, is about 5 mm) -- indeed, being ever the one to document his work he shot the below picture of his photo setup (the krittah is that tiny spot on the car). He even ran it past my cousin the retired entomologist, but he too protested overspecialization -- if it wasn't a pest for the U.S. Navy, he wouldn't know it. At our holiday celebration he asked if I recognized it. I didn't, but given this day of the Internet & my routine use of search tools, surely I could get an answer?

Surely not (so far). I've tried various image-based searches (which were uniformly awful). I've tried searching various descriptions. No luck. Most maddening was a very poetic description on a question-and-answer site, seemingly my krittah but far more imaginative than I would have ever cooked up: "What kind of spider has a lady's face marking it's back? Name spider lady's face markings and red or yellow almond shaped eyes?". Unfortunately, the answer given is a mixture of non sequitur and incoherence -- plus I'm pretty much certain this krittah has 6 legs, not 8. Attempts to wade through image searches were hindered by too few images per gallery and far too few completely useless photos -- of entomologists, of VWs, of spy gear & rock bands & other stuff. I've even tried one online "submit your bug" sort of site, but heard nothing.

I've also tried various insect guides online. The ones I have found are based on the time-tested scheme of dichotomous keys. Each step in the key is a simple binary question, and based on the answer you go to one of two other steps in the key. A great system -- except when it isn't. For one thing, I discovered at least one bit of entomological terminology I didn't know -- so I checked both branches. That isn't too bad -- but suppose I hit more? Or, suppose I answer incorrectly -- or am not sure. It took a lot of looking at Dad's fine photos to absolutely convince myself that the subject has only 6 legs (insect) and not 8 (mite or spider). It also doesn't help that some features (such as the spots on the back) appear differently in different photos. More seriously, the keys I found almost immediately are clearly assuming you are staring at a mature insect -- if you are looking at some sort of larvae they will be completely useless. So perhaps I'm looking at an immature form -- and the key will not help any. In any case, the terminal leaves of the keys I found were woefully underpopulated.

What I would wish for now is a modern automated sketch artist slash photo array. It would ask me questions and I could answer each one yes, no or maybe -- and even it I answered yes it wouldn't rule anything out. With each question the photo array would update -- and I could also say "more like that photo" or "nothing like that photo".

Of course, Dad could have sacrificed the sample for what might seem the obvious approach -- DNA knows all, DNA tells all. That would have nailed it (much as some high school students recently used DNA barcoding to find a new cockroach in their midst), but I think neither of us would want to sacrifice something so beautiful out of pure curiosity (if confirmed to be something awful on the other hand, neither of us would hesitate). Alas, in the mid 1980's I didn't think that way, so I don't have a sample of my vine for further analysis.

If anyone recognizes the bug -- or my vine -- please leave a comment. I'm still curious about both.

(01 Feb 2010 -- corrected size marker to 5 mm, per correspondence with photographer)

Tuesday, January 12, 2010

The Array Killers?

Illumina announced their new HiSeq 2000 instrument today. There are some great summaries at Genetic Future and PolitGenomics; read both for the whole scoop. Perhaps just as jaw dropping as some of the operating statistics on the new beast is the fact that Beijing Genome Institute has already ordered 128 of them. Yow! That's (back-of-envelope) around $100M in instruments which will consume >$30M/year in reagents. I wish I had that budget!

Illumina's own website touts not only the cost (reagents only) of $10K per human genome, but also that this works out to 200 gene expression profiles per run at $200/profile. That implies multiplexing, as there are 32 lanes on the machine (16 lanes x 2 flow cells -- or is it 32 lanes per flowcell? I'm still trying to figure this out based on the note that it images the flowcell both from the top and bottom). That also implies being able to generate high resolution copy number profiles -- which need about 0.1X coverage given published reports for similar cost.

But it's not just Illumina. If a Helicos run is $10K and it has >50 channels, then that would also suggest around $200/sample to do copy number analysis. I've heard some wild rumors about what some goosed Polonators can do now.

The one devil in trying to do lots of profiles is that means making that many libraries, which is the step that everyone still groans about (particularly my vendors!). Beckman Coulter just announced an automated instrument, but it sounds like it's not a huge step forward. Of course, on the Helicos there really isn't much to do -- it's the amplification based systems that need size selection, which is one major bottleneck.

But, once the library throughput question is solved it would seem that arrays are going to be in big trouble. Of course, all of the numbers above ignore the instrument acquisition costs, which are substantial. Array costs may still be under these numbers, which for really big studies will add up. On the other hand, from what I've seen in the literature the sequencer-based info is always superior to array based -- better dynamic range, higher resolution for copy number breakpoints. Will 2010 be the year that the high density array market goes into a tailspin?

Illumina, of course, has both bases covered. Agilent has a big toe in the sequencing field, though by supplying tools around the space. But there's one obvious big array player so far MIA from the next generation sequencing space. That would seem to be a risky trajectory to continue, by anyone's metrix...

Sunday, January 10, 2010

There's Plenty of Room at the Bottom

Friday's Wall Street Journal had a piece in the back opinion section (which has items about culture & religion and similar stuff) discussing Richard Feynman's famous 1959 talk "There's Plenty of Room at the Bottom". This talk is frequently cited as a seminal moment -- perhaps the first proposition -- of nanotechnology. But, it turns out that when surveyed many practitioners in the field claim not to have been influenced by it and often to have never read it. The article pretty much concludes that Feynman's role in the field is mostly promoted by those who promote the field and extreme visions of it.

Now, by coincidence I'm in the middle of a Feynman kick. I first encountered him in the summer of 1985 when his "as told to" book "Surely You're Joking Mr. Feynman" was my hammock reading. The next year he would become a truly national figure with his carefully planned science demonstration as part of the Challenger disaster commission. Other than recently watching Infinity, which focuses around his doomed marriage (his wife would die of TB) & the Manhattan project. Somehow, that pushed me to finally read James Gleick's biography "Genius" and now I'm crunching through "Six Easy Pieces" (a book based largely on Feynman's famous physics lecture set for undergraduates), with the actual lectures checked out as well for stuffing on my audio player. I'll burn out soon (this is a common pattern), but will gain much from it.

I had never actually read the talk before, just summaries in the various books, but luckily it is available on-line -- and makes great reading. Feynman gave the talk at the American Physical Society meeting, and apparently nobody knew what he would say -- some thought the talk would be about the physics job market! Instead, he sketched out a lot of crazy ideas that nobody had proposed before -- how small a machine could one build? How tiny could you write? Could you make small machines which could make even smaller machines and so on and so forth? He even put up two $1000 prizes:

It is my intention to offer a prize of $1,000 to the first guy who can take the information on the page of a book and put it on an area 1/25,000 smaller in linear scale in such manner that it can be read by an electron microscope.

And I want to offer another prize---if I can figure out how to phrase it so that I don't get into a mess of arguments about definitions---of another $1,000 to the first guy who makes an operating electric motor---a rotating electric motor which can be controlled from the outside and, not counting the lead-in wires, is only 1/64 inch cube.

The first prize wasn't claimed until the 1980's, but a string of cranks streamed in to claim the second one -- bringing in various toy motors. Gleick describes Feynman's eyes as "glazed over" when yet another person came in to claim the motor prize -- and an "uh oh" when the guy pulled out a microscope. It turned out that by very patient work it was possible to use very conventional technology to wind a motor that small -- and Feynman hadn't actually set aside money for the prize!

Feynman's relationship to nanotechnology is reminiscent of Mendel's to genetics. Mendel did amazing work, decades ahead of his time. He documented things carefully, but his publication strategy (a combination of obscure regional journals and sending his works to various libraries & famous scientists) failed in his lifetime. Only after three different groups rediscovered his work -- after finding much the same results -- was Mendel started on the road to scientific iconhood. Clearly, Mendel did not influence those who rediscovered him and if his work were still buried in rare book rooms, we would have a similar understanding of genetics to what we have today. Yet, we refer to genetics as "Mendelian" (and "non-Mendelian").

I hope nanotechnologists give Feynman a similar respect. Perhaps some of the terms describing his role are hyperbole ("spiritual founder"), but he clearly articulated both some of the challenges that would be encountered (for example, that issues of lubrication & friction at these scales would be quite different) and why we needed to address them. For example, he pointed out that the computer technology of the day (vacuum tubes) would place inherent performance limits on computers -- simply because the speed of light would limit the speed of information transfer across a macroscopic computer complex. He also pointed out that the then-current transistor technology looked like a dead end, as the entire world's supply of germanium would be insufficient. But, unlike naysayers he pointed out that these were problems to solve, and that he didn't know if they really would be problems.

One last thought -- many of the proponents of synthetic biology point out that biology has come up with wonderfully compact machines that we should either copy or harness. And who first articulated this concept? I don't know for sure, but I now propose that 1959 is the year to beat

The biological example of writing information on a small scale has inspired me to think of something that should be possible. Biology is not simply writing information; it is doing something about it. A biological system can be exceedingly small. Many of the cells are very tiny, but they are very active; they manufacture various substances; they walk around; they wiggle; and they do all kinds of marvelous things---all on a very small scale. Also, they store information. Consider the possibility that we too can make a thing very small which does what we want---that we can manufacture an object that maneuvers at that level!

So if the nanotechnologists don't want to call their field Feynmanian, I propose that synthetic biology be renamed such!

Wednesday, January 06, 2010

On Being a Scientific Zebra

I got a phone call today from someone asking permission to suggest me as a reviewer of a manuscript this person was about to submit. For future reference, if it's similar to anything I've blogged about, I'd be happy to be a referee. I generally get my reviews in on time (though near deadline), a practice I have gotten much better about since getting a "your review is late" note from a Nobelist -- not the sort of person you want to get on the wrong side of.

I end up reviewing a half dozen or so papers a year, from a handful of journals. NAR has used me a few times & I have some former colleagues who are editors are a few of the PLoS journals. There's also one journal of which I'm on the Editorial Board, Briefings in Bioinformatics (anyone who wishes to write a review is welcome to leave contact info in a comment here which I won't pass through). I'll confess that until recently I hadn't done much for that journal, but now I'm actually trying to put together a special issue on second generation sequencing (and if anyone wants to submit a review on the subject by the end of next month, contact me).

I generally like reviewing. Good writing was always valued in my family, and my parents were always happy to proof my writings when I was at home. This space doesn't see that level of attention -- it is deliberately a bit of fire-and-forget. A review I'm currently writing is now undergoing nearly daily revision; some parts are quite stable but others undergo major revision each time I look at them. Eventually it will stabilize or I'll just hit my deadline.

There's two times when I'm not satisfied with my reviews. The worst is when I realize near the deadline that I've agreed to review a paper where I'm uncomfortable with my expertise for a lot of the material. Of course, if I'd actually read the whole thing on first receipt I'd save myself from this. You generally agree to review these after seeing only the abstract, so I suppose I could put in my report "The abstract is poorly written, as on reading it I thought I'd understand the material but on reading the material I find I don't", but I'm not quite that crazy.

The other unsatisifying case is when I'm uneasy with the paper but can't put my finger on why. Typically, I end up writing a bunch of comments which nibble around the edges of the paper, but that isn't really helpful to anyone.

I also tend to be a little unsatisfied when I get to review very good papers, because there isn't much to say. I generally end up wishing for some further extension (and commenting that it is unfair to ask for it), but beyond that what can you say? If the paper is truly good, you really don't have much to do. A good paper once inflicted a most cruel case of writer's block on me -- it was an early paper reporting a large (in those days) human DNA sequence, I we were invited to write a News & Views on it -- and I couldn't come up with anything satisfying and missed the opportunity.

That leaves the most satisfying reviews -- when a paper is quite bad. This isn't meant to be cruel, but these are the papers you can really dig into. In most cases, there is a core of something interesting, but often either the paper is horridly organized and/or there are gaping holes in it. It can be fun to take someone's manuscript, figure out how you would rearrange & re-plan it, and then write out a description of that. I try to avoid going into copy editor mode, but some manuscripts are so error-ridden it's impossible to resist. Would it really be helpful to the author if I didn't? One subject I do try to be sensitive to is the issue of authors being stuck writing in English when it is not their first language -- given that I can hardly read any other language (a fact plain and simple; I'm not proud of it) it would be unfair of me to demand Strunk&White prose. But, it is critical that the paper actually be understandable. One recent paper used a word repeatedly in a manner that made no sense to me -- presumably this was a regionalism of the authors'.

I once reviewed a complete mess of a paper and ended up writing a manuscript-length review of it. In my mind, I constructed a scenario of a very junior student, perhaps even an undergraduate, who had eagerly done a lot of work with very little (or very poor) supervision from a faculty member. The paper was poorly organized as it was, and many of the key analyses had either been badly done or not done at all. Still, I didn't want to squash that enthusiasm and so I wrote that long report. I don't know if they ever rewrote it.

I can get very focused on the details. Visualization is important to me, so I will hammer on graphs that don't fit my Tufte-ean tastes or poorly written figure legends. Missing supplemental material (or non-functioning websites, for the NAR website issue) send my blood pressure skyrocketing.

I wouldn't want to edit manuscripts full time, but I wouldn't mind a slightly heavier load. So if you are an author or an editor, I reiterate that I'm willing to review papers on computational biology, synthetic biology, genomics and similar. I'd love to review more papers on the sorts of topics I work on now -- such as cancer genomics -- than the overhang from my distant past -- a lot of review requests are based on my Ph.D. thesis work!

Monday, December 28, 2009

Length matters!

I was looking through part of my collection of papers using Illumina sequencing and discovered an unpleasant surprise: more than one does not seem to state the read length used in the experiment. While to some this may seem trivial, I had a couple of interests. First, it's useful for estimating what can be done with the technology, and second since read lengths have been increasing it is an interesting guesstimate of when an experiment was done. Of course, there are lots of reasons to carefully pick read length -- the shorter the length, the sooner the instrument can be turned over to another experiment. Indeed, a recent paper estimates that for RNA-Seq IF you know all the transcript isoforms then 20-25 nucleotides is quite sufficient and you are interested in measuring transcript levels (they didn't, for example, discuss the ideal length for mutation/SNP discovery). Of course, that's a whopping "IF", particularly for the sorts of things I'm interested in.

Now in some cases you can back-estimate the read length using the given statistics on numbers of mapped reads and total mapped nucleotides, though I'm not even sure these numbers are reliably showing up in papers. I'm sure to some authors & reviewers they are tedious numbers of little use, but I disagree. Actually, I'd love to see each paper (in the supplementary materials) show their error statistics by read position, because this is something I think would be interesting to see the evolution of. Plus, any lab not routinely monitoring this plot is foolish -- not only would a change show important quality control information, but it also serves as an important reminder to consider the quality in how you are using the data. It's particularly surprising that the manufacturers do not have such plots prominently displayed on their website, though of course those would be suspected of being cherry-picked. One I did see from a platform supplier had a horribly chosen (or perhaps deviously chosen) scale for the Y-axis, so that the interesting information was so compressed as to be nearly useless.

I should have a chance in the very near future to take a dose of my own prescription. On writing this, it occurs to me that I am unaware of widely-available software to generate the position-specific mismatch data for such plots. I guess I just gave myself an action item!

Friday, December 18, 2009

Nano Anglerfish or Feejee Mermaids?

A few months ago I blogged enthusiastically about a paper in Science describing an approach to deorphan enzymes in parallel. Two anonymous commenters were quite derisive, claiming the chemistry for generating labeled metabolites in the paper impossible. Now Science's editor Bruce Alberts has published an expression of concern, which cites worries over the chemistry as well as the failure of the authors to post promised supporting data to their website and changing stories as to how the work was done.

The missing supporting data hits a raw nerve. I've been frustrated on more than one occasion whilst reviewing a paper that I couldn't access their supplementary data, and have certainly encountered this as a reader as well. I've sometimes meekly protested as a reviewer; in the future I resolve to consider this automatic grounds for "needs major revision". Even if the mistake is honest, it means day considered important is unavailable for consideration. Given modern publications with data which is either too large to print or simply incompatible with paper, "supplementary" data is frequently either central to the paper or certainly just off center.

This controversy also underscores a challenge for many papers which I have faced as a reviewer. To be quite honest, I'm utterly unqualified to judge the chemistry in this paper -- but feel quite qualified to judge many of the biological aspects. I have received for review papers with this same dilemma; parts I can critique and parts I can't. The real danger is if the editor inadvertantly picks reviewers who all share the same blind spot. Of course, in an ideal world a paper would always go to reviewers capable of vetting all parts of it, but with many multidisciplinary papers that is unlikely to happen. However, it also suggests a rethink of the standard practice of assigning three reviewers per paper -- perhaps each topic area should be covered by three qualified reviewers (of course, the reviewers would need to honestly declare this -- and not at review deadline time when it is too late to find supplementary reviewers!).

But, it is a mistake to think that peer review can ever be a perfect filter on the literature. It just isn't practical to go over every bit of data with a fine toothed comb. A current example illustrates this: a researcher has been accused of faking multiple protein structures. While some suspicion was raised when other structures of the same molecule didn't agree, the smoking gun is that the structures have systematic errors in how the atoms are packed. Is any reviewer of a structure paper really going to check all the atomic packing details? At some point, the best defense against scientific error & misconduct is to allow the entire world to scrutinize the work.

One of my professors in grad school had us first year students go through a memorable exercise. The papers assigned one week were in utter conflict with each other. We spent the entire discussion time trying to finesse how they could both be right -- what was different about the experimental procedures and how issues of experiment timing might explain the discrepancies. At the end, we asked what the resolution was, and was told "It's simple -- the one paper is a fraud". Once we knew this, we went back and couldn't believe we had believed anything -- nothing in the paper really supported its key conclusion. How had we been so blind before? A final coda to this is that the fraudulent paper is the notorious uniparental mouse paper -- and of course cloning of mice turns out to actually be possible. Not, of course, by the methods originally published and indeed at that time (mid 1970s) it would be well nigh impossible to actually prove that a mouse was cloned.

With that in mind, I will continue to blog here about papers I don't fully understand. That is one bit of personal benefit for me -- by exposing my thoughts to the world I invite criticism and will sometimes be shown the errors in my thinking. It never hurts to be reminded that skepticism is always useful, but I'll still take the risk of occasionally being suckered by P.T. Barnum, Ph.D.. This is, after all, a blog and not a scientific journal. It's meant to be a bit noisy and occasionally wrong -- I'll just try to keep the mean on the side of being correct.

Thursday, December 17, 2009

A Doublet of Solid Tumor Genomes

Nature this week published two papers describing the complete sequencing of a cancer cell line (small cell lung cancer (SCLC) NCI-H209 and melanoma COLO-829) each along with a "normal" cell line from the same individual. I'll confess a certain degree of disappointment at first as these papers are not rich in the information of greatest interest to me, but they have grown on me. Plus, it's rather churlish to complain when I have nothing comparable to offer myself.

Both papers have a good deal of similar structure, perhaps because their author lists share a lot of overlap, including the same first author. However, technically they are quite different. The melanoma sequencing used the Illumina GAII, generating 2x75 paired end reads supplemented with 50x2 paired end reads from 3-4Kb inserts, whereas the SCLC paper used 2x25 mate pair SOLiD libraries with inserts between 400 and 3000 bp.

The papers have estimates of the false positive and false negative rates for the detection of various mutations, in comparison to Sanger data. For single base pair substitutions on the Illumina platform in the melanoma sample, 88% of previously known variants were found and 97% of a sample of 470 newly found variants confirmed by Sanger. However, on small insertion/deletion (indel) there was both less data and much less success. Only one small deletion was previously known, a 2 base deletion which is key to the biology. This was not found by the automated alignment and analysis, though reads containing this indel could be found in the data. A sample of 182 small indels were checked by Sanger and only 36% were confirmed. On large rearrangements, 75% of those tested confirmed by PCR.

The statistics for the SOLiD data in SCLC were comparable. 76% of previously known single nucleotide variants were found and 97% of newly found variants confirmed by Sanger. Two small indels were previously known and neither was found and conversely only 25% of predicted indels confirmed by Sanger. 100% of large rearrangements tested by PCR validated. So overall, both platforms do well for detecting rearrangements and substitutions and are very weak for small indels.

The overall mutation hauls were large, after filtering out variants found in the normal cell line. 22,910 substitutions for the SCLC line and 33,345 in the melanoma line. Both of these samples reflect serious environmental abuse; melanomas often arise from sun exposure and the particular cancer morphology the SCLC line is derived from is characteristic of smokers (the smoking history of the patient was unknown). Both lines showed mutation spectra in agreement with what is previously known about these environmental insults. 92% of C>T single substitutions occured at the second base of a pyrimidne dimers (CC or CT sequences). CC>TT double substitutions were also skewed in this manner. CpG dinucleotides are also to be hotspots and showed elevated mutation frequencies. Transcription-coupled repair repairs the transcribed strand more efficiently than the non-transcribed strand, and in concordance with this in transcribed regions there was nearly a 2:1 bias of C>T changes on the non-transcribed strand. However, the authors state (but I still haven't quite figured out the logic) that transcription-coupled repair can account for only 1/3 of the bias and suggest that another mechanism, previously suspected but not characterized, is at work. One final consequence of transcription-coupled repair is that the more expressed a gene is in COLO-829, the lower its mutational burden. A bias of mutations towards the 3' end of transcribed regions was also observed, perhaps because 5' ends are transcribed at higher levels (due to abortive transcription). A transcribed-strand bias was also seen in G>T mutations, which may be oxidative damage.

An additional angle on mutations in the COLO-829 melanoma line is offered by the observation of copy-neutral loss of heterozygosity (LOH) in some regions. In other words, one copy of a chromosome was lost but then replaced by a duplicate of the remaining copy. This analysis is enabled by having the sequence of the normal DNA to identify germline heterozygosity. Interestingly, in these regions heterzyogous mutations outnumber homozygous ones, marking that these substitutions occurred after the reduplication event. 82% of C>T mutations in these regions show the hallmarks of being early mutations, suggesting they occured late, perhaps after the melanoma metastasized and was therefore removed from ultraviolet exposure.

In a similar manner, there is a rich amount of information in the SCLC mutational data. I'll skip over a bunch to hit the evidence for a novel transcription-coupled repair pathway that operates on both strands. The key point is that highly expressed genes had lower mutation rates on both strands than less expressed genes. A>G mutations showed a bias for the transcribed strand whereas G>A mutations occured equally on each strand.

Now, I'll confess I don't generally get excited about looking a mutation spectra. A lot of this has been published before, though these papers offer a particulary rich and low-bias look. What I'm most interested in are recurrent mutations and rearrangements that may be driving the cancer, particularly if they suggest therapeutic interventions. The melanoma line contained two missense mutations in the gene SPDEF, which has been associated with multiple solid tumors. A truncating stop mutation was found by sequencing SPDEF out of 48 additional tumors. A missense change was found in a metalloprotease (MMP28) which has previously been observed to be mutated in melanoma. Another missense mutation was found in agene which may play a role in ultraviolet repair (though it has been implicated in other processes), suggesting a tumor suppressor role. The sequencing results confirmed two out of three known driver mutations in COLO-829: the V600E activating mutation in kinase BRAF and deletion of the tumor suppressor PTEN. As noted above, the know 2 bp deletion in CDKN2A was not found through the automated process.

The SCLC sample has a few candidates for interestingly mutated genes. A fusion gene in which one partner (CREBBP) has been seen in leukemia gene fusions was found. An intragenic tandem duplication within the chromatin remodelling gene CHD7 was found which should generate an in-frame duplication of exons. Another SCLC cell line (NCI-H2171) was previously known to have a fusion gene involving CHD7. Screening of 63 other SCLC cell lines identified another (LU-135) with internal exon copy number alterations. Lu-135 was further explored by mate pair sequencing witha 3-4Kb library, which identified a breakpoint involving CHD7. Expression analysis showed high expression levels of CHD7 in both LU-135 and NCI-H2171 and a general higher expression of CHD7 in SCLC lines than non-small cell lung cancer lines and other tumor cell lines. An interesting twist is that the fusion partner in NCI-H2171 abd KY-135 is a non-coding RNA gene called PVT1 -- which is thought to be a transcriptional target of the oncogene MYC. MYC is amplified in both these cell lines, suggesting multiple biological mechanisms resulting in high expression of CHD7. It would seem reasonable to expect some high profile functional studies of CHD7 in the not too distant future.

For functional point mutations, the natural place to look is at coding regions and splice junctions, as here we have the strongest models for ranking the likelihood that a mutation will have a biological effect. In the SCLC paper an effort was made to push this a bit further and look for mutations that might affect transcription factor binding sites. One candidate was found but not further explored.

In general, this last point underlines what I believe will be different about subsequent papers. Looking mostly at a single cancer sample, one is limited at one can be inferred. The mutational spectrum work is something which a single tumor can illustrate in detail, and such in depth analyses will probably be significant parts of the first tumor sequencing paper for each tumor type, particularly other types with strong environmental or genetic mutational components. But, in terms of learnign what make cancers tick and how we can interfere with that, the real need is to find recurrent targets of mutation. Various cancer genome centers have been promising a few hundred tumors sequenced over the next year. Already at the recent ASH meeting (which I did not attend), there were over a half dozen presentations or posters on whole genome or exome sequencing of leukemias, lymphomas and myelomas -- the first ripples of the tsunami to come. But, the raw cost of targeted sequencing remains at most a 10th of the cost of an entire genome. The complete set of mutations found in either one of these papers could have been packed onto a single oligo based capture scheme and certainly a high-priority subset could be amplified by PCR without breaking the bank on oligos. I would expect that in the near future tumor sequencing papers will check their mutations and rearrangements on validation panels of at least 50 and preferable hundreds of samples (though assembling such sample collections is definitely not trivial). This will allow the estimation of the population frequency of those mutations which may recur at the level of 5-10% or more. With luck, some of those will suggest pharmacologic interventions which can be tested for their ability to improve patients' lives.

Pleasance, E., Stephens, P., O’Meara, S., McBride, D., Meynert, A., Jones, D., Lin, M., Beare, D., Lau, K., Greenman, C., Varela, I., Nik-Zainal, S., Davies, H., Ordoñez, G., Mudie, L., Latimer, C., Edkins, S., Stebbings, L., Chen, L., Jia, M., Leroy, C., Marshall, J., Menzies, A., Butler, A., Teague, J., Mangion, J., Sun, Y., McLaughlin, S., Peckham, H., Tsung, E., Costa, G., Lee, C., Minna, J., Gazdar, A., Birney, E., Rhodes, M., McKernan, K., Stratton, M., Futreal, P., & Campbell, P. (2009). A small-cell lung cancer genome with complex signatures of tobacco exposure Nature DOI: 10.1038/nature08629

Pleasance, E., Cheetham, R., Stephens, P., McBride, D., Humphray, S., Greenman, C., Varela, I., Lin, M., Ordóñez, G., Bignell, G., Ye, K., Alipaz, J., Bauer, M., Beare, D., Butler, A., Carter, R., Chen, L., Cox, A., Edkins, S., Kokko-Gonzales, P., Gormley, N., Grocock, R., Haudenschild, C., Hims, M., James, T., Jia, M., Kingsbury, Z., Leroy, C., Marshall, J., Menzies, A., Mudie, L., Ning, Z., Royce, T., Schulz-Trieglaff, O., Spiridou, A., Stebbings, L., Szajkowski, L., Teague, J., Williamson, D., Chin, L., Ross, M., Campbell, P., Bentley, D., Futreal, P., & Stratton, M. (2009). A comprehensive catalogue of somatic mutations from a human cancer genome Nature DOI: 10.1038/nature08658

Monday, December 14, 2009

Panda Genome Published!

Today's big genomics news is the advance publication in Nature of the giant panda (aka panda bear) genome sequence. For I'll be fighting someone (TNG) for my copy of Nature!

Pandas are the first bear (and alas, there is already someone making the mistaken claim otherwise in the Nature online comments) and only second member of Carnivora (after dog) with a draft sequence. Little in the genome sequence suggests that they have abandoned meat for a nearly all-plant diet, other than an apparent knockout of the taste receptor for glutamate, a key component of the taste of meat. So if you prepare bamboo for the pandas, don't bother with any MSG! But pandas do not appear to have acquired enzymes for attacking their bamboo, suggesting that their gut microflora do a lot of the work. So a panda microbiome metagenome project is clearly on the horizon. The sequence also greatly advances panda genetics: only 13 panda genes were previously sequenced.

The assembly is notable for being composed entirely of Solexa data using a mixture of library insert lengths. One issue touched on here (and I've seen commented on elsewhere) is that the longer mate pair libraries have serious chimaera issues and were not trusted to simply be fed into the assembly program, but were carefully added in a stepwise fashion (stepping up in library length) during later stages of assembly. It will be interesting to see what the Pacific Biosciences instrument can do in this regard -- instead trying to edit out the middle of large inserts by enzymatic and/or physical means, PacBio apparently has a "dark fill" procedure of pulsing unlabeled nucleotides. This leads to islands of sequence separated by signal gaps of known time, which can be be used to estimate distance. Presumably such an approach will not have chimaeras though the raw base error rate may be higher.

I'm quite confused by their Table 1, which shows the progress of their assembly as different data was added in. The confusing part is that it shows the progressive improvement in the N50 and N90 numbers with each step -- and then much worse numbers for the final assembly. The final N50 is 40Kb, which is substantially shorter than dog (close to 100Kb) but longer than platypus (13 kb). It strikes me that a useful additional statistic (or actually set of statistics) for a mammalian genome would be to calculste what fraction of core mammalian genes (which would have to be defined) are contained on a single contig (or for what fraction will you find at least 50% of the coding region in one contig).

While the greatest threat to panda's continuing existence in the wild is habitat destruction, it is heartening to find out that pandas have a high degree of genetic variability -- almost twice the heterozygosity of people. So there is apparently a lot of genetic diversity packed into the small panda population (around 1600 individuals, based on DNA sampling of scat)

BTW, no that is not the subject panda (Jingjing, who was the mascot for the Beijing Olympics) but rather my shot from our pilgrimage last summer to the San Diego Zoo. I think that is Gao Gao, but I'm not good about noting such things.

(update: forgot to put the Research Blogging bit in the post)

Li, R., Fan, W., Tian, G., Zhu, H., He, L., Cai, J., Huang, Q., Cai, Q., Li, B., Bai, Y., Zhang, Z., Zhang, Y., Wang, W., Li, J., Wei, F., Li, H., Jian, M., Li, J., Zhang, Z., Nielsen, R., Li, D., Gu, W., Yang, Z., Xuan, Z., Ryder, O., Leung, F., Zhou, Y., Cao, J., Sun, X., Fu, Y., Fang, X., Guo, X., Wang, B., Hou, R., Shen, F., Mu, B., Ni, P., Lin, R., Qian, W., Wang, G., Yu, C., Nie, W., Wang, J., Wu, Z., Liang, H., Min, J., Wu, Q., Cheng, S., Ruan, J., Wang, M., Shi, Z., Wen, M., Liu, B., Ren, X., Zheng, H., Dong, D., Cook, K., Shan, G., Zhang, H., Kosiol, C., Xie, X., Lu, Z., Zheng, H., Li, Y., Steiner, C., Lam, T., Lin, S., Zhang, Q., Li, G., Tian, J., Gong, T., Liu, H., Zhang, D., Fang, L., Ye, C., Zhang, J., Hu, W., Xu, A., Ren, Y., Zhang, G., Bruford, M., Li, Q., Ma, L., Guo, Y., An, N., Hu, Y., Zheng, Y., Shi, Y., Li, Z., Liu, Q., Chen, Y., Zhao, J., Qu, N., Zhao, S., Tian, F., Wang, X., Wang, H., Xu, L., Liu, X., Vinar, T., Wang, Y., Lam, T., Yiu, S., Liu, S., Zhang, H., Li, D., Huang, Y., Wang, X., Yang, G., Jiang, Z., Wang, J., Qin, N., Li, L., Li, J., Bolund, L., Kristiansen, K., Wong, G., Olson, M., Zhang, X., Li, S., Yang, H., Wang, J., & Wang, J. (2009). The sequence and de novo assembly of the giant panda genome Nature DOI: 10.1038/nature08696

Sunday, November 22, 2009

Targeted Sequencing Bags a Diagnosis

A nice complement to the one paper (Ng et al) I detailed last week is a paper that actually came out just before hand (Choi et al). Whereas the Ng paper used whole exome targeted sequencing to find the mutation for a previously unexplained rare genetic disease, the Choi et al paper used a similar scheme (though with a different choice of targeting platform) to find a known mutation in a patient, thereby diagnosing the patient.

The patient in question has a tightly interlocked pedigree (Figure 2), with two different consanguineous marriages shown. Put another way, this person could trace 3 paths back to one set of great-great-grandparents. Hence, they had quite a bit of DNA which was identical-by-descent, which meant that in these regions any low-frequency variant call could be safely ignored as noise. A separate scan with a SNP chip was used to identify such regions independently of the sequencing.

The patient was a 5 month old male, born prematurely at 30 weeks and with "failure to thrive and dehydration". Two spontaneous abortions and a death of another premature sibling at day 4 also characterized this family; a litany of miserable suffering. Due to imbalances in the standard blood chemistry (which, I wish the reviewers had insisted on further explanation for those of us who don't frequent that world), a kidney defect was suspected but other causes (such as infection) were not excluded.

The exome capture was this time on the Nimblegen platform, followed by Illumina sequenicng. This is not radically different from the Ng paper, which used Agilent capture and Illumina sequencing. At the moment Illumina & Agilent appear to be the only practical options for whole exome-scale capture, though there are many capture schemes published and quite a few available commercially. Lots of variants were found. One that immediately grabbed attention was a novel missense mutation which was homozygous and in a known chloride transporter, SLC26A3. This missense mutation (D652N)targets a position which is almost utterly conserved across the family, and is making a significant change in side chain (acid group to polar non-charged). Most importantly, SLC26A3 has already been shown to cause "congenital chloride-losing diarrhea" (CLD) when mutated in other positions. Clinical follow-up confirmed that fluid loss was through the intestines and not the kidneys.

One of the genetic diseases of the kidney that had been considered was Bartter syndrome, which the more precise blood chemistry did not match. Given that one patient had been suspected of Bartter but instead had CLD, the group screened 39 more patients with Bartter but lacking mutations in 4 different genes linked to this syndrome. 5 of these patients had homozygous mutations in SLC26A3, 2 of which were novel. 190 control chromosomes were also sequenced; none had mutations. 3 of these patients had further follow-up & confirmation of water loss through the gastrointestinal tract.

This study again illustrates the utility of targeted sequencing for clinical diagnosis of difficult cases. While a whole exome scan is currently in the neighborhood of $20K, more focused searches could be run far cheaper. The challenge will be in designing economical panels which will allow scanning the most important genes at low cost and designing such panels well. Presumably one could go through OMIM and find all diseases & syndromes which alter electrolyte levels and known causative gene(s). Such panels might be doable for perhaps as low as $1-5K per sample; too expensive for routine newborn screening but far better than a endless stream of tests. Of course, such panels would miss novel genes or really odd presentations, so follow-up of negative results with whole exome sequencing might be required. With newer sequencing platforms available, the costs for this may plummet to a few hundred dollars per test, which is probably on par with what the current screening of newborns for inborn errors runs. One impediment to commercial development in this field may well be the rapid evolution of platforms; companies may be hesitant that they will bet on a technology that will not last.

Of course, to some degree the distinction between the two papers is artificial. The Ng et al paper actually, as I noted, did diagnose some of their patients with known genetic disease. Similarly, the patients in this study who are now negative for known Bartter syndrome genes and for CLD would be candidates for whole exome sequencing. In the end, what matters is to make the right diagnosis for each patient so that the best treatment or supportive care can be selected.

Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, Nayir A, Bakkaloğlu A, Ozen S, Sanjad S, Nelson-Williams C, Farhi A, Mane S, & Lifton RP (2009). Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proceedings of the National Academy of Sciences of the United States of America, 106 (45), 19096-101 PMID: 19861545

Thursday, November 19, 2009

Three Blows Against the Tyranny of Expensive Experiments

Second generation sequencing is great, but one of it's major issues so far is that the cost of one experiment is quite steep. Just looking at reagents, going from a ready-to-run library to sequence data is somewhere in the neighborhood of $10K-25K on 454, Illumina, Helicos or SOLiD (I'm willing to take corrections on these values, though they are based on reasonable intelligence). While in theory you can split this cost over multiple experiments by barcoding, that can be very tricky to arrange. Perhaps if core labs would start offering '1 lane of Illumina - Buy It Now!' on eBay the problem could be solved, but finding a spare lane isn't easy.

This issue manifests itself in other ways. If you are developing new protocols anywhere along the pipeline, your final assay is pretty expensive, making it challenging to work inexpensively. I've heard rumors that even some of the instrument makers feel inhibited in process development. It can also make folks a bit gun shy; Amanda heard first hand tonight from someone lamenting a project stymied under such circumstances. Even for routine operations, the methods of QC are pretty inexact so far as they don't really test whether the library is any good, just whether some bulk property (size, PCRability, quantity) is within a spec. This huge atomic cost also the huge barrier to utilization in a clinical setting; does the clinician really want to wait some indefinite amount of time until enough patient samples are queued to make the cost/sample reasonable?

Recently, I've become aware of three hopeful developments on this front. The first is the Polonator, which according to Kevin McCarthy has a consumable cost of only about $500 per run (post library construction). $500 isn't nothing to risk on a crazy idea, but it sure beats $10K. There aren't many Polonators around, but for method development in areas such as targeted capture it would seem like a great choice.

Today, another shoe fell. Roche has announced a smaller version of the 454 system, the GS Junior. While the instrument cost wasn't announced, it will supposedly generate 1/10th as much data (35+Mb from 100Kreads with 400 Q20 bases) for the same cost per basepair, suggesting that the reagent cost for a run will be in the neighborhood of $2.5K. Worse than what I described above, but rather intriguing. This is a system that may have a good chance to start making clinical inroads; $2.5K is a bit steep for a diagnostic but not ridiculous -- or you simply need to multiplex fewer samples to get the cost per sample decent. The machine is going to boast 400+bp reads, playing to the current comparative strength of the 454 chemistry. The instrument cost wasn't mentioned. While I doubt anyone would buy such a machine solely as an upfront QC for SOLiD or Illumina, with some clever custom primer design one probably could make libraries useable 454 plus one other platform.

It's an especially auspicious time for Roche to launch their baby 454, as Pacific Biosciences released some specs through GenomeWeb's In Sequence and what I've been able to scrounge about (I can't quite talk myself into asking for a subscription) this is going to put some real pressure across the market, but particularly on 454. The key specs I can find are a per run cost of $100 which will get you approximately 25K-30K reads of 1.5Kb each -- or around 45Mb of data. It may also be possible to generate 2X the data for nearly the same cost; apparently the reagents packed with one cell are really good for two run in series. Each cell takes 10-15 minutes to run (at least in some workflows) and the instrument can be loaded up with 96 of them to be handled serially. This is a similar ballpark to what the GS Junior is being announced with, though with fewer reads but longer read lengths. I haven't been able to find any error rate estimates or the instrument cost. I'll assume, just because it is new and single molecule, that the error rate will give Roche some breathing room.

But in general, PacBio looks set to really grab the market where long reads, even noisy ones, are valuable. One obvious use case is transcriptome sequencing to find alternative splice forms. Another would be to provide 1.5Kb scaffolds for genome assembly; what I've found also suggests PacBio will offer a 'strobe sequencing' mode which is akin to Helicos' dark filling technology, which is a means to get widely spaced sequence islands. This might provide scaffolding information in much larger fragments. 10Kb? 20Kb? And again, though you probably wouldn't buy the machine just for this, at $100/run it looks like a great way to QC samples going into other systems. Imagine checking a library after initial construction, then after performing hybridization selection and then after another round of selection! After all, the initial PacBio instrument won't be great for really deep sequencing. It appears it would be $5K-10K to get approximately 1X coverage of a mammalian genome -- but likely with a high error rate.

With the ability to easily sequence 96 samples at a time (though it isn't clear what sample prep will entail) does have some interesting suggestions. For example, one could do long survey sequencing of many bacterial species, with each well yielding 10X coverage of an E.coli-sized genome (a lot of bugs are this size or smaller). The data might be really noisy, but for getting a general lay-of-the-land it could be quite useful -- perhaps the data would be too noisy to tell which genes were actually functional vs. decaying pseudogenes, but you would be able to ask "what is the upper bound on the number of genes of protein family X in genome Y". if you really need high quality sequence, then a full run (or targeted sequencing) could follow.

At $100 per experiment, the sagging Sanger market might take another hit. If a quick sample prep to convert plasmids to usable form is released, then ridiculous oversampling (imagine 100K reads on a typical 1.5Kb insert in pUC scenario!) might overcome a high error rate.

One interesting impediment which PacBio has acknowledged is that they won't be able to ramp up instrument production as quickly as they might like and will be trying to place (ration) instruments strategically. I'm hoping at least one goes to a commercial service provider or a core lab willing to solicit outside business, but I'm not going to count on it.

Will Illumina & Life Technologies (SOLiD) try to create baby sequencers? Illumina does have a scheme to convert their array readers to sequencers, but from what I've seen these aren't expected to save much on reagents. Life does own the VisiGen technology, which is apparently similar to PacBio's but hasn't yet published a real proof-of-concept paper -- at least that I could find; their key patent has issued -- reading material for another night.