Omics! Omics!: January 2010

Friday, January 29, 2010

Whither science museums?

Last week at this time I was surviving a terrible electrical storm -- tremendous cracks and crackles all around me. Luckily, it was just the lightning show at the Boston Museum of Science, featuring the world's largest Van de Graaf generator and a pair of huge Tesla coils and other sparking whatnots. TNG was part of a huge overnight group.

I love the MoS and always enjoy a trip there. But it isn't hard to go there and wonder what the future of such museums are and how the current management is taking them.

Exhibit #1: The big event at the MoS right now is the traveling Harry Potter show. We of course got tickets and lumped on audio tour. It's great fun, especially if you enjoyed the movies (it is mostly movie props & costumes), but makes absolutely no pretensions of having even a veneer of science. I've seen movie-centric exhibits at science museums before and they usually try to at least portray how movie technology works. None of that here, other than an occasional remark on the audio tour. Clearly, this is a money maker first and a big draw. But do such exhibits represent a dangerous distraction from the mission of a science museum? Would such a show be more appropriate in an art museum (it apparently was in the art museum in Chicago)?

Another exhibit could also could be described as more art than science -- but is that necessarily a bad thing. It was an exhibit of paintings by an artist who attempts to portray large numbers to make them comprehensible. The exhibit was sparse -- a small number of paintings in a large space, and they really didn't do much for me. Many had a pontillist style -- the dots summing to whatever the number was. One even aped Sunday in the Park. But were they really effective?

Art and science are overlapping domains, so please don't lump me as a philistine trying to keep art out. But I still think there are better ways to host art in a science museum. An in depth exhibit on detecting forgeries or verifying provenance would be an obvious example. The exhibit on optical illusions features a number of artworks which can be viewed in multiple ways.

We actually slept in an exhibit called "Science in the Park", which has a number of interactive exhibits illustrating basic physics concepts. It is certainly popular with the kids, but sometimes you wonder if they are actually extracting anything from it. Could such an exhibit be too fun? One example is a pair of swings of different lengths, to illustrate the properties of pendulums. A saw a lot of swinging, but rarely would a child try both. The exhibit also illustrated a serious challenge with interactive exhibits: durability. The section to illustrate angular momentum was fatally crippled by worn out bearings -- the turntable on which to spin oneself could barely allow 2 rotations. Two exhibits using trolleys looked pretty robust -- but then again some kids were slamming them along the track with all their might.

The lightning show is impressive. I think most kids got a good idea of the skin effect (allowing the operator to sit in a cage being tremendously zapped by the monster Van de Graaf). But how much did they take away? How much can we expect?

Some of the exhibits at the museum predate me by a decade or more. There are some truly ancient animal dioramas that don't seem to get much attention. Another exhibit that is quite old -- but has worn well -- is the Mathematica exhibit. There was only two obvious updatings (a computer-generated fractal mountain and an addendum to the wall of famous mathematicians). It has a few interactive items, and most were working (alas, the Moebius strip traverser was stuck).

One of the treats on Saturday morning was a movie in the Omnimax screen, a documentary on healthy and sick coral reefs in the South Pacific. It's an amazing film, and does illustrate one way to really pack a punch with photos. While the photos of dying and dead reefs are sobering, to me the most stunning photo was of a river junction in Fiji. One river's deep brown (loaded with silt eroded from upstream logging) flowed into another's deep blue. I grew up on National Geographic Cousteau specials, and could also appreciate the drama of one of the filmmakers' grim brush with the bends.

One last thought: late last fall I stumbled on a very intriguing exhibit, though it is unlikely to be the destination of any school field trip. It's the lobby of the Broad Institute, and they have a set of displays aimed at the street outside which both explain some of the high-throughput science methods being used and show data coming off the instruments in real time (some cell phone snapshots below). The Broad is just over a mile from the MoS and there's probably a really fat pipe between them -- it would be great to see these exhibits replicated where large crowds might see them and perhaps be inspired.

(01 Feb 2010 -- fixed stupid typo in title & added question mark)

Thursday, January 28, 2010

A little more Scala

I can't believe how thrilled I was to get a run-time error today! Because that was the first sign I had gotten past the Scala roadblock I mentioned in my previous post. It would have been nicer for the case to just work, but apparently my SAM file was incomplete or corrupt. But, moments later it ran correctly on a BAM file. For better or worse, I deserve nearly no credit for this step forward -- Mr. Google found me a key code example.

The problem I faced is that I have a Java class (from the Picard library for reading alignment data in SAM/BAM format). To get each record, an iterator is provided. But my first few attempts to guess the syntax just didn't work, so it was off to Google.

My first working version is


package hello
import java.io.File
import org.biojava.bio.Annotation
import org.biojava.bio.seq.Sequence
import org.biojava.bio.seq.impl.SimpleSequence
import org.biojava.bio.symbol.SymbolList
import org.biojava.bio.program.abi.ABITrace
import org.biojava.bio.seq.io.SeqIOTools
import net.sf.samtools.SAMFileReader

object HelloWorld extends Application {

val samFile=new File("C:/workspace/short-reads/aln.se.2.sorted.bam")
val inputSam=new SAMFileReader(samFile)
var counter=0

var recs=inputSam.iterator
while (recs.hasNext)
{
  var samRec=recs.next;
  counter=counter+1
}
  
 println("records: ",counter);

Ah, sweet success. But, while that's a step forward it doesn't really play with anything novel that Scala lends me. The example I found this in was actually implementing something richer, which I then borrowed (same imports as before)

First, I define a class which wraps an iterator and defines a foreach method:


class IteratorWrapper[A](iter:java.util.Iterator[A])
{
    def foreach(f: A => Unit): Unit = {
        while(iter.hasNext){
          f(iter.next)
        }
    }
}

Second, is the definition within the body of my object of a rule which allows iterators to be automatically converted to my wrapper object. Now, this sounds powerfully dangerous (and vice versa). A key constraint is Scala won't do this if there is any ambiguity -- if there are multiple legal solutions to what to promote to, it won't work. Finally, I rewrite the loop using the foreach construct.


object HelloWorld extends Application {
 implicit def iteratorToWrapper[T](iter:java.util.Iterator[T]):IteratorWrapper[T] = new IteratorWrapper[T](iter)

val samFile=new File("C:/workspace/short-reads/aln.se.2.sorted.bam")
val inputSam=new SAMFileReader(samFile)
var counter=0

for (val samRec<-recs) 
  { 
    counter=counter+1 
  }
println("records: ",counter);

Is this really better? Well, I think so -- for me. The code is terse but still clear. This also saves a lot of looking up some standard wordy idioms -- for some reason I never quite locked in the standard read-lines-one-at-a-time loop in C# -- always had to copy an example.

You can take some of this a bit far in Scala -- the syntax allows a lot of flexibility and some of the examples in the O'Reilly book are almost scary. I probably once would have been inspired to write my own domain specific language within Scala, but for now I'll pass.

Am I taking a performance hit with this? Good question -- I'm sort of trusting that the Scala compiler is smart enough to treat this all as syntactic sugar, but for most of what I do performance is well behind readibility and ease of coding & maintenance. Well, until the code becomes painfully slow.

I don't have them in front of me, but I can think of examples from back at Codon where I wanted to treat something like an iterator -- especially a strongly typed one. C# does let you use for loops using anything which implements the IEnumerable interface, but it can get tedious to wrap everything up when using a library class which I think should implement IEnumerable but the designer didn't.

I still have some playing to do, but maybe soon I'll put something together that I didn't have code to do previously. That would be a serious milestone.

Wednesday, January 27, 2010

The Scala Experiment

Well, I've taken the plunge -- yet another programming language.

I've written before about this. It's also a common question on various professional bioinformatics discussion boards: what programming language.

It is a decent time to ponder some sort of shift. I've written a bit of code, but not a lot -- partly because I've been more disciplined about using libraries as much as possible (versus rolling my own) but mostly because coding is a small -- but critical -- slice of my regular workflow.

At Codon I had become quite enamored with C#. Especially with the Visual Studio Integrated Development Environment (IDE), I found it very productive and a good fit for my brain & tastes. But, as a bioinformatics language it hasn't found much favor. That means no good libraries out there, so I must build everything myself. I've knocked out basic bioinformatics libraries a number of times (read FASTA, reverse complement a sequence, translate to protein, etc), but I don't enjoy it -- and there are plenty of silly mistakes that can be easy to make but subtle enough to resist detection for an extended period. Plus, there are other things I really don't feel like writing -- like my own SAM/BAM parser. I did have one workaround for this at Codon -- I could tap into Python libraries via a package called Python.NET, but it imposed a severe performance penalty & I would have to write small (but annoying) Python glue code. The final straw is that I'm finding it essential to have a Linux (Ubuntu) installation for serious second-generation sequencing analysis (most packages do not compile cleanly -- if at all -- in my hands on a Windows box using MinGW or Cygwin).

The obvious fallback is Perl -- which is exactly how I've fallen so far. I'm very fluent with it & the appropriate libraries are out there. I've just gotten less and less fond of the language & it's many design kludges (I haven't quite gotten to my brother's opinion: Perl is just plain bad taste). I lose a lot of time with stupid errors that could have been caught at compile time with more static typing. It doesn't help I have (until recently) been using the Perl mode in Emacs as my IDE -- once you've used a really polished tool like Visual Studio you realize how primitive that is.

Other options? There's R, which I must use for certain projects (microarrays) due to the phenomenal set of libraries out there. But R just has never been an easy fit for me -- somehow I just don't grok it. I did write a little serious Python (i.e. not just glue code) at Codon & I could see myself getting into it if I had peers also working in it -- but I don't. Infinity, like many company bioinformatics groups, is pretty much a C# shop though with ecumenical attitudes towards any other language. I've also realized I need as basic comprehension of Ruby, as I'm starting to encounter useful code in that. But, as with Python I can't seem to quite push myself to switch over -- it doesn't appeal to me enough to kick the Perl habit.

While playing around with various second generation sequencing analysis tools, I stumbled across a bit of wierd code in the Broad's Genome Analysis ToolKit (GATK) -- a directory labeled "scala". Turns out, that's yet another language -- and one that has me intrigued enough to try it out.

My first bit of useful code (derived from a Hello World program that I customized having it output in canine) is below and gives away some of the intriguing features. This program goes through a set of ABI trace files that fit a specific naming convention and write out FASTA of their sequences to STDOUT:


package hello
import java.io.File
import org.biojava.bio.Annotation
import org.biojava.bio.seq.Sequence
import org.biojava.bio.seq.impl.SimpleSequence
import org.biojava.bio.symbol.SymbolList
import org.biojava.bio.program.abi.ABITrace
import org.biojava.bio.seq.io.SeqIOTools
object HelloWorld extends Application {

  for (i <- 1 to 32)
    {
 val lz = new java.text.DecimalFormat("00")
 var primerSuffix="M13F(-21)"
 val fnPrefix="C:/somedir/readprefix-"
 if (i>16) primerSuffix="M13R"
 val fn=fnPrefix+lz.format(i)+"-"+primerSuffix+".ab1"
 val traceFile=new File(fn)
 val name = traceFile.getName()
 val trace = new ABITrace(traceFile)
 val symbols = trace.getSequence()
 val seq=new SimpleSequence(symbols,name,name,Annotation.EMPTY_ANNOTATION)
 SeqIOTools.writeFasta(System.out, seq);
    }
}

A reader might ask "Wait a minute? What's all this java.this and biojava.that in there?". This is one of the appeals of Scala -- it compiles to Java Virtual Machine bytecode and can pretty much freely use Java libraries. Now, I mentioned this to a colleague and he pointed out there is Jython (Python to JVM compiler) which reminded me of reference to JRuby (Ruby to JVM compiler). So, perhaps I should revisit my skipping over those two languages. But in any case, in theory Scala can cleanly drive any Java library.

The example also illustrates something that I find a tad confusing. The book keeps stressing how Scala is statically typed -- but I didn't type any of my variables above! However, I could have -- so I can get the type safety I find very useful when I want it (or hold myself to it -- it will take some discipline) but can also ignore it in many cases.

Scala has a lot in it, most of which I've only read about in the O'Reilly book & haven't tried. It borrows from both the Object Oriented Programming (OOP) lore and Functional Programming (FP). OOP is pretty much old hat, as most modern languages are OO and if not (e.g. Perl) the language supports it. Some FP constructs will be very familiar to Perl programmers -- I've written a few million anonymous functions to customize sorting. Others, perhaps not so much. And, like most modern languages all sorts of things not strictly in the language are supplied by libraries -- such as a concurrency model (Actors) that shouldn't be as much of a swamp as trying to work with threads (at least when I tried to do it way back yonder under Java). Scala also has some syntactic flexibility that is both intriguing and scary -- the opportunities for obfuscating code would seem endless. Plus, you can embed XML right in your file. Clearly I'm still at the "look at all these neat gadgets" phase of learning the language.

Is it a picnic? No, clearly not. My second attempt at a useful Scala program is a bit stalled -- I haven't figured out quite how to rewrite a Java example from the Picard (Java implementation of SAMTools) library into Scala -- my tries so far have raised errors. Partly because the particular Java idiom being used was unfamiliar -- if I thought Scala was a way to avoid learning modern Java, I'm quite deluded myself. And, I did note that tonight when I had something critical to get done on my commute I reached for Perl. There's still a lot of idioms I need to relearn -- constructing & using regular expressions, parsing delimited text files, etc. Plus, it doesn't help that I'm learning a whole new development environment (Eclipse) virtually simultaneously -- though there is Eclipse support for all of the languages I looks like I might be using (Java, Scala, Perl, Python, Ruby), so that's a good general tool to have under my belt.

If I do really take this on, then the last decision is how much of my code to convert to Scala. I haven't written a lot of code -- but I haven't written none either. Some just won't be relevant anymore (one offs or cases where I backslid and wrote code that is redundant with free libraries) but some may matter. It probably won't be hard to just do a simple transformation into Scala -- but I'll probably want to go whole-hog and show off (to myself) my comprehension of some of the novel (to me) aspects of the language. That would really up the ante.

Thursday, January 21, 2010

A plethora of MRSA sequences

The Sanger Institute's paper in Science describing the sequencing of multiple MRSA (methicillin-resistant Staphylococcus aureus) genomes is very nifty and demonstrates a whole new potential market for next-generation sequencing: the tracking of infections in support of better control measures.

MRSA is a serious health issue; a friend of mine's relative is battling it right now. MRSA is commonly acquired in health care facilities. Further spread can be combated by rigorous attention to disinfection and sanitation measures. A key question is when MRSA shows up, where did it come from? How does it spread across a hospital, a city, a country or the world?

The gist of the methodology is to grow isolates overnight in the appropriate medium and extract the DNA. Each isolate is then converted into an Illumina library, with multiplex tags to identify it. The reference MRSA strain was also thrown in as a control. Using the GAII as they did, they packed 12 libraries onto one run -- over 60 isolates were sequenced for the whole study. With increasing cluster density and the new HiSeq instrument, one could imagine 2-5 fold (or perhaps greater) packing being practical; i.e. the entire study might fit on one run.

The library prep method sounds potentially automatable -- shearing on the covaris instrument, cleanup using a 96 well plate system, end repair, removal of small (<150nt) fragments with size exclusion beads, A-tailing, another 150nt filtering by beads, adapter ligation, another 150nt filtering, PCR to introduce the multiplexing tags, another filtering for <150nt, quantitation and then pooling. Sequencing was "only" 36nt single end, with an average of 80Mb, Alignment to the reference genome was by ssaha (a somewhat curious choice, but perhaps now they'd use BWA) and SNP calling with ssaha_pileup; non-mapping reads were assembled with velvet and did identify novel mobile element insertions. According to GenomeWeb, the estimated cost was about $320 per sample. That's probably just a reagents cost, but gives a ballpark figure.

Existing typing methods either look at SNPs or specific sequence repeats, and while these often work they sometimes give conflicting information and other times lack the power to resolve closely related isolates. Having high resolution is important for teasing apart the history of an outbreak -- correlating patient isolates with samples obtained from the environment (such as hospital floors & such).

Phylogenetic analysis using SNPs in the "core genome" showed a strong pattern of geographical clustering -- but with some key exceptions, suggesting intercontinental leaps of the bug.

Could such an approach become routine for infection monitoring? A fully-loaded cost might be closer to $20K per experiment or higher. With appropriate budgeting, this can be balanced against the cost of treating an expanding number of patients and providing expensive support (not to mention the human misery involved). Full genome sequencing might also not always be necessary; targeted sequencing could potentially allow packing even more samples onto each run. Targeted sequencing by PCR might also enable eliding the culturing step. Alternatively, cheaper (and faster; this is still a multi-day Illumina run) sequencers might be used. And, of course, this can easily be expanded to other infectious diseases with important public health implications. For those that are expensive or slow to grow, PCR would be particularly appropriate.

It is also worth noting that we're only about 15 years since the first bacterial genome was sequenced. Now, the thought of doing hundreds a week is not at all daunting. Resequencing a known bug is clearly bioinformatically less of a challenge, but still how far we've come!

Simon R. Harris, Edward J. Feil, Matthew T. G. Holden, Michael A. Quail, Emma K. Nickerson, Narisara Chantratita, Susana Gardete, Ana Tavares, Nick Day, Jodi A. Lindsay, Jonathan D. Edgeworth, Hermínia de Lencastre, Julian Parkhill, Sharon J. Peacock, & Stephen D. Bentley (2010). Evolution of MRSA During Hospital Transmission and Intercontinental Spread Science, 327 (5964), 469-474 : 10.1126/science.1182395

Wednesday, January 20, 2010

I'm definitely not volunteering for sample collection on this project

My post yesterday was successful at narrowing down the identity of the mystery creature to a spider of the genus Araneus. Another victory for crowd sourcing! It also points out the value of questioning your assumptions & conclusions. My initial thought on looking at the photos was that the body plan looked spider-like and details of the head and abdomen sometimes pointed me that way -- but then the lack of eight legs convinced me it must be an insect (I thought I could count six in some photos).

One of the commenters addressed my mystery vine with a suggestion that underscored a key detail I inadvertantly left out. The vine raised welts on my legs when it grabbed me -- not in some exaggerated sense, but it definitely had some sort of fine projections (trichomes?) which were not what I'd call thorns, but which stuck to me (and my clothes) like velcro & left behind the small red welts.

The commenter asked if it could have been poison ivy, and I must confess a chuckle. I know that stuff! Boy do I! I've gotten that rash on most of both legs at one point, and all over my neck and arms another time. Numerous cases on my hands over time. It's definitely a very different rash -- much slower to come on, much more itchy with huge welts and very slow to disappear. My clinging plant's rash was short lived.

Poison ivy is easy to identify too. It's truly simple. First, you can start with the old saw "leaves of three, let it be". But, a lot of plants have leaves divided into three parts. Some are yummy: raspberries and strawberries. Others are pretty: columbines. And far more. Also, sometimes it's hard to count -- what else could explain the common confusion of virginia creeper (5-7 divisions) with poison ivy.

Some other key points which make identification simple:
Habitat: Woods, fields, lawns, gardens, roadsides, parking lot edges. Haven't seen it grow in standing water, but I wouldn't rule it out
Size: Tiny plantlets; vines climbing trees for 10+ meters
Habit: Individual plantlets, low growing weed, low growing shrub, climbing vine
Color: Generally dark green, except when not. Red in fall, except when not
Sun: Deep shade to complete sun

Now, this is the pattern where I grew up; my parents joke that it is the county flower. Here in Massachusetts, I don't see quite as much of it and it is very patchy. Wet areas are favorites, but there are also huge stands in non-wet areas. Landward faces of beach dunes are a spot to really watch out for it; Cape Cod is seriously infested.

While it causes humans great angst, poison ivy berries are an important wildlife food source. Indeed, at our previous house I had to be vigilant for sprouts along the flyways into my bird feeder. I've never seen Bambi or Thumper covered with ugly red welts; I'm not sure of the actual taxonomic range of the reaction to the poison (urushiol). Could it really be restricted to humans?

So, here you have a plant which thrives in many ecological niches, is important ecologically, a modest to major pest (inhalation of urushiol is quite dangerous & a hazard for wildfire fighters), an interesting secondary metabolites, and is related to at least two economically important plants (mangoes and cashews, which should be eaten with care by persons with high sensitivity to urushiol). Sounds like a good target for genome sequencing!

Tuesday, January 19, 2010

Green Krittah

As a youth, I was fortunate enough to spend four summers working as a camp counselor. Three of those were spent in the Nature Department, performing all sorts of environmental education functions (one year I taught archery).

One of my enjoyable duties was to wander the wilds of the camp looking for interesting living organisms to put in our terraria and aquaria, a job I relished -- though the campers could often top me for interesting (I never could find a stick insect, but we rarely lacked for one).

During one of those forays, most likely in my favorite sport of hand-catching of frogs, I stumbled onto a patch of a plant I did not recognize. I still remember it rather well -- perhaps because of the small itchy welts it raised on my unprotected legs. No, not nettles (we had plenty of those, and I often stumbled into them when focused on froggy prey), but rather a vining plant with light green triangular leaves about 5 cm on a side. Indeed, the leaves were nearly equilateral.

So, I took a sample back and poured through our various guides. Now, in an ideal world you would have a guide to every living plant expected in that corner of southeastern Pennsylvania -- which might be a tad tricky as we were on the edge of an unusual geologic formation (the Serpentine Barrens) which has unusual flora. But in general, such general plant guides are hard (or impossible) to come by. What I did have was a great book of trees -- but this wasn't a tree. I had several good wildflower books -- but I saw no blooms. I couldn't find it in the edible wild plant books -- so it was neither edible nor likely to be mistaken for one (rule #1 of wild foraging: never EVER collect "wild parsnips" -- if you're wrong they're probably one of several species which will kill you; if you are right and handle them incorrectly the books say you'll get a vicious rash).

But, being a bit stubborn, I didn't give up -- I kicked the problem upstairs. Partly, this was curiosity -- and partly dreams of being the discoverer of some exotic invader. I mailed a carefully selected sample of the plant along with a description to the county agricultural agent. At the end of the summer, I contacted him (I think by dropping in on his office). He was polite -- but politely stumped as well. So much for the experts.

I'm remembering this & relating it because I've come into a similar situation. My father, who is no slouch in the natural world department, spotted this "green krittah" (as he has named it) on his car last summer. Not recognizing it, and wanting to preserve it (plus, he is an inveterate shutterbug), he shot many closeups of it (the red bar, if I remember correctly, is about 5 mm) -- indeed, being ever the one to document his work he shot the below picture of his photo setup (the krittah is that tiny spot on the car). He even ran it past my cousin the retired entomologist, but he too protested overspecialization -- if it wasn't a pest for the U.S. Navy, he wouldn't know it. At our holiday celebration he asked if I recognized it. I didn't, but given this day of the Internet & my routine use of search tools, surely I could get an answer?

Surely not (so far). I've tried various image-based searches (which were uniformly awful). I've tried searching various descriptions. No luck. Most maddening was a very poetic description on a question-and-answer site, seemingly my krittah but far more imaginative than I would have ever cooked up: "What kind of spider has a lady's face marking it's back? Name spider lady's face markings and red or yellow almond shaped eyes?". Unfortunately, the answer given is a mixture of non sequitur and incoherence -- plus I'm pretty much certain this krittah has 6 legs, not 8. Attempts to wade through image searches were hindered by too few images per gallery and far too few completely useless photos -- of entomologists, of VWs, of spy gear & rock bands & other stuff. I've even tried one online "submit your bug" sort of site, but heard nothing.

I've also tried various insect guides online. The ones I have found are based on the time-tested scheme of dichotomous keys. Each step in the key is a simple binary question, and based on the answer you go to one of two other steps in the key. A great system -- except when it isn't. For one thing, I discovered at least one bit of entomological terminology I didn't know -- so I checked both branches. That isn't too bad -- but suppose I hit more? Or, suppose I answer incorrectly -- or am not sure. It took a lot of looking at Dad's fine photos to absolutely convince myself that the subject has only 6 legs (insect) and not 8 (mite or spider). It also doesn't help that some features (such as the spots on the back) appear differently in different photos. More seriously, the keys I found almost immediately are clearly assuming you are staring at a mature insect -- if you are looking at some sort of larvae they will be completely useless. So perhaps I'm looking at an immature form -- and the key will not help any. In any case, the terminal leaves of the keys I found were woefully underpopulated.

What I would wish for now is a modern automated sketch artist slash photo array. It would ask me questions and I could answer each one yes, no or maybe -- and even it I answered yes it wouldn't rule anything out. With each question the photo array would update -- and I could also say "more like that photo" or "nothing like that photo".

Of course, Dad could have sacrificed the sample for what might seem the obvious approach -- DNA knows all, DNA tells all. That would have nailed it (much as some high school students recently used DNA barcoding to find a new cockroach in their midst), but I think neither of us would want to sacrifice something so beautiful out of pure curiosity (if confirmed to be something awful on the other hand, neither of us would hesitate). Alas, in the mid 1980's I didn't think that way, so I don't have a sample of my vine for further analysis.

If anyone recognizes the bug -- or my vine -- please leave a comment. I'm still curious about both.

(01 Feb 2010 -- corrected size marker to 5 mm, per correspondence with photographer)

Tuesday, January 12, 2010

The Array Killers?

Illumina announced their new HiSeq 2000 instrument today. There are some great summaries at Genetic Future and PolitGenomics; read both for the whole scoop. Perhaps just as jaw dropping as some of the operating statistics on the new beast is the fact that Beijing Genome Institute has already ordered 128 of them. Yow! That's (back-of-envelope) around $100M in instruments which will consume >$30M/year in reagents. I wish I had that budget!

Illumina's own website touts not only the cost (reagents only) of $10K per human genome, but also that this works out to 200 gene expression profiles per run at $200/profile. That implies multiplexing, as there are 32 lanes on the machine (16 lanes x 2 flow cells -- or is it 32 lanes per flowcell? I'm still trying to figure this out based on the note that it images the flowcell both from the top and bottom). That also implies being able to generate high resolution copy number profiles -- which need about 0.1X coverage given published reports for similar cost.

But it's not just Illumina. If a Helicos run is $10K and it has >50 channels, then that would also suggest around $200/sample to do copy number analysis. I've heard some wild rumors about what some goosed Polonators can do now.

The one devil in trying to do lots of profiles is that means making that many libraries, which is the step that everyone still groans about (particularly my vendors!). Beckman Coulter just announced an automated instrument, but it sounds like it's not a huge step forward. Of course, on the Helicos there really isn't much to do -- it's the amplification based systems that need size selection, which is one major bottleneck.

But, once the library throughput question is solved it would seem that arrays are going to be in big trouble. Of course, all of the numbers above ignore the instrument acquisition costs, which are substantial. Array costs may still be under these numbers, which for really big studies will add up. On the other hand, from what I've seen in the literature the sequencer-based info is always superior to array based -- better dynamic range, higher resolution for copy number breakpoints. Will 2010 be the year that the high density array market goes into a tailspin?

Illumina, of course, has both bases covered. Agilent has a big toe in the sequencing field, though by supplying tools around the space. But there's one obvious big array player so far MIA from the next generation sequencing space. That would seem to be a risky trajectory to continue, by anyone's metrix...

Sunday, January 10, 2010

There's Plenty of Room at the Bottom

Friday's Wall Street Journal had a piece in the back opinion section (which has items about culture & religion and similar stuff) discussing Richard Feynman's famous 1959 talk "There's Plenty of Room at the Bottom". This talk is frequently cited as a seminal moment -- perhaps the first proposition -- of nanotechnology. But, it turns out that when surveyed many practitioners in the field claim not to have been influenced by it and often to have never read it. The article pretty much concludes that Feynman's role in the field is mostly promoted by those who promote the field and extreme visions of it.

Now, by coincidence I'm in the middle of a Feynman kick. I first encountered him in the summer of 1985 when his "as told to" book "Surely You're Joking Mr. Feynman" was my hammock reading. The next year he would become a truly national figure with his carefully planned science demonstration as part of the Challenger disaster commission. Other than recently watching Infinity, which focuses around his doomed marriage (his wife would die of TB) & the Manhattan project. Somehow, that pushed me to finally read James Gleick's biography "Genius" and now I'm crunching through "Six Easy Pieces" (a book based largely on Feynman's famous physics lecture set for undergraduates), with the actual lectures checked out as well for stuffing on my audio player. I'll burn out soon (this is a common pattern), but will gain much from it.

I had never actually read the talk before, just summaries in the various books, but luckily it is available on-line -- and makes great reading. Feynman gave the talk at the American Physical Society meeting, and apparently nobody knew what he would say -- some thought the talk would be about the physics job market! Instead, he sketched out a lot of crazy ideas that nobody had proposed before -- how small a machine could one build? How tiny could you write? Could you make small machines which could make even smaller machines and so on and so forth? He even put up two $1000 prizes:

It is my intention to offer a prize of $1,000 to the first guy who can take the information on the page of a book and put it on an area 1/25,000 smaller in linear scale in such manner that it can be read by an electron microscope.

And I want to offer another prize---if I can figure out how to phrase it so that I don't get into a mess of arguments about definitions---of another $1,000 to the first guy who makes an operating electric motor---a rotating electric motor which can be controlled from the outside and, not counting the lead-in wires, is only 1/64 inch cube.

The first prize wasn't claimed until the 1980's, but a string of cranks streamed in to claim the second one -- bringing in various toy motors. Gleick describes Feynman's eyes as "glazed over" when yet another person came in to claim the motor prize -- and an "uh oh" when the guy pulled out a microscope. It turned out that by very patient work it was possible to use very conventional technology to wind a motor that small -- and Feynman hadn't actually set aside money for the prize!

Feynman's relationship to nanotechnology is reminiscent of Mendel's to genetics. Mendel did amazing work, decades ahead of his time. He documented things carefully, but his publication strategy (a combination of obscure regional journals and sending his works to various libraries & famous scientists) failed in his lifetime. Only after three different groups rediscovered his work -- after finding much the same results -- was Mendel started on the road to scientific iconhood. Clearly, Mendel did not influence those who rediscovered him and if his work were still buried in rare book rooms, we would have a similar understanding of genetics to what we have today. Yet, we refer to genetics as "Mendelian" (and "non-Mendelian").

I hope nanotechnologists give Feynman a similar respect. Perhaps some of the terms describing his role are hyperbole ("spiritual founder"), but he clearly articulated both some of the challenges that would be encountered (for example, that issues of lubrication & friction at these scales would be quite different) and why we needed to address them. For example, he pointed out that the computer technology of the day (vacuum tubes) would place inherent performance limits on computers -- simply because the speed of light would limit the speed of information transfer across a macroscopic computer complex. He also pointed out that the then-current transistor technology looked like a dead end, as the entire world's supply of germanium would be insufficient. But, unlike naysayers he pointed out that these were problems to solve, and that he didn't know if they really would be problems.

One last thought -- many of the proponents of synthetic biology point out that biology has come up with wonderfully compact machines that we should either copy or harness. And who first articulated this concept? I don't know for sure, but I now propose that 1959 is the year to beat

The biological example of writing information on a small scale has inspired me to think of something that should be possible. Biology is not simply writing information; it is doing something about it. A biological system can be exceedingly small. Many of the cells are very tiny, but they are very active; they manufacture various substances; they walk around; they wiggle; and they do all kinds of marvelous things---all on a very small scale. Also, they store information. Consider the possibility that we too can make a thing very small which does what we want---that we can manufacture an object that maneuvers at that level!

So if the nanotechnologists don't want to call their field Feynmanian, I propose that synthetic biology be renamed such!

Wednesday, January 06, 2010

On Being a Scientific Zebra

I got a phone call today from someone asking permission to suggest me as a reviewer of a manuscript this person was about to submit. For future reference, if it's similar to anything I've blogged about, I'd be happy to be a referee. I generally get my reviews in on time (though near deadline), a practice I have gotten much better about since getting a "your review is late" note from a Nobelist -- not the sort of person you want to get on the wrong side of.

I end up reviewing a half dozen or so papers a year, from a handful of journals. NAR has used me a few times & I have some former colleagues who are editors are a few of the PLoS journals. There's also one journal of which I'm on the Editorial Board, Briefings in Bioinformatics (anyone who wishes to write a review is welcome to leave contact info in a comment here which I won't pass through). I'll confess that until recently I hadn't done much for that journal, but now I'm actually trying to put together a special issue on second generation sequencing (and if anyone wants to submit a review on the subject by the end of next month, contact me).

I generally like reviewing. Good writing was always valued in my family, and my parents were always happy to proof my writings when I was at home. This space doesn't see that level of attention -- it is deliberately a bit of fire-and-forget. A review I'm currently writing is now undergoing nearly daily revision; some parts are quite stable but others undergo major revision each time I look at them. Eventually it will stabilize or I'll just hit my deadline.

There's two times when I'm not satisfied with my reviews. The worst is when I realize near the deadline that I've agreed to review a paper where I'm uncomfortable with my expertise for a lot of the material. Of course, if I'd actually read the whole thing on first receipt I'd save myself from this. You generally agree to review these after seeing only the abstract, so I suppose I could put in my report "The abstract is poorly written, as on reading it I thought I'd understand the material but on reading the material I find I don't", but I'm not quite that crazy.

The other unsatisifying case is when I'm uneasy with the paper but can't put my finger on why. Typically, I end up writing a bunch of comments which nibble around the edges of the paper, but that isn't really helpful to anyone.

I also tend to be a little unsatisfied when I get to review very good papers, because there isn't much to say. I generally end up wishing for some further extension (and commenting that it is unfair to ask for it), but beyond that what can you say? If the paper is truly good, you really don't have much to do. A good paper once inflicted a most cruel case of writer's block on me -- it was an early paper reporting a large (in those days) human DNA sequence, I we were invited to write a News & Views on it -- and I couldn't come up with anything satisfying and missed the opportunity.

That leaves the most satisfying reviews -- when a paper is quite bad. This isn't meant to be cruel, but these are the papers you can really dig into. In most cases, there is a core of something interesting, but often either the paper is horridly organized and/or there are gaping holes in it. It can be fun to take someone's manuscript, figure out how you would rearrange & re-plan it, and then write out a description of that. I try to avoid going into copy editor mode, but some manuscripts are so error-ridden it's impossible to resist. Would it really be helpful to the author if I didn't? One subject I do try to be sensitive to is the issue of authors being stuck writing in English when it is not their first language -- given that I can hardly read any other language (a fact plain and simple; I'm not proud of it) it would be unfair of me to demand Strunk&White prose. But, it is critical that the paper actually be understandable. One recent paper used a word repeatedly in a manner that made no sense to me -- presumably this was a regionalism of the authors'.

I once reviewed a complete mess of a paper and ended up writing a manuscript-length review of it. In my mind, I constructed a scenario of a very junior student, perhaps even an undergraduate, who had eagerly done a lot of work with very little (or very poor) supervision from a faculty member. The paper was poorly organized as it was, and many of the key analyses had either been badly done or not done at all. Still, I didn't want to squash that enthusiasm and so I wrote that long report. I don't know if they ever rewrote it.

I can get very focused on the details. Visualization is important to me, so I will hammer on graphs that don't fit my Tufte-ean tastes or poorly written figure legends. Missing supplemental material (or non-functioning websites, for the NAR website issue) send my blood pressure skyrocketing.

I wouldn't want to edit manuscripts full time, but I wouldn't mind a slightly heavier load. So if you are an author or an editor, I reiterate that I'm willing to review papers on computational biology, synthetic biology, genomics and similar. I'd love to review more papers on the sorts of topics I work on now -- such as cancer genomics -- than the overhang from my distant past -- a lot of review requests are based on my Ph.D. thesis work!

Omics! Omics!