Thursday, July 30, 2009

Couldn't help but laugh at this ,metagenome project

I would have thought this to be an April's Fool prank, except I came across the dataset while browsing the NCBI Short Read Archive. It did have me laughing out loud.

Metagenomic Analysis with Galaxy: Windshield Genomics and Beyond
And in case one thought the title was jargon or a brand name

...we asked the following
questions: “When I drive through Pennsylvania in June my windshield
gets quite dirty with all these bugs. Yet do I know what they are?
How many beetles versus butterflies? Is there a difference between
day and night? Is there a difference between Pennsylvania and
Connecticut?” So we scraped the windshield, isolated genomic DNA,
and subjected it to 454 FLX sequencing. We then uploaded the data
into Galaxy and attempted answering these questions. In the end
Pennsylvania turned out to be different from Connecticut.

Gotta admit some jealousy -- I wish I had this much access to a second-gen sequencer that I could do such a whimsical project!

Tuesday, July 28, 2009

Just say no to perspective pie charts!


Several papers in a row tripped over one of my pet peeves. Why does anyone who cares about their data use a perspective pie chart?

Pie charts in general get little respect in the visualization community (Tufte hates them), but while I don't love them I don't hate them either. The scheme is intuitive and widely understood -- the area (and pie angle) of each slice is proportional to its share of the total. This is one class of objection: area is a harder visual concept to compare than lengths. The other is that these waste dimensions and pixels -- why use two dimensions and lots of pixels to display information which can be shown in one dimension with many fewer pixels. While pixels are not gold, why not maximize their value?

But there's no excuse for perspective pie charts. These abominations are made easy by programs such as Excel, whose defaults tend to range between poor graphical taste and appalling graphical design. The perspective view completely ruins the correlation between shape and value, killing the one virtue of a pie chart.

So, the next time you review a paper with one of these horrors, object! Journal editors of the world: ban them from your pages. Data deserves respect!

Monday, July 27, 2009

Just what does a genome cost?

Scientists are rarely trained in finance, and even more rarely comfortable with it. In an ideal world, experiments just happen and somehow it all gets covered. But the reality is that experiments cost money.

Perhaps nowhere in biology has this been so at the front of attention as with genome sequencing, particularly since the cost has been marching down. But, cost has also turned out to be a murky area. Numbers are thrown around without always having clear evidence.

Today, George Church was quoted as saying the cost is around $5K and would soon be $1K. This is very exciting -- but how real is it? Not only am I nervous that this is an exon resequencing cost, but even if it's for a complete human genome shotgun I wonder how obtainable it really is? Is this cost "fully loaded" or just a raw material cost that omits facility, equipment and labor costs?

The conservative approach is to believe only a value sequencing service I can buy on the open market. Illumina has put the most prominent stake in the ground here, offering sequencing for $48K (but requiring a prescription). They're claiming getting it down to $10K by the end of the year, but until I can buy it I won't believe it.

Of course, someone could offer a $10K genome as a stunt or loss leader. If I can actually buy it, as a consumer I don't really care. But ultimately, you can't lose money on every sale and make it up in volume (alas, proven yet again in my last professional posting).

The rapid change in cost really does give headaches for people trying to relate it to other problems. For example, I feel a bit sorry for the author of a recent NAR review on SNP array technology. It's a good review & it would have a huge hole in it if it didn't address the issue of next-gen sequencing crowding out arrays, but that also leads to a problem. Next-gen clearly beats arrays on almost every measure, but a key driver of the switch is cost: the narrower the gap, the less attractive it is to settle for arrays. However, the comparisons in Table 2 were obsolete about the time the paper hit the Advance Access section. Clearly, any of the costs that are >$48K -- which is 4/6, are suspect.

It would be cool to have a daily changing price for genome sequencing -- perhaps a ticker symbol. Less flashy would be routine bidding on sequencing services on eBay. Buy it now on a genome for $5K -- now that would be real!

Wednesday, July 01, 2009

Gene Expression from A-Z

I was playing with the data from an early RNA-Seq paper just to have a general idea of what such data looks like and to check out some favorite genes. It was also an exercise in learning the latest Spotfire -- I had Spotfire back at MLNM but it's been over 2 years and a completely new interface was rolled out.

An easy way to find favorite genes was and compare across the three tissues (brain, liver, muscle) is to set up a trellis plot with expression as the y-axis and the gene name as the x-axis, and then use the filtering tools to find my genes. Of course, it's hard to avoid looking at the overall plot -- and picking out some fortuitous patterns.

What immediately jumps out are the three semi-blank vertical zones (on the original you can spot a fourth very thin one convincingly in the original; it's vaguely there in the PNG shown here). What are these? Take a guess before reading below.

The big one are all genes starting with "Olf" -- the olfactory receptors. This is a large subfamily of type I G-protein coupled receptors (GPCRs) whose discovery netted a Nobel Prize. In general, these are expressed solely in the olfactory epithelium, but a little more on that later.

The thin line to the left of it has genes starting with Mirn -- micrornas, which this particularly sequencing effort wasn't very tuned for. The next one to the left has genes starting with Ig -- immunoglobulin genes. Since B-cells are not one of the samples, low expression there is no shocker. The very thin line to the right of the Olf cluster which you might not see all start with Vr1 -- the vomeronasal receptors, another bit of specialized GPCRs involved in pheromone recognition.

Of course, especially having an interactive display, you can find other patterns. A block of genes starting with Mrp have very similar, high expressions in all three tissues -- the mitochondrial ribosomal proteins. A clump enriched for names starting with Psm shows a similar pattern -- the proteasome subunits.

I don't recommend spending a lot of time doing this analysis -- the visual cortex is too good at picking up patterns & clearly gene names were not picked to make this a great way to find biology. But it is mildly fascinating.

One further note. While the Olf cluster has a lot of low expression, it isn't devoid of expression (below; ignore the sides as I'm still learning how to quite get the boundaries set precisely in SF). Furthermore, some of the same genes are seen in all three samples. Now, this could be erroneous due to improper fragment mapping or some other transcriptionally active gene that overlaps these, but I think we should also be open to the idea that some of the olfactory receptors may have been co-opted for other purposes. After all, if there is a battery of diverse proteins with a spectacular range and sensitivity for different compounds, why wouldn't some be used for something other than exploring the environment?