Thursday, August 27, 2009

What happened to the eighth dog hair style?

For the second time this summer Science has another step forward in understanding the genetics of dog breeds. Previously it was the identification of a post-wolf event which led to short-legged dogs (which includes my faithful assistant); this time it is that a large (600+ dogs) genetic study has shown that vast majority of dog coat types can be explained by just three genes (you'll need a Science subscription to access these).

Figure 3 of the paper makes the point quite graphically. The three genes found in the study are FGF5, RSPO2 and KRT71. FGF5 is a secreted growth factor previously implicated in hair development, RSPO2 is a regulator of the Wnt pathway known to be important in hair follicles and KRT71 is a keratin which causes a curly phenotype when mutated in mice. So even though these were found by a genome-wide genetic study, they are all excellent candidate genes. Below is my version of Figure 3 (which has illustrations of the dog breeds). Furnishings are extra hair around the eyebrows. Wolf means the ancestral genotype and novel a genotype that post-dates domestication.

ShortBasset houndwolfwolfwolf
Wire Australian terrierwolfnovelwolf
Wire and CurlyAiredale Terrierwolfnovelnovel
Long Golden Retrievernovelwolfwolf
Long with FurnishingsBearded Collienovelnovelwolf
Curly Irish Water Spanielnovelwolfnovel
Curly with FurnishingsBichon Frise novelnovelnovel

Now, this covers a lot of furry ground. The paper claims it describes coat configuration in 95% of the 108 breeds examined. There are some strange coats probably not covered by this work (for example, the Komondor and Puli, which grow dreadlocks -- I haven't seen one personally yet). They do note that a few very long haired breeds (Afghan hound) lack the FGF5 mutation found here, suggesting that some breeds use a different genetic strategy.

The variants themselves are a mix (mutts?). RSPO2 as a mutation in the 3' non-coding region which the paper shows increases expression by about 3 fold. The FGF5 mutation changes a conserved amino acid from Cys to Phe; that Cys may well be involved in a covalent Cys-Cys bond in the structure (common in secreted proteins). The KRT71 mutation is also a coding region mutation.

But the more obvious question to me is they describe 3 essentially binary genetic determinants of coat style -- but describe only 7 combinations not the 8 which could be expected. The missing genotype in the table is wolf-like at FGF5 and RSPO2 but with the novel (post-domestication) genotype at KRT71. Presumably this would yield a short, curly phenotype -- perhaps too short for curling to observed and the trait pair to be selected by breeders.

Cadieu, E., Neff, M., Quignon, P., Walsh, K., Chase, K., Parker, H., VonHoldt, B., Rhue, A., Boyko, A., Byers, A., Wong, A., Mosher, D., Elkahloun, A., Spady, T., Andre, C., Lark, K., Cargill, M., Bustamante, C., Wayne, R., & Ostrander, E. (2009). Coat Variation in the Domestic Dog Is Governed by Variants in Three Genes Science DOI: 10.1126/science.1177808

Sunday, August 23, 2009

Genomes that begin with P: A follow-up

I'm really grateful for the comments on my bit about genome assembly, two of whom (one of which is a leading author in genome assembly) pointed out what I (the armchair genomicist) failed to cover and even better getting some first-hand information from the scientist who assembled the published platypus genome. Getting a better education in practical genomics with tuition being a bit of public embarassment; that's a sweet deal.

The impact of diploidy was something I'll confess hadn't crossed my mind; I think the topic of different assembly algorithms did briefly flit through. As a bioinformatician, I'm a bit red-faced to have ignored the impact of better algorithms. It is a bit surprising to learn that there is still a bit of to assembly.

I also wanted to throw in something else which crossed my mind, but somehow dropped out of the final cut. Now, part of the trouble is that all I have to work with here is a small blurb in some promotional material from Illumina which makes the following claim about the assembly of the panda genome
Using paired reads averaging 75 bp, BGI researchers
generated 50X coverage of the three-gigabase genome with
an N50 contig size of ~300 kb.
If this were done purely from the next-gen data it is remarkable; this is a contig N50 similar to the supercontig N50 for platypus. Is panda just an easier genome? Does the 50X oversampling (as opposed to 6X for platypus) make the difference? Or is it mostly due to the very clever current breed of algorithms. How much worse would the assembly be if the same amount of data came from unpaired reads?

With luck, once the panda genome is published all the underlying read data will be as well, which will mean many of these questions can be addressed computationally; the last three questions above are all ripe for testing.

Wednesday, August 19, 2009

Why do genome assemblies go bad?

Question to ponder: given a genome assembly, how much can we ascertain as to why it didn't fully assemble.

I've gotten to thinking about this after reviewing the platypus genome paper. Yeah, it's a year plus old so I'm a bit behind the times, but it's relevant to a grand series of entries that I hope to launch in the very near future. An opinion piece by Stephen J. O'Brien and colleagues in Genome Research (one catalyst for the grand series) argues that the platypus sequence assembly is far less than what is needed to understand the evolution of this curious creature and that the effort was severely hobbled by the lack of a radiation hybrid (or equivalent map).

First some basic statistics. The sequencing was nearly entirely Sanger sequencing (a tiny smidgeon of 454 reads were incorporated) yielding about 6X sequence coverage and a final assembly of 1.84Gb. Estimates of the platypus genome size can be found in the supplementary notes and are in the neighborhood of 2.35Gb. Presumably losing 0.5Gb to really hideous DNA sequences isn't too bad. An estimate based on flow cytometry put the size closer to 1.9Gb, so perhaps little if anything is missing. The 6X estimate is based on one of the larger (2.4Gb) estimates.

The first issue with the platypus assembly is the relatively low connectedness; the N50 value of only 13Kb; in other words half of the final assembly was in contigs shorter than 13Kb. In contrast, similar coverage assemblies of mouse, chimp and chicken yielded N50 values in the range of 24-38kb.

The supercontig N50 value is also low for platypus; 365Kb vs 10-13Mb for the other three genomes. One possibility explored in the paper was a relatively low amount of fosmid (~40Kb insert) data for platypus. Removing fosmids from chimp had little effect on contigs (as expected; it's not a huge contributor to the total amount of read data) but a significant effect on supercontigs, knocking the N50 from 13.7Mb to 3.3Mb -- which still about 10X better than the platypus assembly.

Looking at the biggest examples, on contigs platypus did okay (245Kb vs. 226-442 for the other species) but again on supercontigs it is small: 14Mb vs. 45-51Mb. Again, removing fosmid data from chimp hurt the assembly but the biggest chimp supercontig was still twice the size of the largest platpus supercontig.

A little bit of the trouble may have been various contaminating DNA. A number of sequences were attributable to a protozoan parasite of platypuses and some other random pools are presumably sample mistrackings at the genome center (which is no knock; no matter how good your staff & LIMS, there's a lot of plates shuffling about with Sanger genome projects).

So what went wrong? What generally goes wrong with assemblies?

The obvious explanation is repetitive elements, and platypus (despite having a smaller genome than human) appears to be no slouch in this department. But, this has an implication. If repeats are killing further assembly, then it should be true that most contigs should end in repetitive elements. Indeed, they should have a terminal stretch of repetitive stuff around the same length as the read length. I'm unaware of anyone trying to do this sort of accounting on an assembly, but I haven't really looked hard for it.

A second possibility is undersampling, either random or non-random. Random undersampling would simply mean more sequencing on the same platform would help. Non-random undersampling would be due to sequences which clone or propagate poorly in E.coli (for Sanger) or PCR amplify poorly (for 454, Illumina & SOLiD). If platypus somehow was a minefield of hard-to-clone sequences, then acquiring a lot of paired-end / mate pair data on a next gen platform (or simply lots of long 454 reads) might help matters. In the paper, only 0.04X 454 data was generated (and it isn't clear what the 454 read length was). Would piling on a lot more help?

A related third possibility is assembly rot. Imagine there are regions which can be propagated in E.coli but with a high frequency of mutations. Different clones might have different errors, resulting in a failure to assemble.

In any case, it would be great for someone out there with some spare next-gen lanes to do a run platypus. Even running one lane of paired-end Illumina would generate around 4-5X coverage of the genome for around $4K. For a little more, one could try using array-based capture to pull down fragments homologous to the ends of contigs, potentially bridging gaps. Even better would be to go whole hog with ~30X coverage from a single run for $50K (obviously not pocket change) and see how that assembly goes. Ideally a new run at the platypus would also include the other mammalian egg-layer, the echidna (well, pick one of the 4 species of them). Would it assemble any better, or are they both terrors for genome assemblers? Only the data will tell us.

Monday, August 17, 2009

A Standard Set of Test Genomes

I was away on vacation, enjoying an Internet-free existence on a tropical isle, when the brouhaha over Stephen Quake's genome sequencing paper blew up. It does appear that the paper used some very sloppy accounting & comparisons for its cost analysis, but some of the reaction to this has been a bit over-the-top (in particular, the needless throwing around of inflammatory personal descriptions).

But, it did get me thinking. The real value of this paper is demonstrating the strengths & weaknesses (and specialized clever algorithmics) of Helicos' sequencing instrument. Helicos has struggled to claw their way into the marketplace, but has gotten a few instruments placed. The machine has some interesting properties (more on this later this week) and I'd love to get to try one out, but alas there are no service providers offering access at this time.

But to really understand the strengths and weaknesses of any platform, you really need one or more common benchmarks on which to compare. Sequencers in different stages of development may aspire to different benchmarks. For example, very early stage sequencers could use a common small DNA to target with intermediate stage sequencers shooting higher and then on to really big problems like the human genome. So, I humbly propose the following sketch of a set of standard targets which should be used for such proof-of-concept papers, going from small to big.

  1. M13. For very early stage sequencers. If that's still too much, just sequence out from the universal priming site. At the other end of things, it would be cool to have a set of M13 (or other clones) with various pathological sequence examples in them to build a set (or cocktail) of test samples.

  2. Lambda phage: 50Kb

  3. E.coli MG1655. ~4.5Mb

  4. Micrococcus luteus. Another bacterium, but with a G+C content around 80%

  5. Saccharomyces cerevisiae S288C. A good stepping stone, plus well studied.

  6. Plasmodium. Not only getting bigger (30Mb-ish), but only about 30% G+C. Being A+T rich can cause trouble just as being at the other end of the spectrum can

  7. Fugu rubripes. Again, a stepping stone (~0.5Gb) genome.
  8. Homo sapiens, NA18507. This is the HapMap sample which has been sequenced on both the SOLiD and Illumina platforms, enabling comparison of their performance. It's too bad this isn't the sample used in the Quake study, since it would allow completely direct comparison

  9. Corn. Even bigger than human, and representative of the biologically interesting and economically important plant genomes of huge size (10Gb on up) and hideous complexity (repetitive elements, high ploidies). Someone with experience in the field should probably specify the particular strain, though given that it is August & I used to love to grow it in the backyard, I'd suggest Silver Queen

There are, of course, some even more mongo genomes. I suspect most developers will top out with human or perhaps corn, but if you really want to go crazy there is lungfish & Fritillaria.

The above is just a suggestion. Perhaps it has too many stages -- most of the genome sequencers seem to go tiny-bacterium-human; at least that's my impression

In a similar vein, there should be some standard target regions for targeted sequencing methods to enable comparison, and again NA18507 would be an obvious choice for the genome to target those in.

Wednesday, August 05, 2009

Young Men & Fire

60 years ago today, 15 young men floated out of a Montana sky onto rugged ground below. They were there to join another already on the scene of a small forest fire. Within a few short hours, only 3 of those men would not be fatally burned by that same fire. As deserved as the festivities are over the spectacular events of July 1969, it's a pity that so little attention has been paid to this tragedy 20 years earlier.

I was in college when I first read the definitive account of the Mann Gulch fire, Young Men and Fire by Norman Maclean. Given that I was about at the same age at the time as those smokejumpers, it had a lot of resonance for me. It remains one of my favorite books (along with his other masterpiece, A River Runs Through It, and Other Stories). Someday, I hope to hike the gulch, both to enjoy the beauty and to contemplate the sacrifices there.

It might seem like this doesn't have much to do with science (it certainly has nothing to do with genomics!), but there's more than a little of it in his science. YM&F covers a lot of what was known then and was found later about the science of wildfires and one of it's first students, Harry Gisborne. Maclean himself was never trained as a scientist, but had a keen eye for nature from spending so much time in it. River contains more than a little bit of the science of fish & fishing streams. Maclean himself led the last two survivors of the fire on a visit to the site decades later that turned up critical artifacts from that night. The Mann Gulch tragedy has also become a case study in how organizations respond to extreme stress.

Another great connection between Maclean & science, only barely touched on in Young Men, but treated in expanded form in another article (reprinted in The Norman Maclean Reader). As a young graduate student at the University of Chicago, Maclean had become an acquaintance of the great physicist Albert Michelson, and Maclean in the longer piece writes lyrically about Michelson's lifelong quest to precisely measure the speed of light. It's a gem of scientific journalism.

I'll leave with a quote from Michelson via Maclean, which I love. Michelson was brushing off a compliment from Maclean on the elder man's billiards playing

Billiards, though, is a good game, but billiards is not as good a game as chess. Chess, though, is not as good a game as painting. But painting is not as good a game as physics.

Tuesday, August 04, 2009

Another Lab Cameo

I spent a chunk of a day early last month dusting off my old lab skills. The rationale was that we were losing the senior lab tech who was supporting a lot of the projects I'm involved in. She's actually an old friend from Millennium (we started & left in near synchrony there) who has been lured back to that fold and will be greatly missed (until we steal her back!).

Anyway, it looked like we'd have a gap in lab support for some PCR projects. A back-fill position was open, but hiring is never instantaneous. Perhaps some resources could be shifted, perhaps not. So I imposed a bit on my friendship to get a quick refresher in PCR.

One interesting factoid is that I have now run PCR in every decade it has existed. In the 80's, when it was still new, some of my undergraduate classes used it. I didn't really appreciate then how cutting edge we were being. Another class had us running PCR in the 90's. I never quite got my hands wet at Millennium (despite several invitations or schemes to, the last coming just before I was shown the door), but I did once supervise PCR runs in a hotel ballroom as part of an Invitrogen-sponsored high school program. At Codon I did spend a morning in the sequencing lab and set up cycle sequencing, which isn't quite PCR but is very similar.

Of course this also means my time in molecular biology labs spans 20 years. Some things haven't changed at all. Pipetmen remain a masterpiece of industrial design, completely functional yet sleekly styled. Lots of other paraphenalia have hardly changed -- microfuge tubes, desktop centrifuges, etc.

However, there are some new items -- leading to new puzzles. I've heard of strip tubes, but this was my first time using them. Nice and simple & organizes the samples. But, after pipetting my aliquots onto the side as I was taught, how do you spin the things down? The e-gels sure beat pouring your own, though it does take away the possibility for excitement. I only ever coated the microwave ceiling with overboiled agar, but one junior faculty member was forever remembered for blowing the door off a microwave.

My guide was also trying to train someone else & so was dashing from lab to lab. Some problems could be solved by asking (where do I get more tips?) but some required waiting. The answer to the strip tubes was a different 'fuge, but one that was being very heavily used that day by multiple scientists.

There is also a lot of lore that either I had forgotten -- or needed some more practice. You should always use the smallest pipettor which will accommodate your load, but with 5 pipettors (I'd never had more than 3 before) to choose from I sometimes used a size larger than desirable. I had forgotten that you should always dial down to the correct amount, never up. Both of these improve precision.

In the end, my experiment wasn't a stunning success. Out of my two sets of samples, only 1 had a working positive control and none of the experiments came out positive (some should have based on prior experiments). However, while that is disappointing it is great how fast the experiment could go from start to finish, yielding immediate feedback. Also, I now know where all the samples and reagents and tools are, so I could dive in.

Reaction to this ranged from amusement to mild alarm (most memorably on one colleagues face simultaneously!). Actually, one informatics person expressed jealousy, stating that he had been thwarted in similar attempts. But, as one other scientist put it, perhaps this isn't the best use of my time. If I really wanted to get good, I'd need to spend a lot of time practicing. Then maybe I'd be okay at PCR, but at the cost that we'd have to train someone else to do the bioinformatics stuff! So I didn't put "Become PCR whiz" in my career development plan. And, despite (or perhaps in reaction to) my efforts, resources were shifted about and we stole a skilled experimentalist from another group within the company.

However, I would regard this as more than a stunt. It is useful to find out what lab work is really like. Partly it helps with one's humility (I really expected both controls to work!) but also with considering the actual work involved in an experiment. That in turns helps with proposing experiment designs; more than once I've suggested designs that are scientifically sound but operationally unrealistic.

One final note: I did feel some pangs of nostalgia for Codon. The business didn't work, but boy did we have some slick automation. For an extended period I could get complex PCR experiments run by simply injecting some rows in an Oracle database -- and by run, I mean oligos ordered, PCRs run, clones picked and sequence verified (not all of these steps were roboticized, but those that weren't would be executed by the staff without any intervention by me -- a veritable assembly line). For a computer jockey, that's a really slick setup - and one I wished I still had.

I doubt I'll be making more trips in the lab soon, though I would like to keep PCRing once a decade for the rest of my days. But who knows? I might again go from riding the bench to working on it!