Tuesday, February 28, 2017

Earth BioGenome Project: Ill-Conceived Megaproject Du Jour

There's been a bit of buzz recently about an unfunded proposal to ultimately sequence every living species on Earth, warming up by sequencing every eukaryotic species, with a targeted cost of $4.8B.  It pains me a bit to write this, but I'm with those who think this is not a wise way to spend money and certainly not likely to work for anywhere near that budget.
I'm working mostly off an article in Science on the proposal, but there's also been quite a bit of Twitter traffic.  The drivers of the idea appear to largely be molecular systematicists and conservation biologists and analogies are made the Human Genome Project (HGP), which they point out seemed technically impractical when originally proposed.  But as touched on in the article, that comparison is specious.  More importantly, the real value and impact of this project appears to be ill-defined.  And the argument for doing this now isn't stated at all.

Again, this is painful.  I have a reputation at work for being "the mad sequencer", earned by leading a bacterial sequencing effort larger than most public efforts and even more so for constantly advocating doing more sequencing.  But that sequencing has a well-defined purpose to advance our drug discovery goals and we have the track record to prove the strategy works.  

I have no doubt that having sequences for all eukaryotes would reveal fascinating biological insights.  But one must weigh that against reasonable alternatives.  There's a strong "build it and they will come" aspect to this proposal, an idea that if all these genomes are available they will attract analysis that yields those insights.  But perhaps it would be better to sequence less and invest more in interpretation?  Or sequence a smaller number of species but more individuals from those species to sample geographic variation? 

There's also a very real question of the quality of sequence.  The number $4.8B is thrown out for the entire project, an inflation-adjusted pricetag from the HGP.  The article also throws out a number of $100 per eukaryotic genome, which a Complete Genomics representative claims they can soon accomplish.  I suspect nobody in the genomics community believes that number; Complete has some serious 'splaining to do.  More importantly, I'd say it's time to really put a stake in the ground and not fund any more short read de novo genomes.

Short read genomes have provided valuable insights, but the reality is that it is abundantly clear that a lot of information is missed.  Human short read sequencing has been successful only because it had a high-quality reference, and these still missed a lot of information.  The sorts of variation missed by short read de novo assembly, repeats, paralogs and details of genome arrangement, are probably often involved in speciation, which is one topic such a grand survey might be expected to shed light on.

Back to that cost; anyone in the industry must be gaping at it.  Let's suppose Complete or Illumina (which made similar claims for some future iteration of NovaSeq) can really deliver appropriate coverage of a typical eukaryotic genome.  We'll ignore for a moment the astounding range of sizes of eukaryotic genomes, which vary about 10,000 fold.  The typical short read metric for resequencing is 30X coverage of a diploid genome; not the 100X typically needed for de novo assembly.

But those cost estimates?  Nearly always they are only data generation.  Maybe library preparation is thrown in, but often not.  Manufacturers can get away with this less-than-honest pricing, but a major project needs to consider fully loaded costs.  Every cost needs to go in: DNA extraction, quality control, analysis, data storage.  Costs of failures.  The science article cites 1.5 million eukaryotic species, which seems really low.  Especially when I can find an estimate of over 5 million fungal species and one paper that suggests there may be 1.6 million protists (most of which have not yet been cataloged). But let's work off that estimate of 1.5 million -- that averages out to about $3K per species for a $4.8B budget.  So they're not relying on that bargain basement $100 per genome price.  But again, that needs to cover everything.

The real budget-buster may well be getting all the required DNA.  Stored specimens may not be suitable; new collections may need to be executed (and funded).  This is the most gigantic difference between this proposal and the HGP: the HGP needed to start with a small number of DNA samples from one species, easily obtainable.  In contrast, the BioGenome proposal requires extracting and tracking millions of samples from a wide array of different sorts of organisms.

Then there's the issue of quality control; without care this effort could spawn hundreds or even thousands of tardigates.  If you don't recognize that jargon, the first paper to report a genome sequence for a tardigrade (water bear) claimed to have found widespread horizontal transfer into the genome.  The second tardigrade genome paper failed to find such and the initial assembly was later shown to be seriously incorrect and marred by contaminants.  If you aren't funding detailed analysis, which is expensive, this sort of garbage-in, garbage out fiasco awaits.

The BioGenome proponents try to channel the spirit of the HGP and say that it too seemed infeasible when launched.  But there's a chasm of difference: HGP and later development focused on first solving a technical challenge which wasn't clearly solvable and then focused on reducing costs.  The huge cost reductions have come after the HGP with modern sequencing instruments, which have enabled eliminating gigantic rooms full of incubators, colony pickers, prep robots and the like.  The first human genome was generated by an army of individuals working in multiple huge facilities on multiple continents.  In contrast, one could probably set up a facility to generate platinum genomes in a two-car garage.

Unfortunately, what BioGenome needs to improve on are mostly hard, logistical problems that don't lend themselves to automation or miniaturization. This is the problem all of genomics is starting to run into: the sequence generation may soon be just a sliver of the total cost and effort of any sequencing project.  That sliver will continue to shrink as technology improves, but all the remaining cost and effort will be very hard to improve upon.

What's the alternative?  Let's start by requiring platinum quality, not pewter, quality genomes, with full blown analysis. I don't have a great handle on what that would cost, but let's throw out a range of $20K to $50K.  Let's also stipulate that this is going to be hard to chisel down; much of that cost is going to be the analysis.  By platinum genome we mean one which resolves, with a low level of exception, each chromosome arm as a set of haplotype-resolved contigs and represents the genetic variation in the source sample and an error rate below one error per megabase (I'm guessing on that error rate).  That's going to mean high quality input DNA, which will probably rule out museum specimens. It's going to mean a minimum of long read sequencing, optical mapping and some sort of Hi-C mapping.  Perhaps some of the smallest genomes can be solved purely with long read sequencing.  

That would ensure high quality genomes that should stand the test of time, but with the cost I am suggesting a $4.8B megaproject would be sequencing only about 100K-200K genomes.  That's still an awful lot of genomes, but really good ones.  These genomes will also save researchers from making silly mistakes or not being able to make solid conclusions due to uncertainty in the data.  And to reiterate, the fully loaded costs of these genomes is going to be dominated by sample collection, quality control, analysis and such, so it's worth amortizing those costs over higher quality information that has permanent value.

But that still begs the question of whether this is a good use of research funds.  For example, the Science article marks the number of eukaryotic families at about 9000.  So $450M might give platinum membership to each eukaryotic family.  That's already a large chunk of change, but perhaps justifiable to place benchmarks around the eukaryotic tree.  But that's only 10% of what is proposed; rationalizing the remaining bit to generate sequences to fill out the tree is going to be difficult.  

Now, the BioGenome project doesn't plan to replicate other projects.  There's the 10,000 Vertebrate Genomes and various other "N of X" genome projects.  Which I glumly suspect are also tending to generate short read assemblies.  Which is a reminder of a serious flaw in these catchy names: if you later realize it made more scientific sense to sequence fewer genomes to higher quality (or fewer species and more individuals per species), you're a bit hemmed in by your name.  So, according to this logic, rather than yet another "N of X" short read scheme I'd much prefer to hear a proposal for "An Inordinate Number of Fondly Sequenced Beetle Genomes".


Mike D'Angelo said...

Really loved reading this critique. As someone who spent large portions of my PhD mining genome sequences for my GOI, its relatives and its genomic locus, I used to be excited whenever I heard about a new genome announcement that could plug a gap in the evolutionary history of my gene. That was in the days of sanger sequencing, where I usually found full length coverage of my gene and tens to thousands of KB either side to define the locus. Then short read genomes started appearing and I searched those with the same excitement, but that didn't last long as I rarely found more than a 100 - 1000 bp stretch of my GOI. This was enough to say it was there but not enough to tell how many paralogues, look at protein sequences of key domains, and forget looking for linked genes! A few times I got full length gene coverage but the sequences looked inaccurate (littered with stop codons and indels). So I basically stopped looking for anything other than 'presence/absence', which meant I was only interested in a few genera that seemed to have either lost my GOI, or species near the point where my GOI first originated (probably during the immunological big bang, so early vertebrates).

The only thing that I will say is that at least one of these 'N of X' type sequencing project has realised the shortcomings of short-read genomes and has pledged to use long reads in its future efforts. The B10K (10,000 bird genome project that aims to sequence every living bird species) has decided to incorporate PacBio Sequel sequencing to improve its assemblies for the next phases of its project (see http://www.pacb.com/press_releases/g10k-and-b10k-initiatives-select-pacbio-smrt-sequencing-for-next-phase-of-genome-projects/ for details). I have no involvement in this project but I really hope it (and the 10,000 vertebrate project) is a success because I'd love to have 10,000 nice assemblies to look at the evolution of my GOI in excruciating detail!

Mike (@mdangelo32)

Anonymous said...

Nice critique. I generally agree with your assessment that the money could be better used elsewhere, but I do think there is a practical application that you have overlooked: enzyme discovery for synthetic biology.

Keith Robison said...

Anonymous: Yes, enzyme discovery is an important area that benefits from highly diverse sequencing. Look for more on this topic in the next few weeks here (I have two ideas queued in my brain; fingers are lagging!)

Travc said...

It amazes me how the issues involved in sample collection and actually making a usable libraries from samples is glossed over. Yeah, it is relatively easy for most of the things people sequence today... model species, large-ish animals, and things which can be cultured. But that is a minuscule fraction of "all species". Having to develop custom protocols will be the norm, not the exception, and that (like analysis) takes people. People are expensive.

Kevin McCluskey said...


Lets remember the ancillary costs. If they sequence these organisms, who is going to maintain access to living cultures, colonies, or plantings? Living collections require FUNDS for capacity building and with estimates of $1 million per year per 10,000 isolates for a simple microbial collection, this could easily consume tens of millions of dollars per year when complicated systems are included. While this is a laudable goal, for whatever reason (NIMBYism, etc) there does not seem to be a consensus that funding living resources is worthwhile.

And how does the system deal with consortia? Many "organisms" are really consortia. Lichen, for example, can be comprised of diverse partners. Many fungi have uncharacterized viruses. Plants have endophytes. What about the microbiome?

Finally, pilot studies, like the 1000 Fungal Genomes program have had major delays in generating the coverage that they require. Perhaps partly because of the requirement for RNA sequencing to define Open Reading Frames. Orthology and Synteny to already-characterized members of the same Family or Genus does not predict gene structure or even whether an ORF is a pseudo gene or an active gene.

Unknown said...

Response. (part 1 because of 4,096 character limit)

Dear Dr. Robinson, thanks for noticing and commenting on the Earth BioGenome project announcement. I am a member of the working group, and whilst I do not presume to speak for everyone or anyone involved, I would like to suggest a more optimistic outlook is not only warranted, but can be easily supported. A full response merits a detailed white paper, which is being drafted, but for now I would like to briefly respond to these comments.

First, I must thank you and note our compete agreement with the comment: “I have no doubt that having sequences for all eukaryotes would reveal fascinating biological insights.” I would add that these insights will fundamentally invigorate biological sciences, greatly enhancing our understanding of the life on earth, most of which remains to be discovered and will provide the inspiration and impulse for future generations. There is no doubt that genome reference sequences will be the foundation for the study of life on planet earth in the 21st century. And it is almost impossible to overstate the value of cataloguing all of life on earth and how it evolved, especially in the context of intense ongoing efforts to find possible microbial life on Mars and various moons in the solar system. Life on Mars will never compare with the majesty of for example the amazon rain forest, sequoias of Redwood National Park, and the marine life of the barrier reefs around the world. The latest discovery of potentially habitable planets in the Trappist-1 system 39 light years (~230 trillion miles) from earth only underscores the rarity and importance of the life around us.

To focus on a few of your comments that address details of the proposed initiative that were not included in Elizabeth Pennisi’s excellent meeting review.

Infrastructure for ALL Biologists. Perhaps most fundamentally, I and others on the working group disagree most with the implicit premise that only medically defined goals such as drug discovery are important. The current scientific grant system is very good at funding the most obviously useful endeavors, but it is hubris to say we know what the most important species on the planet are today and will be in the future. For example, discovering and describing all of the life on planet earth will allow comprehensive searches for molecules of pharmacological importance and will provide understanding of their evolutionary and biological context. From there, materials scientists will search for structural biomolecules, plant biochemists will describe secondary metabolic pathways, molecular geneticists will use new and as yet undiscovered tools similar to crispr and RNAi for theoretical and applied research, quantitative geneticists, agronomists will immediately use the knowledge to improve our agricultural ecosystems, biomedical specialists will rapidly incorporate insights on pathogens and symbionts related with individual health, and evolutionary biologists will look back in time to understand the evolution of life on earth.

Perhaps most elegantly we would have the ability to identify and study the relationships among all species on earth, changing our understanding of our planet agricultural ecosystems. Finally the time to generate this infrastructure is now, as the ongoing human driven extinction event continues. This project cannot save the species on the planet, but we hope it will bring awareness to the grandeur of life on earth, and the need to set aside ecosystems for conservation of this inheritance.

Unknown said...

Response part 2.
Cost: For Planet Earth it is Small Potatoes.
This would be, first and foremost, an international initiative because biodiversity on earth is international. For the sake of argument, that the US were to provide (Make America Great Again!) 25% of the cost, and the EU, and China also contributed 25% each and the rest were shared by other countries around the world. Consider also that for many countries their major contribution would be access to and assistance with the collection of samples. As currently envisioned, a 10 year project might only require a direct US contribution of $100M a year. Whilst for any individual this is a lot of money, for governments this investment in biology, and the acceleration of biological research.

How Realistic is The Proposed Cost.
There are somewhere between 8-15 million species on the planet, although some estimates suggest that sequencing of environmental samples would greatly increase this estimate. However only ~1.5 million eukaryotic species have been described, and it is reasonable to estimate costs based on the sequencing of 2 million species. Crucially, we are assuming that only about half of our costs will be actual genome sequencing (at around $500 per species), and as you recognize, that may not be the hardest part. Most agree that the collection, processing, archiving, and describing the samples will be the most difficult logistical and legal challenges.

Proposed Reference Quality. We know – from hard experience how we know – the technical challenges of achieving reference quality genomes. But we are confident that technology advances from companies such as Oxford Nanopore and Pacific Biosciences will continue to decrease costs while increasing sequence quality and genome contiguity. Standards being established by the genomics community are establishing efficient combined approaches that permit new standards of genome quality including 1Mb contig N50s and 10Mb scaffold N50s.

A defense of Complete Genomics against our ingrained scientific skepticism. This announcement was the result of information provided during a small sequencing technology session of the BioGenomics conference un-related to the announcement of the Earth BioGenome project, so I do not presume to speak for them. But I would like to stick up for them, against the ingrained pessimism we often grow as the people who have to work with the technology today, often forgetting that our annoyances of today are the miracles of only yesterday. In my opinion, their rolling circle nano-ball DNA template technology has always had theoretical and practical advantages for template density and signal to noise technique over the more diffuse bridge PCR technique. Together with the sequencing by synthesis that replaced their previous ligation sequencing, and the support of international partners such as BGI as both partner and large scale user there is no reason to doubt that their goals are achievable.

From presentation at the meeting it is clear that sequencing and analytical steps are becoming more efficient and inexpensive, driven in part by a very competitive landscape with several companies and technologies racing to drop prices by an order of magnitude. For humans, the $100-$200 genome may be here far sooner than we think.

Unknown said...

Response - final part.

Scientific Collaboration and Friendship. Although my response has been long, I want to finish by echoing the words of BGI president Huanming Yang. As many know, the BGI has helped lead and fund most of the genome 10K and other large taxon-based sequencing projects with a broader vision than was possible within the constrained missions of US funding agencies. Professor Huanming expressed in an impassioned talk that the collaboration and friendship generated by these international projects is potentially more important than the actual science. As he expressed at the meeting in an impassioned talk (I paraphrase): In a hundred years, when we will all be long gone I hope that we will have made every effort to save the species we can on this planet. That vision might involve “frozen zoos” of cells from as many species as possible, and that genome sequences for everything in databases is simply taken for granted as part of the furniture by future researchers.

As your comments make clear, the best path can be hard to find, but we hope that such a resource would be just a small part in bring the world closer together and suggest that now is not too soon to start.