Monday, June 08, 2015

BGI Unveils a Sequencing Factory to Go

When I was in George Church's lab, he submitted a grant proposal (which, alas, was not funded) for a sequencing factory to generate one megabase of data per day.  In those days that was an ambitious goal, and the plan would have truly been on a factory scale, with a large workforce and an assembly line of stages to yield the final product of data.
When I was at Millennium and the company was going great guns on EST and BAC sequencing, we had a huge room full of sequencers and the associated colony pickers, DNA preparation robots and the like which could generate a megabase or two (or three) a day.

Next I was at Codon Devices, and one small room held colony pickers and a hallway held three ABI 3730 capillary sequencers, and that was a megabase per day.

Neither of my next stops have had their own sequencers, but the world quickly shifted.  Massively parallel sequencing approaches from 454, ABI and Illumina meant an end to colony pickers and huge DNA preparation robots.  The original Solexa was rated at 1 Gigabase per run, which meant megabases per hour not per day.  The numbers have kept going up, but even smaller instruments appeared -- a modern MiSeq generates more data in less time and at lower cost than that original Solexa, while occupying far less space.  

However, for really big projects big iron has still been popular.  Complete Genomics did go down the factory route, as I wrote about long ago.  Last year Illumina rolled out the 10X system, a set of at least 10 souped-up HiSeq instruments (now available in packs of 5) at 1M per instrument.  Not only does this cost at least $10M upfront, but the cost advantages of the beast are only attained by keeping it fed with 10K-17K genomes per year, which at $1K/genome is at least as much as the purchase price.

Earlier this year Complete Genomics (now acquired by BGI) announced they would launch two sequencer systems, one for small jobs and one for "nation-scale" genome sequencing.  At the European Human Genetics Conference this weekend, the big unit was announced.  Unfortunately, BGI hasn't seen it appropriate to reach out to the blogging community, nor are there a wealth of details.  But some general characteristics of the machine can be grasped.

First, it is an end-to-end beast.  Included in the $12M pricetag are robots for sample preparation and library prep, which are fully automated.  The instrument has a rated throughput of 10K genomes per year, with planned expansion to 30K/year.  Sequencing is based on Complete's ligation (cPAL) technology, which gives very short reads (28 bp paired end) using 48B spots on a patterned flowcell. Complete had a somewhat involved mate-pair system to get more information out of those short reads (though I'm blanking on the details); it is unclear if that is supported in the new system.  As far as I can tell, Long Fragment Read technology for haplotyping is not in the package.  Turnaround time is unclear.  The system can start different sequencing runs at different times, though it isn't clear how many independent units are available. Both human whole genome and human exome sequencing will be supported by the integrated software, which performs the full range of analyses.  Included in this is local reassembly of reads which don't perfectly match the reference. The system requires 1500 square feet (140 square meters) of space, which was about a third of the listed space in the original starbase; this is not a startup-friendly instrument!

My reaction?  Well, I certainly wouldn't have picked the name.  Clearly an attempt to combine Revolution and Velocity, it's a bit to get off the tongue.  The end-to-end solution aspect will appeal to groups that aren't experienced in genomics, and indeed the two outfits which have signed up already are not well-known in the genomics space.  The multiple independent units may appeal to some, but if you are really going to run full throttle planning for that will take over your scheduling.  

A looming question is how data from the Revolocity will stack up against Illumina's X10.  Both are claiming approximately human genomes at around $1K, but are these equivalent genomes?  Given that X10 offers 150bp paired end, or 5 times the read length, what sort of variation can X10 detect better?  For example, how well can you call STRs (or what is the upper limit on allele size) on each instrument?  Complete had in the past released a number of datasets, and I would urge them to release some standard human genomes as raw data (such as the Genome-in-a-bottle reference).

It will be interesting to see how well Revolocity's promise of end-to-end automation and low staffing resonate with the marketplace.  I realized from this I don't have any idea what library preparation methods are popular with X10/X5 sites.  For example, a small number (3-4) NeoPrep devices could in theory keep a HiSeq X10 farm happy, though these require DNA to be sheared upstream of the NeoPrep.  Hopefully the end-to-end, any biological sample aspect of Revolocity will not just be a lure for those with purchasing authority but far removed from actual operation.

Which gets to the big question: how many more institutions are really looking to plunk down $10+M in capital for a lab that burns another $10M-$20M a year in reagents (plus the rental cost of the floor space!)?  At the outset, BGI's machine appears to be slightly inferior on throughput to Illumina's -- but will prospective buyers care about that difference?  BGI/Complete is also coming in quite above the 5X system from Illumina, ceding the market for folks who want to do only 5K-7K genomes per year.

Off in the mist is a very different vision of population-scale human genome sequencing.  Oxford Nanopore proposes that their PromethION device, running with an array of 48 flowcells that each sport 4000 pores and running at the future higher speed, will meet or exceed the throughput of X10 or Revolocity at potentially far lower cost.  Coupled to the Voltrax sample/library preparation device, end-to-end might be possible.  More importantly, this would offer long (but noisy) reads, capable of resolving far more complex structural variants (if the input material is also long).  All of this with the footprint of an iPad!  But, the important catch is that while Oxford has now demonstrated success with the original MinION, PromethION, denser chips, higher pore speeds and Voltrax are all future releases.  

Which vision of population-scale genomics will dominate the next few years?  An integrated factory of uber-short read sequencers?  A factory of somewhat longer (but still short) read sequencers with user-defined sample and library prep?  Or long, noisy reads run on something that will fit in an overhead bin -- but not available (if at all) for many months?  In the span of two decades we've come a long way from dreaming of a 1Mb/day factory -- but do we still need a factory?


  1. Hi Keith, many thanks for this write-up. It's clear that population-level sequencing is a secular 'mega-trend', and the high end of this market needs competition.

    Yesterday and today the NHGRI is having its 8th 'Genomic Medicine Meeting' (GM8), live-streamed now: - precision medicine is front-and-center, deep phenotyping, all connected with the enabling technology of HTP sequencing.

    One item I remember is that CGI had 4x 10-mers, with the spacing in-between the 10-mers of variable length. (That is, the first two 10-mers had a 2-5 base-pair overlap, and the second two 10-mers was something like 4 to 10 bases of spacer.) It was one of the aspects of CGI's business model that likely influenced their decision to be a service provider - they could control the analysis, it being a complex effort, and control the application, only WGS.

    This is a very interesting development, the first sequencer from China. I myself may be visiting China (Shanghai and Nanjing) this Fall, perhaps I could take a side-trip to Shenzhen... (Or maybe not, too difficult to combine work with family...)

    Thanks again for the update! Alas I miss ESHG, although I've heard the weather was pretty poor.

  2. Hi, Keith, I think it will useful to help the scientific community to understand what part of the statement is reality and what part of the statement is just speculation. We all like to practice good science. Even commercial PR has some fine prints about some statement is indeed only "forwarding looking statements." I feel, as an important blogger in this field, you have some responsibility to help the community to do a better jobs understand such distinction.

    Jason Chin (opinion my own)

  3. What's the dominant sequencing technology going to be for the next 3-5 years? Come on Keith, it's Illumina. I am very impressed with the progress Oxford has made but they aren't going to be major players in the short term. Yes, they and PacBio exist, but when you're talking about population scale, clinical sequencing, the only game in town will be Illumina for the foreseeable future. As always, more than happy to eat my words there and am really hoping that Complete can scare Illumina into releasing the denser patterned flowcells we all know exist.

  4. I keep thinking of the informatics required. Granted this is mostly for human but, still at 30 mammalian genomes a day that requires a fair amount of storage space and quick mapping algorithms.

    As for who could use such a machine, I suspect that the Chinese government itself would be interested. Anyone up for the 1,000,000,000 genome project? :-) The current "factory sequencers" wouldn't even make a dent.

  5. Rick, quick mapping algorithms exist. Take a look at Edico Genomics. Their box is pretty amazing and though I hate to admit it, it has the same throughput as our 2 rack, 2000 core intel cluster all in a 4U server box.

  6. @Brian K, I know Illumina has serious traction in the space but if they don't have a road map to exponentially improve their process they will literally be crushed. As Keith elaborated in his post, we are seeing exponential growth in the area of sequencing to the tune of 5x Moores law. Furthermore I have seen a lot of desire for those longer read lengths which Oxford seems to be on the path to providing. I predict at the very least it will be an interesting few years.

  7. @Al, I think the assumption that Illumina doesn't have a plan mapped out is a poor one. They have a lot of power and if you speak to any of their executives they are keenly aware of the developments in the space. Illumina is always an acquisition away from the next greatest thing. Because they are a public company they can't be nearly as open about what's going on in R&D. From what I've seen in the long read sequencing space they're a long way off from dollars-per-gigabase runs and that's probably not going to change in the short term. It's also not clear to me that the long reads are going to be widely clinically valuable in the short term so the premium you have to pay to get them isn't justifiable. We need more research to better understand how to apply that type of data. I'm certainly excited about long reads and diploid genome sequencing, but if we're talking about what's going to change in the short term, my prediction is very little. Even if someone comes out with cheap, high quality, long reads tomorrow, it will easily be a year before they are validated and implemented in any meaningful way in the clinic.