Monday, January 29, 2024

On Illumina's Moats Past & Present

Studying how Illumina came to dominate sequencing markets is certainly worthy of at least a Harvard Business School case study, and perhaps an entire graduate thesis.  But I wanted to give a quick review of some of my thoughts on the matter, spurred by Nava Whiteford's repeated savaging of a piece in another space but also because many of these themes will show up in a flurry of pieces I'm planning (one's even nearly done!) in the next few weeks due to AGBT and some non-AGBT news.  

I've met Nava face-to-face a few times and he's a great guy.  He was an early employee at Oxford Nanopore, thought they don't have a friendly relationship -- Oxford did another one of its "let's generate as much negative publicity as possible" after Nava posted micrographs of a flowcell -- a flowcell he had purchased on eBay.  Nava wrote the 41J blog, but recently moved his writing over to Substack (ASeq Newsletter).  Most of the content is free, but I suggest you seriously consider joining me in buying a subscription as he uses the proceeds for purposes such as obtaining old sequencer hardware to tear down and write about.  He has a great general grasp of in the industry, but far exceeds my skills when it comes to hardware and physics and he's much more willing to dive into patents than I find myself doing.  He's also set up a genomics Discord site which is a great place to catch scuttlebutt ().

The piece that Nava has slammed -- and if he hadn't stepped up I probably would have jumped in with my own panning of it -- is in Elliot Hershberg's Century of Biology -- is on Illumina.  Perhaps we're both jealous that CofB has gazillions of subscribers and errs so badly when trampling on our turf. But one point Nava & Elliot do agree on is that Illumina's chemistry gave them an edge over their early competitors.  What formed Illumina's moat (barrier to competitors) -- and what monsters lived in the moat?  [my CEO loves to talk about moats, and we all like to joke about engineering dragons]

Who Were the Competitors?

As a refresh, 454 was the first to market with a post-Sanger sequencing instrument.  ABI brought out SOLiD at a similar time to Illumina and so did Helicos (I've forgotten the exact order).  There was also back then the Polonator build-it-yourself instrument.  Complete Genomics had their "sequencer factory" approach, but they would sequence anything you wanted as long it was a human complete genome - so not a general purpose solution.  And later to the game came Ion Torrent.

Polony Generation

All of the early instruments, save Helicos, relied on a clonal chemistry, so they needed a method to build clusters aka polonies.  Most used emulsion PCR, which gained an unpopular reputation for being very messy due to the emulsions, and unpleasant for the solvents used to break the emulsions.  Some of the emPCR reactions had large volumes -- I've heard of giant baggies of SOLiD reactions.

Illumina in contrast used bridge amplification. In some embodiments this is isothermal, in others it is PCR.  Confessing here: I think Illumina used the PCR version but not certain.  In any case, no solvents, no mess.  Bridge amplification was a clear win.

The seeds of bridge amplification actually were sown during my time in the Church lab - though not for that purpose.  A post-doc's project was to develop an aptamer library for the entire E.coli proteome, and the idea George had was to run 2D protein gels and then run in situ PCR reactions to generate aptamers on the acrylamide gel.  George's buddy Chris Adams  had developed chemistry to link oligos to acrylamide gels (if my 30+ year old memory has blurred this, please step in and correct).  None of the sequencing ideas would hatch until shortly after I left (dammit!), but there were the seeds.

Bridge amplification passed from Mosaic Technologies to Manteia and finally to Solexa and then Illumina.  But what if it hadn't?  What if one of the other developers had locked it down?  (I believe Bridge Amplification will work with beads -- 454 & Ion Torrent are inherently bead-based).  That's a fun what if -- if Illumina had been stuck with the less desirable mess.  Or what if Solexa had stuck to their original plan to develop a single molecule sequencer, and probably encounter much of the trouble that Helicos did?

It's interesting to note that subsequent platforms have seem more emPCR (Genapsys and now Ultima) and now "rolony" rolling circle approaches (BGI and  Element - and QIAGEN?).  Not sure if Onso has disclosed their amplification chemistry; given the accuracy claims I'd bet it is rolling circle (lacks error jackpotting since RCA doesn't make copies of copies).  Bridge amplification patents are expired now - I think its what Singular is using but they don't talk much about it.  

Nextera

Let's talk about a late addition to the Illumina platform -- much after the initial battles with 454 and SOLiD -- that I believe was sheer genius.  A company called Epicentre had developed (after initial academic papers) a way to use transposons to build sequencing libraries, and was selling kits for Illumina and 454.  Illumina snatched up Epicentre -- and immediately announced they were discontinuing the 454-compatible kits.  Ouch!  

Nextera has so many positive attributes.  

First, it is a relatively simple protocol -- in that dangerous range where I think I could execute it (a proposition never put to the test). There's not lots of additions or bead cleanups.  Not only does that make it good for small labs, it also makes it very friendly for automating on liquid handling robots.  

Second, Nextera works with relatively small input amounts of DNA.  The ligation procedures have largely caught up, but in the beginning Nextera had an edge -- certainly when I was talking to sequencing service providers

Early Nextera did have issues with sensitivity to input DNA concentrations -- we got back one library that made no sense until I discovered via mapping a known sample that the enzyme had made truly minimal insert lengths -- the consequence of low DNA input concentration.  Later iterations of Nextera kits dealt with this.  

But again, imagine if Life Technologies or Roche had scooped up Epicentre and locked Illumina out?  Illumina had a charmed existence of having the easiest option for amplification (bridge) and library preparation (Nextera) and as we'll see also on informatics -- what if at least on some dimension the competition owned the edge?

Read Lengths & Paired Ends

On read lengths, the space was a bit more complicated.  Illumina always had an advantage over SOLiD here -- and I'll gladly buy drinks for someone who can draw out how SOLiD ever hit the read lengths they did hit.  I doubt anyone in 2005 though Illumina would ever get to 150 basepair reads let alone 250 or 300, but they did!  Helicos, despite having no phasing issue since it was single molecule, struggled to ever compete on read length.  Complete (again, only for human genome sequencing) was short too.  454 could actually beat Illumina by quite a bit, though the 1 kilobase reads were apparently super rare.  

The value of longer lengths isn't uniform -- even with 25 basepair reads a lot of variant calling can be done and even with 50 or 75 small genomes assemble pretty decently -- and going to 150 doesn't boost it as much as you'd like because the big repeats (such as ribosomal RNA operons and transposons) are much, much longer.  But in other spaces even modest length improvements could improve specificity of RNA-Seq mapping or identifying splice sites or finding breakpoints from genomic rearrangements.

Illumina also had paired ends first; SOLiD and 454 were much later for this.  Paired ends effectively increase the amount of sequence from longer fragments and led to all sorts of interesting algorithms for splice detection, breakpoint detection and improved genome assembly.  On the other hand, the longer reads from 454 could easily cover more territory and having a single long read can be better than paired ends.  

As a side note, while at Infinity I commissioned a project using SOLiD when their paired end reads had just emerged.  And after digging in the data, I discovered a systematic error with the reverse reads.  Ugh.

With Ultima, we're again seeing long single reads.  Illumina's switch to exclusion amplification has also blunted any value from going to even longer reads -- ExAmp has issues with fragments over about 500 bases and so something like 2x400 (long rumored for MiSeq) wouldn't really gain much.  

Informatics

In one of the earlier pieces Nava commented on the failed concept of colorspace that came with SOLiD.  By failed, I mean nearly completely failed to gain traction in the informatics community -- there are at best a handful of tools that ever interacted with it.  In theory colorspace enabled the detection of wrong basecalls, but it also meant the first output wasn't a simple sequence that any user would inherently understand.  Instead, a series of color transitions were given -- if you knew the first base you could convert these into base calls based on a set of rules -- and if adjacent colors violated the rules then one of them was clearly an error.  Elegant, but never gained traction -- just about everyone just converted these to FASTQ

Similarly, the flowgrams from 454 and Ion Torrent have additional information that could be leveraged for error detection and error correction -- but nearly everybody just converted to FASTQ.  There wasn't a wealth of tools built to use this additional information.

And that was a general theme.  If you saw software coming out and it mentioned a platform, it was almost always Illumina.  Nearly never colorspace and rarely 454 -- though there were a lot of tools for amplicon error analysis and denoising for 454 as that was a popular application (particularly 16S sequencing).  Illumina data had an error pattern more similar to high quality Sanger data than 454 or Ion Torrent (and perhaps converted SOLiD data did too), so developers who had cut their teeth on Sanger easily slid over.  Colorspace and dealing with homopolymer errors was tougher and attracted fewer computational folks.  And there were small bits like the much loved Newbler assembler for 454 being licensed by Roche under wonky terms (I don't remember the details, only that it made using Newbler at a company an exercise in hoop-jumping). 

And even more so as k-mer approaches started becoming popular.  Illumina's low-indel error pattern particularly lends itself to k-mer analysis (a k-mer is a subsequence of length k).  454 and Ion Torrent have far more indel errors, and those complicate k-mer approaches.  We've seen the same with Oxford Nanopore and PacBio -- only with recent accuracies for Oxford and CCS/HiFi becoming dominant with PacBio have kmer tools become popular with these platforms.

Element, Onso and Singular have data that looks just like Illumina data -- except perhaps in some cases even more accurate (Element and Onso have both announced very high accuracy chemistries).  But Ultima has different error modes, different error models and new file formats.  They've made sure that  standard tools -- such as variant calling -- have been built and tuned for their data, but it is different.  And Jonathan Rothberg's 454.bio just made an announcement (I'll write up soon!) and their error patterns are looking very different than anyone else's.

The Rich Get Richer

Later in the competition, as Illumina gained market share, that gain had a compounding effect.  Want to try hot new method X?  Well, it was developed on Illumina.  The software is expecting Illumina data and perhaps keying in on things like the encoded location information in the read name or intolerant of homopolymer errors.  If a company launches a kit, it's compatible with Illumina immediately with maybe an Ion Torrent kit launching a few quarters later -- if ever.

This part of Illumina's moat is being neutralized by all sorts of conversion kits -- every short read new player announces some way to convert Illumina libraries.  There may be small penalties in requiring more DNA input or perhaps fewer of the reads are fully usable (this was an issue with the original Ultima chemistries; I'm told the newest chemistries pretty much eliminate the issue). 

What Do People Want?

Which brings us to what is perhaps been Illumina's deepest moat or fiercest moat monster -- the fact that they have been on top for so long means that expectations for what sequencers will deliver is shaped by what Illumina sequencers do deliver.

Oxford Nanopore made some public complaints about this with regard to Genomics UK -- the technical specifications for the project were seen (by ONT) as really skewed to Illumina.  For example, if you strongly prioritize SNP calling over structural variant calling, then Illumina will look much better (and much, much better the 5 years ago or so when Oxford was complaining).  

Which brings to another what if: what if Illumina/Solexa had been delayed even further and some other technology had gained a bigger foothold before they showed up?  Or what if a nanopore technology had shown up sooner -- after all, when I got to George's lab in 1992 he was already planning nanopore sequencing experiments -- and unknown to anyone in our group was later entering the field than David Deamer and Dan Branton across the Charles River.  PacBio's intellectual origins aren't much later.  It was a long slog to get single molecule long read sequencing working, but who knows what sort of luck could have sped that up?  Or maybe just more grant money would have done the trick.  In any case, imagine if single molecule long reads had shown up in force early on, would it have changed the scientific goals?  On the one hand, projects such as HapMap already had strong SNP blinders on, but then again there was a lot of interesting in array CGH back then for mapping out amplifications and deletions in cancer samples.   How much does available technology end up leading science labs by the nose?  That's an open question.
  \
One still sees biases -- blithe comments about Nanopore accuracy without any reference to the problem at hand.  In the unfolding Ultima story, how much will labs care about the small differences between Ultima data and Illumina data -- or will they be willing to ignore them when Ultima gives them torrents more data for the same cost?  And how many applications are over-served by current Illumina accuracy and could be addressed with really noisy data like what 454.bio might deliver?

Conclusion

My thesis would be that Illumina's advantage on chemistry was multi-fold, with bridge amplification, Nextera and the inherent error rate and pattern playing key roles -- but there was also the knock-on effect of Illumina grabbing the majority of the computational developer mindshare -- and without even explicitly trying to do so.  There was also the fact that Illumina's resemblance to Sanger data not only drove developer adoption, but made it an easier fit for people focused on SNPs.

I'm generally optimistic that Illumina's moat in this area has filled in -- many new entrant platforms offer data that is a drop-in replacement for Illumina and can therefore immediately take advantage of the rich Illumina-compatible informatics ecosystem.   Whatever IP protections Illumina might have had on key sequencing library elements -- P5/P7 and i5/i7 -- have evaporated so all the new entrants are making conversion kits.  It's also worth noting that most library prep kits seem to be bought from non-Illumina vendors, and these companies are willing to support a new player if they believe there is a market there.   Nextera is no longer the only transposase prep in town -- seqWell has their own (though they differ in key details).  

But make no mistake -- Illumina is still the 800 pound gorilla in the field.  Nobody is fired for buying Illumina.  Their moat may be mostly non-existent, but community inertia is a moat amplifier.  The sequencing market is large and growing, so nibbling away at Illumina's stake might be a very successful business strategy, but it will be very difficult to knock Illumina off its perch -- even with Illumina's own contributions to doing that (as in, foolishly rushing the Grail purchase).

Coda

By the way, if you're thinking of opening your wallet to support Nava's blog, also consider the same for Albert Vilella (Rhymes with Haystack)  and Brian Krueger (Omic.ly) -- they both enrich the infosphere with news and facts about genomics and sequencer instrumentation. Charging for content is a route I'm not considering now, but I'm happy to support those who do so.

8 comments:

Toumy said...

Great post. I think you omitted the fierce battle between Complete Genomics and Illumina around WGS starting around 2010. They were really the only ones that were seriously competing with Illumina in the earlier days, even though they did not really have a distributable platform. I also wonder what would have happened if 454 would have focused more on workflow and throughput rather than long read fetishism-the pursuit of a 100bp longer read...
Finally, part of why Illumina is losing some of it's advantages is that they never addressed their Achilles heels, for example bridgePCR and exAMP (error, dephasing, insert size), and not having decent workflow reagents (library prep, TE etc). Of course also showing no love to instruments with lower throughput.

Anonymous said...

Thanks for the trip down memory lane. In 2010, read lengths (1 kb may have been rare, but 400 bp was common) and Newbler made the 454/Roche GS-FLX system my first choice. I was a microbiologist with no bioinformatics skills. Solexa/Illumina didn't provide a data solution (other than a recommendation to "hire a bioinformatician") whereas Roche provided a 'server' to store data and included training to run Newbler. Plus, I wanted closed de novo bacterial genomes, which wasn't feasible with 75 bp reads. Yes, emPCR was messy, but those early paired end protocols were also labor intensive. In retrospect, the Illumina option (a HiSeq 2000?) would have been the better choice. Within a couple of years, the library protocols became easier (probably due to the Epicentre acquisition), read lengths got longer, and output far surpassed that of the GS-FLX. Plus, pyrosequencing had some inherent issues (esp. the homopolymer errors). Then again, as Toumy comments, I wonder what would have happened had workflow and throughput for the 454 been improved.

new said...

Wonderful article!

I always forget the Mosaic Technologies part of the story. It's interesting to think about the platforms as a kind of confluence of technologies/IP flowing from company to company. At some point I guess that reaches critical mass and pulls in more and more IP.

In this case it feels like it reached critical mass at Solexa. But it is interesting to speculate what would have happened if Solexa hadn't been acquired and never added things like patterned flowcells.

In this that case though, it's difficult for me to see what else would have been competitive. If you had a viable amplification approach (maybe RCA or even emPCR). And did unterminated SBS with optical readout, perhaps this would have been good enough for many applications? But nothing like this seemed to appear... perhaps for IP reasons.

Anonymous said...

No solexa used isothermal clusters, and the Manteia technology pre-dated polonies, hence the independent IP. At the time a lot of Solexas board was frit by a company called US Genomics, that actually came to nothing.

Anonymous said...

In my view solexa had two moats, one was the cleanness of the terminated chemistry, IP stood up for 22 years despite many challenges. Secondly as part of that they engineered a custom polymerase to deal with the bulky nucleotides.

Anonymous said...

You don’t mention Lynx MPSS. Often overlooked. But it was an early player with amplification on beads, again with its own IP. This was sequencing by ligation. Lynx was one of many victims of the .com bubble. In a way solexa also was, as its investors were desperate to cover losses and cash it in. Lynx also couldn’t raise money, like Manteia. The .com thing was the asteroid to these dinosaurs. Anyway, lynx was struggling, after raising tons of money historically, its tech was very clunky and they were not able to export a working instrument. The rationale for merging with solexa, apart from the $$, was that Lynx machines could be easily adapted to run solexa sbs. This turned out to be entirely false. So much for technical due diligence as Nava says. People from Lynx, who clearly elected in MPSS, moved to form the team at ABI for SOLiD. And of course that tech came from Agencourt - another overlooked innovator.

Anonymous said...

The investors were ‘frit’ by pretty much anything that popped up (especially US investor Oxford Biosciences). As executive management we spent a lot of time trying to put nervous minds to rest.

Dale Yuzuki said...

As one who was in the trenches (selling Illumina microarrays to the NIH starting in 2005, and then the first Solexa 1G's in 2007) one interesting point was the Complete Genomics business model was not only WGS all the time, but a centralized service model. And now here we are more than 15 years later and Complete Genomics is unrecognizable, with all the BGI/MGI instruments, chemistries, government intervention (and on the other side government backing).

Nice call-out to all the activity Albert and Nava and Brian are doing with their subscription model, another subscriber to all three to help them out financially (and encourage others to do so). I see another trend, towards private communities, but the blog post via social media / e-newsletters is a great way for distribution.