Sunday, April 25, 2021

GISAID Broken Down by Sequencing Hardware

The GISAID database has been the workhorse for storing and distributing SARS-CoV-2 sequences during the COVID-19 pandemic and recently passed one million entries.  There was some Twitter chatter wondering about the hardware breakdown for this, as it isn't really easy to get out of GISAID.  I had done a somewhat arduous partial take at this for my VIB talk last month, but in the meantime GISAID had granted me some additional access to metadata which I've been too busy to tackle.  But knowing some others were curious, time to dive back in. 

Metadata is usually a mess and GISAID is no different.  The sequences are the most critical information and I'm glad they're focused on that, but I will note a few interesting patterns.  You have both understandable (NextSeq vs. NextSeq) and careless (NexSeq, MiinSeq) variation in names.  There's also regional variation -- I've lumped the ElectroSeq with Ion Torrent as apparently that is a branding used in China.  There's far more embarassing issues: one large lab deposited a significant amount of data with the sequencing instrument set to "SEQUENCING_INSTRUMENT" and the assembly method set to "ASSEMBLY_METHOD" -- presumably there was some interpolation syntactical sugar missing in some code.  Another major lab used their name as the sequencing instrument; a lot of their other submissions used Nanopore but I'm not making that leap.   Still another put their assembly specification in the instrument field but alas did not put the instrument in the assembly method field.  Others seem to have put actual individual instrument ids in.  Still, it mostly worked out: 0.69% of the 1.2+ million entries I looked at couldn't have a sequencing platform assigned after I went through rounds of cleaning.

Breakdown Overall By Instrument

There's also just science complications -- some labs, particularly early on, used a mix of methods.  Some appear to have a mix of instruments of the same tech and don't differentiate.  These are minor and I've decided to just ignore the issue -- as a result if you were to see my raw numbers and go out enough decimals the percentages don't sum to 100%.  But, as an example, about 0.071% of sequences mentioning Illumina also mention another platform.

The top level headline is that 79% of GISAID sequences were generated on an Illumina sequencer, 17% on Oxford Nanopore, 2.8% on Pacific Biosciences, 1.2% on Ion Torrent, 1.1% on an MGI box and 0.4% used Sanger.  

If we break Illumina down further, there is a problem that 19% of the entries say they used Illumina but don't specify an instrument type.  Of those that do specify, 47% came off NovaSeq, 26% off MiSeq, 20% off NextSeq, 3% HiSeq, 1% iSeq and 0.8% MiniSeq.  The internet sentiment is that this would skew more towards NextSeq -- probably my guess as well.

Oxford Nanopore has an even worse issue with not being broken down further: 56% of entries.  A reasonable guess is that this is dominated -- perhaps entirely -- by MinION -- but I've dropped them.  Of those that are left, it's 81% MinION, 18% GridION and 0.1% PromethION

PacBio has unspecified leading Sequel II by a small margin, with very few Sequel I.

A very notable absence from the database is Genapsys. The company has pitched their instrument as very good for this work, with no index hopping an high accuracy.  They've shown posters of this at meetings.  But if anybody has actually sequenced the virus at scale with Genapsys, they either are among those who can't get their metadata right or they are foolishly sitting on data that could help the world.

Breakdowns by Platform and Geography

A question of interest is how much do the platform choices differ by different geographical groups.  Note that the geography here is actually that of the sample, but since few samples are collected in one nation and sequenced in another I'll just conflate the two.

For example, the UK remains the top source of GISAID entries with 382K in my sample.  The UK submitters are tidy: only 19 entries can't be assigned a platform.  There's one lone Ion Torrent entry and none for PacBio and 31 Sanger.  Interestingly, the UK actually used Illumina higher on average than the global sample: 86% with 13% Oxford Nanopore.

The U.S. for a long time was grossly undersequencing the virus, but that has reversed and we are approaching the UK now with 339K entries (I am proud to have had a hand in several thousand of those, though wish I could claim more).   The US also has 1% unassignable tech - those large submitters with bad code didn't help here.  79% is Illumina and 10% PacBio, with Oxford in third with 8.5%.  Most of those PacBio sequences are very recent; if you took this picture back in the fall ONT would have a healthier share (and yes, I could take that look but haven't gotten to it).  Ion Torrent is less than a percent and Sanger is rounding error.

For some of the minor platforms, who is using them?  PacBio is mostly a US affair: 34,487 sequences.  France has 257 and interestingly there are 4 from Vietnam.  

MGI has an interesting mix -- 7.7K from Italy, 2.6K from Sweden, 1.7K from Latvia, 1.1K from UAE, only 130 from China (!), 109 from Canada, 61 from Brazil, 29 from Thailand and 9 from Russia.

That low count for China and MGI is striking. Overall China has not deposited many sequences in GISAID, only 1,678 total.  Interestingly, 53% of those were on ONT and only 24%  on Illumina with 7.3% unassignable to a platform.

So who is using Ion Torrent?  The top user is the USA at 3.1K but Sweden is close behind with 2.9K and then India at 2.2K.  But the list is long - 40 countries -- and represents all the populated continents. National percentages can be large:  All seven of Mongolia's sequences, 88% of the 230 sequences from Oman, 66% of Morocco's 164, 63% of the 448 sequences from Egypt, 28% of Brazil's 5,889 and 21% of India's 10,536.  Overall the picture looks like this (dashed line is the world average).

For Oxford I've plotted just the countries with usage above the global average.  

There's a bunch of countries piled up on top because nearly 100% of data is from Oxford devices - most actually are 100% but then there are quirks such as 1 of Curacao's and 2 of Zambia's not being ONT.  Denmark is striking for the huge number of sequences run entirely on ONT











Cote d'Ivoire145145


Sint Maarten122122

Burkina Faso9696

Trinidad and Tobago5555



Saint Lucia1010



British Virgin Islands55

Sint Eustatius44

Saint Kitts and Nevis33

Cayman Islands33

Saint Vincent and the Grenadines11

Antigua and Barbuda11
The list has a lot of Caribbean nations or territories in it as well as multiple African nations.  Indeed, looking at all the island nations of North and South America shows Illumina dominating only in Bahamas (28/28), Bermuda (30/30), Cuba (2/2), Dominican Republic (19/20),  Guadaloupe (102/102), St. Barts (2/2) and St. Martin (10/10)  and just barely in Jamaica (8/15 - 5 ONT and 2 Sanger).  Another tidbit is Saint Martin is sequencing very few (10) but on Illumina, but on the Dutch side of Sint Maarten there's 122 but all on Nanopore.

The two non-ONT sequences from Zambia were Sanger and the one Curacao non-ONT was Ion Torrent.  Not earth shattering, but a check on my data cleaning.

Well, I'm out of good ideas for tonight -- as you can see in the last bits above! Looking at sample prep would be interesting, but that data is much more sparse and messy so I haven't decided whether it is worth cleaning up.  Tracking the change in sequencer mix over time for high output countries could be interesting.


Anonymous said...

What do you make of Oxford Nanopore IPO?

Anonymous said...

Interesting breakdown. Any chance the data will be made available to the public in the future?

Jack Leonard said...

Keith, Congrats for the acknowledgement that you received from Francis Souza for this post in the ILMN Q1 2021 earnings call. -Jack-