Thursday, May 26, 2011

Paying a Painful 75% Secrecy Tax

In a post a while back, I mentioned that my Ion Torrent sequencing project was stalled because my service provider couldn't get some of the key kits, despite an Ion representative posting that no such shortages existed.  I've been remiss in updating that; last Tuesday the kits showed up and Monday I got my data -- and a bit of a shock.

Now, to give some further background, the sample contained a bunch of PCR amplicons from a human cell line DNA pool.  I had worked carefully on the designs based on the available information from Ion -- that is, on nearly nothing.  When I started designing in early March there really was no information, but later they did release an Application Note on amplicon sequencing (indeed, I couldn't get a copy of this until after I had ordered my first primers).  Now once you slog through (or skip over) two full pages of sales pitch for the sequencing chemistry, you get a little bit of useful information.  In particular, the sequences of the priming sequences for the emPCR, which you need to include in your primers if you don't want to ligate adapters on (fusion designs).  Now, as I said I didn't have this when I designed and based on read length issues decided that at least for now adapter ligation would have advantages.  There follows some useful discussion and graphs to explore what read depth you need, whether to plan on unidirectional (fusion) or bidirectional (ligation).  Following that is a pitch for their partner's amplicon sequencing analysis software.  Now, I had already designed my primers when I got the Application Note in my paws, but nothing suggested I was headed for trouble.
When the provider gave me back my data, they warned me that we had not acquired as much data as desired.  Run on the 314 chip (since none other is broadly available), the expected yield would be around 100K reads of up to 100bp.  Instead, we had about 28K reads after filtering, which dropped to just under 25K mappable reads.  The painful attrition process, along with the attractive chip loading plot, are shown below.
This also suggests where Ion thinks they can squeeze more performance from their chips; 1.2M wells but only 43% had beads and only 23% were active beads.  I like having spike-in controls, but it would be useful to know how much the software utilizes them if I'm burning 20K reads on them.  Once we get to just what is allegedly my library, the big hit is 51% poor signal.  So why so poor?
Well, I've griped about Ion's penchant for secrecy before; package inserts are available on-line only to owners, not the broader community of scientists who want to access the platform.   I still haven't had a chance to get Ion's side of things, but their demanding that various documents be pulled from SEQAnswers claiming them proprietary is completely the wrong attitude.  Libel and slander could arguably be suppressed, but beyond that post your corrections.  In any case, on this dataset their practices (and my tendency to think "Hurry up, hurry up, the fine print doesn't mean a thing" hardly helps) burned me good.
The service provider conferred with Ion and came back with the fact my library was too big; the protocols are designed for inserts smaller than 150 bp, and my amplicons were carefully designed to be 150-205 in size.
Through means not quite at the level of George Smiley, I've obtained the documentation for the Ion Fragment Library Preparation kit.  It even has a very helpful appendix on Amplicon Sequencing,  The only real hint about the size limitation is the statement that "Target regions from 75 to 150 nucleotides in length must be sequenced bidirectionally".  Clearly this is insufficiently emphatic!  In the fragment library preparation information, the more serious warning does appear: "Libraries with a mean sze >~220 bp yield results of reduced sequencing quality" (and that size is after adding adapters).
Given that this is such a crucial point in amplicon design, why isn't it in the Application Note.  I can make two guesses.  One is a pure failure to think through what a customer might want to find in the note, as opposed to what marketing would like them to see.  The other is worse: that the information was left out because it would need to change with each advance in the Ion chemistry.  
Ideally, Ion would have a Wiki-type protocols section which presented the current latest-and-greatest protocols.  These could be marked as fully supported or half-baked, to distinguish those for the masses and those for the reckless explorers.
What's particularly silly about all this is that if there is an application Ion should be trying to push hard, it is amplicon sequencing, as this is what is going to sell a lot of machines to a lot of small labs.  Given the current data generation limitations of the PGM, it isn't going to be very useful for much else in a mammalian (or plant or probably most invertebrate) context -- just not enough reads or data generation for genome, exome, transcriptome, ChIP or about any other kind of sequencing.  The library verification product is a smart one, but once MiSeq appears (and by the way, the poster I mentioned is now available online, along with some slightly more detailed PowerPoint slides) it will be a rare HiSeq lab that uses the PGM for this purpose; why go through extra prep steps when you can pop exactly the same sample on the MiSeq for a quick look?  There will be labs that find the PGM useful for bacterial studies, though I'm guessing that the low read counts will make it a hard sell in metagenomics.  But more importantly, amplicon sequencing is likely to be a major mode of medical sequencing and environmental surveillance and lots of other applications that might get a sequencer into every hospital and pathology lab in the country.  That's the market Ion is salivating over, yet they are failing to fertilize the fields that will yield those assays. 
I'm still digging through the data & it is a bit more complicated than the previous E.coli DH10B datasets I've seen (which, to correct the record, number 5, but I promised not to publicize details from 4 of them as a condition of obtaining them).  This is human DNA, and to score errors I need to figure out which differences from the reference are real (we have already pegged several known variants in our sample, as well as others that are annotated SNPs).  I will leave with one final plot on this point; the read length distribution (shown as a running sum coming in from the right) after 130 cycles (where a cycle is 4 nucleotide flows, though not always the same order of nucleotides), shows that quite a few reads went over 100 nt (upper curve; indeed we had 5 reads of 117 nt which were perfect matches to the genome); the lower curve shows the number of aligned (by TMAP) nucleotides in each read (lower curve); the difference between these curves is one way to look at what is unusable in the original call.




16 comments:

Douglas Yu said...

The amplicon length limit was disclosed to us by LifeTech salespeople back in January. I'm surprised that they don't make this very very obvious, because it is THE major reason why not to buy a PGM yet.

Anonymous said...

Some knew, some didn't care.

Fact: Reps at Life Tech are constantly being told they might get fired. Reps at Life Tech get a Rolex for a certain number of PGM sold.

So you end up with a spectrum of actions: honest reps and other reps, who are stuck between feeding their kids or lusting for that ugly power watch of the 90s.

I smell a good MiSeq advert right there.

Don't blame the rep, blame the culture.

NB Pretty sure it's the same at other companies...

Wraithnot said...

Thanks for posting this info- we're trying to decided whether or not to buy a PGM for amplicon sequencing and there's very little independent information out there.

I also had a few quick questions:
1. If the amplicon size was indeed the issue, your shorter amplicons should have been over-represented in the usable data. Did you observe this?
2. The sales rep described a streptavidin bead enrichment step to remove "ion spheres" that didn't get a template molecule (and therefore didn't incorporate the biotinylated A adapter) during the emulsion PCR. But the ratio of "live ISPs" to total ISPs and the ratio of multiclonal ISPs to live ISPs are consistent with Poisson statistics (at least if I did the math properly and if I am interpreting live ISPs correctly). Did your sequencing provider run the enrichment step?

gasstationwithoutpumps said...

Your read length estimates are falling prey to the same error Ion Torrent makes in their ads: ignoring the quality of the reads. Now that you have your own data, plot the read length distributions after quality trimming.

I think that you'll find that the quality drops rapidly after about 60 bases, so you don't have as much data as you think.

Ion Torrent suppressed a paper that had that info in it, with the threat of a lawsuit.

Anonymous said...

Let me know when N>1. What's the p-value of this blog?

Keith Robison said...

Wraithnot: I checked with the provider and an enrichment step was performed.

Yes, it does seem that shorter amplicons show up more frequently in the data, though the results are not clean. We didn't normalize the input amounts, and poor amplifiers in the original PCR definitely fared worse

gasstationwithoutpumps said...

I posted some data (not mine, so I can't give more details) showing how read length and error rate varied with quality trimming for the Ion Torrent:
http://gasstationwithoutpumps.wordpress.com/2011/05/31/a-use-for-an-ion-torrent/

Wraithnot said...

Keith,

Thanks for the additional info. Hopefully a recent post on SEQanswers indicating that the One Touch system will help with this size limitation is correct. It also sounds like Ion needs to tweak their enrichment protocol since the data you showed included this step.

Wraithnot said...

GasStation- Ion Torrent posted a similar accuracy-vs-length analysis of their latest data in Figure 4 of the "performance overview" pdf on their website. This analysis indicates they are now getting 99% accuracy at 100 bp read length. Does this mean that they've improved their accuracy to be equivalent to Illumina, or are they doing something sneaky that isn't obvious to a non expert?

Andy said...

Keith, thanks for posting the data. really interesting.

You can estimate the representation of each amplicon in the sequence library experimentally by qPCR with the original primers used for the amplification. We've found this to be a useful way of estimating representation before sequencing, and relative copy numbers calculated from delta Cq correlate pretty well with the observed relative depth of coverage in sequence data (for 454 at least). This approach should help assess the relative contributions of library generation and (emPCR+sequencing) to representation in these data.

Anonymous said...

Your service provider should have run the samples out on a bioanalyzer to verify the sizing before ever doing the run. This is generally included in the charge. Who did you use as some are better than others. I would stick to the CSP providers if at all possible.

Anonymous said...

Should have bought a GS Junior for Amplicon Sequencing...

Sorry but it is your own fault for not looking into the technology properly and for believing Life Tech reps instead of doing what scientists should do....RESEARCH

Anonymous said...

He didn't have a machine and wanted to see the reads, haven't you read his blog post ? Now, on the other hand I do have to agree that scientists are now into saving every penny possible and Ion Torrent's usage price is attractive. Why you know, Life will be very quick to tell you that Ion Torrent was used by BGI for that super killer cucumber bug sequencing. If you read BGI carefully they basically mentioned that they used what they had, which was a previously well known genome template to work from. Ion Torrent really didn't offer a huge difference.

Does all this really matter ? Everyone here knows that sequencing is going the oligo way, there will be tons of houses who will offer NGS sequencing at pennies, and soon we will buy pennies and companies will sell pennies and the bubble will crumble again and we will be making wallmart science with wallmart quality.

Anonymous said...

Ion's primers have biotin to make the enrichment faster. If you designed your primers on the older protocol, then you will get no enrichment. 20% enrichment with 16% polyclonal pretty much shows that.

They sequenced your sample with no enrichment pretty much.

Porih said...

The one touch won't help, droplet size is plenty big enough to drive copies to the bead. People really don't understand how it works. It is all about how many copies on the bead you can get with longer pieces of DNA. This property is well documented for solid surface DNA amplification.

When sequencing amplicons, do you need to also trim the PCR primer read? I always assumed since the PCR primers are in molar excess and extension occurs off the 3' end, the PCR primer sequence is related to the synthetic oligo and not the sample.

Subtracting the PCR priming sites from the amplicons would really hit their usable read length.

monique_haakensen said...

here's a possible solution??
1) make the 5' primer to contain the adaptor. the 3' primer should not
2) do PCR of any size
3) shear the DNA
4) select the size (up to 150bp)of sheared DNA you want
5) ligate only the 3' adaptor. Only those fragments that contain a 5' adaptor will bind the bead (so you will likely need a higher starting DNA conc than you normally would use), but all your reads should contain the first 150bp sequenced from the 5' end, so you could align that portion and compare these for the snps etc.

haven't tried it... just theorizing this could work? let me know if it does! :)