Now, to give some further background, the sample contained a bunch of PCR amplicons from a human cell line DNA pool. I had worked carefully on the designs based on the available information from Ion -- that is, on nearly nothing. When I started designing in early March there really was no information, but later they did release an Application Note on amplicon sequencing (indeed, I couldn't get a copy of this until after I had ordered my first primers). Now once you slog through (or skip over) two full pages of sales pitch for the sequencing chemistry, you get a little bit of useful information. In particular, the sequences of the priming sequences for the emPCR, which you need to include in your primers if you don't want to ligate adapters on (fusion designs). Now, as I said I didn't have this when I designed and based on read length issues decided that at least for now adapter ligation would have advantages. There follows some useful discussion and graphs to explore what read depth you need, whether to plan on unidirectional (fusion) or bidirectional (ligation). Following that is a pitch for their partner's amplicon sequencing analysis software. Now, I had already designed my primers when I got the Application Note in my paws, but nothing suggested I was headed for trouble.
When the provider gave me back my data, they warned me that we had not acquired as much data as desired. Run on the 314 chip (since none other is broadly available), the expected yield would be around 100K reads of up to 100bp. Instead, we had about 28K reads after filtering, which dropped to just under 25K mappable reads. The painful attrition process, along with the attractive chip loading plot, are shown below.
This also suggests where Ion thinks they can squeeze more performance from their chips; 1.2M wells but only 43% had beads and only 23% were active beads. I like having spike-in controls, but it would be useful to know how much the software utilizes them if I'm burning 20K reads on them. Once we get to just what is allegedly my library, the big hit is 51% poor signal. So why so poor?
Well, I've griped about Ion's penchant for secrecy before; package inserts are available on-line only to owners, not the broader community of scientists who want to access the platform. I still haven't had a chance to get Ion's side of things, but their demanding that various documents be pulled from SEQAnswers claiming them proprietary is completely the wrong attitude. Libel and slander could arguably be suppressed, but beyond that post your corrections. In any case, on this dataset their practices (and my tendency to think "Hurry up, hurry up, the fine print doesn't mean a thing" hardly helps) burned me good.
The service provider conferred with Ion and came back with the fact my library was too big; the protocols are designed for inserts smaller than 150 bp, and my amplicons were carefully designed to be 150-205 in size.
Through means not quite at the level of George Smiley, I've obtained the documentation for the Ion Fragment Library Preparation kit. It even has a very helpful appendix on Amplicon Sequencing, The only real hint about the size limitation is the statement that "Target regions from 75 to 150 nucleotides in length must be sequenced bidirectionally". Clearly this is insufficiently emphatic! In the fragment library preparation information, the more serious warning does appear: "Libraries with a mean sze >~220 bp yield results of reduced sequencing quality" (and that size is after adding adapters).
Given that this is such a crucial point in amplicon design, why isn't it in the Application Note. I can make two guesses. One is a pure failure to think through what a customer might want to find in the note, as opposed to what marketing would like them to see. The other is worse: that the information was left out because it would need to change with each advance in the Ion chemistry.
Ideally, Ion would have a Wiki-type protocols section which presented the current latest-and-greatest protocols. These could be marked as fully supported or half-baked, to distinguish those for the masses and those for the reckless explorers.
What's particularly silly about all this is that if there is an application Ion should be trying to push hard, it is amplicon sequencing, as this is what is going to sell a lot of machines to a lot of small labs. Given the current data generation limitations of the PGM, it isn't going to be very useful for much else in a mammalian (or plant or probably most invertebrate) context -- just not enough reads or data generation for genome, exome, transcriptome, ChIP or about any other kind of sequencing. The library verification product is a smart one, but once MiSeq appears (and by the way, the poster I mentioned is now available online, along with some slightly more detailed PowerPoint slides) it will be a rare HiSeq lab that uses the PGM for this purpose; why go through extra prep steps when you can pop exactly the same sample on the MiSeq for a quick look? There will be labs that find the PGM useful for bacterial studies, though I'm guessing that the low read counts will make it a hard sell in metagenomics. But more importantly, amplicon sequencing is likely to be a major mode of medical sequencing and environmental surveillance and lots of other applications that might get a sequencer into every hospital and pathology lab in the country. That's the market Ion is salivating over, yet they are failing to fertilize the fields that will yield those assays.
I'm still digging through the data & it is a bit more complicated than the previous E.coli DH10B datasets I've seen (which, to correct the record, number 5, but I promised not to publicize details from 4 of them as a condition of obtaining them). This is human DNA, and to score errors I need to figure out which differences from the reference are real (we have already pegged several known variants in our sample, as well as others that are annotated SNPs). I will leave with one final plot on this point; the read length distribution (shown as a running sum coming in from the right) after 130 cycles (where a cycle is 4 nucleotide flows, though not always the same order of nucleotides), shows that quite a few reads went over 100 nt (upper curve; indeed we had 5 reads of 117 nt which were perfect matches to the genome); the lower curve shows the number of aligned (by TMAP) nucleotides in each read (lower curve); the difference between these curves is one way to look at what is unusable in the original call.