The size of library inserts is an important topic in modern sequencing, on both long and short read platforms. I have often talked about long read insert sizes in this space, but today I'm sticking to short reads. Library insert sizes are important here for many reasons. If you are paired end sequencing, inserts below a certain size will yield overlaps, which can be useful (e.g. for read pair merging) but also mean redundant sequencing. Libraries dominated by very long inserts may require special handling, such as lower loading concentration on Element's AVITI system as they generate larger polonies that would be too crowded if loaded at standard loading densities.
So examining library insert sizes is a common task for QCing or troubleshooting short read libraries. Depending on the library preparation method, multiple factors can influence the size distribution. For example, SPRI bead cleanups can be tuned to select for a specific size range, or various electrophoretic methods. Input DNA concentrations to various steps can affect the size distribution and any shearing of the DNA upstream of library prep will affect the size distribution. Or some inputs, such FFPE or cell-free DNA, will have inherent size distributions imposed by the input material. And biases in the amplification (if not PCR-free) or cluster formation may skew things further.
Here for example, is the size distribution of a library prepared by ligation. I assembled an SRA accession with SPAdes and then mapped the reads back to it with bwa-mem2 and pulled the insert lengths using pysam, and plotted only from 100 to 600 bases which is where most of the data lies - and anything particularly small or large is both rare and a bit suspicious Given the shape of the curve, it would seem little if any size selection was imposed. It's not quite normal
Okay, here's another one, but this time a library prepared with seqWell's plexWell library prep approach.
Now the interesting case. Why is this plot so noisy? Well, for one thing I don't believe in histograms for these plots - I'm not binning anything since insert length is inherently an integer value and the numbers for each size are huge - why bother smoothing anything? Plus I know the basic matplotlib scatter plot function options far better than those for the histogram plotting function.
I forget if the first time I stumbled on this if I used scatterplots at first or I had started with connecting the dots from the beginning, but it certainly is more intriguing if you plot this one as a curve.
Huh! The noise appears to be periodic, with a progressive decrease in the amplitude as the inserts are longer. What is going on? Have you guessed?
If you'd like to see my guess, here I've used vertical red lines to illustrate where given the first peak I've marked I'd expect to see the 3rd peak after it.
Why third? Well, that's inherent in what my guess was. Because the guess I had in mind wasn't in integer!! And once I had that magic value in my head I leaped to a mechanistic explanation, given the short read library technology that had been used. Indeed, the whole reason I was doing this is we were evaluating switching to this particular library technology. And everything fell into place - the period and why the amplitude decreases with longer library lengths. Have you guessed it?
Before the big reveal, a penultimate hint - the peak before the first one I've marked is at 105 bases.
Okay, the little reveal - which is the final hint. The library prep for library X is Illumina's Bead Linked Transposase or BLT technology. This has adapter-loaded transposases attached to tiny beads. Input DNA binds the beads and is attacked by the transposases. The key point here is we don't see the effect with the plexWell library but do with BLT - so it must have something to do with the beads.
If it has to do with the beads, then Universal Sequencing Technology's TELL-Seq should show periodicity as well, since it also features transposases on beads. TELL-Seq has the extra twist of barcoding the beads to generate linked read clouds, but we're ignoring this here. All I'm interested in is whether the insert size distribution shows a period. Does it? Yup!!!
Is the period the same? Well, the TELL-Seq data is much messier - but the same sizes I marked before do seem to be a bit peak-y. And 105 is again a peak.
Hmm. DNA on surfaces. Transposases inserting. A size peak at 105 and the periodicity of the peaks is not an integer. What important real number does 105 suggest?????
Did you have a eureka moment? Have you figured out what the period is?
The transposases attack preference period is the same as the helical pitch of B-DNA!!!!!! Which makes good sense - if the DNA is frequently lying completely flat on the bead, then the phosphodiester bonds most accessible to the transposase will tend to be a multiple of that pitch. Those peaks I marked - they are 21 bases apart!
So what Franklin & Gosling's photograph 51 first revealed experimentally has fallen out of what was a routine sequencing technology comparison. A textbook number found again by a completely new method. No, it's not on par with finding a new proof of the Pythagorean Theorem, but I'll take it.
Curiously, whenever I've mentioned this observation to anyone at Illumina I haven't found anyone familiar with it.
May your New Year be filled with the joy of discovery and rediscovery!
No comments:
Post a Comment