A number of large MinION sequencing projects have started showing up on preprint servers. Even better, many of these were run entirely with the current R9.4 chemistry and the authors conveniently provided tables of how much data each flowcell yielded. I've put these together into a common plot. The first five groups are different institutions (Birmingham, Norwich, Nottingham, British Columbia and Santa Cruz) from the effort to sequence the human cell line NA12878; X-shifts within each group represent different DNA preparation methods. Then comes the sequencing of a wild tomato (the image is not meant to represent the variety actually sequenced!). Finally, we have the Cliveome, the sequencing of Clive Brown by Clive Brown.
What I'm interested in here is the yield for the flowcells. Clive has posted internal results getting 10Gb from the R9.4 chemistry, but the best achieved here was just over 7Gb. Note particularly that the yields are spread quite wide, even when presumably runs were made by the same personnel and with the same starting DNA material. Some of the variability between the subcolumns is presumably library-construction related. For example, the three series (in order) for Clive are G-tube sheared DNA in the ligation prep, rapid 1D prep and G-tube DNA which has been size-selected on BluePippin. The G-tube ligation prep appears to be less variable than the rapid 1D set. Size selection with the BluePippin may have lowered yield, though the N is small. However, the benefits of those runs in terms of longer reads may well have outweighed the loss; straight yield is not everything!
Clive just tweeted out an image showing getting to 16Gbp of data, apparently by tuning the running scripts within MinKNOW to eject excess leader sequences before they can plug pores.
Clive just tweeted out an image showing getting to 16Gbp of data, apparently by tuning the running scripts within MinKNOW to eject excess leader sequences before they can plug pores.
Our latest MinION runs. 16G now, just a MinKNOW upgrade away. pic.twitter.com/bSe4lZs0jr— Clive G. Brown (@Clive_G_Brown) February 2, 2017
@pathogenomenick mostly kicking out bad leader complexes in the first second before they jam in.— Clive G. Brown (@Clive_G_Brown) February 2, 2017
— Roger Pettett (@zerojinx) February 2, 2017Getting back to the variability, what causes it? One can imagine a number of contributors. For example, flowcells may have differing numbers of good pores, though my personal observation is that this is very modest. Contaminants coming in from DNA input to the library preps may have detrimental effects. Air bubbles introduced during pipetting.
An interesting (well, maddening to have happen, but interesting in the abstract) artifact is the one illustrated below. The heatmap shows the yield from each pore in the array across a run, organized in the correct geometry (which took a few tries), with the sample port on the left side. You should note that the left side is colder (blue, less data) than the right side (red, more data). When I posted a not-very-good image of the flowcell (I was actually asking about the bubble pattern, which was apparently typical for end-of-run. What an eagle eye spotted is that our loading beads were very heavily deposited near the loading port, burying that part of the flowcell. Properly loading the beads through the "SpotON" port is a delicate balance of dripping them in slowly enough to not introduce bubbles or generate shearing forces on the pores, but quickly enough that the beads distribute evenly across the flowcell. This run didn't thread that needle.
I would propose that someone clever develop a tool to apply different heuristics to nanopore results to inform troubleshooting. One heuristic would be to look for patterns such as the above. Another would be to identify regions of failure, particularly between the platform QC (before sample) and early running once sample is added; this could detect introduced bubbles.
Non-spatial tests would include the ratio of active pores to total predicted pores, a measure of library quality as well as examining the decay of performance during a run. Presumably enough data can be assembled for performance decay over many flowcells in order to model typical and abnormal behavior. The fraction of reads attributable to just adapters might well be another useful statistic to point out library construction issues.
Ideally, such a tool would be embedded in MinKNOW. If I had the time to write it, I'd name it Hermione, after the Harry Potter character who had the sharpest mind and a keen knack for cracking mysteries. Alas, I don't have the time to really go after this. But I do think it would be valuable, particularly as it sometimes appears that active MinION users are primarily composed of a select group of veterans; too many others give up after encountering a steep and mysterious learning curve. The yield variability also wrecks havoc on trying to plan long term; nobody wants to play craps with how many flowcells they will need for a project.
Performance variability isn't unique to MinION; difficulty with under- or over-clustering Illumina flowcells are well known, as are varying levels of optical duplicates in runs. Particularly when trying to go after customers who are not sequencing professionals, having smart automated QC tools (but not anything resembling Clippy!) would help retain and attract customers. Such tools that pinged summary statistics back to Oxford could also help the company identify the degree different types of issues are encountered by end users, enabling prioritizing training, software or chemistry improvements.
After all, we can't just wave our pipettor and say Accio Yield or Evanesco Bubbles, at least with any hope for success!
Just a note on the plot. Not all my cliveome runs were run for the same time, some 24 hrs or less, some for longer. Has the yield been normalised for such ? - remember nanopore runs don't have fixed run times, you need to do a true throughput measure as yield per unit time not per flow cell/ run.
ReplyDeleteA new technology will have some variability in the product to start with, that is the first to be reduced. What we are left with then is variability in sample, or in library preparation. The devices are runnable by anybody, on anything, (contrast with centralised human genome running), so we expect some variability. This can further be reduced by simplification, and by automation (yes also training and practice). Devices to simplify and automate are a priority for this year.
A favourite approach of mine is also to make the throughput so large that even a bad library or difficult sample still yields several gigabases - and for many of the applications of MinION that is sufficient, i.e. bring up the tail.
Since the information on the pores which generated the read is stored in the fast5 format, a spatial analysis doesn't have to be performed "live". A script could look at the number of reads per pore, average quality per pore and pore performance over time. I guess I can write that. Could you share the spatial organisation of the channels?
ReplyDelete(Oh and if I'm not mistaken, your heatmap displays the yield from each channel, rather than per pore.)