Scientists in many fields think carefully about how their instruments are calibrated, a problem which is particularly acute if during the expected lifetime of a project it will be impossible for a human to ever visit the instrument. Below is a photo of a calibration target for a future NASA Mars Mission, which tests a number of aspects of both imaging gear as well as an X-ray spectrometer -- as well as an actual Lincoln penny as a size calibration for the viewing public! Many different aspects of imaging were considered in laying out the target, such as color, resolution and even shadows and lighting effects.
Most sequencing test materials have been mostly selected on the basis of convenience, materials we have lying around. These often have deficiencies. For example, when I was trying to run radioactive Sanger as an undergraduate during the summer of 1989, the positive control was M13. That's nearly guaranteed to sequence, but since I was trying to sequence something of elevated G+C content (though today I would laugh at calling high 50s such), it didn't help me any when my control worked and my sample failed.
For high throughput sequencing, there are many steps in the process that an ideal test material would address, and additionally massively parallel approaches offer the ability to use a control in a spike-in manner. The E.coli phage phiX was famously used by Illumina; a restriction fragment of lambda phage MinION burn-in uses a full lambda phage, and the MARC group used E.coli. Again, convenient and inexpensive, like M13, but are these technically ideal?
I'd argue they are far from it, particularly given the characteristics of the MinION's base calling scheme. The original base caller, as described in the MARC paper, works on mapping 5-mers to the signals using a Hidden Markov Model (HMM). Newer versions of the base caller are using an HMM with 6-mers. So, I would state that a proper calibration target should contain every 6-mer.
How do some of the common samples stack up? PhiX (5,386bp) lacks only 12 5-mers, but would be an infernal standard now given it lacks 666 6-mers. The 3,560bp MinION control fragment is missing only 5-mers, but 998 6-mers. Lambda itself (48,502) has all 5-mers, but lacks 14 6-mers. The E.coli sequence used by the MARC consortium has all possible 5-mers and 6-mers. E.coli is even future-proofed against a 7-mer based caller, but is missing 52 of the possible 8-mers. Of course, for any of these the frequencies are not the same. For example, the rarest 6-mer in E.coli (CCTAGG) appears 32 times, the most abundant (CTGGCG ) 10,664 times
However, one must match a control to the throughput of an instrument. A spike-in of a small amount of a short fragment means that it will be sampled many times; given the current throughput of an Oxford flowcell, but E.coli isn't going to be. So small is better for spike-in.
Now, the ideal standard would have every kmer, a concept known as a de Bruijn Sequence. While Dr. de Bruijn's lifetime overlapped mine, the concept is actually very ancient: the Sanskrit word yamātārājabhānasalagam is a mneumonic for drummers in which every possible triplet of long and short is present. Due to the self-complementary nature of DNA, computing de Bruijn sequences for DNA has an added twist, but there is a nice program called shortcake which will compute them. A de Bruijn sequence of order 5 (5-mers) is only 512 bases long, only 2140 for order 6 and 8192 for order 7. Order 8 would be nice, but that's a bit over 33Kb, which might be a bit expensive to synthesize. Order 9 is 191,072, which would probably be best to break into multiple pieces, should someone invest in making it, as that is getting unwieldy to harvest without breaking.
Now, even this standard wouldn't be perfect for just base calling, as there are special cases that would ideally be included. One could imagine a whole set of homopolymers of different lengths to complement the de Bruijn set. Some of those might be a bit fun to have made; synthesis companies are very averse to long runs of Gs, since they may form G4 tetrad structures. Curiously, one vendor balked at long runs of G but not of C, which never made sense to me (if it really isn't a problem, then reverse complement my design!). Similarly, series of di-,tri and tetranucleotides of known length would enable calibrating calling the lengths of these, which are still used frequently as genetic markers (and can be important functionally as well).
Designing a test article set for methylation would get even trickier. A complete set of kmers with methylation would probably be prohibitive, especially given the alphabet expansion represented by different types of methyl-C. Still, having sets for some common methylations, such as CpG, would be valuable. There is a strong hint that some of the calling errors in MinION data are really issues with methylated kmers not fitting the models well. For example, I did some preliminary work on some earlier public E.coli data (from the Loman lab) which suggested that errors are more frequent near GATC tetramers; this is the site targeted by the Dam methylase. Ideally one would have distinguishable DNAs with GATCs from both Dam+ and Dam- hosts -- or synthetic constructs from a Dam host that could be treated with Dam.
While many of the properties I have described above are perhaps a bit specific to MinION, others would translate to other platforms. For example, a good methylation calibration set would be valuable for PacBio, which has differing sensitivities to differing nucleotide methylation types. A homopolymer calibration set would be useful for every platform.
Still, on other platforms one would want test articles customized for the issues on that platform. For example, the MIRA manual is a great source for reproducible problems in sequencing technologies, lists some motifs that can be problems with Illumina's chemistry. So an Illumina test article should probably have extra sets of these motifs.
Some issues probably can't be practically addressed with designed test articles. For example, the various stages of PCR in some platforms can run into issues on DNA that is either very GC-rich or very GC-poor. Using natural DNAs with these properties is probably best for testing out sequencers (or also amplification kits), though wouldn't do for spike-in controls. But a word to the wise: pick truly skewed genomes. I snicker every time I see Bordetella pertussis used as a "high G+C" control; with a mean around 65%. If I see 65% G+C contigs in a Streptomyces assembly, I start BLASTing them to see what contaminant has crept in; Streptomyces have a mean around 73%.
Ideally, every stage of the process would be covered by the same test article, but that won't always be practical. For example, short test articles mean high sampling even if spiked-in at low levels, which is great for informatics -- but a short calibration standard won't be able to test DNA shearing if the desired shear size is close to (or likely much greater than) the size of the test construct. My proposed de Bruijn sequence construct would test every kmer, but not in every possible larger context. If secondary structure plays a role, then trying to mimic every secondary structure will be impractical.
In summary, better standards are both possible and highly valuable. Perfect calibration articles are not possible, but I believe that we can design DNA sequences which would put instruments through much more comprehensive tests than standards which have been used to date. With a little bit of investment, compact standard materials could be made widely available.