Jonas reminisced a bit about how his undergraduate career still had radioactive sequencing gels, as did mine. The first human genome draft would be completed while he was in grad school and I at my post-Ph.D. job. That draft required a large number of workers over several years on many continents and was a huge step forward in human genetics, but since its release its incompleteness and other shortcomings became apparent. The T2T preprint summarizes some of these -- even something as mundane as the choice of restriction enzyme used to create BAC libraries drove troublesome variation in BAC coverage. It is remarkable that this new reference yields 238 megabases -- 8% of the human genome -- never before assembled. That's 2 times a Caenorhabditis or Drosophila genome! Or another important statistic: the new information includes the entire p-arms of five acrocentric chromosomes (13, 14, 15, 21 & 22) -- each of these is 12-15 megabases and so about a yeast genome. Satellite arrays on 1, 9 and 16 that were previously megabase-long "here there be dragons" (it turns out that phrase can be spelled using only the letter N)
Without being able to study these, we really don't know what the impact of assembling will be. But there are hints. One hairy region of the genome with medical relevance is linked to Facioscapularhumeral muscular dystrophy (FSHD) and contains the gene FSHD-region gene 1 (FRD1). GRCh38 had only 9 different paraologs of FRD1, which would seem complex enough -- but the T2T-CHM13 assembly has 23 paralogs of FRD1.
In terms of sequencing technology this is definitely a "kitchen sink" paper - 30X PacBio CCS, 120X ONT ultra-long read, 100X PCR-free Illumina, 70X HiC using Arima kits on Illumina and a bit else. This allowed checking and cross-checking the assemblies and Jonas noted there do remain regions which really required the ONT ultra-long data to assemble. Final consensus accuracy is estimated to be somewhere between phred 67 and phred 73, or around 1 error per 10 megabases. This of course varies with HiFi coverage; some regions will be lower but conversely there are regions substantially higher. Remaining HiFi issues include homopolymer runs as well as simple repeats consisting of only G and T -- more fodder for better basecallers and HiFi polishers.
In the main paper the assembly is semi-automated. A string graph was generated automatically but then paths through it were resolved in part by manual intervention. That's where the second preprint and the Heng Li paper make advances -- these are nearly (preprint) and fully (paper) automated haplotype-resolved assemblies from diploid samples nearly as good as TDT-CHM13 - and they are male samples so we get that pesky Y chromosome. In the Heng Li paper, only PacBio HiFi and HiC data are used. So we really are at a stage where a single laboratory in a matter of weeks and for a few tens of thousands of dollars can generate a nearly flawless human genome assembly far superior to the carefully curated references available up to very recently. To go from multiyear factories to this sort of fast, inexpensive high quality reference in just over two decades is a stunning technological revolution - and that's about 50 years since any DNA sequencing was possible. For both ONT and HiFi data, there is notably less coverage near the ends of chromosomes and the ONT data tends to just point outwards which leads to a severe strand bias which fouls some variant caller logic. ONT duplex chemistry might have some interesting effects on that issue. There's also interesting regions-specific biases in ONT read lengths in different satellite arrays and also a discrepancy between ONT (uniform) and HiFi (variable) in some satellite arrays. The T2T-CHM13 assembly comes with a list of the least reliable regions -- which is only 0.3% of the assembly length versus 8% of GRCh38.
Jonas quickly sketched for me one of the other papers in the Science barrage; I hope to read them this weekend. By mapping the 1K genome set data back to the new reference the authors found over 100K new variants per sample.
Jonas compared TDT-CHM13 to having an entire movie, with prior references not just being missing frames but entire scenes. He also suggested that this represents a breakout from only reporting huge SNP tables enabled by short reads to routinely generating SNPs plus resolved structural variants.
When asked about the decisive battle of El Alamein, Churchill said "Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning". I feel similarly about the T2T consortium's work. This is a giant milestone in human genome research, as we finally have a complete and very low error assembly of a human genome. If we apply this technology to a range of human genomes from different geographies, what further will we find? How dynamic are these new, complex regions across populations or even potentially over the lifespan of an individual? Jonas talked of the idea of regularly snapshotting our genome over our lifetime -- whether this have medical value or not remains an area for investigation. Today's deluge of papers represents a whole new map which will support asking entirely new questions in human genetics -- and until we ask and answer we won't know which were the most interesting to ask.
The special issue papers:
- The complete sequence of a human genome
- Epigenetic patterns in a complete human genome
- From telomere to telomere: The transcriptional and epigenetic state of human repeat elements
- Complete genomic and epigenetic maps of human centromeres
- Segmental duplications and their variation in a complete human genome
- A complete reference genome improves analysis of human genetic variation
And a commentary by Deanna Church, one of the long-time curators of human reference genomes