Thursday, March 31, 2022

The End of the Beginning of Human Genome Sequencing?

Today in Science a slew of papers have been published from the Telomere-to-Telomere (T2T) Consortium.  The flagship paper details the generation of a complete genome assembly from a Complete Hydatiform Mole (CHM) cell line which is telomere-to-telomere for all 22 autosomes plus X (assembly T2T-CHM13); the companion papers apply this groundbreaking assembly to a number of biological questions.  PacBio CSO Jonas Korlach and I chatted yesterday about the PacBio contribution to the flagship as well as two of the other papers, as well as another T2T preprint on automated assembly and a related paper from Heng Li and colleagues that recently appeared in Nature Biotechnology.  I did not have advance access to the T2T paper but it had appeared in preprint form and Jonas assured me that no substantial information was added in the published version. 

Jonas reminisced a bit about how his undergraduate career still had radioactive sequencing gels, as did mine.  The first human genome draft would be completed while he was in grad school and I at my post-Ph.D. job.  That draft required a large number of workers over several years on many continents and was a huge step forward in human genetics, but since its release its incompleteness and other shortcomings became apparent.  The T2T preprint summarizes some of these -- even something as mundane as the choice of restriction enzyme used to create BAC libraries drove troublesome variation in BAC coverage.  It is remarkable that this new reference yields 238 megabases -- 8% of the human genome -- never before assembled.  That's 2 times a Caenorhabditis or Drosophila genome!  Or another important statistic: the new information includes the entire p-arms of five acrocentric chromosomes (13, 14, 15, 21 & 22) -- each of these is 12-15 megabases and so about a yeast genome. Satellite arrays on 1, 9 and 16 that were previously megabase-long "here there be dragons" (it turns out that phrase can be spelled using only the letter N)

Without being able to study these, we really don't know what the impact of assembling will be.  But there are hints.  One hairy region of the genome with medical relevance is linked to Facioscapularhumeral muscular dystrophy (FSHD) and contains the gene FSHD-region gene 1 (FRD1).  GRCh38 had only 9 different paraologs of FRD1, which would seem complex enough -- but the T2T-CHM13 assembly has 23 paralogs of FRD1.  

In terms of sequencing technology this is definitely a "kitchen sink" paper - 30X PacBio CCS, 120X ONT ultra-long read, 100X PCR-free Illumina, 70X HiC using Arima kits on Illumina and a bit else. This allowed checking and cross-checking the assemblies and Jonas noted there do remain regions which really required the ONT ultra-long data to assemble.  Final consensus accuracy is estimated to be somewhere between phred 67 and phred 73, or around 1 error per 10 megabases.   This of course varies with HiFi coverage; some regions will be lower but conversely there are regions substantially higher.  Remaining HiFi issues include homopolymer runs as well as simple repeats consisting of only G and T -- more fodder for better basecallers and HiFi polishers

In the main paper the assembly is semi-automated.  A string graph was generated automatically but then paths through it were resolved in part by manual intervention.  That's where the second preprint and the Heng Li paper make advances -- these are nearly (preprint) and fully (paper) automated haplotype-resolved assemblies from diploid samples nearly as good as TDT-CHM13 - and they are male samples so we get that pesky Y chromosome.  In the Heng Li paper, only PacBio HiFi and HiC data are used.  So we really are at a stage where a single laboratory in a matter of weeks and for a few tens of thousands of dollars can generate a nearly flawless human genome assembly far superior to the carefully curated references available up to very recently.  To go from multiyear factories to this sort of fast, inexpensive high quality reference in just over two decades is a stunning technological revolution - and that's about 50 years since any DNA sequencing was possible. For both ONT and HiFi data, there is notably less coverage near the ends of chromosomes and the ONT data tends to just point outwards which leads to a severe strand bias which fouls some variant caller logic. ONT duplex chemistry might have some interesting effects on that issue.  There's also interesting regions-specific biases in ONT read lengths in different satellite arrays and also a discrepancy between ONT (uniform) and HiFi (variable) in some satellite arrays.  The T2T-CHM13 assembly comes with a list of the least reliable regions -- which is only 0.3% of the assembly length versus 8% of GRCh38.

Jonas quickly sketched for me one of the other papers in the Science barrage; I hope to read them this weekend.  By mapping the 1K genome set data back to the new reference the authors found over 100K new variants per sample

Jonas compared TDT-CHM13 to having an entire movie, with prior references not just being missing frames but entire scenes.  He also suggested that this represents a breakout from only reporting huge SNP tables enabled by short reads to routinely generating SNPs plus resolved structural variants.  

When asked about the decisive battle of El Alamein, Churchill said "Now this is not the end.  It is not even the beginning of the end.  But it is, perhaps, the end of the beginning".  I feel similarly about the T2T consortium's work.  This is a giant milestone in human genome research, as we finally have a complete and very low error assembly of a human genome.  If we apply this technology to a range of human genomes from different geographies, what further will we find?  How dynamic are these new, complex regions across populations or even potentially over the lifespan of an individual?  Jonas talked of the idea of regularly snapshotting our genome over our lifetime -- whether this have medical value or not remains an area for investigation.  Today's deluge of papers represents a whole new map which will support asking entirely new questions in human genetics -- and until we ask and answer we won't know which were the most interesting to ask.

The special issue papers:

And a commentary by Deanna Church, one of the long-time curators of human reference genomes


Anonymous said...

Awesome progress, but let's not forget this is a homozygous cell line, so not sure I would claim we have finally sequenced "the" full T2T "human genome".

To be precise, this is what was actually sequenced:

"As with many prior reference genome improvement efforts (1, 8, 17–20), including the T2T assemblies of human chromosomes X (14) and 8 (21), we targeted a complete hydatidiform mole (CHM) for sequencing. Most CHM genomes arise from the loss of the maternal complement and duplication of the paternal complement postfertilization and are, therefore, homozygous with a 46,XX karyotype (22). Sequencing of CHM13 confirmed nearly uniform homozygosity, with the exception of a few thousand heterozygous variants and a megabase-scale heterozygous deletion within the rDNA array on chromosome 15 (23) (figs. S1 and S2)."


Unknown said...

I agree that the work from T2T consortium represents a major milestone in human genome research. To me, one of the most interesting things you mentioned was that Jonas is interested in taking regular "snapshots" of the genome over the course of an individuals lifespan. I think this not only has significance for genomics research, but epigenetics research as well. I am particularly interested in understanding the patterns of histone modifications that are made over the course of an individual's life and how this relates to the aging process. This is a little unrelated to the research of the T2T consortium, but nonetheless I am still very excited about the progress that is being made and look forward to any future updates.