Monday, February 02, 2026

Non-coding DNA's Alpha Moment

I had been meaning to read the AlphaGenome paper on non-coding variant effect prediction from Google DeepMind which recently showed up in Nature, but had found excuses not to dive in.  Then DeciBio’s Stephane Budel posted on LinkedIn with some incisive comments and few things spur me to action better than my sense of competition! So here's a quick, incomplete & imperfect take on this giant paper.



A Quick Overview of AlphaGenome & Some of Its Limitations

AlphaGenome is a machine learning model trained on a huge corpus of genomic sequence and functional genomics data from human and mouse.  Given a genome, it predicts an enormous number of genome tracks (5930 for human; 1128 for mouse) for that genome. These include a wide variety of transcriptomic, chromatin accessibility and Hi-C style contact maps.  AlphaGenome uses a context window of an entire megabase, and predicts many features at single base resolution - though ChIP-Seq and H3K27ac are “only” 128 bp resolution and contact maps at 2048 bp resolution.


There are definitely interesting areas out-of-scope for AlphaGenome.  It does dismally on coding variant effect prediction.  It appears to not have been trained on methylation data  - and I would guess there is nowhere near enough hydroxymethylation data available to perform training.  RNA-Seq alignments in the training set were required to be 50 bases or more, so training should have been blind to microRNAs.


What would be some interesting tough cases to throw at it?  Since it doesn’t know about methylation, perhaps X-inactivation - and in particular predicting which RNAs on the inactive X are still transcribed.


It’s also predicting individual exons, not patterns of coordinated splicing ala the one I helped discover in Drosophila retinas.  


It’s also important to not oversell AlphaGenome’s abilities on splicing.  Basically it matches - plus or minus a small margin - the performance of several existing splicing-specific models and those all have an ROC area-under-the-curve of just over 0.5 - so there’s still great need for further improvement.


Similarly, the tissue-specific predictions of expression and splicing deserve great scrutiny - especially since single cell RNA-Seq seems to keep discovering new cell types.


And as Stephane points out, epitranscriptomics is a rapidly expanding field but not something AlphaGenome touched.  A serious challenge here is the lack of large RNA modification datasets for cell atlases.  We’re seeing bigger and bigger cell atlases - Illumina just announced a 1 million cell effort - but these have been pretty universally with short reads and I’m unaware of any such effort with Oxford Nanopore’s Direct RNA.  Indeed, nobody has yet built (to my knowledge) a single cell Direct RNA scheme, so the full range of RNA modifications has no database to train on.  Alida Biosciences has a scheme for reading 6mA and inosine from bulk RNA, but again single cell doesn’t yet seem to be an option.  


The authors also note that the model isn’t being given any kind of conservation scores, which might be useful - though another preprint I just saw noted that many human regulatory elements are undergoing rapid evolution.  Though that sort of “sometimes it is, sometimes it isn’t” seems to be the sort of nuance these large models can parse with remarkable results.


Conservation points to another area the authors mention shortcomings - right now it’s just human and mouse because that’s where the data is.  Wouldn’t it be interesting to generate the data to extend the model to other pharmaceutically important mammals such as dog, pig and rat?


How Might AlphaGenome Change Rare Disease Genomics?

Last year I sent myself to ESHG in beautiful Milan and picked my sessions to nearly purely focus on rare disease genetics.  While it's hard to get a precise number, the reality exists that a significant number of attempts to solve rare diseases with the highest quality long read genomes still fail to zero in on a causative variant - and the difficulty in assessing non-coding variants is a strong candidate for the root cause of most such disappointments.


Many talks mentioned cases in which existing splicing predictors, such as SpliceAI, predicted no effect from variants that actually had pathogenic effects on splicing.  I won’t be surprised if this year’s confab in Gothenburg has many mentions of success or failure of AlphaGenome - likely contrasted with prediction by SpliceAI - on such mutations.  


It also wouldn’t surprise me if some of these clinical groups which have accumulated large numbers of solved clinical genomes will pass these back through AlphaGenome - it’s only 1s of computation on an NVIDIA H100 - to see what happens.  There is a risk that AlphaGenome doesn’t so much as solve the problem of noncoding variant prediction in a clinical context, but shift it from “we don’t know anything about these variants” to “we now have a huge number of predictions to try to validate”.  Such validation then might run into the wall so many clinical projects do - that the appropriate tissue is not accessible.  On the other hand, if AlphaGenome can be leveraged to better interpret the weak expression signals coming from accessible tissues such as cheek epithelium or circulating lymphocytes, then it would make a huge impact.


Similarly, it would be interesting if groups that have both clinical genomes and transcriptomes from the same patients would compare them systematically against AlphaGenome predictions to detect any systematic anomalies.  I’ve already mentioned the probable lack of understanding of X-inactivation, which would also extend to inappropriate manifestations - at ESHG one talk described a translocation that spread X-inactivation into an autosomal region.  Come to think of it, nasty rearrangements would be an interesting specific subset to be compared against AlphaGenome, as these (again, more ESHG talks) can sometimes break or fuse boundaries for defining chromatin state, causing inappropriate activation or silencing of genes. 


Back to DeciBio's Take

In his LinkedIn piece, Stephane made the claim that the rapid pace of advancement in this field, such as AlphaGenome, are devastating the whole idea of 5 year plans and 2 years out is as far as one can function.  I’d agree - and any 2 year plan must have within it as much flexibility and ability to pivot as feasible.  Stephane pointed particularly to companies that have built businesses around variant catalogs and variant effect prediction as ones that must adapt rapidly or be faced with existential crises.  If a competitor can draw on public data and build a model equal to or superior for effect prediction to your carefully crafted database, do you have a moat at all?


He also noted that the part of the variant effect ecosystem that will still be glacial are the payers and regulators.  


Stephane also touched on a favorite topic of mine (hey! Shills gotta shill!) - evidence generation.  A couple of quotes worth stealing 


Shift from “train on data” to “simulate biology, then test reality”


And

Build the evidence flywheel (a.k.a. Industrialize truth)


If we’re really going to do this - build all the evidence we need to expand these models and wring them out utterly so we can identify causative variants far better than we can now - then focused experimentation is needed.  So that calls for high throughput experiments, but a wide variety of them - exactly the remit of autonomous laboratories.  Such labs might carpet bomb specific loci with variants and read effects - I just saw a preprint on Myc. Others might try something like replicating all variants in ClinVar but with consistent, high resolution readouts - imagine if we had RNA-Seq in an appropriate cell line for every non-coding ClinVar variant? Or something else that more clever minds than I dream up - but ideally the dreaming is decoupled from the equipment & skill requirements for execution so the best minds can drive the best experiments.


The AlphaGenome paper is huge and rich; the Supplementary Information a monster.  I won’t pretend the above is more than scratching the surface.  There’s been machine learning models of genomes before, but this one does feel like a true watershed in the field.


No comments: