Wednesday, November 01, 2017

AlphaGo & Biology

A comment was left on an early piece suggesting I comment on the recent AlphaGo paper and the possible applicability of this approach to biomedical sciences.  I'm not sure I have anything terribly original to say, but who can refuse a request?
AlphaGo is a program developed by a group at Google to play the classic board game Go.  Like chess or checkers, Go has no elements of luck.  Players take turns placing marbles on a board to claim territory, with the rules causing some territorial gains to be permanent and others which can be erased by the other player.  That's probably not a very good description, as I played only once in college and we didn't finish the game.

AlphaGo has succeeded in now devastating the best human Go players, not only beating them but utterly confusing them with the program's moves.  There's a great story about DeepBlue flummoxing Garry Kasparov in their match by pulling a move at random because it had reached a deadlock, but that was a fluke and AlphaGo apparently executes move sequences that look nothing like established play.

How'd they do this?  Simply by pitting AlphaGo against itself.  AlphaGo was given the rules and objectives of Go but not given any strategies and instead had to discover them itself.  That's clearly a tantalizing possibility to apply to other hard problem spaces.  

The catch of course is that Go has a definite criterion for victory which can be assessed.  There's no chance that a winning board isn't a winning board.  On the other hand, some press coverage suggested that protein folding might be a next target for the team.  That's an interesting possibility, but knowing if your structure prediction is correct isn't quite so easy.  Of course, that partly depends on what information you allow the computer to access.  Go is a simple rectilinear graph with three possible states for each node.  So the evidence set AlphaGo played with was relatively simple.  With folding a sequence, do you allow the program to access multiple alignments?  Homologous structures?  If you leave out multiple alignments, is it then kosher to check the computer's work using residue co-evolution?  

So perhaps AlphaGo should be first aimed at something more easily assessed than a correct three dimensional structure.  For example, there is the problem of optimizing a strain for production of a given metabolite.  A common test system is producing lycopene in E.coli, since the product (it's the pigment that makes tomatoes red) is easy to assess.  Given all the genes in E.coli plus the lycopene construct, what is the desired level of expression of every gene (with a deletion being null expression).  There are numerous attempts at this in the literature, but an AlphaStrain designer could be very interesting and would also be testable. A related problem is the medium optimization problem: to produce a given product in a given strain, what media conditions are best?  This includes what media components and their concentrations. Or perhaps classifying variants of unknown significance for a gene such as BRCA1 which has high-throughput variant phenotyping to train and test against.

To me, a testable aspect is critical.  IBM's Watson has generated a lot of ads and headlines, but I believe still has zero published results.  Most notoriously, M.D. Anderson Cancer Center bailed out of a collaboration with Watson for cancer diagnostics.  It's an open question whether we really know anything about Watson's abilities, other than winning a TV game show, without some sort of blinded comparison to human experts and clear means to adjudicate whether computer or human did better.

There are of course many other problems in biomedicine which we wish would be solved.  I'd love to see an AlphaGo-like approach generate insights into Alzheimer's, but realistically how would you configure the problem?  What does "winning" look like in any way that the computer could play itself many times to learn a strategy?  My strain example can be assessed, but it sure isn't as fast and easy as staring at a Go board. 

That is one of the fundamental challenges of moving AI into the real world.  Go is a nice problem, but not terribly important.  Real problems are much messier.  Image recognition software gets better and better, but still has a ways to go.  The iPhone environment tries to classify images, but on my phone only a small fraction of dog photos are tagged with dog or panda photos with panda.  Or there are those collections online of sheepdog vs. mop or chihuahua vs. muffin confusion sets.  Or a recent report that tweaking a single pixel could foul some image recognition systems.

Deep learning will likely have huge impacts on how biomedical research is performed.  I won't be surprised to see great things, but those expectations must be tempered with the knowledge that the real world is much messier and ambiguous than black-and-white marbles on a rectangular grid.


Jonathan Badger said...

Although it is important to realize that Deep Learning is really just a more sophisticated artificial neural net method made possible by modern amounts of computing power. Machine learning techniques like neural nets, Hidden Markov Models, and the like have had applications in biology for a long time (gene finding being the most obvious one). Pretty much any problem where you have a set of observations that are mapped to labels (coding vs non-coding, healthy vs diseased, etc) can be used in deep learning to product labels from unlabeled data.

AKatawazi said...

We setup a BWA, Picard, GATK bioinformatics pipeline with good sensitivity but bad specificity. The hope was to use tensorflow which is the foundational software used in Alphago to boost specificity by training our model in GIAB data. The problem we are encountering is we just don't have enough super accurate sequence data. Normal LSTM require huge training sets to produce good results, I definitely think protein folding is a possibility once there is a good physical model on how the folding process actually works. There is also another company called deep genomics that has a program called Spider I think that gives you the probability of a SNP occurring which looks interesting as well. I think this is definitely the wave of the future for life sciences companies, especially if a company has a lot of good training data to teach a LSTM.

Unknown said...

Dr. Robison,
it is a great honor that one of the world's foremost molecular geneticists would respond to a comment that I posted.

Here's the url for the story. AlphaGo used human play as a training set; AlphaGo Zero
used self-play to master the game.

It is quite true that life is much more ambiguous than a board game.
Who can be declared the winner in life? Not easy to know. One challenge that could be reduced to a codeable objective function is GWAS research. People have or do not have a trait/illness. There can be questions of penetrance among others, though opening
up the UKBB or other DNA bank to a reinforcement learning AI might result in
very significant discoveries. An abundance of genetic data now exist. What might AI be able to find for Alzhiemer's?

Anonymous said...

To J Ir, there's some additional complexity you may not have considered in biology.

AlphaGo's objective was to learn how to win. It doesn't matter how, in fact because of how complex its model is, we can only explain how after the fact. So it would be far easier for an AlphaGo-like AI to figure out how to give you Alzheimer's than to figure out the many subtle ways in which you can acquire it. The goal isn't to achieve an Alzheimer's state similar to a winning state, it's to describe all possible paths to Alzheimer's. AlphaGo was not made to describe paths to victory, only to find them given a board configuration.