182 — End-to-end differentiable learning of protein structure
Mohammed AlQuraishi (10.1101/265231)
Read on 18 February 2018When I was an undergraduate at Johns Hopkins, there were a few organic chemistry courses that were giving out credit for playing FoldIt, a game that lets you fold proteins in a 3D environment in an attempt to find their ideal configuration. It was pretty common to see a few dozen students playing the game around campus, even — if I remember correctly — in the originating class itself.
The problem is that protein folding is very hard. Despite our ability to sequence the gene that codes for a protein, and despite our ability to trivially determine the peptide chain that comprises the protein, it’s still extremely difficult to figure out how the primary structure — the peptide chain — folds and reconfigures into its three dimensional structure. And that 3D structure is what makes the protein do what it’s supposed to do.
The state of the art algorithms for predicting protein structure are generally brute-force systems that just wobble proteins around until they find a low-energy configuration. (Two magenets held an inch apart is high-energy. Two magnets touching each other “at rest” is low-energy.) This neither guarantees that we find the correct configuration (the endogenous form might not be in its lowest possible energy configuration), nor does it really follow any meaningful protocol for picking a starting shape.
In short, it’s much like seeing an origami crane and then trying to replicate it by randomly folding a piece of paper of unknown size and shape until you suddenly arrive at that exact same crane.
Mohammed AlQuraishi writes in his blog that this strategy seems fundamentally flawed.
From when I first learned about protein folding, and the approaches taken to predict protein structure, I thought it may be possible to predict proteins without conformational sampling and energy minimization, the two pillars of protein structure prediction. The reasons for this have come to underlie what I call the linguistic hypothesis.
His hypothesis states that there is a grammar to how proteins fold, and if we can learn that grammar, then we can predict protein structure without all of this mucking around with energy minimization.
Learning that grammar, then, is the next challenge. Machine learning methods conventionally work when “learning” a differentiable function, and so the first step AlQuraishi took was to design a representation of a protein folding configuration that can be fed into a neural network. These representational networks were dubbed Recurrent Geometric Networks.
The result is that RGNs output incredibly accurate 3D results several orders of magnutide faster than the energy-minimization pipelines, and with less information about the protein’s function and evolutionary chain.