DeepMind and several research partners have released a database containing the 3D structures of nearly every protein in the human body, as computationally determined by the breakthrough protein folding system demonstrated last year, AlphaFold. The freely available database represents an enormous advance and convenience for scientists across hundreds of disciplines and domains. It may very well form the foundation of a new phase in biology and medicine.
The AlphaFold Protein Structure Database is a collaboration between DeepMind, the European Bioinformatics Institute, and others. It consists of hundreds of thousands of protein sequences with their structures predicted by AlphaFold — and the plan is to add millions more to create a “protein almanac of the world.”
“We believe that this work represents the most significant contribution AI has made to advancing the state of scientific knowledge to date and is a great example of the kind of benefits AI can bring to society,” said DeepMind founder and CEO Demis Hassabis.
From genome to proteome
If you’re not familiar with proteomics in general — and it’s pretty natural if that’s the case — the best way to think about this is perhaps in terms of another significant effort: that of sequencing the human genome. As you may recall, from the late ’90s and early ’00s, this was a colossal endeavor undertaken by a large group of scientists and organizations across the globe and over many years. The genome finished, at last, has been instrumental to the diagnosis and understanding of countless conditions and the development of drugs and treatments for them.
It was, however, just the beginning of the work in that field — like finishing all the edge pieces of a giant puzzle. And one of the following significant projects everyone turned their eyes toward in those years was understanding the human proteome — which is to say all the proteins used by the human body and encoded into the genome.
The problem with the proteome is that it’s much, much more complex. Proteins, like DNA, are sequences of known molecules; in DNA, these are the handful of familiar bases (adenine, guanine, etc.), but in proteins, they are the 20 amino acids (each of which is coded by multiple bases in genes). This in itself creates a great deal more complexity, but it’s only the start. The sequences aren’t simply “code” but actually twist and fold into tiny molecular origami machines that accomplish all kinds of tasks within our bodies. It’s like going from binary code to a complex language that manifests objects in the real world.
Practically speaking, this means that the proteome is made up of not just 20,000 sequences of hundreds of acids each but that each of those sequences has a physical structure and function. And one of the most complex parts of understanding them is figuring out what shape is made from a given series. This is generally done experimentally using something like x-ray crystallography, a long, complex process that may take months or longer to figure out a single protein — if you happen to have the best labs and techniques at your disposal. The structure can also be predicted computationally, though the process has never been good enough to actually rely on — until AlphaFold came along.
Taking a discipline by surprise
Without going into the whole history of computational proteomics (as much as I’d like to), we essentially went from distributed brute-force tactics 15 years ago — remember Folding@home? — to more honed processes in the last decade. Then AI-based approaches came on the scene, making a splash in 2019 when DeepMind’s AlphaFold leapfrogged every other system in the world — then made another jump in 2020, achieving accuracy levels high enough and reliable enough that it prompted some experts to declare the problem of turning an arbitrary sequence into a 3D structure solved.
I’m only compressing this long history into one paragraph because it was extensively covered at the time, but it’s hard to overstate how sudden and complete this advance was. This was a problem that stumped the best minds in the world for decades, and it went from “we maybe have an approach that kind of works, but extremely slowly and at great cost” to “accurate, reliable, and can be done with off the shelf computers” in the space of a year.