top of page

Monday Article #51 AL PROTEIN: The New Dimension of Science Predicting the Structure of Protein

AI Proteins is re-imagining the possibilities of protein therapeutics by rationally designing entirely new proteins to carry out specific therapeutic functions. Using AI-based design and a high-throughput drug discovery platform, AI proteins creates synthetic proteins from scratch and optimizes a proteins activity for each therapeutic application. This engineering process enables the development of proteins that are inexpensive, durable, highly specific, and can be optimized for oral delivery. [1]

Proteins are some of the most fascinating and challenging molecules in the universe, and they pose a big challenge for artificial intelligence. The implementation of machine learning/AI in protein science gives rise to a world of knowledge adventures in the workhorse of the cell and proteome homeostasis, which are essential for making life possible. [2]

Figure 1: Six protein shapes predicted by AlphaFold, an artificial intelligence technology under Google DeepMind

In 2020, an artificial intelligence lab called DeepMind unveiled technology that could predict the shape of proteins — the microscopic mechanisms that drive the behaviour of the human body and all other living things.A year later, the lab shared the tool, called AlphaFold, with scientists and released predicted shapes for more than 350,000 proteins, including all proteins expressed by the human genome. It immediately shifted the course of biological research. If scientists can identify the shapes of proteins, they can accelerate the ability to understand diseases, create new medicines and otherwise probe the mysteries of life on Earth. [2]

Protein science witnesses the most exciting and demanding revolution of its own field; the magnitude of its genetic–epigenetic—molecular networks, inhibitors, activators, modulators, and metabolite information—is astronomical. It is organized in an open “protein self-organize, adjustment and fitness space”; for example, a protein of 100 amino acids would contain 20100 variants, and a process of searching–finding conformations in a protein of 100 amino acids can adopt ∼1046 conformation and a unique native state, the protein data exceeding many petabytes (1 petabyte is 1 million gigabytes). [3]

The Hidden Reason of Analysing Protein Structure through Al: Protein Folding Problem

The 3-D structure of protein is crucial to the biological function of the protein. However, understanding how the amino acid sequence can determine the 3-D structure is highly challenging, and this is called the “protein folding problem”. [4] It involves understanding the thermodynamics of the interatomic forces that determine the folded stable structure, the mechanism and pathway through which a protein can reach its final folded state with extreme rapidity, and how the native structure of a protein can be predicted from its amino acid sequence. [5]

Figure 2: Amino acid chains fold to form a complete protein structure

Protein structures are currently determined experimentally by means of techniques such as X-ray crystallography, cryo-electron microscopy and nuclear magnetic resonance techniques which are both expensive and time-consuming. [4] Such efforts have identified the structures of about 170,000 proteins over the last 60 years, while there are over 200 million known proteins across all life forms. [6] On the other hand, the Levinthal’s paradox shows that while a protein can fold in milliseconds, the time it takes to calculate all the possible structures randomly to determine the true native structure is longer than the age of the known universe, which made predicting protein structure a grand challenge. [5]

Over the years, researchers have applied numerous computational methods to resolve the issue of protein structure prediction, but their accuracy has not been close to experimental techniques except for small simple proteins, thus limiting their value. CASP (Critical Assessment of Structural Prediction) in 1994 was to challenge the scientific community to produce their best protein structure predications, found that GDT (Global Distance Test) scores of only about 40 out of 100 can be achieved for the most difficult proteins by 2016. [6]

The Chronology of AlphaFold

AlphaFold 1 (2018) looked at the large databanks of related DNA sequences now available from many different organisms which most without known 3-D structure, to try to find changes at different resides that appeared to be correlated, even though the residues were not consecutive in the main chain. Such correlations suggest that the residues may be close to each other physically, even though not close in the sequence, allowing a contact map to be estimated. From this, AlphaFold 1 extended this to estimate a probability distribution for just how close the residues might be likely to be by turning the contract map into a likely distance map. It also used more advanced learning methods than previously to develop in interference. Combining a statistical potential based on this probability distribution with the calculated local free-energy of the configuration. [7][8]

AlphaFold 2 (2020) program is significant different from the original version that won CASP 13 in 2018. The DeepMind team had identified that its previous approach, combining local physics with a guide potential derived from pattern recognition, had a tendency to over-account for interactions between residues that were nearby in the sequence compared to interactions between residues further apart along the chain. As the result, AlphaFold 1 had a tendency to prefer models with slightly more secondary structure (alpha helices and beta sheets) than was the case in reality (a form of overfitting) [9]

The software design used in AlphaFold 1 contained a number of modules, each trained separately, that were used to produce the guide potential that was then combined with the physics-based energy potential. AlphaFold 2 replaced this with a system of sub-network coupled together into a single differentiable end-to-end model, based entirely on pattern recognition, which was trained in an integrated way as a single integrated structure. [10]

A key part of the 2020 AlphaFold 2 system are two modules believed to be based on transformer design, which are used to progressively refine a vector of information for each relationship i) between an amino acid residue of the protein and another amino acid reside (represented by the array shown in green), and ii) between each amino acid position and each different sequences in the input alignment (represented by the array shown in red) [11]

Figure 3: AlphaFold 2 block design. The two attention-based transformation modules.


[1] Harnessing the power of synthetically designed proteins to cure diseases . (n.d.). Retrieved from Al PROTEINS :

[2] Metz, C. (2022, July 28). A.I. Predicts the Shape of Nearly Every Protein Known to Science. Retrieved from The New York Times:

[3] Kauffman, S. A. (1992). Origins of Order in Evolution: Self-Organization and Selection. Retrieved from Sprinfer Link :

[4] AlphaFold: Using AI for scientific discovery. (2020, January 15). Retrieved from DeepMind:

[5] Ken A. Dill, S. Banu Ozkan, M. Scott Shell, and Thomas R. Weikl. (2008, June 9). The Protein Folding Problem . Retrieved from Annual Review of Biophysics :

[6] Service, R. F. (2020, November 30 ). ‘The game has changed.' AI triumphs at solving protein structures. Retrieved from Science :

[7] AlQuraishi, M. (2020, January 15). A watershed moment for protein structure prediction. Retrieved from nature :

[8] AlphaFold: Machine learning for protein structure prediction . (2020, January 31). Retrieved from foldit :

[9] John Jumper et al., conference abstract (December 2020)

[10] Kahn, J. (2020, December 1). Lessons from DeepMind's breakthrough in protein-folding A.I. Retrieved from Fortune:

[11] See block diagram. Also John Jumper et al. (1 December 2020), slide 10 Retrieved from


Stay Up-To-Date with New Posts

Search By Tags

bottom of page