Analysis of the spatial structure of SARS-CoV-2 protein using machine learning methods

István Csabai, Ákos Gellért, Balázs Pál (2022.01.01 - 2022.03.30)
ELTE Department of Physics of Complex Systems

Abstract: The COVID-19 epidemic created an extraordinary situation for the whole humanity, claiming millions of lives and causing a significant economic setback. At the same time, the international research community has rapidly generated an order of magnitude larger data set than ever before, which can contribute to understanding the evolution and dynamics of the epidemic, to its containment and to the prevention of similar pandemics. The GISAID and COVID-19 Data Portal databases contain millions of complete SARS-CoV-2 genomes. The genetic sequences can be obtained relatively easily and quickly thanks to modern genome sequencers, but it is very difficult to tell how rapidly a given variant will spread or how serious disease will it cause, solely based on the genetic sequences and the mutations. The genetic information is transcribed into proteins, and the spatial structure, charge distribution and interaction of the proteins with the host proteins determine the function of the virus, the so called phenotype. In summary, the genotype-phenotype problem is the estimation of the behaviour of a virus based on genetic information.

In the last year, the rapidly developing artificial intelligence approach has achieved a milestone that can significantly help genotype-phenotype research. Using the Alphafold2 method, the spatial structure of large proteins can be determined with sufficient accuracy in a reasonable time. The machine learning-based Alphafold2 method requires significant computational, mainly GPU, capacity.

We are collaborating with EMBL-EBI on the development of a SARS-CoV-2 genetic archive in the framework of a H2020 project. We aim to complement this with 3D structures of proteins of as many variants as possible and to use these structures to advance the genotype-phenotype question.

Next Post Previous Post