As researchers struggle to advance urgent biomedical research, these teams have optimized their workflows to process more data more efficiently.
Words: Nicole DeSantis
The COVID-19 pandemic devastating the world has sped up the demand for innovative solutions in biomedical research. At the same time, researchers everywhere face unprecedented working conditions that have made in-person collaboration, travel, and planning for the future much more difficult. Everyday tasks like managing clinical trials have turned into complex challenges. In the current situation, GÉANT’s members need a flexible infrastructure to support remote work and distributed computing. They need access to powerful tools to scale up their data processing to get results faster.
Early in 2019 GÉANT and Google agreed to expand support for academic researchers in EMEA, enabling them to leverage the benefits of Google Cloud, our suite of cloud computing solutions for storage, compute, big data, and machine learning. “Google Cloud’s commitment to supporting educational and academic research is core to our DNA, and we’ll continue to find ways to help researchers and organizations apply cloud technologies for the benefit of all.” says Joe Corkery, M.D., Director of Product, Healthcare and Life Sciences at Google Cloud. “We’re so grateful for the work of these experts, and want to support them with tools and technologies that can help them combat this pandemic.”
Here are two examples of European labs that are re-imagining how they conduct their research:
Pre-training AI models to accelerate genomics research—for everyone
The problem is urgent: the search for a COVID-19 vaccine and treatments depends on first deciphering the virus’ molecular mechanisms. Manipulating 3D models of chemical compounds is a crucial step in developing drug therapies, where molecules must fit together like puzzle pieces. A team at Rostlab in the Technical University of Munich (TUM) developed ProtTrans, an innovative way to use machine learning to analyze protein sequences. By expanding access to critical resources, ProtTrans makes protein sequencing easier and faster despite the challenges of working during the pandemic.
Ahmed Elnaggar, an AI specialist and a Ph.D. candidate in deep learning , and Michael Heinzinger, a Ph.D. candidate in computational biology and bioinformatics, pre-trained models to “read” up to 393 billion amino acid sequences from over two billion proteins that make up the protein universe as we know it today, including viral diseases like COVID-19. By using Google Cloud’s high-speed Tensor Processing Units (TPUs) for data-intensive processing, they were able to train different models, including a model with 3 billion parameters on Google TPU V3-1024, which took several days to converge. The team has already done most of the heaviest computational lifting; now they are starting to distribute their pre-trained models to other researchers who can apply it to their tasks from any location, with ordinary consumer hardware. Compared to traditional methods which required expensive hardware to collect similar proteins via time-consuming database searches, ProtTrans models can be accessed and run by any researcher on a single GPU. Also, recent results show that the new ProtTrans models capture aspects of proteins which were not accessible by previous methods.
As more models are trained, the algorithms themselves become better at predicting complete protein sequences from partial strings of amino acids, which in turn generates faster and more accurate results. By the end of 2020 the Rostlab team hopes to launch a website where researchers can plug in a string of amino acids and quickly access a 3D model of the protein’s structure. According to Heinzlinger, “the proposed approach has the potential to revolutionize the way we analyze protein sequences.” Elnaggar points out that “this work couldn’t have been done two years ago. Without the combination of today’s bioinformatics data, new AI algorithms, and the computing power from GPUs and TPUs, it couldn’t be done.”
Creating the largest-ever DNA search engine—at 4 petabytes
The Biomedical Informatics (BMI) Group run by Dr. Gunnar Rätsch at ETH Zurich (Swiss Federal Institute of Technology) draws on huge datasets of genomic information to answer key questions about molecular processes and diseases like cancer. Their research benefits from the massive increase in genomic information now available in datasets like the National Center for Biotechnology Information’s Sequence Read Archive (SRA), the largest public repository of high throughput sequencing data. By storing vast amounts of raw sequencing data, the SRA helps life science researchers facilitate new discoveries through data analysis.
This flood of data has disadvantages as well as advantages for researchers: for example, each time they need to analyze a DNA sequence dataset they first have to download the huge files and access the algorithms locally. That consumes time and resources, so with Google’s help the BMI Group is developing a cost-effective search engine for DNA sequences that could bring algorithms to the data, instead of the other way around. Now the BMI Group team uses Google Cloud Storage to manage sequencing data and Compute Engine’s Virtual Machine (VM) instances to process them. Called the Metagraph Project, this flexible solution is able to process four petabytes of genomic data, making it the largest DNA search engine ever built.
Faced with a rapidly-changing research climate, both TU Munich and the BMI Group found creative ways to rethink their workflows with the flexible, powerful resources of cloud computing. “IT procurement in universities is often optimised for long research projects,” says André Kahles, Senior Postdoc in the BMI group. “You’re locked into infrastructure for four to five years, without much flexibility to adapt in fast-paced projects. Google Cloud lets us constantly readjust the setup to our needs, creating new opportunities and preventing us from spending money on infrastructure we can’t use optimally.”
Getting started with Google Cloud
TU Munich and the BMI Group in Zurich are just two of the many innovative biomedical projects in EMEA that build on Google tools and infrastructure to make research easier, faster, and more efficient in these difficult times. To start or ramp up your own project, we offer research credits to academics using GCP for qualifying projects in eligible countries. You can find our application form at www.cloud.google.com/edu.