Genomics in the Cloud

Student writer: Claire Dubiel
Student editors: Mahir Jethanandani

23andMe editor: Thao Do

When you hear the phrase “cloud computing” what do you imagine? Do you think of your ability to sync up your calendar, email, address book, and more across your smart devices? How about your ability watch thousands of TV series or movies with a click of a button? Perhaps you are still not too sure what cloud computing really means and are wondering why everyone is going nuts over a mass of condensed water droplets in the sky. Have no fear! Because in this article we are going to go over what cloud computing is, why it is important for the future of genomics, and the types of challenges computer scientists face when developing cloud infrastructure and as-a-service models for genomics.

What is Cloud Computing?

Cloud computing can be thought of a model to allocate and share computer resources over the internet, rather than a local machine. These resources include servers, storage, databases, networking, software, analytics, and more. Third-party cloud providers will allow you to manage your resources through a management console like a web browser or command line interface. Cloud computing uses a pay-as-you-go model that charges users only for the exact amount of computer or storage units needed to run a workflow needed to complete a task.  In a public cloud deployment model, you may share these compute resources with other tenants that are also renting out services hosted by the cloud service provider. In contrast, private cloud deployment models are maintained over a private network and enlist dedicated resources to be used within a single organization. A hybrid cloud is the mix of public and private cloud, combining local infrastructure with computing resources over the cloud specifically tailored to suit an organization’s particular needs [1,2].

Elasticity, Reproducibility, and Data Privacy

The advantage to Cloud Computing is that large commercial cloud vendors like Amazon Web Services, Google Cloud Platform, and Microsoft Azure each own hundreds of thousands of servers in regions across the globe that can be rented out to compute tasks for a fraction of the cost. This characteristic is referred to as elasticity and is a hallmark of cloud computing. No longer do organizations need to build an infrastructure capable of supporting maximum workloads (e.g. a large influx of online orders on Cyber Monday), they may simply scale up resources when needed and only be charged for the resources allocated [1].

Another desirable characteristic of cloud computing is reproducibility. Reproducibility is integral to the scientific process as it allows other scientists to reproduce results cited in a report using a standardized workflow [1]. Furthermore, newer methods may use Docker or any other container providers that can be thought of as running instance of a machine image that contains everything needed to run an application.  We won’t go too into Docker in this article but the important part is that Docker Containers allow for reproducible deployments even if the host machines are configured differently or reside on opposite sides of the globe [3].

Lastly, cloud computing offers several lines of defense in terms of privacy and security. There exist data protection standards for different types of datasets, such as the need for certain genomic datasets, such as NCBI’s Genotypes and Phenotypes protected data, to be encrypted at rest and while being transferred, implementing multi-layer security protocols, and data storage guidelines [1].

Cloud Computing in the World of Genomics

There are many different applications in genomics where the power of cloud computing can be harnessed to derive more insights, analyze data at scale, or store massive data in the cloud. International collaborations such as the International Cancer Genome Consortium (ICGC) securely host data in the Amazon Web Services Cloud to be accessed by authorized individuals across the globe [1].

Another major role for cloud computing is reanalyzing the massive amount of publicly archived datasets. One of largest bottlenecks when working with genomic data is the rate at which data can be transferred. When before it may have been very costly in terms of time and resources to download, store, and analyze data, there is more incentive than ever before to leverage cheap cloud computing technology to ask new questions using the same datasets [1].

Lastly, storage for genomic datasets remains a serious challenge.  A statistic from Nature Reviews Genetics shows that from July 2012 to March 2017, the amount of data in the Sequence Read Archive (SRA) doubled itself four times. This exponential increase in storage usage brings attention to technology and researchers alike to innovate new methods to meet our growing data storage demands in the future [1].

The Future of Cloud Computing

We have just touched the surface of the types of solutions that cloud computing offers researchers in genomics. Now that you know about the basics of what cloud computing is and the types of problems that exist in genomics,  think about where the boundary between computer science and genetics lies. It’s hard to describe, isn’t it? This is because many of the big data problems found genomics are interdisciplinary in nature and need the knowledge of many domain experts to work together to discover a solution. As the boundary between what constitutes as a big data in biology problem and a distributed computing problem becomes more and more ambiguous, there will be rise in demand for people entering the life sciences that are also technically trained to work with cloud platforms and tools.  

Like actual clouds, cloud computing has become omnipresent in our modern life. It powers our apps, our businesses, and to an extent even our decisions. Watching a funny cat video on Vimeo may be the last thing you associate with cloud computing, but what if I told you that Vimeo’s infrastructure is supported by a cloud computing platform that is also used many of the aforementioned international collaborations to advance the field of genomics? [4]

Whether you are more interested in system design, cybersecurity, operations, lab work, statistics, or other fields the future will need passionate scientists and engineers coming from a diverse set of backgrounds to solve the world’s toughest problems in cloud computing.

  1. Langmead, Ben, and Abhinav Nellore. “Cloud Computing for Genomic Data Analysis and Collaboration.” Nature News, Nature Publishing Group, 30 Jan. 2018, www.nature.com/articles/nrg.2017.113.
  2. “What Are Public, Private, and Hybrid Clouds?” A Beginner’s Guide | Microsoft Azure, www.azure.microsoft.com/en-us/overview/what-are-private-public-hybrid-clouds/.  
  3. “Docker Documentation.” Docker Documentation, 24 Aug. 2018, www.docs.docker.com/.  
  4. “Vimeo on EC2.” Amazon AWS, https://s3.amazonaws.com/aws001/trailhead/CustomerPresentations_VimeoEC2_NY.pdf.

 

  • Categories
  • Recent Posts
  • Archives