The CGC History 

 

The Cancer Genome Atlas (TCGA) is one of the richest and most complete genomics datasets, composed of 33 different tumor types or subtypes with data from thousands of patients. Funded by $375 million in taxpayer dollars over the past decade, the project collected and analyzed samples at institutions across the U.S. Multiple samples from each patient were analyzed using multiple approaches including genome sequencing, RNA sequencing, microRNA sequencing, and more. TCGA data represents more than a petabyte of information and continues to grow as more samples are analyzed. 

Learning from this data is challenging

Long download time

Under the current paradigm, before a researcher can begin performing meaningful analysis of TCGA data, they must first download the data from a central repository. With the current data transfer rates available to most researchers, downloading data from a single individual can require hours or days. Extending this to the more than 11,000 individuals participating in TCGA presents a substantial hurdle for most researchers to simply access the raw data. 

large local servers needed to store and compute on data

Then, once the data is downloaded, it costs a substantial amount to store on a server. Cost is compounded because the same data is replicated and stored across multiple institutions. And when data storage is provided, powerful computational resources are required to explore, manipulate, and analyze the data. 

difficulties with collaborative research

Research is becoming increasingly multi-institutional as large consortia often collaborate on the same dataset. Despite the best text-descriptions, it's often hard to precisely reproduce data analysis pipelines. This makes it impossible for teams to guarantee that colleagues who run the same analysis in different computation environments will achieve the same results.

The Cancer Genomics Cloud Pilots directly address these difficulties

Dr. Harold Varmus, the former director of the National Cancer Institute, initiated the Cancer Genomics Cloud pilot project in 2013 specifically to address these issues. In September 2014, the NCI selected three groups to develop pilot systems: The Broad Institute, the Institute for Systems Biology, and Seven Bridges Genomics. These groups are currently developing cloud computing infrastructure to support researchers working with TCGA data.

How will this help the cancer research community? 

Through the CGC pilot project, researchers can:

  • Immediately access petabytes of Open and Controlled TCGA data.
  • Analyze data from their private cohorts alongside TCGA data.
  • Use standard bioinformatics pipelines to perform analyses.
  • Bring their own analysis tools directly to the TCGA dataset.
  • Collaborate with researchers around the world.
  • Access storage and compute resources on the cloud on demand.
  • Access the CGC using the API as well as the visual interface.
  • Enjoy free Amazon Web Services credits for both computation and storage during the evaluation period.