The CGC History

The CGC grew out of a pressing need to analyze large cancer genomics datasets, primarily the Cancer Genome Atlas (TCGA). TCGA is one of the richest and most complete genomics datasets, composed of 33 different tumor types or subtypes with data from thousands of patients. Funded by $375 million in taxpayer dollars over the past decade, the project collected and analyzed samples at institutions across the U.S. Multiple samples from each patient were analyzed using multiple approaches including genome sequencing, RNA sequencing, microRNA sequencing, and more. TCGA data represents more than 2.5 petabytes of information and continues to grow as more samples are analyzed.

Learning from Challenges with The TCGA data

Long download time

Prior to launching the NCI Cloud Resources, before a researcher could begin performing meaningful analysis of TCGA data, they first had to download the data from a central repository. With the current data transfer rates available to most researchers, downloading data from a single individual could require hours or days. Extending this to the more than 11,000 individuals participating in TCGA presented a substantial hurdle for most researchers to simply access the raw data.

large local servers needed to store and compute on data

Then, once the data was downloaded, it cost a substantial amount to store on a server. Cost was compounded because the same data was replicated and stored across multiple institutions. And when data storage was provided, additional powerful computational resources were required to explore, manipulate, and analyze the data.

difficulties with collaborative research

Research has become increasingly multi-institutional as large consortia often collaborate on the same dataset. Despite the best text-descriptions, it's often hard to precisely reproduce data analysis pipelines. This makes it impossible for teams to guarantee that colleagues who run the same analysis in different computation environments will achieve the same results.

The Cancer Genomics Cloud directly addresses these difficulties

Dr. Harold Varmus, the former director of the National Cancer Institute, initiated the Cancer Genomics Cloud pilot project in 2013 specifically to address these issues. In September 2014, the NCI selected three groups to develop pilot systems: The Broad Institute, the Institute for Systems Biology (ISB), and Seven Bridges Genomics. These groups each launched a cloud computing infrastructure in 2016 to support researchers working with TCGA data.

The CGC has matured into a vibrant platform for cancer genomics and beyond. The CGC co-locates large datasets, elastic compute power, and tools for analysis, while also allowing researchers to bring in and develop their own data and tools. Along with the Broad and ISB, the CGC is now part of the NCI Cloud Resources. Using the lessons from TCGA, the number of datasets has expanded beyond the TCGA and now includes data from the following sources:

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
Cancer Cell Line Encyclopedia (CCLE)
The Cancer Imaging Archive (TCIA)
Clinical Proteomic Tumor Analysis Consortium (CPTAC)
International Cancer Genome Consortium (ICGC)
Simons Genome Diversity Project (SGDP)
Personal Genomics Project UK (PGP-UK) pilot

The features of the CGC have also expanded and now include advanced metadata queries, the ability to run interactive analyses, and more. The CGC will continue to expand the number and diversity of cancer datasets and capabilities over time.

How will this help the cancer research community?

Through the CGC, researchers can:

Immediately access petabytes of Open and Controlled TCGA, TARGET, CPTAC, CCLE, TCIA, ICGC, and SGDP data.
Analyze data from their private cohorts alongside public data.
Use standard bioinformatics pipelines to perform analyses.
Bring their own analysis tools directly to the platform.
Connect multiple tools using our interactive custom workflow builder.
Perform custom, interactive analysis and visualization on the platform using Python, R, and Julia.
Collaborate with researchers around the world.
Access high-throughput, cost-effective cloud computing resources and storage on demand and at cost.
Access the CGC using the API as well as the visual interface.
Enjoy free Amazon Web Services credits for both computation and storage during the evaluation period.
Access comprehensive online documentation and training resources, as well as technical support from a team of >200 expert scientists, bioinformaticians, and engineers