The CGC History

The CGC grew out of a pressing need to analyze large cancer genomics datasets, primarily the Cancer Genome Atlas (TCGA). TCGA is one of the richest and most complete genomics datasets, composed of 33 different tumor types or subtypes with data from thousands of patients. Funded by $375 million in taxpayer dollars over the past decade, the project collected and analyzed samples at institutions across the U.S. Multiple samples from each patient were analyzed using multiple approaches including genome sequencing, RNA sequencing, microRNA sequencing, and more. TCGA data represents more than 2.5 petabytes of information and continues to grow as more samples are analyzed.

Dr. Harold Varmus, the former director of the National Cancer Institute, initiated the Cancer Genomics Cloud pilot project in 2013 specifically to address these issues. In September 2014, the NCI selected three groups to develop pilot systems: The Broad Institute, the Institute for Systems Biology (ISB), and Seven Bridges Genomics. These groups each launched a cloud computing infrastructure in 2016 to support researchers working with TCGA data.

The CGC has matured into a vibrant platform for cancer genomics and beyond. The CGC co-locates large datasets, elastic compute power, and tools for analysis, while also allowing researchers to bring in and develop their own data and tools. Along with the Broad and ISB, the CGC is part of the NCI Cloud Resources. The number of datasets has expanded throughout the years and includes data from the following sources:

The Cancer Genome Atlas Program (TCGA)
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
Cancer Cell Line Encyclopedia (CCLE)
The Cancer Imaging Archive (TCIA)
Proteomic Data Commons (PDC)
Clinical Proteomic Tumor Analysis Consortium (CPTAC)
Integrated Canine Data Commons (ICDC)
International Cancer Genome Consortium (ICGC)
Simons Genome Diversity Project (SGDP)
Personal Genomics Project UK (PGP-UK) pilot
Detection of Colorectal Cancer Susceptibility Loci Using Genome-Wide Sequencing (GECCO)
Development of A Tumor Molecular Analyses Program and Its Use to Support Treatment Decisions (LCCC1108)
Pediatric Preclinical Testing Consortium (PPTC)
The Genetic Basis of Aggressive Prostate Cancer, The Role of Rare Variation (APC)
Discovery of Colorectal Cancer Susceptibility Genes in High-Risk Families (TCC)
Limited Use Pilot Test Data (PLCO)

The features of the CGC have also expanded and now include advanced metadata queries, the ability to run interactive analyses, and more. The CGC will continue to expand the number and diversity of cancer datasets and capabilities over time.

HOW WILL THIS HELP THE CANCER RESEARCH COMMUNITY?

Through the CGC, researchers can:

Immediately access petabytes of Open and Controlled TCGA, TARGET, CPTAC, CCLE, TCIA, ICGC, and SGDP data.
Analyze data from their private cohorts alongside public data.
Use standard bioinformatics pipelines to perform analyses.
Bring their own analysis tools directly to the platform.
Connect multiple tools using our interactive custom workflow builder.
Perform custom, interactive analysis and visualization on the platform using Python, R, and Julia.
Collaborate with researchers around the world.
Access high-throughput, cost-effective cloud computing resources and storage on demand and at cost.
Access the CGC using the API as well as the visual interface.
Enjoy free Amazon Web Services credits for both computation and storage during the evaluation period.
Access comprehensive online documentation and training resources, as well as technical support from a team of >200 expert scientists, bioinformaticians, and engineers