Release notes


Datasets on the CGC are now categorized as "harmonized" and "legacy" in accordance with the GDC.

In 2016, the GDC started hosting and distributing previously generated data from The Cancer Genome Atlas (TCGA). Additionally, for all submitted sequence data (FASTQs and BAM alignment files), the GDC generated new alignments (BAM files) to the latest human reference genome, GRCh38, using standard workflows. Using these alignments, the GDC generated derived data, including normal and tumor variant and mutation calls, gene and miRNA expression, and splice junction quantification data. The GDC refers to this process of data generation through standard workflows as data harmonization.

Datasets on the CGC which are aligned to GCRh38 are categorized as "harmonized". Datasets on the CGC which are not aligned to GRCh38 are labeled "legacy". However, "legacy" datasets remain fully supported. In order to follow and synchronize as much as possible with this data model, we have created a new ontology with more entities.

To this end, there are two iterations of TCGA dataset on the CGC:

  • TCGA: this is the "legacy" version of the dataset
  • TCGA GRCh38: this is the harmonized version of the dataset


The Simons Genome Diversity Project (SGDP) Open Access dataset contains complete genome sequences from 130 diverse human populations. It is the largest dataset of diverse, high-quality human genome sequences ever reported and includes many deeply divergent human populations that are not well-represented in other datasets, which makes the SGDP dataset ideal for interrogating the genomic landscape of different populations.

This dataset is available for analysis under the Public projects tab of the top navigation bar. You won’t pay for storage of the raw data files: copy the entire project or select files into your own project on the CGC to take conduct further analysis. Or, compare SGDP data alongside of your own data.

Learn more about using the SGDP public project.