To prepare you for this data jamboree we’ve assembled the following resources:
Data Access
The HTAN Portal leverages several repositories to provide access to data:
Synapse - Open Acccess Processed level 3 and level 4 data
Imaging Data Commons (IDC) - Open Access Imaging data (CC BY 4.0) in DICOM-TIFF format
Seven Bridges Cancer Genomics Cloud - Level 1 & 2 Access-Controled Sequencing data and Open Access Imaging data (CC BY 4.0). Access control for the sequencing data is managed through dbGaP (Study Accession: phs002371).
Instructions for accessing data from these repositories can be found directly from the Explore Page by making a selection of files and clicking on the download button. More information about accessing data can be found here.
Instructions for accessing controlled data via dbGaP
Access to controlled-access (i.e. protected) data is granted on a per project basis via the database of Genotypes and Phenotypes (dbGaP). This primarily includes raw sequencing data such as BAM or FASTQ files as well as VCF files and protected MAF files. To gain access to these files a user must apply for access via dbGaP to individual projects. Each project has a Data Access Committee (DAC) that will approve or disapprove data access requests. Before gaining access through dbGaP users also need to obtain an eRA Commons ID for authentication purposes.
Learning to estimate and manage your cloud costs will prepare you to effectively budget for your research projects. These estimates can be included in grant proposals, or be used to request cloud credits offered by the National Institutes of Health.
On the CGC
HTAN is a National Cancer Institute (NCI)-funded Cancer MoonshotSM initiative to construct 3-dimensional atlases of the dynamic cellular, morphological, and molecular features of human cancers as they evolve from precancerous lesions to advanced disease. (Cell April 2020)
MCMICRO is an end-to-end processing pipeline that transforms multi-channel whole-slide images into single-cell data. MCMICRO is an open source, community supported software that uses Docker and workflow software to create pipelines for analyzing microscopy-based images of tissues.
This series of videos will teach users the basics for using the Seven Bridges Cancer Genomics Cloud (CGC), powered by Velsera. The CGC is part of NCI’s Cancer Research Data Commons, a cloud-based data science infrastructure that connects data sets with analytics tools to allow researchers to share, integrate, analyze, and visualize cancer research data to drive scientific discovery.