Divya Sain Divya Sain

Release notes

Recently published apps

We’ve just published four tools from the OncoGEMINI 1.0.0 toolkit:

  • OncoGEMINI Bottleneck that identifies somatic variants with increasing allele frequency in longitudinal data.

  • OncoGEMINI Loh, a command tool that performs loss of heterozygosity analysis.

  • OncoGEMINI Truncal that recovers variants that appear in all tumor samples, but are absent in the normal sample.

  • OncoGEMINI Unique tool for identifying somatic variants unique to a subset of samples.

Read More
Divya Sain Divya Sain

Release notes

Import via DRS now available on the CGC

Introduction of the DRS client on Cancer Genomics Cloud Powered by Seven Bridges enables import of DRS files from known or open external sources, similar to what's already available on CAVATICA Powered by Seven Bridges and NHLBI BioData Catalyst Powered by Seven Bridges since April 2021. A known source is a DRS endpoint which is known to the platform, while open external sources are DRS endpoints that don’t require authorization. This release enables interoperability between the following platforms, by making corresponding DRS endpoints available as known sources:

  • NHLBI BioData Catalyst Powered by Seven Bridges

  • CAVATICA Powered by Seven Bridges

  • Cancer Genomics Cloud Powered by Seven Bridges

Read More
Divya Sain Divya Sain

Release notes

Recently published apps

SBG Image Processing Toolkit

SBG Image Processing Toolkit consists of apps that enable various stages of machine learning image processing. Seamless integration between the tools of this toolkit provides an easy and logical analysis flow, while enabling support of various data types, preprocessing steps and utilizing computation capabilities of the CGC.

  1. SBG Deep Learning Image Classification Exploratory Workflow is an image classifier pipeline that relies on the transfer learning approach. This allows the use of pre-trained models as the starting point for building a model adjusted to given image datasets. Furthermore, the pipeline allows training of the model for a variety of hyperparameter combinations in parallel by utilizing multiple GPU instances, while detailed metrics and visualizations help determine the best configuration that can later be used to make predictions on new data instances.

  2. SBG Deep Learning Prediction is an image classifier tool that classifies unlabeled images based on labeled data. It is intended as a final step after the SBG Deep Learning Image Classification Exploratory Workflow. Testing different configurations in parallel with the exploratory workflow and finding the best model configuration for the given dataset, then using SBG Deep Learning Prediction with that configuration and all available labeled images as the training data provides the optimal training conditions which lead to the best classification results.

  3. SBG Histology Whole Slide Image Preprocessing takes SVS histopathology images, removes various artifacts, and outputs the desired number of best quality tiles in PNG format that consist of at least 90% tissue.

  4. SBG X-Ray Image Preprocessing Workflow performs the selected X-ray image enhancement algorithm: unsharp masking (UM), high-frequency emphasis filtering (HEF) or contrast limited adaptive histogram equalization (CLAHE).

  5. SBG Stain Normalization involves casting an array of images in the stain colors of a target image. Stain normalization is used as a histopathology image preprocessing step to reduce the color and intensity variations present in stained images obtained from different laboratories.

  6. SBG Medical Image Convert performs medical image format conversion. If the input data are medical images in a non-standard format (e.g. SVS, TIFF, DCM or DICOM), SBG Medical Image Convert converts them to PNG format.

  7. SBG Split Folders organizes an image directory into the train and test subdirectory structure. These directories are necessary inputs for SBG Deep Learning Image Classification Exploratory Workflow and SBG Deep Learning Prediction.

HistoQC

HistoQC is an open-source quality control tool for digital pathology slides. It performs fast quality control to not only identify and delineate artefacts but also discover cohort-level outliers (e.g., slides stained darker or lighter than others in the cohort). It outputs an interactive user interface for easy viewing and understanding of the results.

Minimac4

Minimac4 is a genetic imputation algorithm that can be used to impute genotypes in a genomic region starting from a reference panel in M3VCF format and pre-phased target GWAS haplotypes.

BOLT-LMM

BOLT-LMM is a tool that tests the association between genotypes and phenotypes using a linear mixed model.

Read More
Divya Sain Divya Sain

Release notes

Manifest-based import improvements and data update on the CGC

With this release, we have improved and simplified the process of importing PDC, ICDC and CDS data on the CGC. The update includes better user experience—the upload interface is more modern and streamlined, while manifest files are now automatically uploaded to the destination project and can be reused, shared, and organized as any other file in the project.
This release also includes addition of new studies and updates of existing data for the PDC, ICDC, and CDS datasets.

Recently published apps

GSEAPreranked Workflow performs Gene Set Enrichment Analysis (GSEA). It is generated with an assumption that a differential expression analysis has been done before using the DESeq2 tool which is publicly available on the CGC. The GSEAPreranked Workflow consists of two tools, GSEA Input Prepare and GSEAPreranked. The GSEAPreranked tool represents a wrapper around the command-line tool that was developed by the BROAD Institute. The GSEA Input Prepare tool is based on the Python script developed by the Seven Bridges team to prepare the required input file formats for the GSEAPreranked tool.

Read More
Divya Sain Divya Sain

Release notes

Amazon EC2 GPU G4dn instances available on the CGC

With this update you can now use the newest Amazon EC2 GPU G4dn instances, in task executions and Data Cruncher analyses, as the industry’s most cost-effective and versatile GPU instances for deploying machine learning models.

G4dn instances feature NVIDIA T4 GPUs and custom Intel Cascade Lake CPUs, and are optimized for machine learning inference and small scale training.

NVIDIA drivers come preinstalled and optimized according to the Amazon best practice for the specific instance family and are accessible from the Docker container.

Read More
Divya Sain Divya Sain

Release notes

Recently published apps

ENCODE ChIP-Seq Pipeline 2 analysis studies chromatin modifications and binding patterns of transcription factors and other proteins. It combines chromatin immunoprecipitation (ChIP) assays with standard NGS sequencing. The workflow is based on ChIP-Seq 2 pipeline, developed by the ENCODE Consortium.

ENCODE ATAC-seq Pipeline performs quality control and signal processing, producing alignments and measures of enrichment. The Assay for Transposase-Accessible Chromatin followed by sequencing (ATAC-seq) experiment provides genome-wide profiles of chromatin accessibility. Briefly, the ATAC-seq method works as follows: loaded transposase inserts sequencing primers into open chromatin sites across the genome, and reads are then sequenced. The ends of the reads mark open chromatin sites. The workflow is based on the ENCODE ATAC-seq pipeline, developed by the ENCODE Consortium.

Read More
Divya Sain Divya Sain

Release notes

PDC Embargo Date implementation and datasets update

When working with PDC data on the CGC, embargoed files are now clearly labelled in the visual interface, with their embargo date being stored as a metadata field and inherited by outputs when such files are used as task inputs. EMBARGO means that some NCI data are under an embargo for publication and/or citation until a specific date, known as embargo date.

In addition to the introduction of embargo date, PDC data available on the CGC has been updated to match the PDC Data Release V1.11 of March 10, 2021.

GDC Datasets version update

As of May 27, GDC datasets available through the Data Browser and the API correspond to GDC Data Release 29.0.

Read More
Divya Sain Divya Sain

Release notes

New Command-line Uploader released

The new Command-line (CLI) Uploader, just released as part of the existing Seven Bridges CLI tool, becomes the primary recommended tool for performing large scale uploads to the CGC. The Uploader is easy to install and use, and is a resilient and performant command line application that provides users with a secure and reliable way of uploading data to the CGC.

The legacy Command line uploader will remain functional until August 2021, before being officially deprecated. Along with the legacy CLI Uploader, Desktop Uploader is also planned to be deprecated in August 2021, as Web Uploader is available through the CGC’s visual interface (since September 2020). Find out more about the new CLI Uploader in our documentation.

Recently published apps

GENESIS Update Null Model for Fast Score Test updates null model file obtained with the GENESIS Null model workflow so that it can be used in the GENESIS Single Variant Association Testing workflow in fast score mode.

Read More
Divya Sain Divya Sain

Release notes

CWL v1.2 available on the CGC

The CGC now supports Common Workflow Language (CWL) version v1.2. The new version of CWL brings a major new functionality - conditional execution of workflow steps, as well as several minor features and improvements. For the detailed change log please see the CWL CommandLineTool specification and the CWL Workflow specification.

The new CWL version v1.2 is a backwards-compatible upgrade of version v1.1, meaning all v1.0 and v1.1 features are still supported in v1.2. To upgrade a v1.0 or v1.1 app to v1.2, simply edit the app and the next version you save can automatically be upgraded to v1.2. Note that upgrading a workflow CWL version to v1.2 this way will not upgrade the CWL version of the tools in the workflow.

Apps using CWL v1.0 and v1.1 versions are still supported and can be used in workflows in combination with CWL v1.2 apps.

Read More
Divya Sain Divya Sain

Release notes

Network access control per Project available on the CGC

The CGC has added another layer of security protecting your data. Researchers can now choose from two options for controlling network access for each Project. This feature defines the network access permissions for both Tasks (tools and workflow executions) and Data Cruncher analyses (interactive analysis environments).

When setting up a project, users can choose to deny network access for all executions, thus ensuring even higher security and compliance standards in the execution environment provided by Seven Bridges. This restricted option will be the default selection for all new Projects. This additional security feature will enhance the safety of data during analysis in the cloud for all apps and notebooks. This change will not affect pulling of externally hosted Docker images or access to project files that point to externally hosted datasets, which means that access to common public datasets such as TCGA will not change. Access to the CGC API will also be available from the execution environment.

Read More
Divya Sain Divya Sain

Release notes

Recently published apps

The following apps were published in CWL1.x:

  • SRA Toolkit 2.10.8 - NCBI’s collection of tools and libraries for accessing data in Sequence Read Archives format (SRA).

  • SRA Download and Set Metadata a workflow that allows for downloading full SRA datasets and populating any metadata information that goes with the dataset

  • AnnotSV 3.0.7 - structural variant annotation and raking tool.

  • IsoformSwitchAnalyzeR 1.12.0 - a tool for differential splicing analysis, it performs statistical identification of the isoform switching while comparing two sample groups.

  • DRIMSeq 1.16.1 - performs differential transcript usage (DTU) analyses using Dirichlet-multinomial generalized linear models.

  • DEXSeq 1.36.0 - toolkit for testing differential exon usage in comparative RNA-Seq experiments.

  • Differential Exon Usage with DEXSeq 1.36.0 - a workflow constructed out of DEXSeq tools, meant for a comprehensive differential splicing analysis.

Read More
Divya Sain Divya Sain

Release notes

Recently published apps

The following apps were published in CWL1.x:

  • Single Cell Multi Sample Pairwise Differential Expression Workflow - pipeline that performs differential expression analysis on single cell data between pairs of user defined conditions.

  • Minimap2 v2.17 - a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database, tailored for use with long read sequencing technologies.

  • fastqValidator 0.1.1 - checks format correctness of paired-end and single-end FASTQ files.

  • FastP 0.20.1 - ultra-fast FASTQ preprocessor with useful quality control and data-filtering features, including adapter trimming, quality filtering, per-read quality pruning and many other operations with a single scan of FASTQ data.

  • SBG convert SRA/BAM to FASTQ - an all-in-one tool that converts SRA/SAM/BAM/CRAM files into FASTQ format.

  • SBG Create Expression Matrix - creates aggregated matrices from various types of inputs, most typically from abundance estimates produced by tools like RSEM, Salmon, or Kallisto.

  • SHAPEIT 4.2.1 - phasing tool for sequencing and SNP array data.

  • Regenie 2.0.1 - tool for whole genome regression analysis.

  • UMI-tools 1.1.1 - tools for dealing with Unique Molecular Identifiers (UMIs)/Random Molecular Tags (RMTs) and single cell RNA-Seq cell barcodes.

Read More
Divya Sain Divya Sain

Release notes

Recently published apps

MaxQuant is a tool for quantitative proteomics, designed for analysing large mass-spectrometric data. It takes files with high-resolution, quantitative MS data and produces information about quantification of proteins and PTMs. It can be used for analysing data derived from any major relative quantification techniques (Label-free quantification (LFQ), MS1-level labelling and isobaric MS2-level labelling). Furthermore, it provides quantification algorithms for all common forms of tandem mass (TMT) and isobaric tags for relative and absolute quantitation (iTRAQ) labelling (including higher-plex TMT and multinotch MS3 quantification).

GENESIS Association Results Plotting creates Manhattan and QQ plots from GENESIS association test results with additional filtering and stratification options available. This app with it’s default options is the part of a GENESIS Association testing workflows, however after the association testing is completed users can fine-tune the Manhattan and QQ plots by running this app separately.

Read More
Divya Sain Divya Sain

Release notes

Recently published apps

The following apps were upgraded to CWL1 and had their versions updated as well:

  • GATK

  • Picard

  • VEP toolkit and workflow

Read More
Divya Sain Divya Sain

Release notes

Foundation Medicine data available on the CGC

Foundation Medicine dataset has been made available and is accessible through the Data Browser on the CGC. The dataset contains genomic profiling data from approximately 18,000 adult patients with a diverse array of cancers that underwent genomic profiling.

Read More
Divya Sain Divya Sain

Release notes

GDC Datasets version update

As of March 17, GDC datasets available through the Data Browser and the API correspond to GDC Data Release 28.0.

Read More
Divya Sain Divya Sain

Release notes

Recently published apps

  • GATK Somatic SNVs and INDELs (Mutect2) 4.1.9.0 can be used to detect SNVs and INDELs in one or more tumor samples from a single individual, with or without a matched normal sample. Assembly implies whole haplotypes and read pairs, rather than single bases, as the atomic units of biological variation and sequencing evidence, improving variant calling.

  • GATK Somatic Create Mutect2 Panel of Normals 4.1.9.0 workflow creates a panel of normals (germline and artifactual sites) for use in other GATK workflows. It takes multiple normal sample callsets produced by GATK Somatic SNVs and INDELs 4.1.9.0 (Mutect2 workflow) tumor-only mode (although it is called tumor-only, normal samples are given as the input) and collates sites present in two or more samples into a sites-only VCF.

Both workflows are composed in reference to the official GATK’s WDLs.

Read More
Divya Sain Divya Sain

Release notes

Improved project organization with project tags

In order to improve the organization and findability of projects, project tags have been introduced to the CGC.

Project Admins can now assign tags to projects via the API or through the visual interface. Such tags can be used for filtering purposes when browsing all projects, for projects categorization, and for general custom organization of projects.

The maximum number of tags for a single project is 15, while the maximum number of characters in a single tag is 36.

PDC data update on the CGC

PDC data on the CGC has been updated with the following PDC Data Releases:

  • V1.0.24 (February 5, 2021)

  • V1.0.22 (January 5, 2021)

  • V1.0.21 (December 15, 2020)

See more information about the history and contents of each PDC data update on the CGC.

GDC Datasets version update

As of February 22, GDC datasets available through the Data Browser and the API correspond to GDC Data Release 27.0.

Read More
Divya Sain Divya Sain

Release notes

Recently published apps

The following tools were updated to their latest versions and upgraded to CWL1.x:

  • HISAT2-StringTie workflow

  • StringTie

  • Hisat2

  • Trimmomatic

  • Tabix

  • SBG FASTQ Merge


The following new apps were published, in CWL1.x:

  • Exomiser 12.1.0 - tool for prioritizing variants from WES and WGS data.

  • VEP Slivar Trios Rare Diseases Analysis workflow - analyzes WES and WGS family variants.

  • Clustering and Gene Marker Identification with Seurat 3.2.2 - clustering and gene marker identification analysis starting from gene-cell UMI or read counts.

  • xCell 1.3 - tool for cell type enrichment analysis, which takes gene expression data and performs analysis for 64 immune and stromal cell types.

  • MBASED 1.18.0 tool - used for performing allele specific expression analysis.

  • MBASED workflow - based on the MBASED tool, with added phasing and VEP annotation, the workflow allows for easier running of allele specific expression analysis.

  • elPrep 4.1.6 - high-performance tool for preparing SAM/BAM files for variant calling in sequencing pipelines, which can be used as a replacement for SAMtools and Picard for preparation steps such as filtering, sorting, marking duplicates, calculating and applying base quality score recalibration, etc.

  • Kraken2 2.0.9 - taxonomic sequence classifier that assigns taxonomic labels to DNA sequences.

  • Bracken 2.5 - uses the taxonomic assignments made by Kraken/Kraken2, along with information about the genomes themselves, to estimate abundance at the species/genus level, or above.

Read More
Divya Sain Divya Sain

Release notes

RAS-CRDC Integration Phase 1 completed

The Researcher Auth Service (RAS), sponsored by The Office of Data Science Strategy, is a service provided by NIH's Center for Information Technology (CIT) to facilitate access to NIH’s open and controlled data assets and repositories in a consistent and user-friendly manner.

The RAS initiative is advancing data infrastructure and ecosystem goals defined in the NIH Strategic Plan for Data Science. RAS has adopted the Global Alliance for Genomics and Health (GA4GH) standards for integration of researcher-focused applications and data repositories over the OIDC platform.

The goal for this effort is to coordinate all cloud stacks and use RAS identically across systems. The NCI CRDC (Cancer Research Data Commons) stack was chosen for the pilot phase to create a phased approach that should achieve the larger goals of federated data access using GA4GH Passports, with a focus on how this fits in with NIH data in general.

Phase 1 is now completed introducing a change to the login flow when using eRA Commons:

  • When choosing login with eRA Commons on the CGC, you will now be redirected to the NIH RAS login screen instead of iTrust.

  • Other than the login flow change, user experience on the CGC remains the same.

Recently published apps

  • GATK Broad Best Practice Variant Calling From uBAM - This workflow presents two different BROAD Best Practice workflows incorporated into one - BAM processing and variant calling.

  • Functional Equivalence WGS - This workflow processes WGS data according to the functional equivalence standard.

Read More