Reproducibility remains a major concern in biomedical research. Recently, it has been demonstrated that cancer informatics analyses performed within a single consortia may yield wildly variable results. As the collection of genomic data and analyses continue to accelerate, concerns about maintaining the accuracy of results continues to grow. Large-scale, accurate cancer analyses demand scalable informatics. Scalability in turn requires reproducibility and portability of tools, analyses, and data to ensure that researchers can collaborate easily and effectively.
Recent technological developments and organizational efforts have sought to address the reproducibility problem in biomedical data analysis and have been successfully applied to cancer informatics. For example, Docker containers enable researchers to package software with all of its required dependencies and nothing more. This feature allows software to be shared with anyone in such a way that the exact analysis can be reproduced. Docker containers can be easily shared through GitHub, thirdparty repositories, or usertouser with plaintext files. Moreover, external tools can hook into Docker directly, using it as a component of complex pipelines or analyses. The Common Workflow Language (CWL) is one specification, which enables researchers to describe analysis tools and workflows that are powerful, easy, and portable. Dynamic computing environments, often referred to as ‘the cloud’, are able to support colocalization of cancer data, Docker+CWL workflows, and the computational resources required to perform largescale analyses. These environments can be extended with collaboration and project management tools to enable researchers to work together in a transparent and reproducible fashion.
These methodologies have enabled globalscale cancer genomics initiatives such as the International Cancer Genome Consortium (ICGC) PanCancer Analysis for Whole Genomes Project (PCAWG), and the National Cancer Institute (NCI) Cancer Genomics Cloud (CGC) pilot program. In this workshop, we will instruct attendees in Docker and CWL as well as their use and best practices, and discuss concretely how these technologies enable scalable, reproducible, and portable cancer informatics. We will also discuss the methodologies behind how these tools are developed and deployed and pose the following questions what are next steps for improving reproducibility in bioinformatics and scale informatics efforts? What have we learned from analyses of thousands of cancer genomes that can be applied to other diseases and other consortium efforts? In addition, we will encourage discussion about unmet needs and future solutions in cancer informatics.