This repository contains scripts to reproduce the analysis in the following workflow paper:
Differential transcript expression and differential transcript usage using Salmon and edgeR v4
Xueyi Dong, Lizhong Chen, Junli Nie, Gordon K. Smyth, Yunshun Chen
The example data used in this protocol is the Illumina RNA-seq data from Dong et al, which is available from Gene Expression Omnibus (GEO) under accession number GSE172421.
An HTML report generated from knitting workflow.Rmd is available in the docs folder. This report is also available for convenient viewing online at https://chenlaboratory.github.io/DTE_DTU_workflow/. This file also includes the elapsed time of each R-based stage and the R session information in the test run.
-
data: This folder is where any data required to run the workflow should be stored. RNA-seq reads should be saved intodata/reads, while reference genome and gene annotation files should be downloaded intodata/reference. Here we provided the sample information and experimental design spreadsheet (data/targets.txt) for the example data. Furthermore, we provide the R object (data/counts.RDS) containing the imported Salmon output of our test run to facilitate reproducing the exact results shown in the test run. -
docs: This folder contains an example HTML report we generated by running this workflow. -
setup: This folder contains scripts for the experimental setup, including downloading and preparing data files and installing required software packages. -
workflow: This folder contains scripts for all the steps in this analysis workflow. -
workflow.Rmd: This RMarkdown document organizes the workflow into sequential chunks, each sourcing the R script corresponding to a specific stage. By knitting this document, all stages in R (Stage 2-6) can be run, and an HTML report containing the results can be generated. -
test_workflow.sh: This script can be used to run the whole workflow, including pulling the Docker container, downloading and preparing required data, running Salmon for quantification, and knitting the Rmd document. This script is written using SLURM syntax and requires theapptainermodule.
This protocol is designed to be run on Linux operating system. We recommand at least 20GB of RAM and 8 CPU cores.
This protocol has been tested on Ubuntu 24.04.4 LTS operating system by running test_workflow.sh.
We have provided a pre-configured Docker container environment containing all the necessary software tools and R packages. The image is publicly hosted on Docker Hub.
For users who prefer to maintain a native environment or wish to install the required software manually, detailed installation instructions can be found in the manuscript of the protocol.
The protocol is dependent on the following software:
-
SRA Toolkit software (version 3.1.0 or later)
-
Pigz software (version 2.8 or later)
-
GffRead software (version 0.12.7 or later)
-
Salmon software (version 1.10.0 or later)
-
R (version 4.5.2 or later)
-
R packages:
-
edgeR version 4.8.2
-
limma version 3.66.0
-
rtracklayer version 1.70.1
-
RColorBrewer version 1.1-3
-
ggplot2 version 4.0.2
-
Gviz version 1.54.0
-
pheatmap version 1.0.13
-
readr version 2.2.0
-
jsonlite version 2.0.0
-
Typically, installing all the necessary software takes about 2 to 15 minutes, depending on your computer and network environment.
The protocol has been tested with SRA Toolkit version 3.1.0, GffRead version 0.12.7, Salmon version 1.10.0, R version 4.5.2. The R session information from our test run can be found in the "Session information" section of the expected output.
All the scripts in this repository should be run from the project root directory.
To make sure the workflow can be reproduced, the users should follow the following order:
-
Clone this repository and navigate to the directory of the repository.
-
Install required software tools :
Option A: Install the required software manually
-
Follow the instructions in the "Equipment setup" section in the paper to install SRA Toolkit, pigz, GffRead, Salmon and R.
-
Run
setup/install_R_packages.Rin R to install required R packages.
Option B: Using our Docker container image
-
For Docker users, the image can be retrieved using the following command:
docker pull xueyidong/dte_dtu_workflow:latest -
For users operating on High-Performance Computing (HPC) clusters where Docker is unavailable, this image is also fully compatible with Apptainer. The image can be pulled and converted into a Singularity Image Format (.sif) file using the following command:
apptainer pull dte_dtu_workflow.sif xueyidong/dte_dtu_workflow:latest
- Download and prepare data:
- Download human T2T-CHM13v2.0 reference genome sequence and annotation file into the directory
data/reference. The scriptsetup/down_annotation.shcan be used for downloading the reference data. - Download and prepare the example RNA-seq reads data. Run
download_and_prepare_sra.shto download the data andmerge_tech_batch.shto merge the technical replicates.
- Run the scripts of each step of the workflow under the workflow folder in order. For steps 2-6, we recommend knitting the RMarkdown document
workflow.Rmdto generate an HTML report.
When you want to use our workflow on your own data, we recommend the following:
-
Choose a suitable version of the reference genome and annotation for your data.
-
Prepare a target file to save your experimental design and sample information. The format can be found in this file: data/targets.txt.
-
Adjust the design matrix (stage 4, step 24) and contrasts (stage 5, step 29) according to your experimental design. We recommend the following article as a guide on how to set up your design and contrasts properly: Law et al., A guide to creating design matrices for gene expression experiments, F1000Research, 2020, DOI: 10.12688/f1000research.27893.1.
-
Due to the stochastic nature of Salmon’s quasi-mapping algorithm and Gibbs resampling, the result of each Salmon quantification run can be slightly different. This difference will impact downstream analysis such that different transcripts may be filtered out and a slightly different number of differential expression or usage transcripts may be detected.
-
To achieve strict reproducibility so that results exactly match our example run, or to test the workflow while skipping the time-consuming and computationally intensive data downloading and Salmon quantification steps, you may use our R object
data/RDS/counts.RDScontaining the imported Salmon output. The data can be used by replacing the command that imports Salmon output in Stage 3 (line 11,workflow/3_count_preprocess.R)counts <- catchSalmon(file.path(salmon_dir, samples))by the R command to read in the R object:counts <- readRDS("data/RDS/counts.RDS").