Skip to content

Latest commit

 

History

History
174 lines (126 loc) · 11.1 KB

File metadata and controls

174 lines (126 loc) · 11.1 KB

Split-flow

Split-flow is a Nextflow-based pipeline designed for the downstream processing of Cell Ranger outputs from multiplexed droplet-based single-cell sequencing data. Split-flow integrates several essential steps in single-cell analysis, including: cell calling, ambient molecule correction, demultiplexing, doublet detection and cell type annotation via bringing together popular tools from the literature.

Split-flow workflow

The main input for Split-flow consists of CellRanger outputs of multiplexed multimodal experiments. Based on the experimental design, Split-flow can handle an individual pool as well as data from multiple Chromium chip channels simultaneously. The results are returned as pool or Chromium chip channel-specific objects, saved in both .qs format (readable as Seurat objects), as well as Python .h5mu objects (readable in scanpy), providing flexibility and interoperability for downstream investigations in both R and Python.

Designed to accommodate diverse experimental workflows including CITE-seq, multiome (snATAC+RNA), or hashed scRNA-seq experiments, Split-flow is particularly optimized for the study of hematological malignancies, enhancing the reproducibility and quality of data processing in blood cancer research. Built on the Nextflow framework, Split-flow is optimized for scalability and is optimized to be executed on LSF based computing clusters. It is important to note that split-flow is designed for cellranger-multi outputs but one can easily customize it owing to its modular structure for handling different output formats. Please contact the authors for any questions regarding the customability.

A detailed tutorial is given below for executing the pipeline:

Setting up your environment

First, clone this repository to your project directory:

git clone https://github.com/RippeLab/split-flow

Then, load the latest nextflow version available in your system, in case of conda, activate the conda environment where nextflow is available.

module load nextflow/23.10.1

If Nextflow is not available yet, install Nextflow

After making sure that Nextflow is available on your system, pull the apptainer images that are required for running Split-flow workflow to your system:

singularity pull --name quay.io-rippelab-citeflow-pyciteflow-latest.sif docker://quay.io/rippelab/citeflow-pyciteflow
singularity pull --name quay.io-rippelab-citeflow-rseurat.sif  docker://quay.io/rippelab/citeflow-rseurat
wget -O Demuxafy.sif 'https://www.dropbox.com/scl/fi/kykwi78vk4yifbbag5ajz/Demuxafy.sif?rlkey=5hcugu6ztpy0eik3xno63xiar' 
export SINGULARITY_TMPDIR=$(pwd)/tmp (uses current directory as tmp) 

If the container images specified in the default splitflow.config cannot be pulled automatically during pipeline execution, please pull them manually and place the resulting .sif files in your local apptainer images directory (i.e. ./envs/singularity/). Then update the corresponding container entries in the config to point to the local file (see splitflow.config):

{container = 'file:///path/to/envs/singularity/your-image.sif'}
  • CellBender docker://us.gcr.io/broad-dsde-methods/cellbender:latest
  • Azimuth docker://hub.docker.com/r/satijalab/azimuth
  • celltypist: docker://quay.io/teichlab/celltypist:latest

We run souporcell and cellsnp-lite/vireo combination using the Demuxafy singularity image in our manuscript (Neavin et al.,2024). However, we should note that a custom environment/.sif could be provided in the .config file to run vireo, souporcell and cellsnp-lite with different versions independent of Demuxafy. You can also opt for using local micromamba/conda environments as long as properly configured for each module in the .config before running split-flow.

After following the above steps, your system should be ready to run Split-flow!

Preparing Input Data

Split-flow uses a .csv table to import the input paths. Create an input table params_file.csv where you specify the paths of your cellranger output files. The structure of this .csv table is expected to be similar to the following example. Please make sure to use the same column names: sample, path_cr_raw, path_cr_filtered, sample_no:

sample path_cr_raw path_cr_filtered sample_no
pool1 $PATH_pool1_CR_RAW $PATH_pool1_CR_FILTERED $NO_samples

If you are processing a 10x experiment with the same sample loaded on several Chromium chip channels, you can process the channels with a single Split-flow run, specifying the input table as follows with multiple rows:

sample path_cr_raw path_cr_filtered sample_no
channel1 $PATH_channel1_CR_RAW $PATH_channel1_CR_FILTERED $NO_samples
channel2 $PATH_channel2_CR_RAW $PATH_channel2_CR_FILTERED $NO_samples

$PATH_pool1_CR_RAW should be referring to the raw_feature_bc folder from cellranger output path_pool1/raw_feature_bc_matrix while $PATH_pool1_CR_FILTERED should be referring to the $PATH_pool1/sample_filtered_feature_bc_matrix

Updating configuration file (splitflow.config)

Before running the pipeline, make sure to update the splitflow.config file provided within the cloned Split-flow repository as a template. The .config file specifies parameters for optimizing the individual tools provided within the pipeline:

First, and most importantly, specify the experimental workflow in the .config file: For multiome (RNA+ATAC) data, specify params.workflow option in the .config file as "Multiome", for hashed scRNA-seq experiments specify "scRNAseq" and for CITE-seq specify "CITE-seq".

params {
    workflow = "CITE-seq" 
}

params {
    workflow = "Multiome"
}

params {
    workflow = "scRNAseq"
}

In addition to specfying the experimental workflow, one should also specify whether the hashtag data is provided in the cellranger output (i.e. whether the HTO libraries are included within the ADT matrix) or as a separate .mtx file where the rows represent the hashtags and the columns represents the cell barcodes.

If the HTO counts are given within the ADT matrix (cellranger-multi), keep params.hash_data as null while if the hashtag counts are externally provided, specify params.hash_data as the path to the external .mtx file

params {
    hash_data = null 
}

Split-flow is developed to run on an LSF-based computing cluster system as previously mentioned. However, it can be adapted for use on other computing clusters, such as those using PBS, by manually adjusting the query options within the pipeline's modules as well as modifying the params.system parameter in the .config file. These changes allow the pipeline to be customized for different cluster environments, ensuring compatibility across various systems. However, it is good to keep in mind that manual handling of changes will be needed for this and Split-flow was not tested outside LSF yet.

params {
    system = "lsf"
}

After specifying the experimental workflow and the system settings in the .config file, you should provide the paths to the required files that are not specified in the .csv input.

The .fasta file refers to the fasta file that was used in running the cellranger command.

The .vcf file is required for SNP based demultiplexing. When using a genotype-free workflow, and intending to use a VCF file containing common SNPs at the population level, you can refer to Demuxafy guidelines. The common SNP panels are provided by 1000 Genomes and can simply be downloaded from Demuxafy documentation under Data Preperation: SNP Genotype Data section (Neavin et al., 2024). They provide different levels of filtering on the common SNPs from 1000 Genomes here. Users are encouraged to test population based .vcf files with different filtering.

If you intend to do demultiplexing using SNPs derived from donor WGS/SNP array data, the optimal SNP filtering strategy will be dataset-dependent. It is important to be aware that stricter filtering might reduce the number of available loci to distinguish donors from each other, while lenient filtering on SNPs might introduce noise.

Please specify ${params.vcf_donor} in the config file that points to your donor.vcf file in case you would like to use donor specific genotypes. Add -R ${params.vcf_donor} to DEMULTIPLEXING_VIREO_CELLSNP module in order to pile up reads at donor-specific loci instead of population-level SNPs as well as change the vireo command in the following way in the DEMULTIPLEXING_VIREO module:

vireo -c ${cellsnp_folder} \
    -N ${meta.sample_no} \
    -o "${meta.sample}_vireo_res" \
    -d ${params.vcf_donor} \
    -t GT
params {
  fasta = "$PATH/file/fasta/genome.fa"
  vcf = "$PATH/GRCh38_1000G_MAF0.01_GeneFiltered_ChrEncoding.vcf"
  translation = '$PATH/donor_translation.csv'
}

Additionally, Split-flow requires a translation table as an input. The translation table is needed for matching the hashtag based demultiplexing results with the SNP based demultiplexing. It ensures that the cells are assigned to individual demultiplexed groups according to their specific donor of origin. The translation table should be provided with the following format, and the columns should be ideally separated by a semicolon (";").

Hash Donor
Hash-1 Donor x
Hash-2 Donor y
Hash-3 Donor z
Hash-4 Donor b

Lastly, update the cell type annotation references in the ./splitflow.config based on your needs before running split-flow:

params {
  celltype_ref_azimuth = "bonemarrowref"
  celltype_ref_celltypist = "Immune_All_Low.pkl"
}

celltype_ref_azimuth is directly passed to the reference string submitted to RunAzimuth(). For more information on the reference specifications, see Azimuth . Similarly, the string specified to celltype_ref_celltypist in the .config file directly is passed to celltypist as a reference model. For more information for the customization of the cell type of annotation on celltypists, see CellTypist

 RunAzimuth(
  query = rna_object,
  reference = toString(params.celltype_ref_azimuth),
  query.modality= 'RNA'
)
models.download_models(force_update = True)
model_low = models.Model.load(model=params.celltype_ref_celltypist)

The rest of the parameters provided in the splitflow.config file are default values provided by the tools. If you need to optimize specific parameters for different steps (i.e. minMAF filtering for cellsnp-lite), you can check the documentation of the respective tools provided separately and optimize the methods changing the parameters given in the .config file.

Running the pipeline

Now you should be ready to submit the pipeline! Simply, run:

nextflow run ./main.nf \
  -c ./splitflow.config  \
  --pool_file ./params_file.csv --with-dag