A SLURM-based metatranscriptomic pipeline for marine eukaryotic gene expression analysis. Features dual-assembler approach (MEGAHIT + rnaSPAdes), rRNA removal, and comprehensive taxonomic/functional annotation.
Raw Reads β QC/Trim β rRNA Removal β Assembly (MEGAHIT + rnaSPAdes) β Merge
β
6-frame Translation β Best ORF β Cluster (99%)
β
Merged Results β Quantification (Kallisto) β Extract NT β Taxonomy + Function Annotation
- Dual-assembler approach: Combines MEGAHIT and rnaSPAdes for improved transcript recovery (+150-250% more contigs)
- rRNA removal: SortMeRNA filtering before assembly
- Optimized for marine metatranscriptomics: Tuned parameters for complex ocean communities
- MarFERReT taxonomy: Marine-specific eukaryotic reference database
- KEGG functional annotation: KofamScan for metabolic pathway analysis
- Automated job chaining: SLURM dependencies handle workflow automatically
| Step | Script | Tool | Description |
|---|---|---|---|
| 1 | 01_qc_trim.slurm |
FastQC + Trimmomatic | Quality control and adapter trimming |
| 2 | 02_megahit_assembly.slurm |
MEGAHIT + SortMeRNA | rRNA removal + de novo assembly |
| 2b | 02b_rnaspades_assembly.slurm |
rnaSPAdes | Complementary assembly (optional) |
| 2c | 02c_merge_assemblies.slurm |
CD-HIT-EST | Merge and deduplicate assemblies |
| 3 | 03_translate_6frame.slurm |
EMBOSS transeq | Six-frame translation |
| 4 | 04_select_best_frame.slurm |
Custom Python | Select longest ORF per contig |
| 5 | 05_cluster_mmseqs2.slurm |
CD-HIT | Cluster at 99% amino acid identity |
| 6 | 06_taxonomic_annotation.slurm |
Diamond + MarFERReT | LCA taxonomic assignment |
| 7 | 07_functional_annotation.slurm |
KofamScan | KEGG Orthology annotation |
| 8 | 08_extract_nt_sequences.slurm |
seqkit | Extract NT sequences for quantification |
| 9 | 09_kallisto_quantify.slurm |
Kallisto | Pseudo-alignment quantification (TPM) |
| 10 | 10_merge_results.slurm |
Custom Python | Combine all results into final table |
git clone https://github.com/erocke/MMA_OMICS.git
cd MMA_OMICS/metaT_euk_pipeline# Create all required environments
conda env create -f environment.yml
# Or create individually:
conda create -n megahit -c bioconda megahit -y
conda create -n sortmerena -c bioconda sortmerna -y
conda create -n emboss -c bioconda emboss -y
conda create -n cd-hit -c bioconda cd-hit -y
conda create -n seqkit-env -c bioconda seqkit -y
conda create -n kofamscan -c bioconda kofamscan -y# MarFERReT v1.1.1 (marine eukaryotic reference transcriptomes)
wget https://zenodo.org/record/7055911/files/MarFERReT.v1.1.1.tar.gz
tar -xzf MarFERReT.v1.1.1.tar.gz
# Create Diamond database
diamond makedb --in MarFERReT.v1.1.1.fasta -d MarFERReT.v1.1.1
# KofamScan profiles
wget ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz
wget ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz
gunzip ko_list.gz
tar -xzf profiles.tar.gz
# SortMeRNA databases (usually included with conda install)
# Check: ls $CONDA_PREFIX/share/sortmerna/data/Edit 00_config.sh:
# Sample information
export SAMPLE_ID="YOUR_SAMPLE_NAME"
# Directory structure
export BASE_DIR="/scratch/${USER}/metaT_euk"
# Database paths (adjust for your system)
export MARFERRET_DMND="/path/to/MarFERReT.v1.1.1.dmnd"
export KOFAM_DIR="/path/to/kofam_scan"
export SORTMERNA_DB="/path/to/sortmerna/data"
# HPC settings
export DEFAULT_PARTITION="your_partition"
export DEFAULT_ACCOUNT="your_account"
# Email for job notifications
# Update email in each .slurm file or leave as-is# Source config and create directories
source 00_config.sh
create_dirs
# Copy your raw reads
cp your_sample_R1.fastq.gz ${RAW_DIR}/${SAMPLE_ID}_R1_001.fastq.gz
cp your_sample_R2.fastq.gz ${RAW_DIR}/${SAMPLE_ID}_R2_001.fastq.gz# Run entire pipeline (steps 1-10)
./run_pipeline.sh
# Run with dual assembler (recommended for best results)
./run_pipeline.sh --from 1 --to 2 # QC + MEGAHIT
sbatch 02b_rnaspades_assembly.slurm # rnaSPAdes (parallel)
sbatch 02c_merge_assemblies.slurm # Merge (after both complete)
./run_pipeline.sh --from 3 --to 10 # Continue pipeline
# Dry run (preview jobs)
./run_pipeline.sh --dry-run
# Run specific steps
./run_pipeline.sh --from 3 --to 7# Check job queue
squeue -u $USER
# Watch log file
tail -f 02_megahit_*.out
# Check for errors
cat *_*.err${BASE_DIR}/
βββ raw_reads/ # Input FASTQ files
βββ qc_data/
β βββ fastqc_pre/ # Pre-trim QC reports
β βββ fastqc_post/ # Post-trim QC reports
β βββ trimmed/ # Trimmed reads
β βββ non_rrna/ # rRNA-filtered reads
βββ assemblies/
β βββ megahit_${SAMPLE}/ # MEGAHIT working dir
β βββ rnaspades_${SAMPLE}/ # rnaSPAdes working dir
β βββ ${SAMPLE}.megahit.fasta.gz # Final assembly (or merged)
β βββ ${SAMPLE}.rnaspades.fasta.gz # rnaSPAdes assembly
β βββ ${SAMPLE}.merged.fasta.gz # Combined assembly
β βββ 6tr/ # Translated sequences
β βββ clustering/ # Clustered sequences
β βββ annotations/ # Taxonomy & function
β βββ kallisto/ # Quantification results
βββ results/ # Final merged results
β βββ ${SAMPLE}.merged_results.tsv.gz
βββ logs/ # SLURM job logs
| File | Description |
|---|---|
results/${SAMPLE}.merged_results.tsv.gz |
Final table with abundances, taxonomy, and function |
assemblies/${SAMPLE}.merged.fasta.gz |
Combined assembly (if using dual-assembler) |
assemblies/${SAMPLE}.assembly_stats.txt |
Assembly statistics |
The final merged results contain:
- Contig ID: Unique sequence identifier
- Length: Contig length (nt and aa)
- TPM: Transcripts Per Million (normalized abundance)
- Est_counts: Estimated read counts
- Tax_ID: NCBI taxonomy ID
- Lineage: Full taxonomic lineage
- KO: KEGG Orthology identifier
- KO_definition: Functional description
# ORF filtering
export MIN_ORF_LENGTH=100 # Minimum ORF length (amino acids)
# Clustering
export CLUSTER_IDENTITY=0.99 # CD-HIT clustering threshold
# Annotation
export DIAMOND_EVALUE="1e-5" # Diamond e-value cutoff
export KOFAM_EVALUE="0.00001" # KofamScan e-value
export KOFAM_SCORE_CUTOFF=30 # KofamScan score threshold# rRNA removal (recommended)
RRNA_REMOVAL=true # Set to false to skip
# MEGAHIT k-mer settings (optimized for metaT)
--k-min 21 --k-max 99 --k-step 10
--min-contig-len 200
--min-count 1 # Recover low-abundance transcriptsAdjust in each .slurm file based on your data size:
| Step | Recommended Resources | Notes |
|---|---|---|
| Assembly | 32 CPU, 200GB RAM, 50h | Scales with read count |
| rnaSPAdes | 32 CPU, 200GB RAM, 72h | Slower than MEGAHIT |
| Translation | 4 CPU, 16GB RAM, 4h | Fast |
| Clustering | 8 CPU, 64GB RAM, 8h | Memory scales with sequences |
| Annotation | 16 CPU, 64GB RAM, 48h | KofamScan is slow |
| Kallisto | 8 CPU, 32GB RAM, 4h | Fast |
"conda: command not found" in SLURM jobs
# Make sure conda is initialized in your script
eval "$(conda shell.bash hook)"
conda activate your_env"Disk quota exceeded"
# Check usage
quota -s
du -sh /scratch/$USER/*
# Clean intermediate files
rm -rf assemblies/megahit_*/intermediate_contigs
rm -rf assemblies/rnaspades_*/corrected
rm -rf qc_data/non_rrna/sortmerna_workdirAssembly seems small/fragmented
- Check rRNA content (should be <5% if rRNA-depleted library)
- Try dual-assembler approach (MEGAHIT + rnaSPAdes)
- Verify input read quality with FastQC
Job failed with no error message
# Check SLURM error file
cat *_JOBID.err
# Check if previous step completed
ls -la ${expected_input_file}# Quick stats with seqkit
conda activate seqkit-env
seqkit stats -a -T your_assembly.fasta.gz
# Key metrics:
# - N50 > 500bp is reasonable for metaT
# - Check total bases vs input read dataFor publications, include:
Raw reads were quality-trimmed using Trimmomatic (v0.40) and rRNA sequences removed with SortMeRNA (v4.3) using the SILVA database. Cleaned reads were assembled using both MEGAHIT (v1.2.9) and rnaSPAdes (v4.0.0), with assemblies merged using CD-HIT-EST at 99% identity. Contigs were translated in all six reading frames using EMBOSS transeq, and the longest ORF β₯100 amino acids was retained per contig. Protein sequences were clustered at 99% identity using CD-HIT. Taxonomic annotation was performed using Diamond (v2.1) against the MarFERReT v1.1.1 database with LCA classification. Functional annotation was performed using KofamScan against KEGG Orthology profiles. Transcript abundances were quantified using Kallisto (v0.46) as TPM.
If you use this pipeline, please cite:
- MEGAHIT: Li D, et al. (2015) MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31:1674-1676
- SPAdes/rnaSPAdes: Bushmanova E, et al. (2019) rnaSPAdes: a de novo transcriptome assembler. GigaScience 8:giz100
- SortMeRNA: Kopylova E, et al. (2012) SortMeRNA: fast and accurate filtering of ribosomal RNAs. Bioinformatics 28:3211-3217
- MarFERReT: Coesel SN, et al. (2023) MarFERReT: Marine Functional EukaRyotic Reference Taxonomy. Zenodo
- Diamond: Buchfink B, et al. (2021) Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods 18:366-368
- KofamScan: Aramaki T, et al. (2020) KofamKOALA: KEGG ortholog assignment based on profile HMM. Bioinformatics 36:2251-2252
- Kallisto: Bray NL, et al. (2016) Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34:525-527
- CD-HIT: Fu L, et al. (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:3150-3152
Contributions are welcome! Please feel free to submit issues or pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
- Emma Rocke (@erocke) - University of Cape Town
- AtlantECO / Mission Microbiome Atlantic consortium
- MarFERReT development team
- Armbrust Lab (University of Washington) for the original pipeline inspiration