MetaT Eukaryotic Metatranscriptomics Pipeline

A SLURM-based metatranscriptomic pipeline for marine eukaryotic gene expression analysis. Features dual-assembler approach (MEGAHIT + rnaSPAdes), rRNA removal, and comprehensive taxonomic/functional annotation.

🔬 Pipeline Overview

Raw Reads → QC/Trim → rRNA Removal → Assembly (MEGAHIT + rnaSPAdes) → Merge
                                                                        ↓
                                                    6-frame Translation → Best ORF → Cluster (99%)
                                                                                          ↓
Merged Results ← Quantification (Kallisto) ← Extract NT ← Taxonomy + Function Annotation

Key Features

Dual-assembler approach: Combines MEGAHIT and rnaSPAdes for improved transcript recovery (+150-250% more contigs)
rRNA removal: SortMeRNA filtering before assembly
Optimized for marine metatranscriptomics: Tuned parameters for complex ocean communities
MarFERReT taxonomy: Marine-specific eukaryotic reference database
KEGG functional annotation: KofamScan for metabolic pathway analysis
Automated job chaining: SLURM dependencies handle workflow automatically

📋 Pipeline Steps

Step	Script	Tool	Description
1	`01_qc_trim.slurm`	FastQC + Trimmomatic	Quality control and adapter trimming
2	`02_megahit_assembly.slurm`	MEGAHIT + SortMeRNA	rRNA removal + de novo assembly
2b	`02b_rnaspades_assembly.slurm`	rnaSPAdes	Complementary assembly (optional)
2c	`02c_merge_assemblies.slurm`	CD-HIT-EST	Merge and deduplicate assemblies
3	`03_translate_6frame.slurm`	EMBOSS transeq	Six-frame translation
4	`04_select_best_frame.slurm`	Custom Python	Select longest ORF per contig
5	`05_cluster_mmseqs2.slurm`	CD-HIT	Cluster at 99% amino acid identity
6	`06_taxonomic_annotation.slurm`	Diamond + MarFERReT	LCA taxonomic assignment
7	`07_functional_annotation.slurm`	KofamScan	KEGG Orthology annotation
8	`08_extract_nt_sequences.slurm`	seqkit	Extract NT sequences for quantification
9	`09_kallisto_quantify.slurm`	Kallisto	Pseudo-alignment quantification (TPM)
10	`10_merge_results.slurm`	Custom Python	Combine all results into final table

🚀 Quick Start

1. Clone the Repository

git clone https://github.com/erocke/MMA_OMICS.git
cd MMA_OMICS/metaT_euk_pipeline

2. Set Up Conda Environments

# Create all required environments
conda env create -f environment.yml

# Or create individually:
conda create -n megahit -c bioconda megahit -y
conda create -n sortmerena -c bioconda sortmerna -y
conda create -n emboss -c bioconda emboss -y
conda create -n cd-hit -c bioconda cd-hit -y
conda create -n seqkit-env -c bioconda seqkit -y
conda create -n kofamscan -c bioconda kofamscan -y

3. Download Required Databases

# MarFERReT v1.1.1 (marine eukaryotic reference transcriptomes)
wget https://zenodo.org/record/7055911/files/MarFERReT.v1.1.1.tar.gz
tar -xzf MarFERReT.v1.1.1.tar.gz

# Create Diamond database
diamond makedb --in MarFERReT.v1.1.1.fasta -d MarFERReT.v1.1.1

# KofamScan profiles
wget ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz
wget ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz
gunzip ko_list.gz
tar -xzf profiles.tar.gz

# SortMeRNA databases (usually included with conda install)
# Check: ls $CONDA_PREFIX/share/sortmerna/data/

4. Configure the Pipeline

Edit 00_config.sh:

# Sample information
export SAMPLE_ID="YOUR_SAMPLE_NAME"

# Directory structure
export BASE_DIR="/scratch/${USER}/metaT_euk"

# Database paths (adjust for your system)
export MARFERRET_DMND="/path/to/MarFERReT.v1.1.1.dmnd"
export KOFAM_DIR="/path/to/kofam_scan"
export SORTMERNA_DB="/path/to/sortmerna/data"

# HPC settings
export DEFAULT_PARTITION="your_partition"
export DEFAULT_ACCOUNT="your_account"

# Email for job notifications
# Update email in each .slurm file or leave as-is

5. Prepare Input Data

# Source config and create directories
source 00_config.sh
create_dirs

# Copy your raw reads
cp your_sample_R1.fastq.gz ${RAW_DIR}/${SAMPLE_ID}_R1_001.fastq.gz
cp your_sample_R2.fastq.gz ${RAW_DIR}/${SAMPLE_ID}_R2_001.fastq.gz

6. Run the Pipeline

# Run entire pipeline (steps 1-10)
./run_pipeline.sh

# Run with dual assembler (recommended for best results)
./run_pipeline.sh --from 1 --to 2      # QC + MEGAHIT
sbatch 02b_rnaspades_assembly.slurm     # rnaSPAdes (parallel)
sbatch 02c_merge_assemblies.slurm       # Merge (after both complete)
./run_pipeline.sh --from 3 --to 10      # Continue pipeline

# Dry run (preview jobs)
./run_pipeline.sh --dry-run

# Run specific steps
./run_pipeline.sh --from 3 --to 7

7. Monitor Progress

# Check job queue
squeue -u $USER

# Watch log file
tail -f 02_megahit_*.out

# Check for errors
cat *_*.err

📁 Directory Structure

${BASE_DIR}/
├── raw_reads/                          # Input FASTQ files
├── qc_data/
│   ├── fastqc_pre/                     # Pre-trim QC reports
│   ├── fastqc_post/                    # Post-trim QC reports
│   ├── trimmed/                        # Trimmed reads
│   └── non_rrna/                       # rRNA-filtered reads
├── assemblies/
│   ├── megahit_${SAMPLE}/              # MEGAHIT working dir
│   ├── rnaspades_${SAMPLE}/            # rnaSPAdes working dir
│   ├── ${SAMPLE}.megahit.fasta.gz      # Final assembly (or merged)
│   ├── ${SAMPLE}.rnaspades.fasta.gz    # rnaSPAdes assembly
│   ├── ${SAMPLE}.merged.fasta.gz       # Combined assembly
│   ├── 6tr/                            # Translated sequences
│   ├── clustering/                     # Clustered sequences
│   ├── annotations/                    # Taxonomy & function
│   └── kallisto/                       # Quantification results
├── results/                            # Final merged results
│   └── ${SAMPLE}.merged_results.tsv.gz
└── logs/                               # SLURM job logs

📊 Output Files

Main Results

File	Description
`results/${SAMPLE}.merged_results.tsv.gz`	Final table with abundances, taxonomy, and function
`assemblies/${SAMPLE}.merged.fasta.gz`	Combined assembly (if using dual-assembler)
`assemblies/${SAMPLE}.assembly_stats.txt`	Assembly statistics

Results Table Columns

The final merged results contain:

Contig ID: Unique sequence identifier
Length: Contig length (nt and aa)
TPM: Transcripts Per Million (normalized abundance)
Est_counts: Estimated read counts
Tax_ID: NCBI taxonomy ID
Lineage: Full taxonomic lineage
KO: KEGG Orthology identifier
KO_definition: Functional description

⚙️ Configuration Options

Key Parameters (`00_config.sh`)

# ORF filtering
export MIN_ORF_LENGTH=100          # Minimum ORF length (amino acids)

# Clustering
export CLUSTER_IDENTITY=0.99       # CD-HIT clustering threshold

# Annotation
export DIAMOND_EVALUE="1e-5"       # Diamond e-value cutoff
export KOFAM_EVALUE="0.00001"      # KofamScan e-value
export KOFAM_SCORE_CUTOFF=30       # KofamScan score threshold

Assembly Parameters (`02_megahit_assembly.slurm`)

# rRNA removal (recommended)
RRNA_REMOVAL=true                  # Set to false to skip

# MEGAHIT k-mer settings (optimized for metaT)
--k-min 21 --k-max 99 --k-step 10
--min-contig-len 200
--min-count 1                      # Recover low-abundance transcripts

Resource Allocation

Adjust in each .slurm file based on your data size:

Step	Recommended Resources	Notes
Assembly	32 CPU, 200GB RAM, 50h	Scales with read count
rnaSPAdes	32 CPU, 200GB RAM, 72h	Slower than MEGAHIT
Translation	4 CPU, 16GB RAM, 4h	Fast
Clustering	8 CPU, 64GB RAM, 8h	Memory scales with sequences
Annotation	16 CPU, 64GB RAM, 48h	KofamScan is slow
Kallisto	8 CPU, 32GB RAM, 4h	Fast

🔧 Troubleshooting

Common Issues

"conda: command not found" in SLURM jobs

# Make sure conda is initialized in your script
eval "$(conda shell.bash hook)"
conda activate your_env

"Disk quota exceeded"

# Check usage
quota -s
du -sh /scratch/$USER/*

# Clean intermediate files
rm -rf assemblies/megahit_*/intermediate_contigs
rm -rf assemblies/rnaspades_*/corrected
rm -rf qc_data/non_rrna/sortmerna_workdir

Assembly seems small/fragmented

Check rRNA content (should be <5% if rRNA-depleted library)
Try dual-assembler approach (MEGAHIT + rnaSPAdes)
Verify input read quality with FastQC

Job failed with no error message

# Check SLURM error file
cat *_JOBID.err

# Check if previous step completed
ls -la ${expected_input_file}

Checking Assembly Quality

# Quick stats with seqkit
conda activate seqkit-env
seqkit stats -a -T your_assembly.fasta.gz

# Key metrics:
# - N50 > 500bp is reasonable for metaT
# - Check total bases vs input read data

📚 Methods Description

For publications, include:

Raw reads were quality-trimmed using Trimmomatic (v0.40) and rRNA sequences removed with SortMeRNA (v4.3) using the SILVA database. Cleaned reads were assembled using both MEGAHIT (v1.2.9) and rnaSPAdes (v4.0.0), with assemblies merged using CD-HIT-EST at 99% identity. Contigs were translated in all six reading frames using EMBOSS transeq, and the longest ORF ≥100 amino acids was retained per contig. Protein sequences were clustered at 99% identity using CD-HIT. Taxonomic annotation was performed using Diamond (v2.1) against the MarFERReT v1.1.1 database with LCA classification. Functional annotation was performed using KofamScan against KEGG Orthology profiles. Transcript abundances were quantified using Kallisto (v0.46) as TPM.

📖 Citation

If you use this pipeline, please cite:

MEGAHIT: Li D, et al. (2015) MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31:1674-1676
SPAdes/rnaSPAdes: Bushmanova E, et al. (2019) rnaSPAdes: a de novo transcriptome assembler. GigaScience 8:giz100
SortMeRNA: Kopylova E, et al. (2012) SortMeRNA: fast and accurate filtering of ribosomal RNAs. Bioinformatics 28:3211-3217
MarFERReT: Coesel SN, et al. (2023) MarFERReT: Marine Functional EukaRyotic Reference Taxonomy. Zenodo
Diamond: Buchfink B, et al. (2021) Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods 18:366-368
KofamScan: Aramaki T, et al. (2020) KofamKOALA: KEGG ortholog assignment based on profile HMM. Bioinformatics 36:2251-2252
Kallisto: Bray NL, et al. (2016) Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34:525-527
CD-HIT: Fu L, et al. (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:3150-3152

🤝 Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👥 Authors

Emma Rocke (@erocke) - University of Cape Town

🙏 Acknowledgments

AtlantECO / Mission Microbiome Atlantic consortium
MarFERReT development team
Armbrust Lab (University of Washington) for the original pipeline inspiration

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
00_config.template.sh		00_config.template.sh
01_qc_trim.slurm		01_qc_trim.slurm
02_megahit_assembly.slurm		02_megahit_assembly.slurm
02b_rnaspades_assembly.slurm		02b_rnaspades_assembly.slurm
02c_merge_assemblies.slurm		02c_merge_assemblies.slurm
03_translate_6frame.slurm		03_translate_6frame.slurm
04_select_best_frame.slurm		04_select_best_frame.slurm
05_cluster_mmseqs2.slurm		05_cluster_mmseqs2.slurm
06_taxonomic_annotation.slurm		06_taxonomic_annotation.slurm
07_functional_annotation.slurm		07_functional_annotation.slurm
08_extract_nt_sequences.slurm		08_extract_nt_sequences.slurm
09_kallisto_quantify.slurm		09_kallisto_quantify.slurm
10_merge_results.slurm		10_merge_results.slurm
LICENSE		LICENSE
README.md		README.md
compare_samples.R		compare_samples.R
environment.yml		environment.yml
evaluate_busco.slurm		evaluate_busco.slurm
run_pipeline.sh		run_pipeline.sh
visualize_results.R		visualize_results.R

Folders and files

Latest commit

History

Repository files navigation

MetaT Eukaryotic Metatranscriptomics Pipeline

🔬 Pipeline Overview

Key Features

📋 Pipeline Steps

🚀 Quick Start

1. Clone the Repository

2. Set Up Conda Environments

3. Download Required Databases

4. Configure the Pipeline

5. Prepare Input Data

6. Run the Pipeline

7. Monitor Progress

📁 Directory Structure

📊 Output Files

Main Results

Results Table Columns

⚙️ Configuration Options

Key Parameters (00_config.sh)

Assembly Parameters (02_megahit_assembly.slurm)

Resource Allocation

🔧 Troubleshooting

Common Issues

Checking Assembly Quality

📚 Methods Description

📖 Citation

🤝 Contributing

📄 License

👥 Authors

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Key Parameters (`00_config.sh`)

Assembly Parameters (`02_megahit_assembly.slurm`)

Packages