Skip to content

erocke/Eukarion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MetaT Eukaryotic Metatranscriptomics Pipeline

License: MIT

A SLURM-based metatranscriptomic pipeline for marine eukaryotic gene expression analysis. Features dual-assembler approach (MEGAHIT + rnaSPAdes), rRNA removal, and comprehensive taxonomic/functional annotation.

πŸ”¬ Pipeline Overview

Raw Reads β†’ QC/Trim β†’ rRNA Removal β†’ Assembly (MEGAHIT + rnaSPAdes) β†’ Merge
                                                                        ↓
                                                    6-frame Translation β†’ Best ORF β†’ Cluster (99%)
                                                                                          ↓
Merged Results ← Quantification (Kallisto) ← Extract NT ← Taxonomy + Function Annotation

Key Features

  • Dual-assembler approach: Combines MEGAHIT and rnaSPAdes for improved transcript recovery (+150-250% more contigs)
  • rRNA removal: SortMeRNA filtering before assembly
  • Optimized for marine metatranscriptomics: Tuned parameters for complex ocean communities
  • MarFERReT taxonomy: Marine-specific eukaryotic reference database
  • KEGG functional annotation: KofamScan for metabolic pathway analysis
  • Automated job chaining: SLURM dependencies handle workflow automatically

πŸ“‹ Pipeline Steps

Step Script Tool Description
1 01_qc_trim.slurm FastQC + Trimmomatic Quality control and adapter trimming
2 02_megahit_assembly.slurm MEGAHIT + SortMeRNA rRNA removal + de novo assembly
2b 02b_rnaspades_assembly.slurm rnaSPAdes Complementary assembly (optional)
2c 02c_merge_assemblies.slurm CD-HIT-EST Merge and deduplicate assemblies
3 03_translate_6frame.slurm EMBOSS transeq Six-frame translation
4 04_select_best_frame.slurm Custom Python Select longest ORF per contig
5 05_cluster_mmseqs2.slurm CD-HIT Cluster at 99% amino acid identity
6 06_taxonomic_annotation.slurm Diamond + MarFERReT LCA taxonomic assignment
7 07_functional_annotation.slurm KofamScan KEGG Orthology annotation
8 08_extract_nt_sequences.slurm seqkit Extract NT sequences for quantification
9 09_kallisto_quantify.slurm Kallisto Pseudo-alignment quantification (TPM)
10 10_merge_results.slurm Custom Python Combine all results into final table

πŸš€ Quick Start

1. Clone the Repository

git clone https://github.com/erocke/MMA_OMICS.git
cd MMA_OMICS/metaT_euk_pipeline

2. Set Up Conda Environments

# Create all required environments
conda env create -f environment.yml

# Or create individually:
conda create -n megahit -c bioconda megahit -y
conda create -n sortmerena -c bioconda sortmerna -y
conda create -n emboss -c bioconda emboss -y
conda create -n cd-hit -c bioconda cd-hit -y
conda create -n seqkit-env -c bioconda seqkit -y
conda create -n kofamscan -c bioconda kofamscan -y

3. Download Required Databases

# MarFERReT v1.1.1 (marine eukaryotic reference transcriptomes)
wget https://zenodo.org/record/7055911/files/MarFERReT.v1.1.1.tar.gz
tar -xzf MarFERReT.v1.1.1.tar.gz

# Create Diamond database
diamond makedb --in MarFERReT.v1.1.1.fasta -d MarFERReT.v1.1.1

# KofamScan profiles
wget ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz
wget ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz
gunzip ko_list.gz
tar -xzf profiles.tar.gz

# SortMeRNA databases (usually included with conda install)
# Check: ls $CONDA_PREFIX/share/sortmerna/data/

4. Configure the Pipeline

Edit 00_config.sh:

# Sample information
export SAMPLE_ID="YOUR_SAMPLE_NAME"

# Directory structure
export BASE_DIR="/scratch/${USER}/metaT_euk"

# Database paths (adjust for your system)
export MARFERRET_DMND="/path/to/MarFERReT.v1.1.1.dmnd"
export KOFAM_DIR="/path/to/kofam_scan"
export SORTMERNA_DB="/path/to/sortmerna/data"

# HPC settings
export DEFAULT_PARTITION="your_partition"
export DEFAULT_ACCOUNT="your_account"

# Email for job notifications
# Update email in each .slurm file or leave as-is

5. Prepare Input Data

# Source config and create directories
source 00_config.sh
create_dirs

# Copy your raw reads
cp your_sample_R1.fastq.gz ${RAW_DIR}/${SAMPLE_ID}_R1_001.fastq.gz
cp your_sample_R2.fastq.gz ${RAW_DIR}/${SAMPLE_ID}_R2_001.fastq.gz

6. Run the Pipeline

# Run entire pipeline (steps 1-10)
./run_pipeline.sh

# Run with dual assembler (recommended for best results)
./run_pipeline.sh --from 1 --to 2      # QC + MEGAHIT
sbatch 02b_rnaspades_assembly.slurm     # rnaSPAdes (parallel)
sbatch 02c_merge_assemblies.slurm       # Merge (after both complete)
./run_pipeline.sh --from 3 --to 10      # Continue pipeline

# Dry run (preview jobs)
./run_pipeline.sh --dry-run

# Run specific steps
./run_pipeline.sh --from 3 --to 7

7. Monitor Progress

# Check job queue
squeue -u $USER

# Watch log file
tail -f 02_megahit_*.out

# Check for errors
cat *_*.err

πŸ“ Directory Structure

${BASE_DIR}/
β”œβ”€β”€ raw_reads/                          # Input FASTQ files
β”œβ”€β”€ qc_data/
β”‚   β”œβ”€β”€ fastqc_pre/                     # Pre-trim QC reports
β”‚   β”œβ”€β”€ fastqc_post/                    # Post-trim QC reports
β”‚   β”œβ”€β”€ trimmed/                        # Trimmed reads
β”‚   └── non_rrna/                       # rRNA-filtered reads
β”œβ”€β”€ assemblies/
β”‚   β”œβ”€β”€ megahit_${SAMPLE}/              # MEGAHIT working dir
β”‚   β”œβ”€β”€ rnaspades_${SAMPLE}/            # rnaSPAdes working dir
β”‚   β”œβ”€β”€ ${SAMPLE}.megahit.fasta.gz      # Final assembly (or merged)
β”‚   β”œβ”€β”€ ${SAMPLE}.rnaspades.fasta.gz    # rnaSPAdes assembly
β”‚   β”œβ”€β”€ ${SAMPLE}.merged.fasta.gz       # Combined assembly
β”‚   β”œβ”€β”€ 6tr/                            # Translated sequences
β”‚   β”œβ”€β”€ clustering/                     # Clustered sequences
β”‚   β”œβ”€β”€ annotations/                    # Taxonomy & function
β”‚   └── kallisto/                       # Quantification results
β”œβ”€β”€ results/                            # Final merged results
β”‚   └── ${SAMPLE}.merged_results.tsv.gz
└── logs/                               # SLURM job logs

πŸ“Š Output Files

Main Results

File Description
results/${SAMPLE}.merged_results.tsv.gz Final table with abundances, taxonomy, and function
assemblies/${SAMPLE}.merged.fasta.gz Combined assembly (if using dual-assembler)
assemblies/${SAMPLE}.assembly_stats.txt Assembly statistics

Results Table Columns

The final merged results contain:

  • Contig ID: Unique sequence identifier
  • Length: Contig length (nt and aa)
  • TPM: Transcripts Per Million (normalized abundance)
  • Est_counts: Estimated read counts
  • Tax_ID: NCBI taxonomy ID
  • Lineage: Full taxonomic lineage
  • KO: KEGG Orthology identifier
  • KO_definition: Functional description

βš™οΈ Configuration Options

Key Parameters (00_config.sh)

# ORF filtering
export MIN_ORF_LENGTH=100          # Minimum ORF length (amino acids)

# Clustering
export CLUSTER_IDENTITY=0.99       # CD-HIT clustering threshold

# Annotation
export DIAMOND_EVALUE="1e-5"       # Diamond e-value cutoff
export KOFAM_EVALUE="0.00001"      # KofamScan e-value
export KOFAM_SCORE_CUTOFF=30       # KofamScan score threshold

Assembly Parameters (02_megahit_assembly.slurm)

# rRNA removal (recommended)
RRNA_REMOVAL=true                  # Set to false to skip

# MEGAHIT k-mer settings (optimized for metaT)
--k-min 21 --k-max 99 --k-step 10
--min-contig-len 200
--min-count 1                      # Recover low-abundance transcripts

Resource Allocation

Adjust in each .slurm file based on your data size:

Step Recommended Resources Notes
Assembly 32 CPU, 200GB RAM, 50h Scales with read count
rnaSPAdes 32 CPU, 200GB RAM, 72h Slower than MEGAHIT
Translation 4 CPU, 16GB RAM, 4h Fast
Clustering 8 CPU, 64GB RAM, 8h Memory scales with sequences
Annotation 16 CPU, 64GB RAM, 48h KofamScan is slow
Kallisto 8 CPU, 32GB RAM, 4h Fast

πŸ”§ Troubleshooting

Common Issues

"conda: command not found" in SLURM jobs

# Make sure conda is initialized in your script
eval "$(conda shell.bash hook)"
conda activate your_env

"Disk quota exceeded"

# Check usage
quota -s
du -sh /scratch/$USER/*

# Clean intermediate files
rm -rf assemblies/megahit_*/intermediate_contigs
rm -rf assemblies/rnaspades_*/corrected
rm -rf qc_data/non_rrna/sortmerna_workdir

Assembly seems small/fragmented

  • Check rRNA content (should be <5% if rRNA-depleted library)
  • Try dual-assembler approach (MEGAHIT + rnaSPAdes)
  • Verify input read quality with FastQC

Job failed with no error message

# Check SLURM error file
cat *_JOBID.err

# Check if previous step completed
ls -la ${expected_input_file}

Checking Assembly Quality

# Quick stats with seqkit
conda activate seqkit-env
seqkit stats -a -T your_assembly.fasta.gz

# Key metrics:
# - N50 > 500bp is reasonable for metaT
# - Check total bases vs input read data

πŸ“š Methods Description

For publications, include:

Raw reads were quality-trimmed using Trimmomatic (v0.40) and rRNA sequences removed with SortMeRNA (v4.3) using the SILVA database. Cleaned reads were assembled using both MEGAHIT (v1.2.9) and rnaSPAdes (v4.0.0), with assemblies merged using CD-HIT-EST at 99% identity. Contigs were translated in all six reading frames using EMBOSS transeq, and the longest ORF β‰₯100 amino acids was retained per contig. Protein sequences were clustered at 99% identity using CD-HIT. Taxonomic annotation was performed using Diamond (v2.1) against the MarFERReT v1.1.1 database with LCA classification. Functional annotation was performed using KofamScan against KEGG Orthology profiles. Transcript abundances were quantified using Kallisto (v0.46) as TPM.

πŸ“– Citation

If you use this pipeline, please cite:

  • MEGAHIT: Li D, et al. (2015) MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31:1674-1676
  • SPAdes/rnaSPAdes: Bushmanova E, et al. (2019) rnaSPAdes: a de novo transcriptome assembler. GigaScience 8:giz100
  • SortMeRNA: Kopylova E, et al. (2012) SortMeRNA: fast and accurate filtering of ribosomal RNAs. Bioinformatics 28:3211-3217
  • MarFERReT: Coesel SN, et al. (2023) MarFERReT: Marine Functional EukaRyotic Reference Taxonomy. Zenodo
  • Diamond: Buchfink B, et al. (2021) Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods 18:366-368
  • KofamScan: Aramaki T, et al. (2020) KofamKOALA: KEGG ortholog assignment based on profile HMM. Bioinformatics 36:2251-2252
  • Kallisto: Bray NL, et al. (2016) Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34:525-527
  • CD-HIT: Fu L, et al. (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:3150-3152

🀝 Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ‘₯ Authors

  • Emma Rocke (@erocke) - University of Cape Town

πŸ™ Acknowledgments

  • AtlantECO / Mission Microbiome Atlantic consortium
  • MarFERReT development team
  • Armbrust Lab (University of Washington) for the original pipeline inspiration

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors