This repository hold scripts and select results used to assemble and analyse the genomes of Abudefduf saxatilis and Abudefduf troschelii for the paper "Comparative Genomics of Trans-isthmian Damselfishes (Abudefduf saxatilis and A.troschelii)" by Tracy et al., 2026 (accepted for publication in GBE).
Overall this repository includes information on the de-novo assemblies for PacBio genomes, and conduct subsequent comparative analyses such as synteny, demographic history, and gene family expansion.
All scripts were run using Auburn University's Easley Supercomputer using slurm. As such, all scripts have a similar setup at the top with SBATCH parameters that will need to be modified for your own use. I have removed my own job names, partition names, and user names and indicated where you will need to enter your own. I have left the memory, ntasks, and time allotted for the run to help others who may be using the same supercomputer for these analyses. Most of these parameters are likely overkill, however these are what I used and they worked for me.
Below is an example of what the SBATCH component of each script looks like. THe memory, ntasks, and time parameters are different for each script in this repository.
#!/bin/bash
#SBATCH --job-name=ENTERJOBNAME # job name
#SBATCH --mem=180G
#SBATCH --ntasks=30 # number of tasks across all nodes
#SBATCH --partition=ENTERPARTITIONNAME # name of partition to submit job
#SBATCH --time=48:00:00 # Run time (D-HH:MM:SS)
#SBATCH --output=job-%j.out # Output file. %j is replaced with job ID
#SBATCH --error=job-%j.err # Error file. %j is replaced with job ID
#SBATCH --mail-type=ALL # will send email for begin,end,fail
#SBATCH --mail-user=ENTERUSERNAME@auburn.edu
if not already in fasta format run bamtofastx.sh script
source activate genome_env2
bam2fasta -o Outfilename Infilename.bam
conda deactivate
run hifiasm.sh script
module load hifiasm/0.15.1
hifiasm -o atro_assembly.asm -t 30 /home/cbt0022/A-troschelii-genome/files.rc.byu.edu/owen_fish/17G/ccs/m64140_211120_040313.hifi_reads.fasta.gz
download .gfa files to visualize in bandage on local computer
need to change path and filenames
scp cbt0022@easley.auburn.edu:~/A-sax-genome/analysis/asax_assembly.asm.bp.p_ctg.gfa .
scp cbt0022@easley.auburn.edu:~/A-sax-genome/analysis/asax_assembly.asm.bp.hap2.p_ctg.gfa .
scp cbt0022@easley.auburn.edu:~/A-sax-genome/analysis/asax_assembly.asm.bp.hap1.p_ctg.gfa .
get primary contigs in fasta from hifiasm output --- really not sure where this came from or what this does need to change file names
awk '/^S/{print ">"$2;print $3}' asax_assembly.asm.bp.p_ctg.gfa > asax_assembly.asm.p_ctg.fa # get primary contigs in FASTA
convert gfa to fasta (on desktop)
./gfatools gfa2fa ../asax_assembly.asm.bp.p_ctg.gfa > Asax_assembly.fa
pyfastx to get stats (on desktop)
pyfastx stat Asax_assembly.fa > stats.txt
or on Easley
pyfastx info atro_assembly.asm.bp.p_ctg.fa > Atro_stats.txt
#!/bin/bash
#SBATCH --job-name=samtools # job name
#SBATCH --ntasks=1 # number of tasks across all nodes
#SBATCH --mem=80G
#SBATCH --partition=investor_bg4 # name of partition to submit job
#SBATCH --time=30:00:00 # Run time (D-HH:MM:SS)
#SBATCH --output=job-%j.out # Output file. %j is replaced with job ID
#SBATCH --error=job-%j.err # Error file. %j is replaced with job ID
#SBATCH --mail-type=FAIL # will send email for begin,end,fail
#SBATCH --mail-user=cbt0022@auburn.edu
#module load python/anaconda/
#module load bwa
module load samtools
#source activate blobtools2
#module load mpich/3.2
#index genome assembly
bwa index -p ASAX ~/A-sax-genome/analysis/asax_assembly.asm.p_ctg.fa
##run bwa
#mpiexec -np 40
bwa mem -t 40 ASAX ~/A-sax-genome/analysis/Abudefduf_genome.fasta.gz > Asax_bwa_aln.sam
##convert to bam and sort with samtools
samtools view -bu Asax_bwa_aln.sam | samtools sort -T Asax -O bam -l 9 -m 30G -o Asax_bwa_aln_sorted.bam
#conda deactivate
Decontaminate genome using blobtools on Auburn University's Easley Cluster.
Input data for blobtools:
-i /home/cbt0022/A-sax-genome/analysis/asax_assembly.asm.p_ctg.fa #This is the assembled genome output from hifiasm
-t blast.out
-b Asax_bwa_aln_sorted.bam #Raw reads mapped to genome
-o A-sax_ref_blob #OUTPUT
See README.md in Decontamination folder for code used and results.
Reference genome to use: Nemo_v1.1 GenBank: GCA_003047355.2
- only chromosome level assembly in Pomacentridae
download reference genome to easley:
wget --recursive --no-host-directories --cut-dirs=6 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/047/355/GCA_003047355.2_Nemo_v1.1/ -P reference/
run this on command line:
module load python/anaconda
export CONDA_PKGS_DIRS=~/.conda/pkgs
#already created the genome_env3 environment
#conda create -n genome_env_minimap
source activate genome_env_minimap
conda install -c bioconda minimap2
conda deactivate
minimap.sh script:
#!/bin/bash
#SBATCH --job-name=minimap_abu # job name
#SBATCH --ntasks=30 # number of tasks across all nodes
#SBATCH --mem=120G
#SBATCH --partition=general # name of partition to submit job
#SBATCH --time=48:00:00 # Run time (D-HH:MM:SS)
#SBATCH --output=job-%j.out # Output file. %j is replaced with job ID
#SBATCH --error=job-%j.err # Error file. %j is replaced with job ID
#SBATCH --mail-type=ALL # will send email for begin,end,fail
#SBATCH --mail-user=cbt0022@auburn.edu
source activate genome_env_minimap
minimap2 -ax map-pb ~/reference/GCA_003047355.2_Nemo_v1.1/Nemo_v1.1_genomic.fa.gz ~/A-sax-genome/analysis/asax_assembly.asm.p_ctg.fa > aln.sam # for PacBio CLR reads
sbatch minimap.sh