Skip to content

C-Tracy/GenomeAlignmentAnnotation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

112 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genome Assembly for Abudefduf

This repository hold scripts and select results used to assemble and analyse the genomes of Abudefduf saxatilis and Abudefduf troschelii for the paper "Comparative Genomics of Trans-isthmian Damselfishes (Abudefduf saxatilis and A.troschelii)" by Tracy et al., 2026 (accepted for publication in GBE).

Overall this repository includes information on the de-novo assemblies for PacBio genomes, and conduct subsequent comparative analyses such as synteny, demographic history, and gene family expansion.

Table of Contents

  1. File formatting
  2. De-novo assembly
  3. Decontamination
  4. Quality Assesment

Easley job submission

All scripts were run using Auburn University's Easley Supercomputer using slurm. As such, all scripts have a similar setup at the top with SBATCH parameters that will need to be modified for your own use. I have removed my own job names, partition names, and user names and indicated where you will need to enter your own. I have left the memory, ntasks, and time allotted for the run to help others who may be using the same supercomputer for these analyses. Most of these parameters are likely overkill, however these are what I used and they worked for me.

Below is an example of what the SBATCH component of each script looks like. THe memory, ntasks, and time parameters are different for each script in this repository.

#!/bin/bash
#SBATCH --job-name=ENTERJOBNAME             # job name
#SBATCH --mem=180G
#SBATCH --ntasks=30                  # number of tasks across all nodes
#SBATCH --partition=ENTERPARTITIONNAME          # name of partition to submit job
#SBATCH --time=48:00:00              # Run time (D-HH:MM:SS)
#SBATCH --output=job-%j.out          # Output file. %j is replaced with job ID
#SBATCH --error=job-%j.err           # Error file. %j is replaced with job ID
#SBATCH --mail-type=ALL              # will send email for begin,end,fail
#SBATCH --mail-user=ENTERUSERNAME@auburn.edu

Convert File Format

if not already in fasta format run bamtofastx.sh script

source activate genome_env2


bam2fasta -o Outfilename Infilename.bam 

conda deactivate

De-novo genome assembly

run hifiasm.sh script

module load hifiasm/0.15.1


hifiasm -o atro_assembly.asm -t 30 /home/cbt0022/A-troschelii-genome/files.rc.byu.edu/owen_fish/17G/ccs/m64140_211120_040313.hifi_reads.fasta.gz

download .gfa files to visualize in bandage on local computer

need to change path and filenames

scp cbt0022@easley.auburn.edu:~/A-sax-genome/analysis/asax_assembly.asm.bp.p_ctg.gfa . 
scp cbt0022@easley.auburn.edu:~/A-sax-genome/analysis/asax_assembly.asm.bp.hap2.p_ctg.gfa .
scp cbt0022@easley.auburn.edu:~/A-sax-genome/analysis/asax_assembly.asm.bp.hap1.p_ctg.gfa .

get primary contigs in fasta from hifiasm output --- really not sure where this came from or what this does need to change file names

awk '/^S/{print ">"$2;print $3}' asax_assembly.asm.bp.p_ctg.gfa > asax_assembly.asm.p_ctg.fa  # get primary contigs in FASTA

convert gfa to fasta (on desktop)

 ./gfatools gfa2fa ../asax_assembly.asm.bp.p_ctg.gfa > Asax_assembly.fa

pyfastx to get stats (on desktop)

pyfastx stat Asax_assembly.fa > stats.txt

or on Easley

pyfastx info atro_assembly.asm.bp.p_ctg.fa > Atro_stats.txt

Map Raw Reads back to Assembled Genome

#!/bin/bash
#SBATCH --job-name=samtools            # job name
#SBATCH --ntasks=1                # number of tasks across all nodes
#SBATCH --mem=80G
#SBATCH --partition=investor_bg4           # name of partition to submit job
#SBATCH --time=30:00:00              # Run time (D-HH:MM:SS)
#SBATCH --output=job-%j.out          # Output file. %j is replaced with job ID
#SBATCH --error=job-%j.err           # Error file. %j is replaced with job ID
#SBATCH --mail-type=FAIL              # will send email for begin,end,fail
#SBATCH --mail-user=cbt0022@auburn.edu

#module load python/anaconda/
#module load bwa 
module load samtools

#source activate blobtools2
#module load mpich/3.2

#index genome assembly
bwa index -p ASAX  ~/A-sax-genome/analysis/asax_assembly.asm.p_ctg.fa


##run bwa

#mpiexec -np 40 
bwa mem -t 40 ASAX ~/A-sax-genome/analysis/Abudefduf_genome.fasta.gz > Asax_bwa_aln.sam
##convert to bam and sort with samtools
samtools view -bu Asax_bwa_aln.sam | samtools sort -T Asax -O bam -l 9 -m 30G -o Asax_bwa_aln_sorted.bam


#conda deactivate

Decontamination

Decontaminate genome using blobtools on Auburn University's Easley Cluster.

Input data for blobtools:
-i /home/cbt0022/A-sax-genome/analysis/asax_assembly.asm.p_ctg.fa #This is the assembled genome output from hifiasm
-t blast.out
-b Asax_bwa_aln_sorted.bam #Raw reads mapped to genome -o A-sax_ref_blob #OUTPUT

See README.md in Decontamination folder for code used and results.

Assembly with reference genome

Reference genome to use: Nemo_v1.1 GenBank: GCA_003047355.2

  • only chromosome level assembly in Pomacentridae

download reference genome to easley:

wget --recursive --no-host-directories --cut-dirs=6 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/047/355/GCA_003047355.2_Nemo_v1.1/ -P reference/

run this on command line:

module load python/anaconda
export CONDA_PKGS_DIRS=~/.conda/pkgs
#already created the genome_env3 environment
#conda create -n genome_env_minimap
source activate genome_env_minimap
conda install -c bioconda minimap2
conda deactivate

minimap.sh script:

#!/bin/bash
#SBATCH --job-name=minimap_abu             # job name
#SBATCH --ntasks=30                  # number of tasks across all nodes
#SBATCH --mem=120G
#SBATCH --partition=general          # name of partition to submit job
#SBATCH --time=48:00:00              # Run time (D-HH:MM:SS)
#SBATCH --output=job-%j.out          # Output file. %j is replaced with job ID
#SBATCH --error=job-%j.err           # Error file. %j is replaced with job ID
#SBATCH --mail-type=ALL              # will send email for begin,end,fail
#SBATCH --mail-user=cbt0022@auburn.edu


source activate genome_env_minimap

minimap2 -ax map-pb  ~/reference/GCA_003047355.2_Nemo_v1.1/Nemo_v1.1_genomic.fa.gz ~/A-sax-genome/analysis/asax_assembly.asm.p_ctg.fa > aln.sam   # for PacBio CLR reads

sbatch minimap.sh

BUSCO

Genome Annotation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors