Skip to content

vgl-hub/vgl-curation

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

128 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Process Curated Genome Assembly

ProcessCurated is a set of tools to process curated genome assemblies. In combination with gfastats and mashmap, tt reconciliate the AGP file manually curated in PretextView to rename, reorient, and sort the assemblies to get them ready for submission.

Install

Prerequisites:

    python>=3.9
    setuptools

Download the release archive or clone the github repository. And run pip install ./ from the main directory.

Quickstart with Conda

Create Conda environnement:

    cd vgl-curation
    conda create -n ProcessCuration python=3.9 setuptools
    conda activate ProcessCuration
    pip install ./

Process the curated assembly

Available tools

Available tools:

  • split_agp: Correcting AGP for sequence lengths, splitting the haplotypes from the corrected AGP, assigning unlocs before the agp is imposed on the fasta, and remove haplotig duplications from their origin haplotype.
  • chromosome_assignment: Substituting scaffold for chromosome assignments
  • sak_generation: Processing Mashmap output for reorientation and renaming of scaffolds in haplotype 2.

split_agp

Inputs: -Fasta file of the Assembly with the two haplotypes -Curated AGP generated with PretextView (See manual in docs for curation tips) -Output Directory (Optionnal, default ./)

Usage:

    split_agp -f <Assembly fasta file> -a <Curated AGP> -o <Output Directory>

Output:

  • Corrected AGP file
  • AGP file for each haplotype in repositories <Output Directory>/Hap_1/ and <Output Directory>/Hap_2/
  • An agp file with the unloc assigned and the duplicated haplotigs removed. (//hap.unlocs.no_hapdups.agp)
  • Removed haplotigs (>//haplotigs.agp)

chromosome_assignment

Inputs:

  • AGP file with the unloc assigned (-a)
  • Sorted fasta file for the haplotype (-f)
  • Output directory (-o)

Usage:

    chromosome_assignment -a Hap_1/hap.unlocs.no_hapdups.agp -f Hap_1/hap.sorted.fa -o Hap_1 

Output:

  • Table mapping the scaffolds to their chromosomal assignment (<output_dir>/inter_chr.tsv)
  • Fasta file with chromosomal level sequences (<output_dir>/hap.chr_level.fa)

sak_generation

Inputs:

  • Table mapping the scaffolds to their chromosomal assignment for haplotype 1 (-1) and haplotype 2 (-2)
  • Haplotype used as a query in Mashmap (-q). Possible values (Hap_1 or Hap_2)
  • Haplotype used as a reference in Mashmap (-q). Possible values (Hap_1 or Hap_2)
  • Corrected AGP output of AGPCorrect (-a)
  • Output directory (-o)

Usage:

     sak_generation  -1 Hap_1/inter_chr.tsv -2 Hap_2/inter_chr.tsv -r Hap_1 -q Hap_2 -a corrected.agp -m  mashmap.out -o <Output Directory>>

Output:

  • SAK file for gfastats reversing and renaming of Haplotype 2 sequences (<out_dir>/rvcp.sak)
  • Table with the orientation of Hap1 vs Hap2 scaffolds (The sign is the most represented in the mashmap alignments between two sequences) (<out_dir>/orientation.tsv)
  • The Mapping between hap2 and hap1 names (<out_dir>/hap2.vs.hap1.tsv)

Run the whole Pipeline

Extra dependencies:

  • gfastats>=1.3.10
  • MashMap

Inputs:

  • Fasta file with the 2 haplotypes ()
  • AGP file generated by PretextView after manual curation (see README.md for curation tips) ()

Split AGP and assign unlocs:

    split_agp -f <fasta> -a <agp>

Run gfastats to reconciliate the curated agp with the fasta file for each haplotype and sort them by size:

    gfastats <fasta> --agp-to-path Hap_1/hap.unlocs.no_hapdups.agp -o Hap_1/hap.unlocs.no_hapdups.fa
    gfastats Hap_1/hap.unlocs.no_hapdups.fa --sort largest -o Hap_1/hap.sorted.fa 

    gfastats <fasta> --agp-to-path Hap_2/hap.unlocs.no_hapdups.agp -o Hap_2/hap.unlocs.no_hapdups.fa 
    gfastats Hap_2/hap.unlocs.no_hapdups.fa --sort largest -o Hap_2/hap.sorted.fa 

Assign chromosome names to scaffolds:

    chromosome_assignment -a Hap_1/hap.unlocs.no_hapdups.agp -f Hap_1/hap.sorted.fa -o Hap_1 
    chromosome_assignment -a Hap_2/hap.unlocs.no_hapdups.agp -f Hap_2/hap.sorted.fa -o Hap_2 

Align haplotype 1 versus haplotype 2 with MashMap to detect mis-orientations:

    mashmap -r Hap_1/hap.chr_level.fa  -q Hap_2/hap.chr_level.fa -f one-to-one -t 16 -s 50000 --pi 90 -o mashmap_original_fastas.out

Rename and reorient the haplotype 2 sequences to match the haplotype 1:

    awk '/^>/ {$1 = $1 "_oldname"} {print}' Hap_2/hap.chr_level.fa > Hap_2/tmpnamed.fa  # Add temporary suffix to avoid confusion between old and new names during the renaming.
    sak_generation  -1 Hap_1/inter_chr.tsv -2 Hap_2/inter_chr.tsv -r Hap_1 -q Hap_2 -a corrected.agp -m  mashmap_original_fastas.out -s <Heterogametic chromosome> -o tagged_pairs/ 
    gfastats Hap_2/tmpnamed.fa  -k tagged_pairs/reversing_renaming.sak  -o Hap_2/hap2.partially_renamed.fasta 
    sed '/^>/ s/_oldname//' Hap_2/hap2.partially_renamed.fasta >  Hap_2/hap2.reoriented.renamed.fasta  # Remove temporary suffix.

Use the curation script to run everything

The script is designed to process both haplotypes at once and will separate the haplotype files into two folders, "Hap_1 " and "Hap_2".

sh curation_2.0_pipe.sh -f <haplotype combined fasta> -a <PretextView generated agp> -s <Heterogametic Chromosome> -d
-h help
-f combined haplotype fasta
-a haplotype agp generated from pretextview
-d Test dependencies and install if missing (need conda). (Optional) (Note: This option may fail if you are in a python venv)

Example:
sh curation_2.0_pipe.sh -f rCycPin1.HiC.haps_combined.fasta -a rCycPin1.HiC.haps_combined.pretext.agp -s W -d

Outputs

  • corrected.agp: Corrected agp
  • Hap_[12]/hap.agp: Split agp for each haplotype
  • Hap_[12]/hap.unlocs.no_hapdups.agp: Processed AGP file
  • Hap_[12]/haplotigs.agp: AGP file containing the haplotig duplications
  • Hap_[12]/hap.unlocs.no_hapdups.fa: Reconciliated fasta
  • Hap_[12]/hap.sorted.fa: Reconciliated fasta sorted by descending sizes
  • Hap_[12]/inter_chr.tsv: Table mapping the scaffolds to their chromosomal assignment.
  • Hap_[12]/hap.chr_level.fa: Fasta file with chromosomal level sequences.
  • mashmap_original_fastas.out: Mashmap of the fasta files with chromosomal level sequences of each haplotype.
  • tagged_pairs/orientation.tsv: Table containing the orientation of the sequences in Hap_2 compared to their homolog in Hap1
  • tagged_pairs/reversing_renaming.sak: Instruction file for reverting and renaming the sequences in Hap_2 to match their homolog in Hap_1
  • Hap_2/hap2.reoriented.renamed.fasta: Renamed and reoriented fasta file for Hap_2

Debugging

Most errors should appear in the standard output of the curation_2.0_pipe.sh script, except for the mashmap and the fasta reconciliation steps. If you are unsure of the origin of the error, check the logs:

  • std.{run}.out: Standard output of the run
  • logs/std.agp_to_path_hap[12].{run}.out: Standard output of the fasta reconciliations for the run
  • logs/std.mashmap.{run}.out: Standard output of mashmap for the run

Curation Tips

Before curating:

  1. Decontaminate the haplotypic assemblies.
  2. Modify the names in the decontaminated assemblies; H1.scaffold_1 for hap1 and H2.scaffold_2. The post-scritps are designed to accept this H1 and H2 notation.
  3. Concatenate the assemblies into a single fasta; plot a pretext map.

Curation:

  1. Curate both haplotypes simultaneously. The presence of both haplotypes can be especially useful for identifying sex and microchromosomes, as well as haplotig duplications (mis-phased sequences).
  2. Tags:
    • Create Hap_1 and Hap_2 tags in PretextView. These tags only need to be created once, PretextView will remember them in other curations. In the PretextView menu, click "Meta Data Tags" and type in the two tags as such:

      image image


    • Teasing the haplotypes apart gets a little messy, especially if there are sequences moved between haplotypes (i.e./ a scaffold from Hap_1 assigned to a Hap_2 scaffold or vice versa). The unassigned scaffolds can be sorted by the H1 and H2 notations we added prior to mapping. However, we need to use the Hap_1 and Hap_2 tags we just created to sort the chromosomes. For each chromosome, assign the appropriate haplotype tag to the left most scaffold, as such:

      image image


    • Tag the sex chromosomes as per usual. The current VGP standard is to have a representant of each chromosome in Hap_1. So if the genome is heterogametic (XY or ZW), make sure that any sex chromosomes is tagged with Hap_1. For homogametic genomes (XX or ZZ), leave one sexual chromosome in each haplotype (tags Hap_1 and Hap_2)
    • Tag any unlocalized sequences as "unloc". Place any unloc sequences at the end (right most side) of their chromosomal assignment. These unlocs need to be painted with the chromosome they belong too.
  3. Once done, paint all the scaffolds (from both haplotypes) into chromosomes. The homologs will need to alternate for proper identification. With everything painted, generate your AGP.

FAQ

  1. Why won't my PretextMap open in PretextView?

Hi-res PretextMaps likely require an HPC to generate the map, but will also require a discrete GPU to open the map in PretextView because it requires 16GB of RAM (i./e/ Macbooks with the M1 chip will have this capacity).

  1. Why aren't my unlocalized (unloc) sequences being named correctly?

a. I (at this time) configured the pipeline to process unlocs placed at the end of their respective chromosome assignments. Processing unlocs placed at the beginning of the painted chromosome is more complicated, but is possible - time permitting I will go back and modify this in the future. For now place all unlocs at the right end of their painted chromosome.
b. The unlocs also have to be painted. Double check to make sure they have been painted along with their assigned chromosome.

About

VGL curation pipeline

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 78.8%
  • Shell 21.2%