ProcessCurated is a set of tools to process curated genome assemblies. In combination with gfastats and mashmap, tt reconciliate the AGP file manually curated in PretextView to rename, reorient, and sort the assemblies to get them ready for submission.
Prerequisites:
python>=3.9
setuptools
Download the release archive or clone the github repository. And run pip install ./ from the main directory.
Create Conda environnement:
cd vgl-curation
conda create -n ProcessCuration python=3.9 setuptools
conda activate ProcessCuration
pip install ./
Available tools:
- split_agp: Correcting AGP for sequence lengths, splitting the haplotypes from the corrected AGP, assigning unlocs before the agp is imposed on the fasta, and remove haplotig duplications from their origin haplotype.
- chromosome_assignment: Substituting scaffold for chromosome assignments
- sak_generation: Processing Mashmap output for reorientation and renaming of scaffolds in haplotype 2.
Inputs:
-Fasta file of the Assembly with the two haplotypes
-Curated AGP generated with PretextView (See manual in docs for curation tips)
-Output Directory (Optionnal, default ./)
Usage:
split_agp -f <Assembly fasta file> -a <Curated AGP> -o <Output Directory>
Output:
- Corrected AGP file
- AGP file for each haplotype in repositories
<Output Directory>/Hap_1/and<Output Directory>/Hap_2/ - An agp file with the unloc assigned and the duplicated haplotigs removed. (//hap.unlocs.no_hapdups.agp)
- Removed haplotigs (>//haplotigs.agp)
Inputs:
- AGP file with the unloc assigned (-a)
- Sorted fasta file for the haplotype (-f)
- Output directory (-o)
Usage:
chromosome_assignment -a Hap_1/hap.unlocs.no_hapdups.agp -f Hap_1/hap.sorted.fa -o Hap_1
Output:
- Table mapping the scaffolds to their chromosomal assignment (<output_dir>/inter_chr.tsv)
- Fasta file with chromosomal level sequences (<output_dir>/hap.chr_level.fa)
Inputs:
- Table mapping the scaffolds to their chromosomal assignment for haplotype 1 (-1) and haplotype 2 (-2)
- Haplotype used as a query in Mashmap (-q). Possible values (Hap_1 or Hap_2)
- Haplotype used as a reference in Mashmap (-q). Possible values (Hap_1 or Hap_2)
- Corrected AGP output of AGPCorrect (-a)
- Output directory (-o)
Usage:
sak_generation -1 Hap_1/inter_chr.tsv -2 Hap_2/inter_chr.tsv -r Hap_1 -q Hap_2 -a corrected.agp -m mashmap.out -o <Output Directory>>
Output:
- SAK file for gfastats reversing and renaming of Haplotype 2 sequences (<out_dir>/rvcp.sak)
- Table with the orientation of Hap1 vs Hap2 scaffolds (The sign is the most represented in the mashmap alignments between two sequences) (<out_dir>/orientation.tsv)
- The Mapping between hap2 and hap1 names (<out_dir>/hap2.vs.hap1.tsv)
Extra dependencies:
- gfastats>=1.3.10
- MashMap
Inputs:
- Fasta file with the 2 haplotypes ()
- AGP file generated by PretextView after manual curation (see README.md for curation tips) ()
Split AGP and assign unlocs:
split_agp -f <fasta> -a <agp>
Run gfastats to reconciliate the curated agp with the fasta file for each haplotype and sort them by size:
gfastats <fasta> --agp-to-path Hap_1/hap.unlocs.no_hapdups.agp -o Hap_1/hap.unlocs.no_hapdups.fa
gfastats Hap_1/hap.unlocs.no_hapdups.fa --sort largest -o Hap_1/hap.sorted.fa
gfastats <fasta> --agp-to-path Hap_2/hap.unlocs.no_hapdups.agp -o Hap_2/hap.unlocs.no_hapdups.fa
gfastats Hap_2/hap.unlocs.no_hapdups.fa --sort largest -o Hap_2/hap.sorted.fa
Assign chromosome names to scaffolds:
chromosome_assignment -a Hap_1/hap.unlocs.no_hapdups.agp -f Hap_1/hap.sorted.fa -o Hap_1
chromosome_assignment -a Hap_2/hap.unlocs.no_hapdups.agp -f Hap_2/hap.sorted.fa -o Hap_2
Align haplotype 1 versus haplotype 2 with MashMap to detect mis-orientations:
mashmap -r Hap_1/hap.chr_level.fa -q Hap_2/hap.chr_level.fa -f one-to-one -t 16 -s 50000 --pi 90 -o mashmap_original_fastas.out
Rename and reorient the haplotype 2 sequences to match the haplotype 1:
awk '/^>/ {$1 = $1 "_oldname"} {print}' Hap_2/hap.chr_level.fa > Hap_2/tmpnamed.fa # Add temporary suffix to avoid confusion between old and new names during the renaming.
sak_generation -1 Hap_1/inter_chr.tsv -2 Hap_2/inter_chr.tsv -r Hap_1 -q Hap_2 -a corrected.agp -m mashmap_original_fastas.out -s <Heterogametic chromosome> -o tagged_pairs/
gfastats Hap_2/tmpnamed.fa -k tagged_pairs/reversing_renaming.sak -o Hap_2/hap2.partially_renamed.fasta
sed '/^>/ s/_oldname//' Hap_2/hap2.partially_renamed.fasta > Hap_2/hap2.reoriented.renamed.fasta # Remove temporary suffix.
The script is designed to process both haplotypes at once and will separate the haplotype files into two folders, "Hap_1 " and "Hap_2".
sh curation_2.0_pipe.sh -f <haplotype combined fasta> -a <PretextView generated agp> -s <Heterogametic Chromosome> -d
-h help
-f combined haplotype fasta
-a haplotype agp generated from pretextview
-d Test dependencies and install if missing (need conda). (Optional) (Note: This option may fail if you are in a python venv)
Example:
sh curation_2.0_pipe.sh -f rCycPin1.HiC.haps_combined.fasta -a rCycPin1.HiC.haps_combined.pretext.agp -s W -d
corrected.agp: Corrected agpHap_[12]/hap.agp: Split agp for each haplotypeHap_[12]/hap.unlocs.no_hapdups.agp: Processed AGP fileHap_[12]/haplotigs.agp: AGP file containing the haplotig duplicationsHap_[12]/hap.unlocs.no_hapdups.fa: Reconciliated fastaHap_[12]/hap.sorted.fa: Reconciliated fasta sorted by descending sizesHap_[12]/inter_chr.tsv: Table mapping the scaffolds to their chromosomal assignment.Hap_[12]/hap.chr_level.fa: Fasta file with chromosomal level sequences.mashmap_original_fastas.out: Mashmap of the fasta files with chromosomal level sequences of each haplotype.tagged_pairs/orientation.tsv: Table containing the orientation of the sequences in Hap_2 compared to their homolog in Hap1tagged_pairs/reversing_renaming.sak: Instruction file for reverting and renaming the sequences in Hap_2 to match their homolog in Hap_1Hap_2/hap2.reoriented.renamed.fasta: Renamed and reoriented fasta file for Hap_2
Most errors should appear in the standard output of the curation_2.0_pipe.sh script, except for the mashmap and the fasta reconciliation steps. If you are unsure of the origin of the error, check the logs:
std.{run}.out: Standard output of the runlogs/std.agp_to_path_hap[12].{run}.out: Standard output of the fasta reconciliations for the runlogs/std.mashmap.{run}.out: Standard output of mashmap for the run
Before curating:
- Decontaminate the haplotypic assemblies.
- Modify the names in the decontaminated assemblies; H1.scaffold_1 for hap1 and H2.scaffold_2. The post-scritps are designed to accept this H1 and H2 notation.
- Concatenate the assemblies into a single fasta; plot a pretext map.
Curation:
- Curate both haplotypes simultaneously. The presence of both haplotypes can be especially useful for identifying sex and microchromosomes, as well as haplotig duplications (mis-phased sequences).
- Tags:
- Create
Hap_1andHap_2tags in PretextView. These tags only need to be created once, PretextView will remember them in other curations. In the PretextView menu, click "Meta Data Tags" and type in the two tags as such:
- Teasing the haplotypes apart gets a little messy, especially if there are sequences moved between haplotypes (i.e./ a scaffold from Hap_1 assigned to a Hap_2 scaffold or vice versa). The unassigned scaffolds can be sorted by the H1 and H2 notations we added prior to mapping. However, we need to use the
Hap_1andHap_2tags we just created to sort the chromosomes. For each chromosome, assign the appropriate haplotype tag to the left most scaffold, as such:
- Tag the sex chromosomes as per usual. The current VGP standard is to have a representant of each chromosome in Hap_1. So if the genome is heterogametic (XY or ZW), make sure that any sex chromosomes is tagged with
Hap_1. For homogametic genomes (XX or ZZ), leave one sexual chromosome in each haplotype (tagsHap_1andHap_2) - Tag any unlocalized sequences as "unloc". Place any unloc sequences at the end (right most side) of their chromosomal assignment. These unlocs need to be painted with the chromosome they belong too.
- Create
- Once done, paint all the scaffolds (from both haplotypes) into chromosomes. The homologs will need to alternate for proper identification. With everything painted, generate your AGP.
- Why won't my PretextMap open in PretextView?
Hi-res PretextMaps likely require an HPC to generate the map, but will also require a discrete GPU to open the map in PretextView because it requires 16GB of RAM (i./e/ Macbooks with the M1 chip will have this capacity).
- Why aren't my unlocalized (unloc) sequences being named correctly?
a. I (at this time) configured the pipeline to process unlocs placed at the end of their respective chromosome assignments. Processing unlocs placed at the beginning of the painted chromosome is more complicated, but is possible - time permitting I will go back and modify this in the future. For now place all unlocs at the right end of their painted chromosome.
b. The unlocs also have to be painted. Double check to make sure they have been painted along with their assigned chromosome.



