Skip to content

grtakaha/CASA

Repository files navigation

Computer-Assisted Sequence Annotation (CASA)

A centralized tool manager for four scripts used to process and visualize protein sequences.

NOTE: Image from http://doi.org/10.1002/aps3.70009

APS3-13-e70009-g002

See below for example(s). If you want to generate publication-quality, annotated protein sequence alignments, CASA might be your tool!

If you find CASA useful, please consider citing our paper: http://doi.org/10.1002/aps3.70009 (And don't forget to cite BLAST and Clustal Omega!)

Example data and figures can be found at the OSF repository: https://osf.io/xnmha/

NOTE: These scripts make use of EMBL-EBI and NCBI resources. References for tools and databases used here include:

UniProt:
The UniProt Consortium.
“UniProt: The Universal Protein Knowledgebase in 2023.”
Nucleic Acids Research 51, no. D1 (January 6, 2023): D523–31. https://doi.org/10.1093/nar/gkac1052.

NCBI:
Sayers, Eric W, Evan E Bolton, J Rodney Brister, Kathi Canese, Jessica Chan, Donald C Comeau, Ryan Connor, et al.
“Database Resources of the National Center for Biotechnology Information.”
Nucleic Acids Research 50, no. D1 (December 1, 2021): D20–26. https://doi.org/10.1093/nar/gkab1112.

Clustal Omega:
Sievers, Fabian, Andreas Wilm, David Dineen, Toby J Gibson, Kevin Karplus, Weizhong Li, Rodrigo Lopez, et al.
“Fast, Scalable Generation of High‐quality Protein Multiple Sequence Alignments Using Clustal Omega.”
Molecular Systems Biology 7, no. 1 (January 2011): 539. https://doi.org/10.1038/msb.2011.75.

Sievers, Fabian, and Desmond G. Higgins.
“Clustal Omega for Making Accurate Alignments of Many Protein Sequences.”
Protein Science: A Publication of the Protein Society 27, no. 1 (January 2018): 135–45. https://doi.org/10.1002/pro.3290.

BLAST+:
Camacho, Christiam, George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, and Thomas L. Madden.
“BLAST+: Architecture and Applications.”
BMC Bioinformatics 10, no. 1 (December 2009): 1–9. https://doi.org/10.1186/1471-2105-10-421.

https://blast.ncbi.nlm.nih.gov/doc/blast-help/references.html#references

CASA SVG Example

NOTE:

  • This PNG was cropped and saved with a white background.
  • Unedited CASA SVG outputs have transparent backgrounds and start with a 7.5in x 9in page size.
  • Original SVG taken from our CASA OSF repository: https://osf.io/xnmha/
alignment

Prerequisites

General:

  • Internet connection (when running search_proteins.py and retrieve_annotations.py)
  • Python 3.7+
  • ~ 1 GB of storage for Swiss-Prot download and/or creation of BLAST database (if running search_proteins.py)

Command-line tools:

Python Libraries:

  • pandas
  • requests
  • session_info

Installing

Before running, ensure that required command-line tools are on your PATH.

  • Clustal Omega is required for alignment.py
  • NCBI BLAST+ is required for search_proteins.py

Download and add the CASA directory to PATH and PYTHONPATH.

Other installation recommendations and protocols can be found in CASA/extra_protocols.pdf

SCRIPTS

CASA.py

Runs one or more of the below scripts in the order given.

INPUT: Depends on which tool(s) are being executed. Should be an acceptable input of the first tool executed. Inputs for runs that start with annotate.py may additionally be limited by what can be passed to downstream tools. Example: --order annotate svg -> must use .clustal or .clustal_num file as input (.fasta file cannot be used by clustal_to_svg.py)

OUTPUT: Depends on which tool(s) are being executed. Each tool will have its own output if it is included in a run. All outputs will be split into separate directories in a run that includes search_proteins.py.

NOTE: There are currently limited ways to run multiple tools at once (inputs and outputs will vary depending on start and end):

  • blast annotate align svg
  • blast annotate align
  • blast annotate
  • blast align annotate svg
  • blast align annotate
  • blast align svg
  • blast align
  • blast
  • annotate align svg
  • annotate align
  • annotate svg
  • annotate
  • align annotate svg
  • align annotate
  • align svg
  • align
  • svg

Example: python -m CASA -i ./alignment.clustal -o ./svg_folder/ -ord annotate svg -u TRUE -c FALSE -nums TRUE

Usage:

usage: CASA.py [-h] [-i INFILE] [-o OUT_DIRECTORY] [-ord ORDER [ORDER ...]] [-s STYPE] [-nr NUM_RES] [-t TITLE]
               [-c CODES] [-n NUMS] [-u UNIPROT_FORMAT] [-a ANNOTATIONS] [-f FEATURES] [-db DATABASE]
               [-bopts BLAST_OPTIONS] [-copts CLUSTAL_OPTIONS]

CASA Tool Manager

options:
  -h, --help            show this help message and exit
  -i INFILE, --infile INFILE
                        Full path of input file
  -o OUT_DIRECTORY, --out_directory OUT_DIRECTORY
                        Full path of output directory.
  -ord ORDER [ORDER ...], --order ORDER [ORDER ...]
                        Order of tools to run (blast, annotate, align, svg).Ex. --order align svg
  -s STYPE, --stype STYPE
                        Sequence type ("protein" is currently the only option).
  -nr NUM_RES, --num_res NUM_RES
                        Number of results.
  -t TITLE, --title TITLE
                        Alignment title ([TITLE].clustal, [TITLE].pim).
  -c CODES, --codes CODES
  -n NUMS, --nums NUMS  When set to TRUE, includes total residue numbers at the end of each line.
  -u UNIPROT_FORMAT, --uniprot_format UNIPROT_FORMAT
                        When set to TRUE, truncates all accessions as if they were UniProt entries. Ex.
                        sp|P00784|PAPA1_CARPA -> PAPA1_CARPA
  -a ANNOTATIONS, --annotations ANNOTATIONS
                        If an annotation file is provided, it will be used to annotate the resulting SVG files.
  -f FEATURES, --features FEATURES
                        A comma-separated list of feature:color pairs to include in SVGs.Case sensitive. If features
                        include spaces, the list must be enclosed in quotes.If no features should be included, use: -f
                        NoneThe following example is default behavior. Ex. -f "Active site:#0000ff,Disulfide
                        bond:#e27441,Propeptide:#9e00f2,Signal:#2b7441"
  -db DATABASE, --database DATABASE
                        (optional) Full file path to a protein FASTA file that can be used as a BLAST database.
                        makeblastdb will be run on this file if no BLAST database exists.
  -bopts BLAST_OPTIONS, --blast_options BLAST_OPTIONS
                        (optional) Bracketed, comma-separated list of valid blastp input parameters. Valid arguments
                        can be shown via CLI with "blastp -h". File locations (like -import_search_strategy) MUST be
                        full file paths (not relative). Example Usage: -bopts "[-threshold 0,-sorthits 4,-max_hsps 1]"
  -copts CLUSTAL_OPTIONS, --clustal_options CLUSTAL_OPTIONS
                        (optional) Bracketed, comma-separated list of valid clustalo input parameters. Valid arguments
                        can be shown via CLI with "clustalo -h".File locations (like --hmm-in) MUST be full file paths
                        (not relative). Example Usage: -copts "[--residuenumber,--iterations 3]"

search_proteins.py

Takes one or more protein sequences (FASTA format) as input and BLASTs them against a given database.
If no database is provided via the -db option, the current Swiss-Prot release (uniprotkb_refprotswissprot) will be downloaded and used.

Requires ~1 GB of storage for Swiss-Prot download and/or creation of BLAST database.

Swiss-Prot download information: Current Swiss-Prot release is verified/downloaded from:

  • ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/

Swiss-Prot database and associated files are stored in the same directory as this script.

INPUT: FASTA-formatted file with at least one sequence.

OUTPUT: A set of directories - one for each sequence in the original input file - that contain the following:

  • the BLAST results for that sequence (the query) against the current Swiss-Prot release in table form ([QUERY].tsv),
  • individual FASTA files with UniProt sequences for each BLAST hit,
  • one FASTA file containing all protein sequences, including the query sequence (all.fasta).

NOTE: --stype dna is currently not supported in any form. May enter an infinite loop. Please do not use --stype dna until updated.

If used at the beginning of a multi-step run, downstream commands will be run on each resulting collection of outputs.

For example: search_proteins.py will yield multiple all.fasta (one for each query), which can be sent to both retrieve_annotations.py and alignment.py. This is why "blast annotate align" is a valid input for the --order optional argument.

Example from CASA.py:

  python -m CASA -i ./unknown_proteins.fasta -o ./unknown_protein_folder/ -ord blast align -s protein -nr 3

Example standalone:

  python -m search_proteins -i ./unknown_proteins.fasta -o ./ -s protein -nr 5

Usage:

usage: UniProt BLAST script [-h] [-i INFILE] [-o OUT_DIRECTORY] [-s STYPE] [-nr NUM_RES] [-db DATABASE]
                            [-bopts BLAST_OPTIONS]

BLASTs FASTA sequences against a given BLAST database.

options:
  -h, --help            show this help message and exit
  -i INFILE, --infile INFILE
                        Full path of input file.
  -o OUT_DIRECTORY, --out_directory OUT_DIRECTORY
                        Full path of output directory.
  -s STYPE, --stype STYPE
                        Sequence type ("protein" is currently the only option).
  -nr NUM_RES, --num_res NUM_RES
                        (optional) Number of results.
  -db DATABASE, --database DATABASE
                        (optional) Full file path to a protein FASTA file that can be used as a BLAST database.
                        makeblastdb will be run on this file if no BLAST database exists.
  -bopts BLAST_OPTIONS, --blast_options BLAST_OPTIONS
                        (optional) Bracketed, comma-separated list of valid blastp input parameters. Valid arguments
                        can be shown via CLI with "blastp -h". File locations (like -import_search_strategy) MUST be
                        full file paths (not relative). Example Usage: -bopts "[-threshold 0,-sorthits 4,-max_hsps 1]"

retrieve_annotations.py

Takes one or more protein sequences (FASTA format) as input and retrieves annotations for sequences whos IDs exist in UniProt.

INPUT: A FASTA-formatted, CLUSTAL_NUM-formatted, or CLUSTAL-formatted file with at least one protein sequence.

OUTPUT: A collection of files that includes the following:

  • individual annotation files (.ann), one for each unique sequence in the input file,
  • one combined annotation file that includes all annotations for this collection of sequences (all.ann).

NOTE: all.ann can be used as input for clustal_to_svg.py. See ANNOTATION FORMAT for help formatting annotations by hand.

Example from CASA.py:

python -m CASA -i ./uniprot_proteins.fasta -o ./annotations_folder/ -ord annotate align svg -s protein -u TRUE -c FALSE -n FALSE

Example standalone:

python -m retrieve_annotations -i ./uniprot_proteins.fasta -o ./annotations_folder/

Usage:

usage: retrieve_annotations.py [-h] [-i INFILE] [-o OUT_DIRECTORY]

options:
  -h, --help            show this help message and exit
  -i INFILE, --infile INFILE
                        Full path of input file.
  -o OUT_DIRECTORY, --out_directory OUT_DIRECTORY
                        Full path of output directory.

alignment.py

Takes at least two protein sequences as input and aligns them using Clustal Omega.

INPUT: A FASTA-formatted file with at least two sequences.

OUTPUT: An alignment (.clustal) of the given FASTA file.

Example from CASA.py:

python -m CASA -i ./three_proteins.fasta -o ./alignments_folder/ -ord annotate align -s protein -title three_proteins

Example standalone:

python -m alignment -i ./three_proteins.fasta -o ./alignments_folder/ -s protein -title three_proteins

Usage:

usage: alignment.py [-h] [-i INFILE] [-o OUT_DIRECTORY] [-s STYPE] [-t TITLE] [-copts CLUSTAL_OPTIONS]

options:
  -h, --help            show this help message and exit
  -i INFILE, --infile INFILE
                        Full path of input file.
  -o OUT_DIRECTORY, --out_directory OUT_DIRECTORY
                        Full path of output directory. Must end with "/".
  -s STYPE, --stype STYPE
                        Sequence type ("protein" or "dna").
  -t TITLE, --title TITLE
                        Alignment title ([TITLE].clustal, [TITLE].pim).
  -copts CLUSTAL_OPTIONS, --clustal_options CLUSTAL_OPTIONS
                        (optional) Bracketed, comma-separated list of valid clustalo input parameters. Valid arguments
                        can be shown via CLI with "clustalo -h".File locations (like --hmm-in) MUST be full file paths
                        (not relative). Example Usage: -copts "[--residuenumber,--iterations 3]"

clustal_to_svg.py

Reformats a .clustal_num or .clustal alignment into an editable Inkscape SVG. Currently annotates conserved residues (automatic, not optional) and a given list of features (optional).

INPUT: A CLUSTAL or CLUSTAL_NUM file.

OUTPUT: A sequential set of SVGs (.svg), numbered 0, 1, 2, etc., with formatted alignments and associated conserved residues and/or annotations.

NOTE: These SVG outputs were designed to be edited in Inkscape. They retain full functionality when opened in Inkscape, including multiline text-box editability. They retain some functionality when opened in other SVG viewers/editors. They can be viewed and edited in Illustrator, but do not retain multiline text-box editability (each letter is its own text-box). They can also be viewed in browsers like Chrome.

Example from CASA.py:

python -m CASA -i ./proteins.fasta -o ./SVGs/ -ord align svg -s protein -u TRUE -c FALSE -n FALSE -a annotations.ann -f "Active site:blue,Propeptide:#000000"

Example standalone:

python -m clustal_to_svg -i ./alignment.clustal -o ./SVGs/ -u TRUE -c FALSE -n FALSE -a annotations.ann -f "Active site:blue,Propeptide:#000000"

Usage:

usage: clustal_to_svg.py [-h] [-i INFILE] [-o OUT_DIRECTORY] [-c CODES] [-n NUMS] [-u UNIPROT_FORMAT] [-a ANNOTATIONS]
                         [-f FEATURES]

options:
  -h, --help            show this help message and exit
  -i INFILE, --infile INFILE
  -o OUT_DIRECTORY, --out_directory OUT_DIRECTORY
                        Full path of output directory.
  -c CODES, --codes CODES
                        When set to TRUE, includes Clustal identity codes at the bottom of each block.
  -n NUMS, --nums NUMS  When set to TRUE, includes total residue numbers at the end of each line.
  -u UNIPROT_FORMAT, --uniprot_format UNIPROT_FORMAT
                        When set to TRUE, truncates all accessions as if they were UniProt entries. Ex.
                        sp|P00784|PAPA1_CARPA -> PAPA1_CARPA
  -a ANNOTATIONS, --annotations ANNOTATIONS
                        If an annotation file is provided, it will be used to annotate the resulting SVG files.
  -f FEATURES, --features FEATURES
                        A comma-separated list of feature:color pairs to include in SVGs.Case sensitive. If features
                        include spaces, the list must be enclosed in quotes.If no features should be included, use: -f
                        NoneThe following example is default behavior. Ex. -f "Active site:#0000ff,Disulfide
                        bond:#e27441,Propeptide:#9e00f2,Signal:#2b7441"

ANNOTATION FORMAT

Annotations can be added to an SVG with the -a or --annotations option in a run that calls clustal_to_svg.py.

A truncated, but real example of a valid annotation file can be found in annotation_example.ann.

NOTE: If no annotation is explicity provided, and retrieve_annotations.py is called during the run, clustal_to_svg.py will instead use the "all.ann" annotation file retrieved by retrieve_annotations.py. Annotation files provided via -a or --annotations will override those retrieved by retrieve_annotations.py.

NOTE: The -f or --features option was added to clustal_to_svg.py on 2024.10.30. It allows for users to provide a comma-separated list of feature:color pairs that can be used to customize SVG annotations. Default behavior is identical to including the following option in a run that calls clustal_to_svg.py:

-f "Active site:#0000ff,Disulfide bond:#e27441,Propeptide:#9e00f2,Signal:#2b7441"

Format:

Annotation files that include the following columns and VALUES (tab-delimited) can be used as inputs for clustal_to_svg.py:

	prot	whole_prot	type	location.start.value	location.end.value
ARBITRARY_INDEX	UNIPROT_FORMAT_ACC	FULL_ACCESSION	ANNOTATION_TYPE	START	END

Other columns, like "description" may be added for record-keeping, but they will not be used when adding annotations to SVGs.

Example:

	prot	whole_prot	type	location.start.value	location.end.value
0	PAPA1_CARPA	sp|P00784|PAPA1_CARPA	Active site	158	158

About

Computer-Assisted Sequence Annotation (CASA)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages