A centralized tool manager for four scripts used to process and visualize protein sequences.
NOTE: Image from http://doi.org/10.1002/aps3.70009
See below for example(s). If you want to generate publication-quality, annotated protein sequence alignments, CASA might be your tool!
If you find CASA useful, please consider citing our paper: http://doi.org/10.1002/aps3.70009 (And don't forget to cite BLAST and Clustal Omega!)
Example data and figures can be found at the OSF repository: https://osf.io/xnmha/
NOTE: These scripts make use of EMBL-EBI and NCBI resources. References for tools and databases used here include:
UniProt:
The UniProt Consortium.
“UniProt: The Universal Protein Knowledgebase in 2023.”
Nucleic Acids Research 51, no. D1 (January 6, 2023): D523–31. https://doi.org/10.1093/nar/gkac1052.
NCBI:
Sayers, Eric W, Evan E Bolton, J Rodney Brister, Kathi Canese, Jessica Chan, Donald C Comeau, Ryan Connor, et al.
“Database Resources of the National Center for Biotechnology Information.”
Nucleic Acids Research 50, no. D1 (December 1, 2021): D20–26. https://doi.org/10.1093/nar/gkab1112.
Clustal Omega:
Sievers, Fabian, Andreas Wilm, David Dineen, Toby J Gibson, Kevin Karplus, Weizhong Li, Rodrigo Lopez, et al.
“Fast, Scalable Generation of High‐quality Protein Multiple Sequence Alignments Using Clustal Omega.”
Molecular Systems Biology 7, no. 1 (January 2011): 539. https://doi.org/10.1038/msb.2011.75.
Sievers, Fabian, and Desmond G. Higgins.
“Clustal Omega for Making Accurate Alignments of Many Protein Sequences.”
Protein Science: A Publication of the Protein Society 27, no. 1 (January 2018): 135–45. https://doi.org/10.1002/pro.3290.
BLAST+:
Camacho, Christiam, George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, and Thomas L. Madden.
“BLAST+: Architecture and Applications.”
BMC Bioinformatics 10, no. 1 (December 2009): 1–9. https://doi.org/10.1186/1471-2105-10-421.
https://blast.ncbi.nlm.nih.gov/doc/blast-help/references.html#references
NOTE:
- This PNG was cropped and saved with a white background.
- Unedited CASA SVG outputs have transparent backgrounds and start with a 7.5in x 9in page size.
- Original SVG taken from our CASA OSF repository: https://osf.io/xnmha/
General:
- Internet connection (when running search_proteins.py and retrieve_annotations.py)
- Python 3.7+
- ~ 1 GB of storage for Swiss-Prot download and/or creation of BLAST database (if running search_proteins.py)
Command-line tools:
- Clustal Omega (http://www.clustal.org/omega/)
- UPDATE: clustal.org seems to have expired. Please download source code from https://github.com/FabianSievers/clustal-omega
- For Windows users, if you intend to run Clustal Omega, I would recommend using a Linux environment (like the WSL2) for now.
- NCBI BLAST+ (https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
Python Libraries:
- pandas
- requests
- session_info
Before running, ensure that required command-line tools are on your PATH.
- Clustal Omega is required for alignment.py
- NCBI BLAST+ is required for search_proteins.py
Download and add the CASA directory to PATH and PYTHONPATH.
Other installation recommendations and protocols can be found in CASA/extra_protocols.pdf
Runs one or more of the below scripts in the order given.
INPUT: Depends on which tool(s) are being executed. Should be an acceptable input of the first tool executed. Inputs for runs that start with annotate.py may additionally be limited by what can be passed to downstream tools. Example: --order annotate svg -> must use .clustal or .clustal_num file as input (.fasta file cannot be used by clustal_to_svg.py)
OUTPUT: Depends on which tool(s) are being executed. Each tool will have its own output if it is included in a run. All outputs will be split into separate directories in a run that includes search_proteins.py.
NOTE: There are currently limited ways to run multiple tools at once (inputs and outputs will vary depending on start and end):
- blast annotate align svg
- blast annotate align
- blast annotate
- blast align annotate svg
- blast align annotate
- blast align svg
- blast align
- blast
- annotate align svg
- annotate align
- annotate svg
- annotate
- align annotate svg
- align annotate
- align svg
- align
- svg
Example: python -m CASA -i ./alignment.clustal -o ./svg_folder/ -ord annotate svg -u TRUE -c FALSE -nums TRUE
Usage:
usage: CASA.py [-h] [-i INFILE] [-o OUT_DIRECTORY] [-ord ORDER [ORDER ...]] [-s STYPE] [-nr NUM_RES] [-t TITLE]
[-c CODES] [-n NUMS] [-u UNIPROT_FORMAT] [-a ANNOTATIONS] [-f FEATURES] [-db DATABASE]
[-bopts BLAST_OPTIONS] [-copts CLUSTAL_OPTIONS]
CASA Tool Manager
options:
-h, --help show this help message and exit
-i INFILE, --infile INFILE
Full path of input file
-o OUT_DIRECTORY, --out_directory OUT_DIRECTORY
Full path of output directory.
-ord ORDER [ORDER ...], --order ORDER [ORDER ...]
Order of tools to run (blast, annotate, align, svg).Ex. --order align svg
-s STYPE, --stype STYPE
Sequence type ("protein" is currently the only option).
-nr NUM_RES, --num_res NUM_RES
Number of results.
-t TITLE, --title TITLE
Alignment title ([TITLE].clustal, [TITLE].pim).
-c CODES, --codes CODES
-n NUMS, --nums NUMS When set to TRUE, includes total residue numbers at the end of each line.
-u UNIPROT_FORMAT, --uniprot_format UNIPROT_FORMAT
When set to TRUE, truncates all accessions as if they were UniProt entries. Ex.
sp|P00784|PAPA1_CARPA -> PAPA1_CARPA
-a ANNOTATIONS, --annotations ANNOTATIONS
If an annotation file is provided, it will be used to annotate the resulting SVG files.
-f FEATURES, --features FEATURES
A comma-separated list of feature:color pairs to include in SVGs.Case sensitive. If features
include spaces, the list must be enclosed in quotes.If no features should be included, use: -f
NoneThe following example is default behavior. Ex. -f "Active site:#0000ff,Disulfide
bond:#e27441,Propeptide:#9e00f2,Signal:#2b7441"
-db DATABASE, --database DATABASE
(optional) Full file path to a protein FASTA file that can be used as a BLAST database.
makeblastdb will be run on this file if no BLAST database exists.
-bopts BLAST_OPTIONS, --blast_options BLAST_OPTIONS
(optional) Bracketed, comma-separated list of valid blastp input parameters. Valid arguments
can be shown via CLI with "blastp -h". File locations (like -import_search_strategy) MUST be
full file paths (not relative). Example Usage: -bopts "[-threshold 0,-sorthits 4,-max_hsps 1]"
-copts CLUSTAL_OPTIONS, --clustal_options CLUSTAL_OPTIONS
(optional) Bracketed, comma-separated list of valid clustalo input parameters. Valid arguments
can be shown via CLI with "clustalo -h".File locations (like --hmm-in) MUST be full file paths
(not relative). Example Usage: -copts "[--residuenumber,--iterations 3]"
Takes one or more protein sequences (FASTA format) as input and BLASTs them against a given database.
If no database is provided via the -db option, the current Swiss-Prot release (uniprotkb_refprotswissprot) will be downloaded and used.
Requires ~1 GB of storage for Swiss-Prot download and/or creation of BLAST database.
Swiss-Prot download information: Current Swiss-Prot release is verified/downloaded from:
- ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/
Swiss-Prot database and associated files are stored in the same directory as this script.
INPUT: FASTA-formatted file with at least one sequence.
OUTPUT: A set of directories - one for each sequence in the original input file - that contain the following:
- the BLAST results for that sequence (the query) against the current Swiss-Prot release in table form ([QUERY].tsv),
- individual FASTA files with UniProt sequences for each BLAST hit,
- one FASTA file containing all protein sequences, including the query sequence (all.fasta).
NOTE: --stype dna is currently not supported in any form. May enter an infinite loop. Please do not use --stype dna until updated.
If used at the beginning of a multi-step run, downstream commands will be run on each resulting collection of outputs.
For example: search_proteins.py will yield multiple all.fasta (one for each query), which can be sent to both retrieve_annotations.py and alignment.py. This is why "blast annotate align" is a valid input for the --order optional argument.
Example from CASA.py:
python -m CASA -i ./unknown_proteins.fasta -o ./unknown_protein_folder/ -ord blast align -s protein -nr 3
Example standalone:
python -m search_proteins -i ./unknown_proteins.fasta -o ./ -s protein -nr 5
Usage:
usage: UniProt BLAST script [-h] [-i INFILE] [-o OUT_DIRECTORY] [-s STYPE] [-nr NUM_RES] [-db DATABASE]
[-bopts BLAST_OPTIONS]
BLASTs FASTA sequences against a given BLAST database.
options:
-h, --help show this help message and exit
-i INFILE, --infile INFILE
Full path of input file.
-o OUT_DIRECTORY, --out_directory OUT_DIRECTORY
Full path of output directory.
-s STYPE, --stype STYPE
Sequence type ("protein" is currently the only option).
-nr NUM_RES, --num_res NUM_RES
(optional) Number of results.
-db DATABASE, --database DATABASE
(optional) Full file path to a protein FASTA file that can be used as a BLAST database.
makeblastdb will be run on this file if no BLAST database exists.
-bopts BLAST_OPTIONS, --blast_options BLAST_OPTIONS
(optional) Bracketed, comma-separated list of valid blastp input parameters. Valid arguments
can be shown via CLI with "blastp -h". File locations (like -import_search_strategy) MUST be
full file paths (not relative). Example Usage: -bopts "[-threshold 0,-sorthits 4,-max_hsps 1]"
Takes one or more protein sequences (FASTA format) as input and retrieves annotations for sequences whos IDs exist in UniProt.
INPUT: A FASTA-formatted, CLUSTAL_NUM-formatted, or CLUSTAL-formatted file with at least one protein sequence.
OUTPUT: A collection of files that includes the following:
- individual annotation files (.ann), one for each unique sequence in the input file,
- one combined annotation file that includes all annotations for this collection of sequences (all.ann).
NOTE: all.ann can be used as input for clustal_to_svg.py. See ANNOTATION FORMAT for help formatting annotations by hand.
Example from CASA.py:
python -m CASA -i ./uniprot_proteins.fasta -o ./annotations_folder/ -ord annotate align svg -s protein -u TRUE -c FALSE -n FALSE
Example standalone:
python -m retrieve_annotations -i ./uniprot_proteins.fasta -o ./annotations_folder/
Usage:
usage: retrieve_annotations.py [-h] [-i INFILE] [-o OUT_DIRECTORY]
options:
-h, --help show this help message and exit
-i INFILE, --infile INFILE
Full path of input file.
-o OUT_DIRECTORY, --out_directory OUT_DIRECTORY
Full path of output directory.
Takes at least two protein sequences as input and aligns them using Clustal Omega.
INPUT: A FASTA-formatted file with at least two sequences.
OUTPUT: An alignment (.clustal) of the given FASTA file.
Example from CASA.py:
python -m CASA -i ./three_proteins.fasta -o ./alignments_folder/ -ord annotate align -s protein -title three_proteins
Example standalone:
python -m alignment -i ./three_proteins.fasta -o ./alignments_folder/ -s protein -title three_proteins
Usage:
usage: alignment.py [-h] [-i INFILE] [-o OUT_DIRECTORY] [-s STYPE] [-t TITLE] [-copts CLUSTAL_OPTIONS]
options:
-h, --help show this help message and exit
-i INFILE, --infile INFILE
Full path of input file.
-o OUT_DIRECTORY, --out_directory OUT_DIRECTORY
Full path of output directory. Must end with "/".
-s STYPE, --stype STYPE
Sequence type ("protein" or "dna").
-t TITLE, --title TITLE
Alignment title ([TITLE].clustal, [TITLE].pim).
-copts CLUSTAL_OPTIONS, --clustal_options CLUSTAL_OPTIONS
(optional) Bracketed, comma-separated list of valid clustalo input parameters. Valid arguments
can be shown via CLI with "clustalo -h".File locations (like --hmm-in) MUST be full file paths
(not relative). Example Usage: -copts "[--residuenumber,--iterations 3]"
Reformats a .clustal_num or .clustal alignment into an editable Inkscape SVG. Currently annotates conserved residues (automatic, not optional) and a given list of features (optional).
INPUT: A CLUSTAL or CLUSTAL_NUM file.
OUTPUT: A sequential set of SVGs (.svg), numbered 0, 1, 2, etc., with formatted alignments and associated conserved residues and/or annotations.
NOTE: These SVG outputs were designed to be edited in Inkscape. They retain full functionality when opened in Inkscape, including multiline text-box editability. They retain some functionality when opened in other SVG viewers/editors. They can be viewed and edited in Illustrator, but do not retain multiline text-box editability (each letter is its own text-box). They can also be viewed in browsers like Chrome.
Example from CASA.py:
python -m CASA -i ./proteins.fasta -o ./SVGs/ -ord align svg -s protein -u TRUE -c FALSE -n FALSE -a annotations.ann -f "Active site:blue,Propeptide:#000000"
Example standalone:
python -m clustal_to_svg -i ./alignment.clustal -o ./SVGs/ -u TRUE -c FALSE -n FALSE -a annotations.ann -f "Active site:blue,Propeptide:#000000"
Usage:
usage: clustal_to_svg.py [-h] [-i INFILE] [-o OUT_DIRECTORY] [-c CODES] [-n NUMS] [-u UNIPROT_FORMAT] [-a ANNOTATIONS]
[-f FEATURES]
options:
-h, --help show this help message and exit
-i INFILE, --infile INFILE
-o OUT_DIRECTORY, --out_directory OUT_DIRECTORY
Full path of output directory.
-c CODES, --codes CODES
When set to TRUE, includes Clustal identity codes at the bottom of each block.
-n NUMS, --nums NUMS When set to TRUE, includes total residue numbers at the end of each line.
-u UNIPROT_FORMAT, --uniprot_format UNIPROT_FORMAT
When set to TRUE, truncates all accessions as if they were UniProt entries. Ex.
sp|P00784|PAPA1_CARPA -> PAPA1_CARPA
-a ANNOTATIONS, --annotations ANNOTATIONS
If an annotation file is provided, it will be used to annotate the resulting SVG files.
-f FEATURES, --features FEATURES
A comma-separated list of feature:color pairs to include in SVGs.Case sensitive. If features
include spaces, the list must be enclosed in quotes.If no features should be included, use: -f
NoneThe following example is default behavior. Ex. -f "Active site:#0000ff,Disulfide
bond:#e27441,Propeptide:#9e00f2,Signal:#2b7441"
Annotations can be added to an SVG with the -a or --annotations option in a run that calls clustal_to_svg.py.
A truncated, but real example of a valid annotation file can be found in annotation_example.ann.
NOTE: If no annotation is explicity provided, and retrieve_annotations.py is called during the run, clustal_to_svg.py will instead use the "all.ann" annotation file retrieved by retrieve_annotations.py. Annotation files provided via -a or --annotations will override those retrieved by retrieve_annotations.py.
NOTE: The -f or --features option was added to clustal_to_svg.py on 2024.10.30. It allows for users to provide a comma-separated list of feature:color pairs that can be used to customize SVG annotations. Default behavior is identical to including the following option in a run that calls clustal_to_svg.py:
-f "Active site:#0000ff,Disulfide bond:#e27441,Propeptide:#9e00f2,Signal:#2b7441"
Format:
Annotation files that include the following columns and VALUES (tab-delimited) can be used as inputs for clustal_to_svg.py:
prot whole_prot type location.start.value location.end.value
ARBITRARY_INDEX UNIPROT_FORMAT_ACC FULL_ACCESSION ANNOTATION_TYPE START END
Other columns, like "description" may be added for record-keeping, but they will not be used when adding annotations to SVGs.
Example:
prot whole_prot type location.start.value location.end.value
0 PAPA1_CARPA sp|P00784|PAPA1_CARPA Active site 158 158
