Skip to content

Sipley/pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

212 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pipeline

Disclaimer

This was created in partial fulfillment of the requirements for a Scripting for Biologists graduate-level course, which I co-opted to improve the bioinformatics pipeline used in my MS thesis. I have generalized many parts of this pipeline but in it's current state it's probably only super useful for me. Check back later for a more generally useful pipeline.

Purpose

To survey the diversity of Venom Allergen-like Proteins (VAPs) expressed in Schistosoma mansoni

Known dependences

Python3 + modules
  • biopython + its dependencies
  • ftplib (slow)
  • os

If on mac and have pip, you can install all required Python modules with the following:

python3 -m pip install --user numpy scipy matplotlib ipython jupyter pandas sympy nose
python3 -m pip install --user biopython
python3 -m pip install --user ftplib
python3 -m pip install --user os
python3 -m pip install --user wormbase-parasite
python3 -m pip install --user requests
BLAST+

BLAST+ 2.9.0 executables: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

After installing, add the BLAST+ executables to your path by inserting the following into your ~/.bash_profile:

PATH="/usr/local/ncbi/blast/bin:${PATH}"
export PATH

Then exit terminal & re-enter or run source ~/bash_profile

Optional

TransDecoder + dependencies

The easiest way to install TransDecoder and many other programs is through anaconda (available here).

With anaconda installed, simply run the following to install the appropriate version of TransDecoder:

conda config --add channels bioconda
conda install transdecoder=3.0.1 # Do not use most recent version

To install with anaconda on mac:

conda install hmmer

Pfam database: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release

mafft

If on mac, get mafft by running:

conda install -c bioconda mafft 
RAxML

If on mac, get RAxML by running:

conda install -c bioconda raxml

Don't forget to add to your ~/.bash_profile:

PATH="/Users/breanna/miniconda3/bin:$PATH"
export PATH
CD-HIT

If on mac, this should work:

conda install -c bioconda cd-hit 
SignalP

You'll have to install this yourself by requesting an academic download. Be sure you install the correct version for your system (if not mac Darwin). Please follow instructions carefully and make sure the exectuables are in your PATH; it's important that the file structure of the original download is conserved for SignalP to find the correct executables. It is strongly advised that you decompress the tarball for this download in your /usr/local/bin

If you downloaded the package to /usr/local/bin as sugguested, this should work on Mac to add the executables to your path:

cp ~/Downloads/signalp-4.1g.Darwin.tar.gz /usr/local/bin/.
cd /usr/local/bin
tar -xvzf ~/Downloads/signalp-4.1g.Darwin.tar.gz

Add this new directory to your path in ~/.bash_profile

# Adding SignalP
PATH="/usr/local/bin/signalp-4.1:${PATH}"
export PATH

YOU MUST READ THE *.readme FILE & change settings in 'signalp' appropriately. Pay special attention to my $outputDir---this MUST be writable by ALL users. Recommended settings:

# full path to the signalp-4.1 directory on your system (mandatory)
BEGIN {
    $ENV{SIGNALP} = '/usr/local/bin/signalp-4.1';
}

# determine where to store temporary files (must be writable to all users)
my $outputDir = "/var/tmp";
FTP (Optional)

If on mac, get ftp by running:

brew install inetutils
Other programs
  • "Normal" sed

If on mac, download by running:

brew install gnu-sed

Don't forget to add to path:

PATH="/usr/local/opt/gnu-sed/libexec/gnubin:$PATH"
  • R

Summary

  1. Acquire a large dataset of sequences
  2. Pull out Gene Family of Interest from all sequences (non-redundant)
  3. Infer the best gene tree for your Gene Family of Interest
  4. Color best gene tree based on signal peptide predictions
  5. Compare average rates of amino acid substitutions

Step 1: Acquire a large dataset of sequences

For this tutorial, we will use the publicly available transcriptome of Schistosoma mansoni

Please fork this repository and clone the directory from your GitHub to your local machine in your home directory ~. Then cd into the scripts directory and run pipeline.sh (ignore the warning message):

git clone https://github.com/<YOUR GITHUB USERNAME>/pipeline.git
cd pipeline
cd tmp

bash scripts/pipeline.sh

Output

  1. data/query.fasta

    This contains the protein sequence for the only solved structure of a 'Venom allergen-like Protein' from flatworms, to my knowledge.

  2. data/schistosoma_mansoni.PRJEA36577.WBPS13.CDS_transcripts.fa.gz

    This is the most recent transcriptome assembly for Schistosoma mansoni, which arguably has the most well-annotated genome of all flatworms.

This tutorial can be easily extended to a much larger number of transcriptomes

If you have the needed link, you can download a dataset containing 47 flatworms transcriptomes by running the following code in your terminal:

mkdir transcriptomes
cd transcriptomes
curl -L <LINK-TO-TRANSCRIPTOMES>?dl=1 > transcriptomes.zip
unzip transcriptomes.zip
rm transcriptomes.zip

To confirm that the transcriptomes were downloaded successfully, please run the following code (if you're on a mac and received a command not found error, please run brew install md5sha1sum and try again):

md5sum -c md5sum.txt

Your output should look like:

transcriptomes-MS.tgz: OK

If not, the transcriptomes were not downloaded correctly, and you should proceed with extreme caution.

Step 2: Pull out Gene Family of Interest from all sequences (non-redundant)

This will run automatically.

What heppened?

  1. Sequences expressed in Schistosoma mansoni that are homologous to the solved structure of Schistosoma mansoni were identified using BLAST+ & saved in a new tmp/ directory as potential_VAPs.fasta
  2. Potential VAP sequences were translated to protein and validated as VAPs via CRISP domain presence

Step 3: Infer the best gene tree for your gene family of interest

This will run automatically.

What happened?

  1. Redudant sequences were removed using 'cd-hit'

  2. The VAP protein sequences were aligned using the program 'mafft'

  3. Header names were cleaned up

  4. RAxML was employed

    b. Maximum likelihood trees inferred from 10 bootstrapped alignments (rapid) under WAG model of protein evolution c. ML search for best tree (Fast ML -> Slow ML -> Thorough ML) d. Draw Bootstrap Support Valus on best-scoring ML tree

Note: This is for proof-of-concept only; if running this pipeline for publication, you'll want to adjust # of bootstraps & change model to what best fits your data

Outputs:

RAxML_bestTree.WAG.10bp
RAxML_bipartitions.WAG.10bp
RAxML_bipartitionsBranchLabels.WAG.10bp
RAxML_bootstrap.WAG.10bp
RAxML_info.WAG.10bp

Can find some interesting info in RAxML_info.WAG.10bp.

Step 4: Color best gene tree based on signal peptide predictions

What happened?

  1. Signal peptides were predicted from full protein sequences using SignalP
  2. A .png file was created with the RAxML tree with signal peptide predictions mapped onto it

You will find a new directory tmp/ which will contain many files including:

Pfam-A.hmm					pfam-hits_VAPs.domtblout
Pfam-A.hmm.h3f					pfam-validate_VAPs.domtblout
Pfam-A.hmm.h3i					potential_VAPs.fasta
Pfam-A.hmm.h3m					potential_VAPs.fasta.transdecoder.bed
Pfam-A.hmm.h3p					potential_VAPs.fasta.transdecoder.cds
RAxML_bestTree.WAG.10bp				potential_VAPs.fasta.transdecoder.gff3
RAxML_bipartitions.WAG.10bp			potential_VAPs.fasta.transdecoder.pep
RAxML_bipartitionsBranchLabels.WAG.10bp		potential_VAPs.fasta.transdecoder_dir
RAxML_bootstrap.WAG.10bp			problematic-VAPs.txt
RAxML_info.WAG.10bp				sprot-hits_VAPs.blastp
VAPs.signalp					sprot.db.phr
VAPs.signalp.csv				sprot.db.pin
VAPs_unique_clean-names.fasta			sprot.db.pog
blast.db.nhr					sprot.db.psd
blast.db.nin					sprot.db.psi
blast.db.nog					sprot.db.psq
blast.db.nsd					test_moreThan75percentHit_names.txt
blast.db.nsi					transdecoder-VAP_names.txt
blast.db.nsq					transdecoder-complete_VAPs.pep
blast.tblastn					transdecoder-complete_VAPs_75-percent.pep
blast_hits.fasta				transdecoder_VAPs.pep
contigLure.txt					transdecoder_VAPs_unique.pep
mafft_S-mansoni_VAP.fasta			transdecoder_VAPs_unique.pep.clstr
mafft_S-mansoni_VAP_clean-names.fasta		tree_signalp.png
pfam-CRISP_names.txt				uniprot_sprot.fasta

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors