WARNING : STILL A WORK IN PROGRESS

This repo was only made public for the github action limits.

Creating a MASH Reference

The mash reference that can be downloaded from the mash documentaion is for RefSeq version 70.

I do not inherently have a problem with RefSeq version 70, but RefSeq is well past version 200 now.

RefSeq updates four times year, and I needed an easy way to create and distribute a mash sketch file of the representative bacterial/prokaryotic genomes.

To replicate the methods:

Step 1. Download Datasets and Dataformat

wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets
wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/dataformat
chmod +x datasets dataformat

Step 2. Download Mash

wget https://github.com/marbl/Mash/releases/download/v2.3/mash-Linux64-v2.3.tar
tar -xvf mash-Linux64-v2.3.tar

Step 3. Get a list of all the genomes

Note: this also changes how some of the names are represented

datasets summary genome taxon bacteria --reference --as-json-lines | \
  dataformat tsv genome --fields accession,organism-name --elide-header | \
  sed 's/\[//g' | \
  sed 's/\]//g' | \
  sed 's/["'\'']//g' | \
  sed 's/endosymbiont of /endosymbiont_of_/g' > \
  ids.txt

Step 4. Download the reference files and sketch them

Note: Since this is done in Github Actions (GA), I need to keep everything below 30G. The best way to do this is to download the process each reference file individually, and then combine it to the whole. This obviously does not need to be followed if not under those same limitations.

while read line
do
  id=$(echo $line | awk '{print $1}')
  ge=$(echo $line | awk '{print $2}')
  if [ ! -n "$ge" ] ; then ge="unknown" ; fi
  sp=$(echo $line | awk '{print $3}')
  if [ ! -n "$sp" ] ; then sp="unknown" ; fi

  datasets download genome accession $id
  unzip ncbi_dataset.zip
  cp ncbi_dataset/data/*/*_genomic.fna ${ge}_${sp}_${id}.fasta
  if [ ! -f RefSeqSketches_${version}.msh ]
  then
    mash sketch ${ge}_${sp}_${id}.fasta -o RefSeqSketches_${version}
  else          
    mash sketch ${ge}_${sp}_${id}.fasta -o ${ge}_${sp}_${id}
    mv RefSeqSketches_${version}.msh tmp.msh
    mash paste RefSeqSketches_${version} tmp.msh ${ge}_${sp}_${id}.msh
    rm tmp.msh ${ge}_${sp}_${id}.msh
  fi

  rm ${ge}_${sp}_${id}.fasta
  rm -rf ncbi_dataset/
  rm ncbi_dataset.zip
  rm README.md
  rm md5sum.txt
done < ids.txt

Step 5. Compress the sketch file

gzip RefSeqSketches_${version}.msh

Step 6. Use

I did not try to do anything TOO fancy.

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
.github/workflows		.github/workflows
data		data
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
metadata.json		metadata.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WARNING : STILL A WORK IN PROGRESS

Creating a MASH Reference

Step 1. Download Datasets and Dataformat

Step 2. Download Mash

Step 3. Get a list of all the genomes

Step 4. Download the reference files and sketch them

Step 5. Compress the sketch file

Step 6. Use

About

Uh oh!

Releases 6

Packages

Uh oh!

License

erinyoung/update_mash_dist

Folders and files

Latest commit

History

Repository files navigation

WARNING : STILL A WORK IN PROGRESS

Creating a MASH Reference

Step 1. Download Datasets and Dataformat

Step 2. Download Mash

Step 3. Get a list of all the genomes

Step 4. Download the reference files and sketch them

Step 5. Compress the sketch file

Step 6. Use

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Packages