Skip to content

maren-ha/OffTargetPredictionBenchmark

Repository files navigation

Off-Target Prediction Benchmark Reproducibility

This repository accompanies the CRISPR off-target prediction benchmark manuscript. It is intended for readers who want to understand how the benchmark was assembled, rerun the main quantitative comparison from stored tool outputs, and reproduce the manuscript figures.

Here you can find the Quarto documentation. It shows how the benchmark can be run based on precomputed tool predictions, explains the tools used in the benchmark and what to keep in mind when running them, and shows how the results in the manuscript are generated. It also includes pages that show how each manuscript figure is generated based on the benchmark outputs.

Reproducibility Scope

This repository reproduces the benchmark analysis from standardized output tables from each tool, available for download via Zenodo (see below). It does not rerun each external off-target prediction tool from its native software environment. The individual tools have different installation procedures, command-line interfaces, reference genome requirements, and web/API interfaces. The commands and setup notes used to generate the standardized outputs are summarized in the Quarto page docs/06_tool_setup_reference.qmd.

The larger standardized output files with the predictions of each tool and scored candidate sites are provided separately on Zenodo:

The expected layout is explained in config/zenodo_artifacts.yml and summarized in data/zenodo/README.md. Essentially, the Zenodo archive should be extracted into data/zenodo/ and contains the three data subfolders directly. After the archive has been downloaded or symlinked there, the code in this repository allows to rerun the canonical standard benchmark, regenerate the benchmark summary tables, and rebuild the manuscript figures from the benchmark outputs and documented figure-specific inputs.

Repository Contents

The repository contains three main components of the analysis:

  • the canonical filtered truth table used for the human full-cohort benchmark, which is the basis for all analysis in the manuscript,
  • the Python code that evaluates standardized tool outputs against that truth table,
  • and a Quarto documentation site that illustrates the workflows and reproduces the manuscript figures one by one.

Repository Inputs

1. GitHub-tracked compact inputs

These are small enough to be included directly in the repository. The main one is:

  • data/manuscript/manuscript_primary.csv

This table is the canonical filtered ground truth set used in the standard human benchmark in the manuscript.

2. Zenodo-backed standardized tool outputs

The full benchmark also requires standardized contract files for each off-target prediction tool. These files are called

prediction_contract_<tool>.csv

and should be downloaded from Zenodo and saved or symlinked under:

data/zenodo/standard_tool_predictions/

There are additional Zenodo files for specific figures, which are too large to directly track in the GitHub repo. These include the scored candidate tables for Figure 2 and the no-bulge machine learning (ML) prediction tool contracts used by Figure 5 Panel C. The full expected file list is stored in config/zenodo_artifacts.yml; the standard benchmark contract filenames are also listed in config/tool_output_manifest.example.yml.

Quick start

From the repository root:

pip install -e .
python scripts/run_manuscript_benchmark.py --help

Once the Zenodo files are in place, the canonical rerun is:

python scripts/run_manuscript_benchmark.py

To render the documentation site and execute the figure pages:

python scripts/render_docs.py

This command installs a project-local Jupyter kernel inside .jupyter/ and then runs Quarto with the repository's Python environment.

Reproducing the figures

  • Figures 3, 4, and 6 are rebuilt from benchmark run outputs.
  • Figure 6 also uses the no-bulge ML comparison recall-curve output for the machine learning tools.
  • Figure 5 starts from the standardized prediction contracts (because it measures whether each validated true site was evaluated by each tool, before any rank cutoff is applied.)
  • Figure 1 describes the benchmark cohort itself and therefore starts from the truth/input layer (GitHub-tracked file in the data folder).
  • Figure 2 requires a broader scored candidate layer than the benchmark summaries alone and is documented separately (file stored on Zenodo).

Repository structure

  • src/offtarget_benchmark/: benchmark runner, helper functions, and plotting code
  • scripts/: command-line scripts for running the benchmark and rendering docs
  • data/: compact benchmark inputs plus the Zenodo backed data directories
  • results/benchmark_runs/: benchmark outputs used by the figure pages
  • docs/: Quarto tutorials and figure walkthroughs

About

Code and documentation to reproduce benchmark results of CRISPR-Cas9 off-target prediction tools

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages