This repository accompanies the CRISPR off-target prediction benchmark manuscript. It is intended for readers who want to understand how the benchmark was assembled, rerun the main quantitative comparison from stored tool outputs, and reproduce the manuscript figures.
Here you can find the Quarto documentation. It shows how the benchmark can be run based on precomputed tool predictions, explains the tools used in the benchmark and what to keep in mind when running them, and shows how the results in the manuscript are generated. It also includes pages that show how each manuscript figure is generated based on the benchmark outputs.
This repository reproduces the benchmark analysis from standardized output
tables from each tool, available for download via Zenodo (see below). It does not rerun each external off-target prediction tool from its
native software environment. The individual tools have different installation
procedures, command-line interfaces, reference genome requirements, and web/API
interfaces. The commands and setup notes used to generate the standardized
outputs are summarized in the Quarto page docs/06_tool_setup_reference.qmd.
The larger standardized output files with the predictions of each tool and scored candidate sites are provided separately on Zenodo:
- Zenodo record: https://zenodo.org/records/20627722
- Data archive URL for direct download: https://zenodo.org/records/20627722/files/offtarget_prediction_benchmark_zenodo_artifacts_v1.zip?download=1
The expected layout is explained in config/zenodo_artifacts.yml and
summarized in data/zenodo/README.md. Essentially, the Zenodo archive should be extracted into
data/zenodo/ and contains the three data subfolders directly. After the
archive has been downloaded or symlinked there, the code in this repository allows to rerun the
canonical standard benchmark, regenerate the benchmark summary tables, and
rebuild the manuscript figures from the benchmark outputs and documented
figure-specific inputs.
The repository contains three main components of the analysis:
- the canonical filtered truth table used for the human full-cohort benchmark, which is the basis for all analysis in the manuscript,
- the Python code that evaluates standardized tool outputs against that truth table,
- and a Quarto documentation site that illustrates the workflows and reproduces the manuscript figures one by one.
These are small enough to be included directly in the repository. The main one is:
data/manuscript/manuscript_primary.csv
This table is the canonical filtered ground truth set used in the standard human benchmark in the manuscript.
The full benchmark also requires standardized contract files for each off-target prediction tool. These files are called
prediction_contract_<tool>.csv
and should be downloaded from Zenodo and saved or symlinked under:
data/zenodo/standard_tool_predictions/
There are additional Zenodo files for specific figures, which are too large
to directly track in the GitHub repo. These include the scored candidate tables
for Figure 2 and the no-bulge machine learning (ML) prediction tool contracts used by Figure 5 Panel C.
The full expected file list is stored in config/zenodo_artifacts.yml; the
standard benchmark contract filenames are also listed in config/tool_output_manifest.example.yml.
From the repository root:
pip install -e .
python scripts/run_manuscript_benchmark.py --helpOnce the Zenodo files are in place, the canonical rerun is:
python scripts/run_manuscript_benchmark.pyTo render the documentation site and execute the figure pages:
python scripts/render_docs.pyThis command installs a project-local Jupyter kernel inside .jupyter/
and then runs Quarto with the repository's Python environment.
- Figures 3, 4, and 6 are rebuilt from benchmark run outputs.
- Figure 6 also uses the no-bulge ML comparison recall-curve output for the machine learning tools.
- Figure 5 starts from the standardized prediction contracts (because it measures whether each validated true site was evaluated by each tool, before any rank cutoff is applied.)
- Figure 1 describes the benchmark cohort itself and therefore starts from the truth/input layer (GitHub-tracked file in the
datafolder). - Figure 2 requires a broader scored candidate layer than the benchmark summaries alone and is documented separately (file stored on Zenodo).
src/offtarget_benchmark/: benchmark runner, helper functions, and plotting codescripts/: command-line scripts for running the benchmark and rendering docsdata/: compact benchmark inputs plus the Zenodo backed data directoriesresults/benchmark_runs/: benchmark outputs used by the figure pagesdocs/: Quarto tutorials and figure walkthroughs