pyFIRE

Finding Informative Regulatory Elements — a Python implementation with single-cell support.

pyFIRE is a ground-up rewrite of FIRE (Elemento, Slonim & Tavazoie, Molecular Cell 2007), an information-theoretic framework for de novo discovery of DNA/RNA regulatory motifs. This version adds native single-cell capabilities inspired by VISION (DeTomaso & Yosef, Nature Communications 2019), integrates with the scverse ecosystem (scanpy / AnnData), and is pip-installable with numba-accelerated internals.

How it works

FIRE discovers motifs whose presence in regulatory sequences (promoters, UTRs, enhancers) is statistically associated with gene expression patterns. It uses mutual information as a non-parametric, assumption-free measure of dependence — capturing linear, nonlinear, and combinatorial relationships that correlation-based methods miss.

Bulk mode

Sequences + Expression
        │
        ▼
  ┌──────────────┐
  │ K-mer        │
  │ Screening    │
  └──────┬───────┘
         ▼
  ┌──────────────┐
  │ IUPAC        │
  │ Optimization │
  └──────┬───────┘
         ▼
  ┌──────────────┐
  │ Significance │
  │ Testing      │
  └──────┬───────┘
         ▼
  ┌──────────────┐
  │ Distribution │
  │ & Combine    │
  └──────┬───────┘
         ▼
     FIREResult

K-mer Screening: Screen all 4^k k-mers by CMI(kmer; expression | length), then keep candidate seeds.
IUPAC Optimization: Greedily expand seeds with degenerate IUPAC codes to maximize CMI; filter redundant motifs.
Significance Testing: Permutation testing (z-scores, p-values) + jackknife robustness.
Distribution & Combine: Test positional/orientation bias and detect synergistic or redundant motif pairs.

Single-cell mode (scRNA-seq)

pyFIRE supports two single-cell workflows:

De novo discovery — discover motifs directly from scRNA-seq data using per-cell conditional mutual information. For each cell c and motif m, the score is CMI(motif_presence; expression_cell | length) across all genes. This produces a (n_cells, n_motifs) MI matrix.

AnnData + Sequences
        │
        ▼
  ┌──────────────┐
  │ Metacell     │
  │ Pre-screen   │
  └──────┬───────┘
         ▼
  ┌──────────────┐
  │ Per-cell     │
  │ CMI Matrix   │
  └──────┬───────┘
         ▼
  ┌──────────────┐
  │ Spatial      │
  │ Coherence    │
  └──────┬───────┘
         ▼
  ┌──────────────┐
  │ IUPAC        │
  │ Optimization │
  └──────┬───────┘
         ▼
  ┌──────────────┐
  │ Significance │
  │ Testing      │
  └──────┬───────┘
         ▼
  SingleCellDiscoveryResult

Metacell Pre-screen: Per-cluster pseudo-bulk CMI seed screen with permutation p-values and BH-FDR thresholding.
Per-cell CMI Matrix: Compute CMI(kmer; expression_cell | confounders) for each cell × candidate.
Spatial Coherence: Geary's C on the KNN graph with permutation p-values and BH-FDR thresholding.
IUPAC Optimization: Greedy motif optimization with strict max-frequency cap and CMI/MI redundancy filtering.
Significance Testing: Final Geary's C testing with full permutations and BH-FDR correction.

Post-hoc scoring — given motifs from a bulk run (or supplied externally), score per-cell motif activity as the weighted mean expression of each motif's target genes, then test for spatial coherence on a KNN graph using Geary's C autocorrelation. Motifs whose activity is smoothly distributed across transcriptionally similar cells are reported as significant.

Results are stored directly in AnnData objects.

Length bias correction

Sequences of different lengths have different probabilities of containing any given motif by chance. pyFIRE corrects for this by computing conditional mutual information — CMI(motif; expression | length) — throughout the pipeline, rather than raw MI. This conditions out the confounding effect of length while keeping motif counts as simple binary presence/absence (no density normalization needed).

In both bulk and single-cell modes, requested length correction is automatically disabled when sequence lengths are effectively constant (e.g., fixed-width promoter windows). In that case, pyFIRE falls back to a single conditioning state instead of over-correcting.

Installation

pip install -e "."

With single-cell support (anndata + scanpy):

pip install -e ".[singlecell]"

Development (includes pytest, ruff):

pip install -e ".[dev]"

Requirements

Python >= 3.10
numpy, scipy, pandas, numba, pyfaidx, click, tqdm, matplotlib
Optional: anndata, scanpy (for single-cell mode)

Quick start

Command line

# Discover motifs from regulatory sequences + expression data
# Default outputs go to <expression>_FIRE_<mode>/
# (e.g., expression.tsv + mode=dna -> expression_FIRE_dna/pyfire_results_dna.tsv + plots/matrix files)
pyfire discover --fasta promoters.fasta -e expression.tsv

# RNA mode (single-strand scanning + RC-informativeness seed filter)
pyfire discover --fasta utr3.fasta -e expression.tsv --mode rna

# With custom parameters
pyfire discover --fasta promoters.fasta -e expression.tsv \
    -k 7 \
    --n-seeds 200 \
    --seed-permutations 200 \
    --seed-pvalue 1e-5 \
    # optional alternative rule:
    # --seed-z-threshold 3.0 \
    --seed-max-freq 0.33 \
    # optional override for optimization-stage cap:
    # --max-freq 0.33 \
    --seed-stop-failures 5 \
    --n-permutations 10000 \
    --z-threshold 4.0 \
    --robustness-threshold 3 \
    --output-dir runs/run1

# Select expression columns directly from a multi-column table
# (supports names or 1-based indices)
pyfire discover --fasta promoters.fasta -e data_examples/test_DESeq_logFC.txt.gz \
    --cols GENE,log2FoldChange
# equivalent: --cols 8,2

# Disable automatic plotting
pyfire discover --fasta promoters.fasta -e expression.tsv --no-plot

# Scan sequences for known motifs
pyfire scan sequences.fasta motifs.txt -o scan_results.tsv

# Single-cell de novo discovery (with plots)
pyfire discover-sc pbmc.h5ad promoters.fasta \
    --groupby celltype \
    -k 7 \
    --mode dna \
    --seed-fdr 0.1 \
    --seed-max-freq 0.33 \
    --max-freq 0.5 \
    --n-permutations 5000 \
    -o pbmc_results.h5ad

# Re-plot from saved results without re-running discovery
pyfire draw results.tsv --format pdf --no-logos --lp-max 10

# Re-draw single-cell outputs without re-running discovery
pyfire draw-sc pbmc_scFIRE --groupby celltype --format all

Extract promoter/UTR FASTA

Use the CLI helpers (pyfire make-promoter-fasta, pyfire make-utr3-fasta, pyfire make-utr5-fasta) to build sequence FASTA files from gene/transcript identifiers. They support three identifier modes via --entity:

gene (symbol-level; one selected major isoform per symbol)
gene_id (ENSG-level; one selected major isoform per ENSG)
transcript_id (ENST-level; one record per transcript)

They also write a metadata TSV describing transcript selection, scores, and any skipped records. Metadata now begins with a # command: ... header containing the exact generating command.

1) Get annotation

Download GENCODE human annotation and place it in fasta/:

mkdir -p fasta
cd fasta
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_49/gencode.v49.basic.annotation.gtf.gz
cd ..

2) (Optional) Build transcript score table for isoform-aware selection

Example using data_examples/MDA_tx_abudance_salmon.txt and averaging MDAchr7 replicates:

python - <<'PY'
import pandas as pd
tx = pd.read_csv("data_examples/MDA_tx_abudance_salmon.txt", sep="\t")
tx["transcript_id"] = tx["Unnamed: 0"].str.replace(r"\.\d+$", "", regex=True)
tx["expression"] = tx[["MDAchr7_1", "MDAchr7_2"]].mean(axis=1)
tx[["transcript_id", "expression"]].to_csv("fasta/MDAchr7_tx_scores.tsv", sep="\t", index=False)
PY

3) Build promoters, 3'UTRs, and 5'UTRs

Example using DESeq2 output data_examples/MDAchr7_gene_logFC.txt:

# Optional but recommended for speed (local batch extraction)
# macOS: brew install bedtools
# Ubuntu/Debian: sudo apt-get install -y bedtools

# Promoters: 2.5 kb upstream + 0.5 kb downstream of TSS
pyfire make-promoter-fasta \
  --genes data_examples/MDAchr7_gene_logFC.txt \
  --column row \
  --gtf fasta/gencode.v49.basic.annotation.gtf.gz \
  --sequence-source genome \
  --genome fasta/GRCh38.primary_assembly.genome.fa \
  --entity gene \
  --upstream 2500 \
  --downstream 500 \
  --fetch-engine bedtools \
  --bed-out fasta/MDAchr7_promoters.bed \
  --isoform-table fasta/MDAchr7_tx_scores.tsv \
  --isoform-id-column transcript_id \
  --isoform-score-column expression \
  --out fasta/MDAchr7_promoters.fa

# 3'UTRs
pyfire make-utr3-fasta \
  --genes data_examples/MDAchr7_gene_logFC.txt \
  --column row \
  --gtf fasta/gencode.v49.basic.annotation.gtf.gz \
  --sequence-source genome \
  --genome fasta/GRCh38.primary_assembly.genome.fa \
  --entity gene \
  --isoform-table fasta/MDAchr7_tx_scores.tsv \
  --isoform-id-column transcript_id \
  --isoform-score-column expression \
  --out fasta/MDAchr7_utr3.fa

# 5'UTRs
pyfire make-utr5-fasta \
  --genes data_examples/MDAchr7_gene_logFC.txt \
  --column row \
  --gtf fasta/gencode.v49.basic.annotation.gtf.gz \
  --sequence-source genome \
  --genome fasta/GRCh38.primary_assembly.genome.fa \
  --entity gene \
  --isoform-table fasta/MDAchr7_tx_scores.tsv \
  --isoform-id-column transcript_id \
  --isoform-score-column expression \
  --out fasta/MDAchr7_utr5.fa

Notes:

Fastest promoter extraction is local genome + --fetch-engine bedtools (single batch call).
--sequence-source ensembl avoids downloading a full genome FASTA.
For local extraction, use --sequence-source genome --genome /path/to/genome.fa.
make-promoter-fasta supports --fetch-engine and optional --bed-out; UTR commands currently use direct interval extraction.
Metadata files are written as <out>.metadata.tsv by default.
In --entity gene mode, symbol headers are deduplicated by default (first selected record kept).

Python API — bulk discovery

from pyfire.discovery.pipeline import FIREPipeline
from pyfire.sequence.fasta import read_fasta_with_lengths
from pyfire.io.expression import read_expression_table, align_sequences_and_expression
from pyfire.io.results import save_results

# Load data
names, sequences, lengths = read_fasta_with_lengths("promoters.fasta")
genes, expression = read_expression_table("expression.tsv")

# Align to common gene set
common_names, common_seqs, common_expr = align_sequences_and_expression(
    names, sequences, genes, expression
)

# Run the pipeline
pipeline = FIREPipeline(k=7, n_top_seeds=100, n_permutations=10_000)
result = pipeline.run(common_seqs, common_expr)

# Inspect results
for motif, sig in zip(result.motifs, result.significance):
    print(f"{motif.pattern}  z={sig.z_score:.1f}  robustness={sig.robustness}/10")

# Save
save_results(result, "results.tsv")

Python API — single-cell de novo discovery

import scanpy as sc
from pyfire.sc.discovery import SingleCellDiscoveryPipeline

# Load scRNA-seq data (with PCA + neighbors already computed)
adata = sc.read_h5ad("pbmc.h5ad")

# Run de novo motif discovery directly from single-cell data
pipeline = SingleCellDiscoveryPipeline(
    k=7, groupby="celltype", n_permutations=5000,
)
result = pipeline.run(
    adata,
    gene_sequences=sequences,      # promoter sequences for genes in adata
    gene_names=gene_names,
)

# Results are stored in adata
print(adata.obsm["X_motif_cmi"].shape)  # (n_cells, n_motifs) per-cell CMI
print(adata.uns["pyfire_sc_discovery"]["significant_motifs"])

# Inspect discovered motifs
for motif in result.motifs:
    print(f"{motif.pattern}  mean_CMI={motif.mean_cmi:.4f}")
for name, ac in zip(result.motif_names, result.autocorrelation):
    print(f"{name}  C'={ac.c_prime:.3f}  z={ac.z_score:.1f}")

Python API — single-cell post-hoc scoring

import scanpy as sc
from pyfire.motif.iupac import IUPACPattern
from pyfire.sc.pipeline import SingleCellPipeline

# Load scRNA-seq data (with PCA + neighbors already computed)
adata = sc.read_h5ad("pbmc.h5ad")

# Define motifs (from bulk FIRE or known motifs)
patterns = [IUPACPattern("NACGTACGN"), IUPACPattern("NRYGAAACN")]

# Run single-cell analysis
sc_pipeline = SingleCellPipeline(n_neighbors=30, p_threshold=0.01)
result = sc_pipeline.run(
    adata,
    patterns=patterns,
    gene_sequences=sequences,      # promoter sequences for genes in adata
    gene_names=gene_names,
)

# Results are stored in adata
print(adata.obsm["X_motif_activity"].shape)  # (n_cells, n_motifs)
print(adata.uns["pyfire"]["significant_motifs"])

# Or access from the result object
for name, ac in zip(result.motif_names, result.autocorrelation):
    print(f"{name}  C'={ac.c_prime:.3f}  p={ac.p_value:.4f}")

Pseudo-bulk from single-cell data

from pyfire.io.anndata_bridge import adata_to_pseudobulk

# Aggregate by cell type for bulk-mode FIRE
expr_bulk, group_names, gene_names = adata_to_pseudobulk(adata, groupby="celltype")

# Then run the bulk pipeline on each column of the resulting matrix

Input formats

Sequences (FASTA)

Standard FASTA file with one regulatory sequence per gene. Sequence names must match the gene identifiers in the expression file.

>YAL001C
ACGTACGTACGTNNACGT...
>YAL002W
TGCATGCATGCATGCA...

Expression (TSV)

Tab-separated file with gene names in the first column and a continuous expression value in the second column. The expression value can be any continuous measure (log fold-change, mean expression, PC loading, etc.).

gene    expression
YAL001C 2.34
YAL002W -1.05
YAL003W 0.72

Motifs (for scanning)

Plain text file with one IUPAC motif per line:

ACGTRYY
WKGAAACN
TGASTCA

Pipeline parameters

Bulk (`discover`)

Parameter	Default	Description
`k`	7	K-mer length for initial screening
`n_top_seeds`	0	Maximum seeds to optimize after screening (0 = uncapped; screening still stops after consecutive non-significant seeds)
`mode`	`dna`	`dna`: both strands + canonical k-mers; `rna`: single-strand k-mers filtered to keep only seeds more informative than their reverse complement
`seed_permutations`	10,000	Permutations per seed for seed-significance tests during screening
`seed_pvalue`	1e-5	Seed p-value cutoff during screening (default enforces "better than all permutations")
`seed_max_freq`	0.33	Maximum allowed seed abundance during screening
`max_freq`	`seed_max_freq` (if omitted)	Maximum allowed motif abundance during IUPAC optimization
`seed_stop_after_failures`	5	Stop seed screening after this many consecutive non-significant seeds
`n_expr_bins`	auto	Expression bins; auto = `n_genes / (50 * n_motif_bins)`
`n_length_bins`	5	Requested sequence length bins for CMI conditioning (auto-reduced to 1 when lengths are near-constant)
`n_motif_bins`	2	Motif count bins (2 = binary presence/absence)
`n_permutations`	10,000	Shuffles for null distribution
`n_jackknife_trials`	10	Jackknife resamples for robustness
`z_threshold`	4.0	Minimum z-score to report a motif
`robustness_threshold`	3	Minimum jackknife passes (out of 10)
`add5` / `add3`	1 / 1	Flanking positions added during IUPAC optimization
`min_cmi_ratio`	5.0	CMI ratio threshold for redundancy filtering

Single-cell (`discover-sc`)

Parameter	Default	Description
`k`	7	K-mer length for initial screening
`groupby`	auto	Column in `adata.obs` for cluster labels (tries `leiden`, `louvain`, `cluster`, `celltype`)
`n_neighbors`	30	KNN neighbors for graph construction
`n_expr_bins`	5	Per-cell expression bins (fewer than bulk due to sparsity)
`n_length_bins`	5	Requested sequence-length bins for conditioning (auto-reduced/disabled when lengths are near-constant)
`n_permutations`	5,000	Permutations for final Geary's C significance
`p_threshold`	0.01	BH-FDR threshold for spatial/significance stages
`mode`	`dna`	`dna`: canonical seed k-mers + both-strand motif scanning; `rna`: strand-specific seeds + single-strand scanning
`seed_fdr`	0.1	Stage-1 seed BH-FDR threshold
`seed_max_freq`	0.33	Maximum allowed seed abundance during Stage-1
`add5` / `add3`	1 / 1	Flanking positions added during IUPAC optimization
`max_freq`	0.5	Strict maximum allowed motif frequency during optimization
`min_cmi_ratio`	5.0	CMI ratio threshold for redundancy filtering
`max_cells_optimize`	5,000	Subsample to this many cells during optimization

Output

Results TSV

One row per discovered motif:

Column	Description
`motif`	IUPAC pattern string
`seed`	Original k-mer seed
`cmi`	Conditional mutual information (bits)
`frequency`	Fraction of sequences containing the motif
`degeneracy`	Number of concrete sequences the pattern matches
`z_score`	Permutation z-score
`p_value`	Empirical p-value
`corrected_p_value`	Bonferroni-corrected p-value
`robustness`	Jackknife robustness score (0–10)
`position_z`	Z-score for positional bias
`position_p`	P-value for positional bias
`orientation_z`	Z-score for orientation (strand) bias
`orientation_p`	P-value for orientation bias

Single-cell results (in AnnData)

De novo discovery (discover-sc):

adata.obsm["X_motif_cmi"] — (n_cells, n_motifs) per-cell CMI matrix
adata.uns["pyfire_sc_discovery"] — dict with:
- motif metrics: motif_names, seeds, mean_cmi, frequencies
- spatial stats: gearys_c, gearys_c_prime, gearys_z, gearys_p, gearys_fdr
- significant set: significant_motifs
- run metadata: k, mode, confounders, n_permutations, seed_max_freq, max_freq
- diagnostics: seed_table, stage3_table, redundancy_log

Post-hoc scoring (known motifs):

adata.obsm["X_motif_activity"] — (n_cells, n_motifs) activity matrix
adata.uns["pyfire"] — dict with motif_names, gearys_c, gearys_c_prime, gearys_z, gearys_p, significant_motifs

Visualization

The discover command automatically generates figures alongside results (disable with --no-plot):

results.matrix.tsv — enrichment matrix (signed log10 p-values per motif × expression bin), used by the draw command
results.pdf / results.png — FIRE matrix figure: heatmap of motif enrichment across expression bins, with sequence logos, significance borders, and statistics columns
results.html — interactive HTML version with hover tooltips and inline base64 logos
results.motif_matches.tsv — flat report with one row per motif hit: motif, sequence name, 1-based position, strand, exact matched sequence, and sequence context with 20 bp flanks on both sides
command.txt — exact run command plus a ready-to-run pyfire draw ... template with optional plotting flags

The discover-sc command writes a structured output directory (default: <adata_stem>_scFIRE/):

run/: run metadata
- command.txt (exact run command + ready-to-run draw-sc command)
- sc_plot_config.json
- <adata>.scfire.h5ad
discovery/: core discovery tables
- sc_motifs.tsv, sc_seed_screen.tsv, sc_stage3.tsv, sc_redundancy.tsv, sc_orientation_filter.tsv
- sc_logos/ (motif logo SVGs used by HTML report)
matrices/: plotting/report matrices
- sc_motif_cmi.matrix.tsv, sc_cell_metadata.tsv, sc_gene_motif_matrix.tsv
targets/: motif target summaries (sc_motif_targets.tsv)
interactions/: interaction outputs
- sc_interactions.tsv, sc_interactions.heatmap.pdf, sc_interactions.html, sc_interaction_maps/
annotation/: motif annotation/match reports
- sc_motif_annotations.tsv, sc_mirna_annotations.tsv, sc_motif_matches.tsv
enrichment/: GMT outputs
- sc_gmt_motifs.tsv, sc_gmt_pairs.tsv, sc_gmt_pairs.pdf
plots/: static plot outputs
- sc_motif_heatmap.pdf, sc_embedding_<motif>.pdf, sc_autocorrelation.pdf
- sc_group_enrichment_<motif>.pdf (per-motif barplots by group; includes significance)
- sc_group_enrichment_stats.tsv (Kruskal-Wallis + one-vs-rest Mann-Whitney p/FDR table)
sc_report.html at output root

sc_report.html includes:

embedding explorer with interactive point-size and color-range controls + viridis legend
motif table with logo, annotation, and top GMT hit
seed diagnostics table (top 20)
group enrichment barplot panel by selected groupby metadata column
- metrics: high-activity fraction (score >= motif P75) or mean motif score
- one-vs-rest significance stars from BH-FDR-adjusted Mann-Whitney tests

Use pyfire draw to re-render figures from saved results with different visual parameters:

pyfire draw results.tsv --format pdf --lp-max 10 --no-logos --cell-width 24

Draw flag	Description
`--format`	Output format: `pdf`, `png`, `html`, or `all` (default)
`--matrix-file`	Override enrichment matrix path (default: `<results>.matrix.tsv`)
`--lp-max`	Color scale range (±value, default 20)
`--sig-threshold`	Significance threshold before Bonferroni (default 0.05)
`--cell-width` / `--cell-height`	Cell dimensions in points
`--logo-width`	Sequence logo column width in points
`--no-logos` / `--no-stats` / `--no-scale` / `--no-expression-bars`	Toggle sections
`--logo-axes` / `--no-logo-axes`	Show or hide numbered x-axis and y-axis (bits) for motif logos
`--logo-height-fraction`	Set logo height relative to row height (0-1]
`--logo-axes-left-pad`	Extra left padding before logo column when axes are shown (pt)
`--header-min` / `--header-max`	Override global min/max used by continuous header bars
`--header-font-size` / `--stat-font-size`	Font sizes in points
`--dpi`	PNG resolution

Use pyfire draw-sc to re-render single-cell plots/reports without rerunning discovery:

pyfire draw-sc pbmc_scFIRE --groupby celltype --format all

IUPAC codes

pyFIRE uses the standard IUPAC nucleotide ambiguity codes:

Code	Bases	Code	Bases
A	A	M	A, C
C	C	R	A, G
G	G	W	A, T
T	T	S	C, G
K	G, T	Y	C, T
V	A, C, G	H	A, C, T
D	A, G, T	B	C, G, T
N	A, C, G, T

During optimization, the algorithm tries all 15 codes at each position, keeping the change that maximizes CMI. Positions derived from the original k-mer seed are constrained to contain the original nucleotide.

Architecture

src/pyfire/
├── core/                     # Computational engines (numba-accelerated)
│   ├── information.py        # MI, CMI, batch MI/CMI
│   ├── information_percell.py # Per-cell CMI for single-cell discovery
│   ├── quantization.py       # Equal-population binning (bulk + per-cell)
│   └── permutation.py        # Permutation null distributions, jackknife
├── motif/                    # Motif representations
│   ├── iupac.py              # IUPACPattern class
│   ├── kmer.py               # 2-bit k-mer encoding and counting
│   └── pwm.py                # Position weight matrices
├── sequence/                 # Sequence handling
│   ├── fasta.py              # FASTA reading (pyfaidx)
│   └── scanner.py            # Motif occurrence scanning
├── discovery/                # Bulk FIRE pipeline
│   ├── screen.py             # K-mer CMI screening
│   ├── optimize.py           # IUPAC greedy optimization
│   ├── significance.py       # Permutation + jackknife testing
│   ├── distribution.py       # Position/orientation bias
│   ├── combine.py            # Motif-motif interactions
│   └── pipeline.py           # End-to-end orchestrator
├── sc/                       # Single-cell mode
│   ├── discovery.py          # De novo SC motif discovery (MI/CMI pipeline)
│   ├── scoring.py            # Per-cell motif activity (sparse-aware)
│   ├── neighbors.py          # KNN graph construction
│   ├── autocorrelation.py    # Geary's C on KNN graph
│   ├── accessibility.py      # scATAC (placeholder)
│   └── pipeline.py           # Post-hoc SC scoring orchestrator
├── io/                       # I/O utilities
│   ├── expression.py         # Expression table reading
│   ├── anndata_bridge.py     # AnnData integration
│   └── results.py            # Result export + enrichment matrix load/save
├── viz/                      # Visualization
│   ├── config.py             # PlotConfig dataclass (maps to CLI flags)
│   ├── _colors.py            # Blue→Black→Yellow FIRE diverging colormap
│   ├── _logo.py              # IC-weighted sequence logos (matplotlib TextToPath)
│   ├── _layout.py            # MatrixLayout grid engine
│   ├── matrix.py             # EnrichmentMatrix + FIREMatrixPlot (PDF/PNG)
│   ├── html.py               # FIREMatrixHTML (self-contained interactive HTML)
│   └── matrix_sc.py          # SC heatmap, UMAP embedding, autocorrelation bars
└── cli/
    └── main.py               # Click CLI (discover, scan, discover-sc, draw)

Testing

pytest tests/ -v

165 tests covering MI/CMI mathematical properties (including per-cell CMI), IUPAC pattern operations, k-mer encoding round-trips, motif scanning, the full bulk discovery pipeline (with planted-motif recovery and no-signal controls), single-cell de novo discovery (metacell screening, per-cell CMI, optimizer, end-to-end pipeline with planted motif), single-cell scoring, Geary's C autocorrelation, visualization (color interpolation, sequence logos, matrix plots, HTML output, single-cell plots), and CLI commands.

References

Elemento, O., Slonim, N. & Tavazoie, S. A universal framework for regulatory element discovery across all genomes and data types. Molecular Cell 28, 337–350 (2007).
DeTomaso, D. & Yosef, N. Hotspot identifies informative gene modules across modalities of single-cell genomics. Cell Systems 12, 446–456 (2021).

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data_examples		data_examples
gmt		gmt
motif_databases		motif_databases
paper		paper
scripts		scripts
src/pyfire		src/pyfire
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
MANUAL.md		MANUAL.md
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

pyFIRE

How it works

Bulk mode

Single-cell mode (scRNA-seq)

Length bias correction

Installation

Requirements

Quick start

Command line

Extract promoter/UTR FASTA

1) Get annotation

2) (Optional) Build transcript score table for isoform-aware selection

3) Build promoters, 3'UTRs, and 5'UTRs

Python API — bulk discovery

Python API — single-cell de novo discovery

Python API — single-cell post-hoc scoring

Pseudo-bulk from single-cell data

Input formats

Sequences (FASTA)

Expression (TSV)

Motifs (for scanning)

Pipeline parameters

Bulk (discover)

Single-cell (discover-sc)

Output

Results TSV

Single-cell results (in AnnData)

Visualization

IUPAC codes

Architecture

Testing

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Bulk (`discover`)

Single-cell (`discover-sc`)

Packages