Skip to content

Synthetic benchmark#62

Merged
ArcaneEmergence merged 41 commits into
mainfrom
experiment/synth_bench
Jun 19, 2026
Merged

Synthetic benchmark#62
ArcaneEmergence merged 41 commits into
mainfrom
experiment/synth_bench

Conversation

@ArcaneEmergence

Copy link
Copy Markdown
Collaborator

Major changes in simulation and the model:

Model changes

  • size factor
  • clonotype median aggregation
  • memory efficient credible interval FDR control
  • Save and load model using pickle (issues with version compatibility of jax)
  • extreme outlier removal
  • kmeans outlier handling now simply takes the three highest points, if not already clustered

Simulator

  • Component variance now sampled in relationship to mean
  • Assure sampled fraction of binder, and outlier proportion is close to user specified parameter

Snakemake

  • Improved snakemake pipeline

Figure notebook

  • Added notebooks to reproduce figures
  • Added minimal set of csv files to recreate Figure 2

…ve and show figure optional, 3. legend frameon off and add titles, 4. changed indentation to spaces
…timized for HPC to avoid spawning many small jobs, instead uses multiprocessing to run in parallel. 2. Additional logging of figs and metrics 3. Changed folder structure depending on scenario and configuration in a yaml
- size factor calculation
- alpha offset
- incorporating negative control in a more flexible way
- exponential lr decay
- flag to log performance during training
- various clonotype info incorporations
- small improvements of plotting
… training data if zvalue > 100, but keep it for prediction

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a revamped “synthetic benchmark” workflow (Snakemake + Slurm wrappers) and updates core simulation/model utilities to support new benchmarking requirements (e.g., variance sampling tied to mean, outlier handling, clonotype aggregation, and model pickling).

Changes:

  • Replace legacy synthetic benchmark snakefiles/slurm scripts with a scenario-driven snakefile_benchmark.smk + slurm_benchmark.sh and per-scenario YAML configs.
  • Add new simulator behavior (variance-from-mean model; tighter control of binder/outlier proportions) and new benchmark runner scripts (simulate_data.py, updated run_dextrademixer.py, new run_beam.py).
  • Extend core library utilities and the DextraDemixer model (metrics aggregation helpers, new preprocessing knobs, saving/loading via pickle, and updates to posterior/FDR logic).

Reviewed changes

Copilot reviewed 35 out of 41 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
experiments/synthetic_benchmark/snakefile_run_timing Removed legacy timing workflow snakefile
experiments/synthetic_benchmark/snakefile_run_simulation Removed legacy simulation workflow snakefile
experiments/synthetic_benchmark/snakefile_benchmark.smk New scenario-driven benchmark workflow (simulate → run tools → aggregate)
experiments/synthetic_benchmark/slurm_run_timing.sh Removed legacy Slurm wrapper
experiments/synthetic_benchmark/slurm_run_simulation.sh Removed legacy Slurm wrapper
experiments/synthetic_benchmark/slurm_benchmark.sh New unified Slurm submission wrapper for benchmark scenarios
experiments/synthetic_benchmark/simulate_data.py New CLI entrypoint for simulation generation
experiments/synthetic_benchmark/run_dextramixerkmeans.py Removed legacy runner
experiments/synthetic_benchmark/run_dextramixer.py Removed legacy runner
experiments/synthetic_benchmark/run_dextrademixer.py Updated runner to new model/config + richer metric logging
experiments/synthetic_benchmark/run_beamt.py Removed legacy runner
experiments/synthetic_benchmark/run_beam.py New BEAM runner producing benchmark CSV outputs
experiments/synthetic_benchmark/environment.yaml Removed old per-experiment conda environment
experiments/synthetic_benchmark/create_data_mean_variance_fold_increase.py Removed old simulation script
experiments/synthetic_benchmark/benchmarks/synth_benchmark/config.yaml New benchmark scenario configuration
experiments/synthetic_benchmark/benchmarks/scaling/config.yaml New scaling scenario configuration
experiments/synthetic_benchmark/benchmarks/dropout/config.yaml New dropout scenario configuration
experiments/synthetic_benchmark/aggregate_results.py Simplified aggregation via shared aggregate_csv utility
experiments/hyperparameter_tuning/snakefile_run_optuna_multi_at_once Removed legacy optuna workflow
experiments/hyperparameter_tuning/snakefile_run_optuna Removed legacy optuna workflow
experiments/hyperparameter_tuning/slurm_run_optuna_multi_at_once.sh Removed legacy optuna Slurm wrapper
experiments/hyperparameter_tuning/optuna_dextrademixer.py Removed legacy optuna driver
experiments/hyperparameter_tuning/environment_optuna.yaml Removed legacy optuna environment
experiments/hyperparameter_tuning/create_data_mean_variance_fold_increase.py Removed legacy simulation script
experiments/hyperparameter_tuning/aggregate_results.py Removed legacy aggregation script
experiments/.slurm/config.yaml Updated Slurm profile submission template/resources
experiments/.slurm_one_node/status.py New status checker script for one-node profile
experiments/.slurm_one_node/config.yaml New one-node Slurm profile
environment_minimal.yaml New pinned “minimal” environment for benchmarks/tests
environment_full.yaml New pinned “full” environment export
env_minimal.def New Apptainer definition for minimal env
env_full.def New Apptainer definition for full env
dextrademixer/utils/utils.py Added shared aggregation + metrics + Slurm helpers
dextrademixer/utils/simulation.py Updated simulator: variance sampling + binder/outlier proportion control
dextrademixer/model/Dextrademixer.py Model enhancements: size factors, outlier filtering, save/load, posterior logic changes
.gitignore Updated ignore patterns for experiment outputs/assets

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread experiments/synthetic_benchmark/run_dextrademixer.py
Comment thread experiments/synthetic_benchmark/run_dextrademixer.py
Comment thread experiments/synthetic_benchmark/run_beam.py
Comment on lines +83 to +90
output:
protected(
"benchmarks/{scenario}/csv/Dextra{model_config}-" # wildcard cannot be empty, therefore have to use a small hack here
"{N},0.4,{po},{p},False,{mean_inc},None,{i}.csv",
)
params:
neg_ctrl_key=lambda wc: "neg_control" if wc.model_config == "Demixer+neg." else "None"
resources:
Comment on lines +41 to +45
resources:
c=1,
mem="8000M",
node="",
qos="cpu_preemptible" if PREEMPTIBLE else "cpu_normal",
Comment on lines 925 to 944
clone = None if c is None else jnp.array(c, dtype=INT_DTYPE)
self.data = {"x": jnp.array(x, dtype=INT_DTYPE),
"s": None if s is None else jnp.array(s, dtype=FLOAT_DTYPE),
"x_neg": None if neg_cont is None else jnp.array(neg_cont, dtype=FLOAT_DTYPE),
"clone": clone,
# If clone is not contiuous, then there will be problems with indexing
"clone_continuous": None if clone is None else jnp.searchsorted(jnp.unique(clone), clone),
"sigma": None if sigma is None else jnp.array(sigma, dtype=FLOAT_DTYPE),
zscore = jnp.abs((x - jnp.mean(x)) / jnp.std(x))
outlier_threshold = 100 # TODO Hardcoded
# With outliers
self.data_full = {"x": jnp.array(x, dtype=INT_DTYPE),
"s": None if s is None else jnp.array(s, dtype=FLOAT_DTYPE),
"x_neg": None if neg_cont is None else jnp.array(neg_cont, dtype=FLOAT_DTYPE),
"clone": clone,
# If clone is not contiuous, then there will be problems with indexing
"clone_continuous": None if clone is None else jnp.searchsorted(jnp.unique(clone), clone),
"sigma": None if sigma is None else jnp.array(sigma, dtype=FLOAT_DTYPE),
}
# Without outliers
self.data = {"x": jnp.array(x[jnp.where(zscore < outlier_threshold)], dtype=INT_DTYPE),
"s": jnp.array(s[jnp.where(zscore < outlier_threshold)], dtype=FLOAT_DTYPE) if s is not None else None,
"x_neg": jnp.array(neg_cont[jnp.where(zscore < outlier_threshold)], dtype=FLOAT_DTYPE) if neg_cont is not None else None,
"clone": jnp.array(clone[jnp.where(zscore < outlier_threshold)], dtype=INT_DTYPE) if clone is not None else None,
"clone_continuous": None if clone is None else jnp.searchsorted(jnp.unique(clone), clone[jnp.where(zscore < outlier_threshold)]),
"sigma": None if sigma is None else jnp.array(sigma, dtype=FLOAT_DTYPE)[jnp.where(zscore < outlier_threshold)],
}
Comment on lines 1121 to 1122
super().preprocess_model_data(x=x, s=s, neg_cont=neg_cont, c=c, sigma=sigma, mode=mode,
alpha_model=alpha_model, **kwargs)
Comment on lines +233 to +236
def fit_svi(self, guide='normal', svi_config: Dict[str, Union[int, float]] = None,
nof_inits: int = 100, use_minimal_loss: bool = True, rng_key: int = 998777,
return_loss: bool = False) \
-> az.InferenceData:
y_true: Array = None) \
-> az.InferenceData:
Comment on lines +44 to +62
"""
Sample a realistic variance given a mean using the fitted power-law model:
log(var) = a + b*log(mean) + Normal(0, resid_std^2)

Args:
mean : float or np.ndarray
Mean(s) at which to sample the variance. Must be > 0; broadcasting allowed.
a : float, default 2.0221541172111164
Proportionality constant (exp(intercept) from log–log OLS).
b : float, default 1.6969075027280063
Scaling exponent (slope from log–log OLS).
resid_std : float, default 0.31049623532404225
Residual standard deviation on the *log-variance* scale (σ from OLS residuals).
rng : int | np.random.RandomState, default 42
Source of randomness. If int, used as the seed. If None, uses SciPy/Numpy default RNG.
Returns:
float or np.ndarray
A sample of variance values with the same broadcasted shape as `mean`.
"""
Comment on lines +148 to +159
if use_size_factor:
pmhc_list = use_size_factor if isinstance(use_size_factor, list) else mdata[gex_key].var_names.tolist()
x_plus = jnp.array(gex[:, pmhc_list].X.toarray(),
dtype=FLOAT_DTYPE) # only used for size factor calculation
s = self.calculate_size_factors(x_plus)
del x_plus
else:
s = jnp.ones(x.shape[0], dtype=FLOAT_DTYPE)

self._check_parameters(x, x_neg, c, sigma)
self.model.preprocess_model_data(x=x, neg_cont=x_neg, c=c, sigma=sigma, mode=self.mode,
alpha_model=self.alpha_model, **kwargs)
self.model.preprocess_model_data(x=x, s=s, neg_cont=x_neg, c=c, sigma=sigma, mode=self.mode,
alpha_model=self.alpha_model, outlier_threshold=outlier_threshold, **kwargs)
@ArcaneEmergence ArcaneEmergence marked this pull request as ready for review June 19, 2026 11:05
@ArcaneEmergence ArcaneEmergence merged commit 355267a into main Jun 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants