LAMF: Liquid Audience Measurement Framework

Reference implementation and reproducibility package for the paper "Liquid Audiences: Measuring Intent-Based Segmentation in Real Time" (Kasimov, 2025).

This repository provides the core framework described in the manuscript, the policies evaluated in Sections 5.1–5.5, the evaluation machinery (bootstrap CIs, Friedman omnibus, Nemenyi post-hoc), and scripts that reproduce every numerical result and figure in the paper.

What this is

LAMF is a measurement and governance layer for real-time intent-based segmentation. The paper makes three concrete claims:

Intent can be represented as a decaying segment-of-one state (half-life, TTL).
Governed policies (consent, caps, do-nothing, capped exploration) reduce trust-damaging contacts relative to always-on baselines.
The same governance pattern holds across three public benchmarks (MovieLens 1M, RetailRocket, REES46 Cosmetics) and a controlled simulation.

This repository is the executable version of those claims. A reviewer can clone it, install the dependencies, run the tests, run the benchmarks, and compare the numerical output against the tables and figures in the paper.

Relationship to the paper

Paper section	Code
Section 3.2, Eq. 4 (decayed accumulator)	`lamf/intent.py`
Section 3.5 (four policies)	`lamf/policies.py`
Section 3.6 (evaluation protocol)	`lamf/evaluation.py`
Section 5.1–5.2 (simulation)	`lamf/simulation.py`
Section 5.3 (MovieLens 1M replay)	`scripts/run_benchmark.py --dataset movielens`
Section 5.4 (RetailRocket replay)	`scripts/run_benchmark.py --dataset retailrocket`
Section 5.5 (REES46 Cosmetics replay)	`scripts/run_benchmark.py --dataset rees46`
Friedman + Nemenyi (Table S2)	`lamf.evaluation.friedman_nemenyi`

Repository layout

lamf/                       # Core package
├── intent.py               # Decayed accumulator (Eq. 4), IntentState class
├── policies.py             # Static, Triggers, Liquid+Rules, Liquid+Bandits
├── evaluation.py           # Event-log replay, bootstrap, Friedman, Nemenyi
├── simulation.py           # Section 5.1 illustrative simulation
└── datasets.py             # Loaders for MovieLens, RetailRocket, REES46

scripts/
├── run_benchmark.py        # Unified CLI runner for all four experiments
└── reproduce_all.sh        # One-shot reproduction of all paper results

tests/
└── test_lamf.py            # 30 verification tests (pytest)

configs/
└── default.yaml            # Default hyperparameters (theta, half-lives, etc.)

docs/
├── DATASETS.md             # How to obtain each public dataset
└── EXTENDING.md            # How to add a new dataset or policy

examples/
└── rees46_quickstart.py    # Minimal working example end-to-end

data/raw/                   # Place downloaded datasets here (see DATASETS.md)
results/                    # Benchmark outputs land here

Quick start

Prerequisites

Python 3.9 or later
~4 GB RAM (REES46 is the largest workload)
Optional: ~8 GB disk space if all three public datasets are downloaded

Installation

git clone https://github.com/furkatkasimov/lamf.git
cd lamf
pip install -r requirements.txt
pip install -e .
# Or, to also install development tools:
pip install -e ".[dev]"

Verify installation (recommended before running benchmarks)

pytest tests/ -v

All 30 tests should pass in under 60 seconds.

Run the illustrative simulation

The simulation needs no external data and takes ~1 minute:

python scripts/run_benchmark.py --dataset simulation --output-dir results/simulation

This reproduces the simulation rows of Table S1 and Table S2. The data underlying Figure 2 is saved as sim_per_seed.csv; the bar chart itself must be plotted separately (see "Output format" below).

Run a public-data replay (REES46 Cosmetics example)

Download the REES46 Cosmetics Shop dataset from Kaggle (kaggle.com/datasets/mkechinov/ecommerce-events-history-in-cosmetics-shop) and unzip the CSV files into data/raw/rees46/. Then run:

python scripts/run_benchmark.py \
    --dataset rees46 \
    --data-dir data/raw/rees46 \
    --output-dir results/rees46

Runtime: ~1 minute on a modern laptop. See docs/DATASETS.md for the other datasets.

Reproduce every result in the paper

Once all datasets are in place:

bash scripts/reproduce_all.sh

This runs the simulation, MovieLens 1M, RetailRocket, and REES46 benchmarks in sequence and writes results to results/.

Output format

The replay benchmarks (MovieLens, RetailRocket, REES46) and the simulation benchmark write different sets of files, because the simulation aggregates across seeds rather than users.

Replay benchmarks (MovieLens, RetailRocket, REES46)

File	Description
`per_user.csv`	Raw per-user metric values
`summary.csv`	Policy means with 95% bootstrap CIs
`pvals.csv`	Paired-bootstrap P-values vs. Triggers
`friedman.csv`	Friedman omnibus ($\chi^2_F$, df, p, Kendall's W)
`posthoc_conv.csv`	Nemenyi pairwise adj. p-values — conversion
`posthoc_trust.csv`	Nemenyi pairwise adj. p-values — trust
`posthoc_vol.csv`	Nemenyi pairwise adj. p-values — volume
`fig_<name>_conversion.png`	Conversion bar chart with 95% CI
`fig_<name>_trust.png`	Trust bar chart with 95% CI
`fig_<name>_volume.png`	Volume bar chart with 95% CI

Simulation benchmark

File	Description
`sim_per_seed.csv`	Per-seed conversion and trust-event rates per policy
`sim_friedman.csv`	Friedman omnibus ($\chi^2_F$, df, p, Kendall's W)
`sim_posthoc_conv.csv`	Nemenyi pairwise adj. p-values — conversion
`sim_posthoc_trust.csv`	Nemenyi pairwise adj. p-values — trust

The simulation does not produce bar-chart PNGs directly. Figure 2 from the paper can be reconstructed from sim_per_seed.csv using any plotting tool; mean and standard deviation across seeds are the relevant aggregates.

How to add a new dataset

The expected input schema is a pandas DataFrame with columns:

user_id (int), item_id (int), ts (float, seconds), cat_idx (int), weight (float)

Plus a dictionary item_to_cat_idx: Dict[int, int] and the number of categories. To add a new dataset:

Write a loader function in lamf/datasets.py that returns (events_df, item_to_cat_idx, n_categories).
Add a CLI option in scripts/run_benchmark.py.
Run the same evaluation pipeline:

from lamf.evaluation import evaluate_policies_on_event_log, bootstrap_summary, friedman_nemenyi
from lamf.datasets import load_your_new_dataset

events, item_to_cat, n_cats = load_your_new_dataset("data/raw/your_dataset")

per_user = evaluate_policies_on_event_log(
    events=events,
    item_to_cat_idx=item_to_cat,
    n_categories=n_cats,
    half_life_fast_days=1.0,   # tune for your domain
    half_life_slow_days=7.0,   # tune for your domain
    theta=0.55,
    theta_safe=0.65,
    top_k=10,
)

summary, pvals = bootstrap_summary(per_user, n_bootstrap=500)
omnibus, posthocs = friedman_nemenyi(per_user)

See docs/EXTENDING.md for a worked example adding the Taobao UserBehavior dataset.

How to tune for a different domain

Hyperparameters controlling the policy behaviour, listed in configs/default.yaml:

Parameter	Default	Meaning
`half_life_fast_days`	1.0	Decay half-life for the fast/exploration accumulator
`half_life_slow_days`	7.0	Decay half-life for the slow/stable accumulator
`theta`	0.55	Activation threshold (min confidence to act)
`theta_safe`	0.65	Safety threshold (single vs top-k behaviour)
`top_k`	10	Size of the recommendation list above theta_safe
`bandit_epsilon`	0.05	Bandit exploration cap

Guidance from Section 5 (these values are also defined as dataset overrides in configs/default.yaml, which is the authoritative source):

Fast e-commerce browsing: half_life_fast ≈ 1 day, half_life_slow ≈ 7 days
Movie ratings / slower content consumption: half_life_fast ≈ 3 days, half_life_slow ≈ 14 days
Urgency-driven intents (price-checks): use shorter half-lives
Replenishment intents: use longer half-lives (weeks)

Interpreting the metrics

The three proxies reported are deliberately simple so they transfer across domains. They are not a substitute for field measurement; they diagnose the governance pattern of a policy, not its real-world profit.

Conversion proxy: fraction of decision points where the user's next-interacted item appears in the recommended set.
Trust proxy: fraction of decisions where the top-recommended item lies outside the user's two strongest slow-decay categories (an "off-category contact"). Near-zero trust for governed policies is partly a design consequence of confidence gating and category-aligned recommendation; interpret accordingly.
Volume proxy: fraction of decision points where the policy emits any recommendation.
Net reward (for framing only): conversion − 0.4 × trust. The 0.4 weight is a transparent compromise, not tuned; reweighting can change which policy has the highest net reward without new data.

Reproducibility notes

All random seeds are fixed (simulation seeds 0–9; bootstrap seed 42; evaluation seed 0). Running the same benchmark twice should produce byte-identical output.
The REES46 loader subsamples 15,000 users by default (random_seed=42); this matches the paper. Different subsamples produce slightly different numbers but the qualitative findings hold (see paper limitation 8).
Results in results/ match the tables and figures in the paper to within floating-point tolerance.

Citation

If you use this code, please cite:

@article{kasimov2025liquid,
  title   = {Liquid Audiences: Measuring Intent-Based Segmentation in Real Time},
  author  = {Kasimov, Furkat},
  year    = {2025},
  note    = {Preprint},
}

License

MIT. See LICENSE.

Contact

For questions about the framework or the paper, open an issue on this repository or contact the author via the ORCID on the manuscript.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LAMF: Liquid Audience Measurement Framework

What this is

Relationship to the paper

Repository layout

Quick start

Prerequisites

Installation

Verify installation (recommended before running benchmarks)

Run the illustrative simulation

Run a public-data replay (REES46 Cosmetics example)

Reproduce every result in the paper

Output format

Replay benchmarks (MovieLens, RetailRocket, REES46)

Simulation benchmark

How to add a new dataset

How to tune for a different domain

Interpreting the metrics

Reproducibility notes

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
configs		configs
data		data
docs		docs
examples		examples
lamf		lamf
results		results
scripts		scripts
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

LAMF: Liquid Audience Measurement Framework

What this is

Relationship to the paper

Repository layout

Quick start

Prerequisites

Installation

Verify installation (recommended before running benchmarks)

Run the illustrative simulation

Run a public-data replay (REES46 Cosmetics example)

Reproduce every result in the paper

Output format

Replay benchmarks (MovieLens, RetailRocket, REES46)

Simulation benchmark

How to add a new dataset

How to tune for a different domain

Interpreting the metrics

Reproducibility notes

Citation

License

Contact

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages