Honest headline. This program produced 0 deployable, 0 paper, and 0 live strategies. That is not the failure of the project — it is the result. The deliverable is (1) a rigorous, leakage-safe validation harness and (2) the reproducible, documented map of which data changes the answer. Under a strict zero-cost data constraint, no taker-tradable alpha survived out-of-sample validation. The binding constraint is the information set, not the method.
QuantLab Alpha is a local-first, stage-gated alpha research platform for a single operator on commodity Apple-Silicon hardware. Its architecture is four staged subsystems: an S1 tabular predictor, an S2 LLM governor that vetoes signals not supported by a local research corpus, S3 free-data feeds and broker abstractions, and S4 promotion-gated execution with a file-based kill switch and an append-only, byte-for-byte replayable audit log. The platform's delivered scientific contribution is not a profitable strategy. It is a reproducible validation harness — purged/embargoed cross-validation, weighted zero-mean R², adversarial feature validation, a seeded noise floor, probability of backtest overfitting, deflated Sharpe ratios, and stationary-bootstrap confidence intervals — together with the honest finding that under a zero-cost data constraint no taker-tradable alpha survives out-of-sample validation. Across an S1 tabular predictor and roughly thirteen closed signal-research branches culminating in a delta-neutral funding-carry capstone, the binding constraint is shown to be the information set, not the modelling method. We further show (§6) that this conclusion is predicted by four independent strands of the literature — post-publication anomaly decay, backtest-overfitting theory, the Grossman–Stiglitz/Berk–Green equilibrium, and the empirical failure of crowdsourced retail alpha — and that no public quantitative system survives the same validation gate. This repository is research-only: it reports 0 deployable strategies and authorizes no paper or live trading.
The problem is the standard one stated honestly: can a single operator, using only free or self-authenticated data and local compute, build something that survives hedge-fund-grade out-of-sample validation? Most retail "alpha" evaporates the moment transaction costs, survivorship bias, and overfitting controls are applied. The contribution of this work is to apply those controls rigorously, document every failure, and isolate why each one fails.
The platform rests on a two-layer thesis: numeric prediction and evidence-based governance should be orthogonal responsibilities.
- S1 is the only authoritative source of numeric forecasts. It produces a prediction and a confidence; it never decides to trade.
- S2 is an LLM governor that may only veto or explain. Every
passverdict must cite at least one chunk of the local research corpus or be downgraded toinsufficient_evidence. The governor never originates a trade.
This separation gives four concrete research questions.
| ID | Research question | Repository evidence |
|---|---|---|
| RQ1 | Can leakage-safe tabular models clear a pre-registered weighted zero-mean R² gate on Jane-Street-style targets? | src/quant_research_stack/alpha/, docs/research/VALIDATION_RUNBOOK.md |
| RQ2 | Do candidate features survive adversarial validation and a seeded random-noise floor? | alpha/adversarial.py, alpha/metrics.py |
| RQ3 | Can an LLM governor veto unsupported trades while being forced to cite local research chunks? | src/quant_research_stack/governor/, ADR 0005 |
| RQ4 | Can staged promotion prevent accidental live trading before paper/shadow evidence exists? | docs/runbooks/, ADR 0002 |
The answer to RQ1 is no, not under free data (S1 sits below its gate). RQ2–RQ4 are answered affirmatively as infrastructure: the controls work and they correctly reject fragile results. The remainder of this paper is the evidence.
flowchart LR
raw[raw free data] --> feat[features + meta-features]
feat --> S1[S1 tabular predictor]
S1 --> S2[S2 LLM governor<br/>GBNF + paper citations]
S2 --> S4[S4 execution<br/>QUANTLAB_STAGE gate]
S4 --> audit[(append-only audit log)]
Stage gating. A single operator-controlled environment variable,
QUANTLAB_STAGE, selects the broker class loaded at process start: paper
(simulated brokers), live_shadow (read-only real account routed to a null broker),
or live (real money). In-process self-promotion is forbidden by design — promotion
requires a signed runbook commit, an edited .env, and a process restart
(ADR 0002,
stage promotion runbook).
Kill-switch precedence. A KILL_TRADING file in the repository root, a stale
model, a data outage, or a drawdown breach halts trading; the kill path takes
precedence over every other decision in S4
(ADR 0014,
kill switch runbook).
Audit and replay. Every S2/S3/S4 decision lands in an append-only JSONL audit log. Each rotation is made read-only after closing, and replay of the log must reproduce the same decision sequence byte-for-byte. The full design rationale is in the platform design spec and the two-tier separation in ADR 0001.
The harness is the deliverable, so the methodology section is precise. Each control links to the module that implements it and to the validation runbook.
Implemented in alpha/cv.py. For fold
Purging removes training rows whose label horizon overlaps the validation window (so a label observed in training does not depend on a price realized inside the validation set). The embargo additionally drops a buffer of rows immediately after each validation block, blocking serial-correlation leakage across the split. Combinatorial purged cross-validation (CPCV) generalizes this to multiple held-out groups so the overfitting estimate (§3.6) sees many train/test recombinations (López de Prado, 2018).
Implemented in alpha/metrics.py. The
Jane-Street-style score is
The denominator is taken about zero, not about the sample mean. A score above zero means the model beats the naive "predict zero return" baseline; the metric can be negative when a model is worse than that baseline — which several S1 base learners are on the holdout (§4.1).
Implemented in
alpha/adversarial.py. A classifier is
trained to distinguish train rows from holdout rows on a per-feature basis. Any feature
whose train-vs-holdout classifier AUC > 0.6 is dropped or transformed, because such
a feature carries a regime shift the model would exploit in-sample and lose on the
holdout.
A seeded Gaussian feature
Implemented in alpha/stacking.py. Let
and the stacked forecast is
Implemented in
crypto_research/perps/validation.py.
Combinatorially symmetric cross-validation (CSCV; Bailey, Borwein, López de Prado &
Zhu, 2017) splits performance into many train/test pairs, ranks the in-sample best
configuration, and measures how often it underperforms out of sample. With
A high PBO means the apparent best strategy is most likely an artifact of selection over many trials.
Also in
crypto_research/perps/validation.py.
The deflated Sharpe ratio corrects a measured Sharpe
where
A stationary bootstrap (Politis & Romano, 1994) resamples blocks of returns (preserving serial dependence) to build a confidence interval on the Sharpe ratio. The pre-registered gate requires the lower bound of the CI to be strictly positive — a point estimate is not enough.
Implemented in
crypto_research/funding/carry.py.
For a delta-neutral long-spot / short-perp position, the per-period net return decomposes
as
where
The latest S1 run (experiments/alpha_s1/20260523-160541) was trained on
4,011,392 rows and evaluated on a permanent 1,008,656-row holdout with 3 CV
folds, on 79 features surviving adversarial (§3.3) and noise-floor (§3.4)
filtering. The pre-registered milestone gate is holdout weighted zero-mean R² ≥
0.012.
| Model | Holdout weighted zero-mean R² |
|---|---|
| CatBoost | +0.0062 |
| Ridge | +0.0039 |
| LightGBM | +0.0025 |
| Sequence (1D-CNN) | −0.0007 |
| XGBoost | −0.0093 |
| MLP | −0.0094 |
| Ensemble (holdout) | +0.0055 |
Verdict: below gate, not released. The ensemble holdout score of 0.0055 is less than half the 0.012 gate. Two base learners (XGBoost, MLP) are negative on the holdout — worse than predicting zero. The trees and the linear baseline carry a faint, real but sub-gate signal; nothing here is deployable. See the S1 implementation plan.
Beyond S1, roughly thirteen independent signal-research branches were opened and closed. Every one was killed by a documented mechanism — never by hand-waving. The verdicts group into four recurring failure modes: noise floor, subsumed by vol-targeting / regime exposure, predictive-but-untradable (markout below cost), and data-blocked / survivorship-unsafe.
| Branch | Verdict | Why it failed | Reference |
|---|---|---|---|
| OHLCV cross-sectional (6 iterations) | closed | noise floor; PSR/DSR kill the faint holdout flicker | note |
| VRP index | closed | real but subsumed by HMM regime (orthogonal residual Sharpe ≈ 0) | intake |
| HMM single-index | closed | strong static fit but delay-sensitive and refit-unstable | intake |
| Event-macro FOMC | closed | real + placebo-clean but subsumed by vol-targeting; below gate | note |
| Microstructure L2 / L1 / tick | closed | predictive but untradable (markout < spread + fee) | note |
| Futures carry / term-structure | rejected (data) | no native curve available for free | intake |
| Options-IV cross-sectional | rejected (data) | only free return source is survivorship-biased | intake |
| EDGAR 10-K text features | closed | clean (PIT + survivorship) but annual = too low-frequency | intake |
| Zero-cost allocators v1 / v2 | DO_NOT_ADVANCE | cleared literal gate but crypto-regime-carried / ETH-concentrated and bootstrap-fragile | intake |
The full ledger, the four-wall taxonomy, and the consolidated arc are in the zero-cost alpha search close-out and the signal-research program review. Outcome: 0 production, 0 paper, 0 live candidates.
The freshest and most instructive result is a delta-neutral funding-rate carry: long spot / short USDT-M perp on BTC + ETH, collecting the 8h funding that longs persistently pay shorts, on free Binance Vision data over 2020-01..2026-04. It was the first branch to structurally escape the cost wall (it is held, not a taker bet) and the subsumption wall (it is carry, not vol-timing), and the first to clear its data audits.
Data audit — PASS. 6,936 funding settlements; a 2,312-row/asset daily carry panel
with zero missing days; spot Vision-daily reconciled to on-disk 1-minute data to 0.0%.
Two bugs that would have fabricated a result were caught by the harness: an
exact-timestamp 8h join silently dropped ~45% of settlements (millisecond jitter in
calc_time), and a pooled book annualized 8h data at 365 instead of 1095. Both were
fixed and regression-tested.
Backtest — net-positive in 6 of 7 years. After 10 bps spot + 5 bps perp taker cost, 8h-marked pooled annual net returns were 2020 +23.8%, 2021 +40.2%, 2022 +2.4%, 2023 +8.2%, 2024 +13.3%, 2025 +5.1%, 2026 −0.26% (partial year). The unlevered pooled Sharpe is ≈ 8.6 at ≈ 14%/yr.
The high Sharpe is real, not a marking illusion. Re-marking the carry on the true 8h funding-settlement grid (rather than a smoothed daily close) did not deflate the Sharpe: it moved 8.56 → 8.61. The spot-perp basis is genuinely tight even at 8h resolution, so the figure is not a basis-variance artifact. The danger is elsewhere.
The real risk is the fat left tail under leverage. Unlevered, the carry is capital-inefficient (100% margin on both legs). Any leverage used to fix that introduces short-perp liquidation in crashes. Under an isolated-margin liquidation model the stressed pooled returns are 3× → −17%/yr, 5× → −38%/yr, 10× → −90%/yr. This is a calm-period Sharpe that does not price its own tail — pennies in front of a steamroller.
Verdict: DO_NOT_ADVANCE. It fails the pre-registered 2026 regime gate (2026 YTD net −0.16% on the daily-close run; −0.26% 8h-marked) and the edge is decaying with crowding (2024 +13% → 2025 +5% → 2026 ≈ 0). See the funding-carry negative result, the funding-carry intake, and the realism results.
Every failure in this program reduces to one or more of four walls. Only the last is a methodology issue; the first three are data-access issues.
- Cost wall. The signal is real but its markout is below realistic transaction cost (all microstructure channels).
- Subsumption wall. A single-index risk-timing signal is already captured by vol-targeting or regime exposure, leaving ≈ 0 orthogonal residual (VRP, HMM, FOMC, the allocators' equity sleeves).
- Data-access / survivorship wall. Structurally new channels need entitled or survivorship-safe data unavailable for free (futures curve, options-IV cross-section, 10-Q labels, point-in-time fundamentals).
- Frequency / sample wall. Clean free data exists but at too low a frequency for robust inference (EDGAR 10-K annual: only a handful of holdout cross-sections).
The funding-carry capstone added a fifth pattern that does not reduce to these: a real, free, market-neutral carry that is regime-decaying and tail-dominated under the leverage needed to make it efficient.
Meta-conclusion: data acquisition is the binding constraint. Methodology and validation are not the bottleneck — the controls work, and they repeatedly rejected fragile results before they could mislead. What is missing is information the free tier cannot supply. The full argument and the paid-data path are in the zero-cost alpha search close-out, the zero-cost constraint note, and the paid-data acquisition recommendation.
A natural objection is that public repositories or published systems already achieve
what this program could not. A multi-source survey (companion report:
reports/2026-06-02-COMPETITIVE-LANDSCAPE-PUBLIC-QUANT.md)
finds the opposite: no public artifact contains a deployable, costed, capacity-aware,
gate-surviving alpha. The visible ecosystem is infrastructure (backtest/execution
engines such as NautilusTrader, backtrader, Zipline),
research pipelines (Microsoft Qlib; Yang et al.,
2020, whose published benchmarks report ranking ICs of
This section formalizes why — and shows that four independent strands of the academic
and industry record reproduce this program's four-wall thesis (§5). Each subsection pairs
a closed-form model with the wall it explains; all figures are regenerated by
scripts/make_landscape_figures.py from the
equations alone (no market data), so they are fully reproducible.
Let
where
The practical consequence: with
McLean & Pontiff (2016) tracked 97 cross-sectional predictors from publication into live
markets. Decomposing the in-sample return
i.e. roughly a quarter of the edge was never real and a further third is arbitraged away once the signal is public (McLean & Pontiff, 2016). Critically, the liquid, easy-to-arbitrage anomalies — precisely those tradable from free OHLCV — decay the most, exactly as the limits-of-arbitrage view predicts (Shleifer & Vishny, 1997).
The crowding limit is the Grossman–Stiglitz / "Red-Queen" equilibrium: if a fraction
is the Grossman–Stiglitz impossibility result (Grossman & Stiglitz, 1980) and Lo's
adaptive-markets view of strategy proliferation and decay (Lo, 2004); LLM-assisted
research drives
The standard error of a cross-sectional information coefficient over
A real cross-sectional IC is small — Qlib's published benchmarks sit at
Let a signal have gross information ratio
When the per-trade markout is below
The Berk–Green (2004) rational-markets mechanism implies decreasing returns to scale:
alpha net of price impact falls in deployed capital
so a capacity-constrained anomaly a small operator can access (micro-caps) has an
optimal size
| Quantity reported | Public repos / cookbooks | This program |
|---|---|---|
| Sharpe | gross, un-deflated, best-of-$N$ ( |
net, deflated against |
| Sample | in-sample / favourable window | purged-embargoed OOS holdout |
| Costs | often omitted |
|
| Capacity | ignored | required (§6.5) |
| Decay | unmodelled |
|
Reading the same market through stricter instruments yields a stricter — and more honest — number. The external record (anomaly decay, backtest-overfitting theory, Quantopian, pod-shop practice, LLM homogenization) does not contradict this program's result; it predicts it. The corollary stands: the binding constraint is the information set, not the method — the genuine winners win on entitled data, execution infrastructure, capacity-constrained niches, and portfolio construction across many decayed sleeves, none of which a free-data solo operator can manufacture from a better model.
The §6.1–6.6 results are theory. We also ran them as a measured experiment: a grid of
~1.06 M systematically-enumerated single-asset strategies (21 cited signal families × 8
lookbacks × 8 thresholds × 3 volatility estimators × 2 position modes × 3 holding periods,
across 11 universes), each backtested net-of-cost on real 2015–2026 daily data and passed
through PBO + Deflated Sharpe + a price-level permutation control. At the reported N = 1,000
sampled tier (the conclusion only sharpens with N, since the bar grows as √(2 ln N)):
| Measured | Value |
|---|---|
| Best in-sample Sharpe | 0.96 |
Expected max under the null √(2 ln N)·σ |
1.02 (the best is below the chance bar) |
| Probability of Backtest Overfitting (PBO) | 0.69 |
| Deflated Sharpe of the best strategy | 0.44 (gate ≥ 0.95) |
| Strategies surviving the deflated bar | 0 / 1000 |
| Permutation MCPT p-value | 1.0 (real best 0.83 < permuted mean 1.35) |
The "best" strategy sits exactly where a zero-skill search would place it — confirming the
selection-bias mechanism of §6.1 on live backtests. Full write-up and reproduction:
strategy-zoo overfitting report
and verdict; harness under
src/quant_research_stack/strategy_benchmark/zoo/.
The capstone result regenerates in seconds from cached free data:
make mvpmake mvp runs the funding-carry v1 pipeline, the realism (8h-marked + liquidation)
pass, and regenerates the four figures embedded above (figures/*.png). It prints the
honest verdict — funding-carry = DO_NOT_ADVANCE (research_only) — and writes reports to
reports/signal_research/funding_carry_v1/.
Environment. Dependencies are managed with uv; all entry points run with
PYTHONPATH=src.
uv sync --extra dev
PYTHONPATH=src uv run pytest -q
ruff check src scripts tests
mypy srcArtifact integrity. Every S1 artifact is hashed under
experiments/alpha_s1/20260523-160541/_artifact_sha256.json,
so a clean checkout can verify byte-for-byte that the reported numbers come from the
committed models. Audit replay: replaying the append-only audit log must reproduce
the same decision sequence byte-for-byte (§2).
- No deployable alpha exists. 0 deployable, 0 paper, 0 live strategies. Nothing in this repository is authorized for paper or live trading, and no result here constitutes investment advice.
- S1 is below its gate. Holdout weighted zero-mean R² of 0.0055 vs a 0.012 gate; two base learners are negative on the holdout.
- The funding carry is tail-dominated and decaying. Its high Sharpe is a calm-regime figure; under the leverage needed for capital efficiency it carries a crash-liquidation tail (−17%/−38%/−90% at 3×/5×/10×), and it fails the pre-registered 2026 regime gate as the edge crowds out.
- Free-data scope. Sub-millisecond equity HFT, native futures curves, true options chains, and survivorship-safe equity fundamentals are all out of scope on free tiers.
- What would change the answer. Paid, survivorship-safe data (e.g. Sharadar) would reopen the data-blocked branches; the dormant ingestion scaffold is ready. The feasibility analysis and kill criteria are in the Sharadar data-purchase feasibility study.
Architecture Decision Records — docs/architecture/adrs/
- ADR 0001 — two-tier tabular / LLM separation
- ADR 0002 — three-stage promotion gate
- ADR 0003 — GBNF-constrained LLM output
- ADR 0005 — LLM governor citation requirement
- ADR 0010 — fill model and fixed-bps slippage
- ADR 0014 — kill-switch precedence
Specs — docs/superpowers/specs/
Plans — docs/superpowers/plans/
Negative-result notes — docs/research/
- OHLCV cross-sectional alpha
- Event-macro FOMC
- Microstructure L2/L1/tick
- Funding-carry capstone
- Zero-cost alpha search close-out
- Signal-research program review
Research intakes — docs/research/intake/
- VRP index v1
- HMM single-index v1
- Funding-carry v1
- Futures carry / term-structure v1
- Options-IV features v1
- EDGAR 10-K text features v1
- Zero-cost deployable v1
Runbooks — docs/runbooks/
Program reports — reports/
- Program review — the data-entitlement constraint
- Competitive landscape — public & published quant systems (basis of §6)
Cited in author–year form throughout §3 and §6.
- Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2017). The Probability of Backtest Overfitting. Journal of Computational Finance, 20(4), 39–69. https://doi.org/10.21314/JCF.2016.322
- Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. Journal of Portfolio Management, 40(5), 94–107. https://doi.org/10.3905/jpm.2014.40.5.094
- Berk, J. B., & Green, R. C. (2004). Mutual Fund Flows and Performance in Rational Markets. Journal of Political Economy, 112(6), 1269–1295. https://doi.org/10.1086/424739
- Grossman, S. J., & Stiglitz, J. E. (1980). On the Impossibility of Informationally Efficient Markets. American Economic Review, 70(3), 393–408.
- Harvey, C. R., Liu, Y., & Zhu, H. (2016). … and the Cross-Section of Expected Returns. Review of Financial Studies, 29(1), 5–68. https://doi.org/10.1093/rfs/hhv059
- Lo, A. W. (2004). The Adaptive Markets Hypothesis. Journal of Portfolio Management, 30(5), 15–29.
- López de Prado, M. (2018). Advances in Financial Machine Learning. Hoboken, NJ: Wiley. (purged/embargoed CV, CPCV — §3.1, §3.6)
- McLean, R. D., & Pontiff, J. (2016). Does Academic Research Destroy Stock Return Predictability? Journal of Finance, 71(1), 5–32. https://doi.org/10.1111/jofi.12365
- Politis, D. N., & Romano, J. P. (1994). The Stationary Bootstrap. Journal of the American Statistical Association, 89(428), 1303–1313. https://doi.org/10.1080/01621459.1994.10476870
- Shleifer, A., & Vishny, R. W. (1997). The Limits of Arbitrage. Journal of Finance, 52(1), 35–55. https://doi.org/10.1111/j.1540-6261.1997.tb03807.x
- Yang, X., Liu, W., Zhou, D., Bian, J., & Liu, T.-Y. (2020). Qlib: An AI-oriented Quantitative Investment Platform. arXiv:2009.11189. https://arxiv.org/abs/2009.11189
Non-peer-reviewed; used for industry context and corroboration, not as primary evidence.
- AI-driven alpha decay under algorithmic monoculture — arXiv:2605.23905 (2026); IBKR Quant: LLMs and the shortening shelf life of copyable alpha.
- Backtest-bias surveys — The three ways backtests lie; Backtest overfitting & live performance.
- Crowdsourced-alpha & crowding case studies — The rise and fall of Quantopian; Factor decay & pod shops.
- Platforms surveyed — Microsoft Qlib · NautilusTrader · awesome-systematic-trading.
This repository is not a regulated investment advisor and produces no investment advice. It is a research system reporting a negative result. Real-money trading is gated behind operator-only promotion controls and is solely the operator's responsibility.












