Result (real FI-2010). DeepLOB trained on the real FI-2010 benchmark (ZScore CF_7) reaches test macro-F1 0.68 → 0.80 as the prediction horizon grows from k=10 to k=100 events — the longer-horizon-is- easier pattern the literature reports. Numbers are on a 50k-train / 20k-test event subset over 6 epochs (for a CPU/MPS-feasible run); scaling events + epochs closes the gap to the paper's ~0.83 at k=10. Reproduce with
make fi2010. Separately, the trading-eval tests encode the thesis that a perfectly accurate predictor still loses money once half-spread + queue/latency costs hit tick-sized edges.
The tick-strata / temporal-decay tooling is shown on synthetic labels below — FI-2010's normalized arrays don't expose per-instrument tick-size or year metadata, so that diagnostic needs raw LOBSTER-style data.
Thesis. Reimplement DeepLOB (Zhang, Zohren & Roberts, 2019) and a modern transformer (TLOB, 2025) for short-horizon mid-price direction on limit-order-book data — then do the part everyone skips: failure analysis (tick-size strata, signal decay over time) and a transaction-cost-aware trading evaluation that shows whether directional accuracy survives as P&L.
A DeepLOB clone is now a baseline, not an achievement. Your contribution is the rigorous "when/why does it work, and does the edge survive costs?" analysis.
flowchart LR
F["FI-2010 / LOBSTER"] --> W["100×40 windows<br/>horizon labels (leakage-safe)"]
W --> MO["DeepLOB · TLOB"]
MO --> TR["Train<br/>macro-F1 · class weights · early stop"]
TR --> EV["Trade eval<br/>cost + queue/latency → net P&L · Deflated Sharpe"]
TR --> ST["Failure analysis<br/>tick strata · temporal decay"]
project1_deep_lob/
├── IMPLEMENTATION_LOGIC.md
├── DATA.md
├── LLM_PROMPTS.md
├── requirements.txt
├── src/models/deeplob.py # IMPLEMENTED: faithful DeepLOB (PyTorch)
├── src/models/tlob.py # IMPLEMENTED: TLOB dual-attention transformer (einops)
├── src/data/fi2010.py # IMPLEMENTED: FI-2010 loader, horizon labels, leakage-safe windowing
├── src/train.py # IMPLEMENTED: Hydra trainer (class weights, macro-F1 early stop, wandb)
├── src/backtest/trade_eval.py # IMPLEMENTED: cost-aware net P&L, turnover, Deflated Sharpe
└── src/eval/strata.py # IMPLEMENTED: tick-strata + temporal-decay F1 + plots
All modules implemented with tests (CPU-only, tiny tensors; no network/GPU; FI-2010 windowing tested on a
synthetic array). DeepLOB and TLOB share the (B,1,100,40)->(B,3) contract so they are directly comparable.
The trading evaluation makes the project's thesis concrete, its tests show a perfectly accurate predictor
still loses money once half-spread + queue/latency costs are charged against tick-sized edges. Drop in
FI-2010 (Train_*/Test_* files under data/) and run python -m src.train to reproduce.
make setup && make check # install, then lint + typecheck + test (CI parity)
make train # after dropping FI-2010 Train_*/Test_* files into data/- Accuracy ≠ P&L. The trading evaluation deliberately charges half-spread + a queue/latency penalty on turnover, so a high-macro-F1 model can still lose money, the failure mode most LOB repos omit.
- Shared I/O for DeepLOB and TLOB makes the two a clean, controlled comparison rather than apples-to-oranges.
- Limitation: FI-2010 is dated and the figure here is illustrative until the dataset is dropped into
data/. - What I'd do next: real FI-2010 + LOBSTER validation; queue-position modeling from message data; multi-horizon joint training; calibrating the cost model to a specific venue.
- Zhang, Z., Zohren, S. & Roberts, S. (2019). DeepLOB: Deep Convolutional Neural Networks for Limit Order Books. IEEE Trans. Signal Processing.
- Berti, L. & Kasneci, G. (2025). TLOB: A Dual-Attention Transformer for Limit Order Book forecasting. (replicated here)
- Ntakaris, A. et al. (2018). Benchmark dataset for mid-price forecasting of limit order book data (FI-2010). J. Forecasting.
- López de Prado, M. (2018). Advances in Financial Machine Learning (Deflated Sharpe, honest backtesting). Wiley.

