Skip to content

Serve simulation under gunicorn (fix concurrent-sim slowdown) + DMP_API_URL alias#11

Open
RyadT wants to merge 1 commit into
mainfrom
ryad/sim-prod-perf-dmp-fix
Open

Serve simulation under gunicorn (fix concurrent-sim slowdown) + DMP_API_URL alias#11
RyadT wants to merge 1 commit into
mainfrom
ryad/sim-prod-perf-dmp-fix

Conversation

@RyadT

@RyadT RyadT commented May 30, 2026

Copy link
Copy Markdown
Contributor

Why

Investigated a report that simulations feel ~2x slower on prod (covidmod.isi.jhu.edu) than localhost for the same Barnsdall zone (74002 / min-pop 5000 / greedy-weight, ~7000 people, 720h month-long run).

Measured findings:

  • The simulator compute is the same speed on prod and local; the in-process DMP is confirmed working on both (timeline_source_counts = {dmp: 24-26k, fallback: 0} on every prod run — not the slow per-infection HTTP path).
  • The ~2x appears under concurrent load. The Flask dev-server entrypoint (app.run(threaded=True)) runs CPU-bound sims on threads that share one GIL, so overlapping sims serialize. Localhost is single-user; prod is a shared multi-user host, so the contention shows there.

Reproduced directly: two concurrent 720h runs on the Flask server took 56s & 59s each (vs ~31s solo).

What

  1. gunicorn entrypoint (Dockerfile, requirements.txt): run the sim server under gunicorn with multiple worker processes instead of the Flask dev server. CPU-bound sims now run on separate cores instead of serializing on one GIL.
    • WEB_CONCURRENCY controls worker count (default 4; override per host).
    • --timeout 0 because each /simulation/ request streams SSE for the full run duration — a non-zero timeout would kill long runs mid-stream.
  2. DMP env-var alias (simulator/config.py): accept DMP_API_URL as an alias for DMP_API_BASE_URL. The deploy compose sets DMP_API_URL, so without this the HTTP-fallback base URL silently defaulted to localhost:8000 (nothing listens there in the sim container) and could never reach the real dmp service if the in-process DMP ever became unavailable. In-process path is unaffected; this only repairs the fallback target. Backward-compatible (does not require any Deploy change).

Verification

Ran the app under gunicorn --worker-class sync --timeout 0 with WEB_CONCURRENCY=2 locally:

Flask dev server (old) gunicorn 2 workers (new)
Solo 720h ~31-40s 40s
2 concurrent 720h 56s & 59s (~2x) 41s & 44s (~1x)

SSE progress streaming confirmed working through gunicorn; single-sim latency unchanged.

Notes

  • Single-sim latency is single-core-bound and does not change; this fixes throughput/latency under concurrency.
  • The deploy-entrypoint test (tests/test_deploy_entrypoints.py) only asserts imports, so it is unaffected by the CMD change.

🤖 Generated with Claude Code

The Flask dev-server entrypoint (app.run, threaded=True) serializes concurrent
CPU-bound sims on a single GIL: two overlapping 720h runs each slowed from ~31s
solo to ~56-59s (~2x). Switch the container entrypoint to gunicorn with multiple
worker processes so concurrent sims run on separate cores. Verified locally: 2
concurrent 720h runs stay ~41-44s each instead of doubling. WEB_CONCURRENCY sets
the worker count; --timeout 0 is required because each /simulation/ request
streams Server-Sent Events for the full run duration.

Also accept DMP_API_URL as an alias for DMP_API_BASE_URL: the deploy compose
sets DMP_API_URL, so without the alias the HTTP fallback base_url silently
defaulted to localhost:8000 (nothing listens there in the sim container) and
could never reach the real dmp service if the in-process DMP became unavailable.
The in-process path is unaffected; this only fixes the fallback target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant