Serve simulation under gunicorn (fix concurrent-sim slowdown) + DMP_API_URL alias#11
Open
RyadT wants to merge 1 commit into
Open
Serve simulation under gunicorn (fix concurrent-sim slowdown) + DMP_API_URL alias#11RyadT wants to merge 1 commit into
RyadT wants to merge 1 commit into
Conversation
The Flask dev-server entrypoint (app.run, threaded=True) serializes concurrent CPU-bound sims on a single GIL: two overlapping 720h runs each slowed from ~31s solo to ~56-59s (~2x). Switch the container entrypoint to gunicorn with multiple worker processes so concurrent sims run on separate cores. Verified locally: 2 concurrent 720h runs stay ~41-44s each instead of doubling. WEB_CONCURRENCY sets the worker count; --timeout 0 is required because each /simulation/ request streams Server-Sent Events for the full run duration. Also accept DMP_API_URL as an alias for DMP_API_BASE_URL: the deploy compose sets DMP_API_URL, so without the alias the HTTP fallback base_url silently defaulted to localhost:8000 (nothing listens there in the sim container) and could never reach the real dmp service if the in-process DMP became unavailable. The in-process path is unaffected; this only fixes the fallback target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Investigated a report that simulations feel ~2x slower on prod (covidmod.isi.jhu.edu) than localhost for the same Barnsdall zone (74002 / min-pop 5000 / greedy-weight, ~7000 people, 720h month-long run).
Measured findings:
timeline_source_counts={dmp: 24-26k, fallback: 0}on every prod run — not the slow per-infection HTTP path).app.run(threaded=True)) runs CPU-bound sims on threads that share one GIL, so overlapping sims serialize. Localhost is single-user; prod is a shared multi-user host, so the contention shows there.Reproduced directly: two concurrent 720h runs on the Flask server took 56s & 59s each (vs ~31s solo).
What
Dockerfile,requirements.txt): run the sim server under gunicorn with multiple worker processes instead of the Flask dev server. CPU-bound sims now run on separate cores instead of serializing on one GIL.WEB_CONCURRENCYcontrols worker count (default 4; override per host).--timeout 0because each/simulation/request streams SSE for the full run duration — a non-zero timeout would kill long runs mid-stream.simulator/config.py): acceptDMP_API_URLas an alias forDMP_API_BASE_URL. The deploy compose setsDMP_API_URL, so without this the HTTP-fallback base URL silently defaulted tolocalhost:8000(nothing listens there in the sim container) and could never reach the realdmpservice if the in-process DMP ever became unavailable. In-process path is unaffected; this only repairs the fallback target. Backward-compatible (does not require any Deploy change).Verification
Ran the app under
gunicorn --worker-class sync --timeout 0withWEB_CONCURRENCY=2locally:SSE progress streaming confirmed working through gunicorn; single-sim latency unchanged.
Notes
tests/test_deploy_entrypoints.py) only asserts imports, so it is unaffected by the CMD change.🤖 Generated with Claude Code