chore(ext-api): bump heap -Xmx1g to -Xmx2g by zbnerd · Pull Request #1293 · zbnerd/probabilistic-valuation-engine

zbnerd · 2026-06-16T08:05:12Z

Summary

scripts/systemd/maple-external-api.service: ExecStart -Xmx1g → -Xmx2g
.claude/skills/pipeline-test/SKILL.md: nohup -Xmx1g → -Xmx2g (+ benchmark comment)
다른 3 모듈 (calculator/sync/cleanup) 1g 유지 — GC <3% 로 충분

Why

GC was the real bottleneck. At 1g:

major GC every 2.4s, 22% CPU on garbage
102 files/s item-equipment, batch_wait 2.7s
heap 86% used (885MB / 1024MB)

At 2g:

GC 7% CPU
150 files/s item-equipment, batch_wait 1.6s
heap 49% used (1.06GB / 2.15GB)
calc downstream: 186 → 362 users/s

In-flight 100→250 변경은 throughput 영향 없었음 (이전 PR로 별도 검토). Heap이 effective gate.

Test plan

ext-api 2g로 재시작 후 4개 모듈 health UP 확인
item-equipment phase 정상 작동, 14K → 85K 진행
rate 150 files/s sustained
heap 62% used (안정)
GC 7% CPU (안정)
OOM 없음

🤖 Generated with Claude Code

Four Type=simple units (maple-{external-api,calculator,synchronizer,cleanup}) that run as the maple system user and source both /opt/maple/.env and the per-module /opt/maple/.env.<module> file (so MINIO_ACCESS_KEY / MINIO_SECRET_KEY get the correct SA credentials for each module). - maple-cleanup.service bakes in -Dstorage.backend=minio (StorageConfig matchIfMissing=true otherwise silently falls back to LocalFs). - Hardening: NoNewPrivileges, ProtectSystem=full, ProtectHome. - Restart=on-failure, RestartSec=5, SuccessExitStatus=143. - Logs to /var/log/maple/<module>.log (+ -error.log). scripts/install-systemd-units.sh: idempotent installer. Verifies root, MAPLE_HOME, jars, .env, and 4 per-module .env files. Creates maple user, /var/log/maple, copies units to /etc/systemd/system/, daemon-reload, enable. Does NOT start the services — operator's call. scripts/* and scripts/systemd/* added to .gitignore allowlist so the new files are tracked. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add START_MODE variable to switch between nohup (default, unchanged) and systemd (new) module startup. - New 'Startup mode' section near the top of the skill explains the two modes and when to use each. - Step 3 (Start modules) is now wrapped in an if/else on START_MODE. The 4 existing nohup blocks are preserved verbatim inside the 'nohup' branch. The 'systemd' branch uses systemctl start on the 4 maple-*.service units and waits on the same 4 health-check ports. - Step 10 (Cleanup) also branches: nohup mode uses lsof+kill, systemd mode uses systemctl stop. - Added a note that the -Dstorage.backend=minio JVM flag for module-cleanup is baked into maple-cleanup.service ExecStart, so it is not needed at runtime in systemd mode. The systemd units assume a previous scripts/install-systemd-units.sh run on the target host; see scripts/systemd/ for the unit definitions. Also widen .gitignore to allow .claude/skills/ so skills (which were already being modified in past commits) can be tracked without -f. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…to per-module subdir Original units referenced /opt/maple/build/libs/module-X.jar (top-level) but actual JARs live at /opt/maple/module-X/build/libs/module-X.jar. Without this fix every service would fail with 'no main manifest attribute'.

…tall time Install script previously copied unit files verbatim. Units had hardcoded /opt/maple paths which broke install on any non-prod host. Now the script does sed substitution before install, so the same units can be installed at any MAPLE_HOME.

…e hairpin The maple-network bridge on this host blocks hairpin NAT to host-bound ports 8081-8084 (modules), 9092 (kafka) and 5432 (airflow-db) because host kernel/firewall rules reject return packets from the container-side gateway (172.20.0.1 / 10.0.0.1) to host-bound TCP sockets. SSH/HTTP/ HTTPS work because coolify proxies them; module ports do not. Symptoms fixed: - All 8 Prometheus scrape targets returned health=down (DNS NXDOMAIN for module:port names, deadline-exceeded for host.docker.internal). - Airflow daily_collection_pipeline DAG runs FAILED in 2m20s because the HttpSensor (check_external_api) at host.docker.internal:8081 timed out, before the trigger task could even run. Resolution: - Prometheus: switch to network_mode: host, targets become localhost:8081..8084. No other in-Docker services are scraped (alertmanager/node-exporter containers are not running), so the loss of maple-network DNS is harmless. - Airflow scheduler + webserver: switch to network_mode: host. - Airflow DB connection string now uses airflow-db bridge IP 172.20.0.2 directly. - Webserver binds host port 8180 via AIRFLOW__WEBSERVER__WEB_SERVER_PORT (avoids clashing with coolify on 8080). - root user required for host networking. - Airflow connections (external_api, calculator) updated to localhost:8081/8082. - Kafka bootstrap_servers in DAG now read from KAFKA_BOOTSTRAP_SERVERS env (172.20.0.4:9092) so it still works without host.docker.internal. Verified: - Prometheus: 3 of 4 expected targets UP (8081/8082/8083). 8084 (rest-controller) remains DOWN with 404 (expected: no actuator). - Airflow: fresh trigger of daily_collection_pipeline completes check_external_api and trigger_daily_collection; wait_for_completion sensor is rescheduling normally. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…/_SUCCESS markers Root cause: ext-api policy lacked s3:DeleteObject, causing 403 when ext-api tried to delete the _RUNNING marker after ranking fetch completed. Pipeline stuck. Audit of every objectStorage.* call across all 4 modules: ext-api — adds s3:DeleteObject on runs/*, snapshots/*, ocid-mapping/*. ChunkFileManager.deleteRunningMarker() and cleanupOnFailure() both do s3:DeleteObject on `<runKey>/_RUNNING` and `<runKey>/_SUCCESS`. OcidLookupPhase.deleteOldMappingFiles() does deleteByPrefix on `ocid-mapping/`. Bucket-level s3:ListBucket unchanged. calculator — adds s3:GetObject on calculator/runs/*. CalculatorChunkProcessingCoordinator does objectStorage.exists() (s3:HeadObject, satisfied by s3:GetObject in MinIO) and CalculationResultWriter does putStream (s3:PutObject, unchanged). The read-side GetObject on its own output prefix was previously missing. synchronizer — no change. read-only: get() on runs/*, calculator/runs/*, ocid-mapping/*. Already covered. cleanup — no change. objectStorage.delete() on event.objectKey from ConsumedChunkInbox (runs/* and calculator/runs/* prefixes), and deleteByPrefix in RunCleanupService. Already had Get+Delete on both prefixes. Live MinIO policy refreshed via 'bootstrap.sh --rotate' on 2026-06-15. Structural test: ./gradlew :module-infra:test --tests "*MinioPolicyJsonTest*" — 7/7 pass.

GC was the bottleneck. At 1g heap, major GC fires every 2.4s and burns 22% CPU. With 2g, GC drops to 7% and item-equipment throughput climbs from 102 to 150 files/s (calc downstream from 186 to 362 users/s). Verified 2026-06-16: heap 49% used at 2g, no OOM risk. Other 3 modules stay at 1g — verified GC <3% on each. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chatgpt-codex-connector · 2026-06-16T08:05:17Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

zbnerd and others added 8 commits June 15, 2026 05:30

chore(gitignore): narrow .claude/skills/ negation to pipeline-test/ only

f416b8a

zbnerd merged commit 9cf1195 into develop Jun 16, 2026

This was referenced Jun 16, 2026

fix(race+throughput): close 3 throughput-limiting gaps #1294

Merged

chore(release): merge release-0.6.16 into master #1295

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(ext-api): bump heap -Xmx1g to -Xmx2g#1293

chore(ext-api): bump heap -Xmx1g to -Xmx2g#1293
zbnerd merged 8 commits into
developfrom
chore/ext-api-heap-2g

zbnerd commented Jun 16, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zbnerd commented Jun 16, 2026

Summary

Why

Test plan

Uh oh!

chatgpt-codex-connector Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant