chore(release): merge release-0.6.16 into master by zbnerd · Pull Request #1295 · zbnerd/probabilistic-valuation-engine

zbnerd · 2026-06-16T08:13:59Z

Summary

release-0.6.16 cut from develop
Includes all develop commits since release-0.6.12
Throughput improvements verified on running pipeline (item-equipment 150 files/s)

Changes since release-0.6.12

PR feat(minio): 4 SA prefix-policy isolation + ephemeral CI #1288 feat(minio): 4 SA prefix-policy isolation + ephemeral CI
PR feat(calculator,cleanup): stale runId filter + stale-kafka scan endpoint #1286 feat(calculator,cleanup): stale runId filter + stale-kafka scan endpoint
feat(skill): systemd startup mode in pipeline-test
feat(infra): systemd unit files for 4 MinIO-using modules
fix(systemd): correct ExecStart JAR paths to per-module subdir
fix(install): substitute /opt/maple → MAPLE_HOME in unit files
fix(infra): host network for prometheus + airflow (bypass bridge hairpin)
fix(minio): audit + extend 4 SA policies for DeleteObject on _RUNNING/_SUCCESS
PR chore(ext-api): bump heap -Xmx1g to -Xmx2g #1293 chore(ext-api): bump heap -Xmx1g to -Xmx2g
PR fix(race+throughput): close 3 throughput-limiting gaps #1294 fix(race+throughput): close 3 throughput-limiting gaps
- ext-api/ChunkedSnapshotSink: whenComplete (not join) for race-free publish
- ext-api/application.yml: in-flight 100 → 250
- infra/MinioObjectStorage: drop temp-file double-spool, in-memory ByteArray
- calculator/CurrentRunIdHolder: polled from ext-api /run-status

Test plan

4 modules health UP
MinIO health UP
item-equipment rate 146-157 files/s sustained
heap 54% used at 2g, GC 7% CPU
calculator downstream 300+ users/s
0 race errors after 185K+ chunks

🤖 Generated with Claude Code

* docs(spec): minio operations design — 5 SA + prefix policy + bootstrap container Single bucket preserved, 5 service accounts (ext-api, calculator, synchronizer, cleanup, read-api), prefix-scoped policies, one-shot minio-bootstrap container for SA/policy creation, root credential isolated to bootstrap env only. Zero Spring source change. Key rotation deferred to ADR. * docs(plan): minio SA isolation — 12 tasks with TDD steps 5 policy JSONs, idempotent bootstrap.sh, .env split, scope IT, rotation-deferred ADR. Zero Spring source change. * docs(spec,plan): minio SA isolation post-grill revisions Spec: 5→4 SA (drop read-api per Q4 codebase audit), reassign ocid-mapping/* from synchronizer to ext-api (Q5), add CI/dev strategy to spec. Plan: 14 tasks, ILM 1-rule invariant, dev-bootstrap.sh, ephemeral CI, 4 per-SA BootSmokeIT classes, ADR with runbook. * chore: branch feature/minio-sa-isolation — minio SA isolation baseline * feat(minio): 4 SA policy JSONs (ext-api owns ocid-mapping) + structural test * feat(minio): idempotent bootstrap.sh (bucket + 1-rule ILM + 4 SA + 4 policies) * docs(env): .env.bootstrap.template — root + 4 SA secret placeholders * chore(gitignore): minio per-SA env files * refactor(docker): minio-init → minio-bootstrap, mount script + env_file * fix(minio): remove unsupported s3:HeadObject from all 4 SA policies MinIO IAM subset does not include s3:HeadObject as a separate action; it is implicit in s3:GetObject. mc admin policy create rejects it. All 4 policy JSONs updated. Structural test still passes (the test does not assert s3:HeadObject presence). Verified against running MinIO via the Task 5 smoke test. * fix(minio): bootstrap ILM loop without jq (use text-mode mc + bash read) The minio/mc Alpine image does not ship jq, awk, grep, or sed. The previous '|| true' silently masked the jq error, so the ILM cleanup loop was a no-op and re-runs accumulated duplicate rules (5 per prefix observed in the live environment before this fix). Switch to text-mode 'mc ilm ls', strip the box-drawing characters with tr, and use bash's read + positional params to extract (ID, PREFIX) pairs. Wrap the inner tokenizer in 'set +u' to handle stripped lines with fewer than 3 tokens. Drop the '|| true' on mc ilm rm so a failed removal fails the script loudly. Verified by running the bootstrap container twice against the live MinIO: after run 2, exactly 1 rule per prefix (snapshots/, runs/, calculator/, ocid-mapping/). * feat(env): per-module MinIO SA env files; drop dead MinIO config from rest-controller * feat(scripts): dev-bootstrap.sh — one-line env set generator * docs(adr): minio key rotation deferred + manual runbook (prod-only) * test(minio): remove obsolete MinioBootSmokeIT (single-class; replaced by per-SA) * test(minio): boot smoke per SA (4 IT classes; replaces single-class MinioBootSmokeIT) * test(minio): per-SA scope IT (3 tests, ext-api/calculator/cleanup, positive + 403 negative) * ci(minio): ephemeral MinIO + random SA keys for SA-scope IT (no GitHub Secrets) * fix(minio): synchronizer policy needs ocid-mapping read (OcidLookupService consumer) OcidLookupService.kt:29 reads ocid-mapping/ocid-mapping-<runId>.jsonl.gz files produced by OcidLookupPhase to populate synchronizer's in-memory mapping state. Without Read on ocid-mapping/* the synchronizer SA gets 403 on every ocid-lookup event. The security invariant is write-ownership, not read-ownership: ext-api is the sole WRITER of ocid-mapping/*; synchronizer is a READER. The synchronizer policy deliberately has no s3:PutObject action. * test(minio): add positive synchronizer IT covering ocid-mapping read Asserts synchronizer can read runs/*, calculator/runs/*, and ocid-mapping/* and gets 403 on PutObject to ocid-mapping/*. The 403 on write is the load-bearing assertion: it proves ext-api is the sole writer (write-ownership invariant). * ci(minio): verify-SAs step fails loudly on missing SA Previously the step only ran 'mc admin user list local' for human inspection. Now it greps for each of the 4 SAs and exits 1 with an explicit error message if any are missing. set -euo pipefail makes the step fail the job instead of silently passing. * docs(spec): correct synchronizer ocid-mapping invariant (read, not write) Original Q5 audit incorrectly dropped synchronizer's ocid-mapping read access. Re-reading OcidLookupService.kt:29 shows synchronizer consumes ocid-mapping/ocid-mapping-*.jsonl.gz to populate its mapping state. The security invariant is write-ownership: ext-api is the sole writer; synchronizer is a reader. Spec text, Appendix A note, and revision history updated to reflect the corrected invariant. * fix(minio): split policies into bucket-level + object-level statements (Q2); trim synchronizer to Get-only (Q5) * test(minio): assert synchronizer listByPrefix returns 403 (Q5 least-privilege guard) * feat(bootstrap): --rotate flag forces re-create of SAs and policies (Q3) * feat(scripts): warn before regenerating .env.bootstrap in dev-bootstrap.sh (Q4) * test(minio): assert cleanup policy actually grants s3:DeleteObject (regression guard)

Four Type=simple units (maple-{external-api,calculator,synchronizer,cleanup}) that run as the maple system user and source both /opt/maple/.env and the per-module /opt/maple/.env.<module> file (so MINIO_ACCESS_KEY / MINIO_SECRET_KEY get the correct SA credentials for each module). - maple-cleanup.service bakes in -Dstorage.backend=minio (StorageConfig matchIfMissing=true otherwise silently falls back to LocalFs). - Hardening: NoNewPrivileges, ProtectSystem=full, ProtectHome. - Restart=on-failure, RestartSec=5, SuccessExitStatus=143. - Logs to /var/log/maple/<module>.log (+ -error.log). scripts/install-systemd-units.sh: idempotent installer. Verifies root, MAPLE_HOME, jars, .env, and 4 per-module .env files. Creates maple user, /var/log/maple, copies units to /etc/systemd/system/, daemon-reload, enable. Does NOT start the services — operator's call. scripts/* and scripts/systemd/* added to .gitignore allowlist so the new files are tracked. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add START_MODE variable to switch between nohup (default, unchanged) and systemd (new) module startup. - New 'Startup mode' section near the top of the skill explains the two modes and when to use each. - Step 3 (Start modules) is now wrapped in an if/else on START_MODE. The 4 existing nohup blocks are preserved verbatim inside the 'nohup' branch. The 'systemd' branch uses systemctl start on the 4 maple-*.service units and waits on the same 4 health-check ports. - Step 10 (Cleanup) also branches: nohup mode uses lsof+kill, systemd mode uses systemctl stop. - Added a note that the -Dstorage.backend=minio JVM flag for module-cleanup is baked into maple-cleanup.service ExecStart, so it is not needed at runtime in systemd mode. The systemd units assume a previous scripts/install-systemd-units.sh run on the target host; see scripts/systemd/ for the unit definitions. Also widen .gitignore to allow .claude/skills/ so skills (which were already being modified in past commits) can be tracked without -f. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…to per-module subdir Original units referenced /opt/maple/build/libs/module-X.jar (top-level) but actual JARs live at /opt/maple/module-X/build/libs/module-X.jar. Without this fix every service would fail with 'no main manifest attribute'.

…tall time Install script previously copied unit files verbatim. Units had hardcoded /opt/maple paths which broke install on any non-prod host. Now the script does sed substitution before install, so the same units can be installed at any MAPLE_HOME.

…e hairpin The maple-network bridge on this host blocks hairpin NAT to host-bound ports 8081-8084 (modules), 9092 (kafka) and 5432 (airflow-db) because host kernel/firewall rules reject return packets from the container-side gateway (172.20.0.1 / 10.0.0.1) to host-bound TCP sockets. SSH/HTTP/ HTTPS work because coolify proxies them; module ports do not. Symptoms fixed: - All 8 Prometheus scrape targets returned health=down (DNS NXDOMAIN for module:port names, deadline-exceeded for host.docker.internal). - Airflow daily_collection_pipeline DAG runs FAILED in 2m20s because the HttpSensor (check_external_api) at host.docker.internal:8081 timed out, before the trigger task could even run. Resolution: - Prometheus: switch to network_mode: host, targets become localhost:8081..8084. No other in-Docker services are scraped (alertmanager/node-exporter containers are not running), so the loss of maple-network DNS is harmless. - Airflow scheduler + webserver: switch to network_mode: host. - Airflow DB connection string now uses airflow-db bridge IP 172.20.0.2 directly. - Webserver binds host port 8180 via AIRFLOW__WEBSERVER__WEB_SERVER_PORT (avoids clashing with coolify on 8080). - root user required for host networking. - Airflow connections (external_api, calculator) updated to localhost:8081/8082. - Kafka bootstrap_servers in DAG now read from KAFKA_BOOTSTRAP_SERVERS env (172.20.0.4:9092) so it still works without host.docker.internal. Verified: - Prometheus: 3 of 4 expected targets UP (8081/8082/8083). 8084 (rest-controller) remains DOWN with 404 (expected: no actuator). - Airflow: fresh trigger of daily_collection_pipeline completes check_external_api and trigger_daily_collection; wait_for_completion sensor is rescheduling normally. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…/_SUCCESS markers Root cause: ext-api policy lacked s3:DeleteObject, causing 403 when ext-api tried to delete the _RUNNING marker after ranking fetch completed. Pipeline stuck. Audit of every objectStorage.* call across all 4 modules: ext-api — adds s3:DeleteObject on runs/*, snapshots/*, ocid-mapping/*. ChunkFileManager.deleteRunningMarker() and cleanupOnFailure() both do s3:DeleteObject on `<runKey>/_RUNNING` and `<runKey>/_SUCCESS`. OcidLookupPhase.deleteOldMappingFiles() does deleteByPrefix on `ocid-mapping/`. Bucket-level s3:ListBucket unchanged. calculator — adds s3:GetObject on calculator/runs/*. CalculatorChunkProcessingCoordinator does objectStorage.exists() (s3:HeadObject, satisfied by s3:GetObject in MinIO) and CalculationResultWriter does putStream (s3:PutObject, unchanged). The read-side GetObject on its own output prefix was previously missing. synchronizer — no change. read-only: get() on runs/*, calculator/runs/*, ocid-mapping/*. Already covered. cleanup — no change. objectStorage.delete() on event.objectKey from ConsumedChunkInbox (runs/* and calculator/runs/* prefixes), and deleteByPrefix in RunCleanupService. Already had Get+Delete on both prefixes. Live MinIO policy refreshed via 'bootstrap.sh --rotate' on 2026-06-15. Structural test: ./gradlew :module-infra:test --tests "*MinioPolicyJsonTest*" — 7/7 pass.

GC was the bottleneck. At 1g heap, major GC fires every 2.4s and burns 22% CPU. With 2g, GC drops to 7% and item-equipment throughput climbs from 102 to 150 files/s (calc downstream from 186 to 362 users/s). Verified 2026-06-16: heap 49% used at 2g, no OOM risk. Other 3 modules stay at 1g — verified GC <3% on each. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chore(ext-api): bump heap -Xmx1g to -Xmx2g

Three independent fixes that together restore item-equipment throughput to its expected 150 files/s. All verified end-to-end on 2026-06-16, heap 2g. 1. ext-api/ChunkedSnapshotSink: chunk-ready publish race - Use S3TransferManager future.whenComplete (not blocking join) - Order preserved: publish only fires after PUT completes - Writer thread returns in ~50ms (gzip+close), not 1-4s - 0 race errors vs prior 1133 failed chunks/90s 2. ext-api/application.yml: in-flight 100 -> 250 - 250-permit rate limiter was being throttled by 100-concurrent batch cap. Aligning both unblocks the rate limiter. - Per-batch wave time 2.5s -> 1.6s (with heap fix in PR #1293) 3. infra/MinioObjectStorage: drop temp-file double-spool - Was: Files.createTempFile + Files.copy + putObject + delete (4 round-trips per chunk, /tmp I/O contention) - Now: drain stream to ByteArray, RequestBody.fromByteArray (1 round-trip, no disk) - Sync S3Client.putObject cannot chunked-stream (no length-1 API on sync path), so in-memory is the only sync option 4. calculator/CurrentRunIdHolder: in-memory set -> DB-backed known-runs - Was: ConcurrentHashMap (lost on restart, drift on multi-instance) - Now: polled from ext-api /run-status endpoint - Stale-chunk skip reason migrated to calculator_chunks_skipped_total - Coordinator + test refactored to match Test plan: - [x] ext-api item-equipment: 102 -> 150 files/s - [x] calc downstream: 186 -> 362 users/s - [x] 0 race errors after 100+ chunks - [x] calculator skip reason 'stale_run' replaces 'endpoint_mismatch' Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

fix(race+throughput): close 3 throughput-limiting gaps

Master was 3 commits ahead (PR #1287 release-0.6.12 + 2 sync merges). Develop had progressed 12 commits beyond master. This merge brings release-0.6.12 changes (181 files, 20k+ insertions) into develop so the hotfix branch reflects production code. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chatgpt-codex-connector · 2026-06-16T08:14:05Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

zbnerd and others added 13 commits June 15, 2026 11:43

chore(gitignore): narrow .claude/skills/ negation to pipeline-test/ only

f416b8a

Merge pull request #1293 from zbnerd/chore/ext-api-heap-2g

9cf1195

chore(ext-api): bump heap -Xmx1g to -Xmx2g

Merge pull request #1294 from zbnerd/fix/race-and-throughput-trio

b8d2fb9

fix(race+throughput): close 3 throughput-limiting gaps

zbnerd merged commit 46e6f0d into master Jun 16, 2026
1 check failed

zbnerd deleted the release-0.6.16 branch June 16, 2026 08:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(release): merge release-0.6.16 into master#1295

chore(release): merge release-0.6.16 into master#1295
zbnerd merged 13 commits into
masterfrom
release-0.6.16

zbnerd commented Jun 16, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zbnerd commented Jun 16, 2026

Summary

Changes since release-0.6.12

Test plan

Uh oh!

chatgpt-codex-connector Bot commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant