Skip to content

feat: cross-platform CI matrix, sandbox hardening, and operator tooling#535

Open
somethingwithproof wants to merge 19 commits intoCacti:developfrom
somethingwithproof:feat/distro-test-matrix
Open

feat: cross-platform CI matrix, sandbox hardening, and operator tooling#535
somethingwithproof wants to merge 19 commits intoCacti:developfrom
somethingwithproof:feat/distro-test-matrix

Conversation

@somethingwithproof
Copy link
Copy Markdown
Contributor

@somethingwithproof somethingwithproof commented Apr 14, 2026

Consolidated PR for Cacti 1.3 platform modernization, sandbox hardening, operator tooling, and security review follow-up. Matches maintainer's consolidation pattern from #523.

Scope

A. Cross-platform CI matrix (original)

Adds a CI lane that builds spine inside 10 Linux distro containers plus macOS, Windows (advisory), and FreeBSD 14. Ships scripts/test-distros.sh for local reproduction and docs/platforms.md listing supported platforms with per-distro build-dep install commands.

Matrix: rockylinux:9, rockylinux:8, almalinux:9, fedora:latest, debian:12, debian:trixie, ubuntu:22.04, ubuntu:24.04, opensuse/leap:15, alpine:3.20.

B. Sandbox and hardening (original + expanded)

  • seccomp allowlist with TSYNC + x86/x32 arch coverage, TIOCSTI deny, CLONE_NEWUSER deny
  • landlock ruleset narrowed to /tmp/spine, /run/spine, /var/run/spine
  • PR_SET_DUMPABLE + PR_SET_NO_NEW_PRIVS applied unconditionally at main() entry
  • AppArmor profile without /tmp/** catch-all; explicit /root and /etc/shadow denies
  • SELinux policy module with permissive-by-default documentation
  • libaudit hook with per-event rate limit (10/60s/type)
  • systemd unit with LimitCORE=0, LimitMEMLOCK=infinity, NoNewPrivileges, ProtectSystem=strict
  • --mlock flag to pin credentials against swap
  • LD_/DYLD_ prefix strip in child environment, PERL5OPT/PYTHONSTARTUP/RUBYOPT/NODE_OPTIONS scrubbed

C. Security review follow-up (new)

Findings from an in-depth review, consolidated from closed #538/#539/#540:

  • C1 poller.c: guard kill(pid, SIGKILL) against nft_pchild returning -1 (broadcast-kill prevention)
  • H1 util.c: enforce SECURITY.md-documented 0600 and owner match on spine.conf; open with O_NOFOLLOW|O_CLOEXEC
  • H2 error.c: async-signal-safe fatal handler (pre-formatted messages + write(2) + _exit); SIGPIPE ignored process-wide
  • H3 multi-file: FD_CLOEXEC on sockets, MySQL session, audit netlink, log fd
  • H7 util.c: length-bounded config tokenizer replacing sscanf("%15s %255s"); range checks on DB_Port / CircuitBreakerThreshold
  • H8 ping.c: strncopy in get_namebyhost (fixes 1-byte OOB write on overlength token)
  • H9 snmp.c: SNMPv3 contextEngineID carried as binary with explicit length (no strlen truncation at 0x00)
  • Misc: umask(027) at startup; spine_scrub_secrets() called from every exit path; SNMP community/password redaction in command-error logs; mysql_fetch_row null guards; snprintf accumulator truncation guards; php_readpipe buffer-full break; circuit breaker threshold clamp; nft_popen close_cleanup null check replacing assert

D. Build fixes (applied during CI iteration)

  • <sys/prctl.h> under __linux__ for prctl calls
  • <stdint.h> under HAVE_LANDLOCK for uint64_t
  • _GNU_SOURCE guard in platform_posix.c and platform_process_posix.c for pthread_setname_np / pipe2 on clang
  • install-apt-deps validator regex fixed (tr -d set, - escaped) so package lists with spaces are accepted
  • strncpymemcpy in strncopy/die/spine_log to silence -Wstringop-truncation
  • db_escape trim_limit tightened for -Wformat-truncation
  • Integration/FreeBSD fixtures chmod 0600 on spine.conf for H1 enforcement

Tests

New unit tests: test_poller_pid_guard, test_config_perms, test_signal_handler, test_cloexec, test_get_namebyhost, test_snmp_engine_id, test_config_parser, test_startup_umask, test_scrub_secrets, test_snprintf_guard, test_redact_args, test_spine_audit. Extended test_circuit_breaker for threshold clamp. Extended test_env_scrub for LD_/DYLD_ prefix strip and interpreter init vars.

Full suite: ctest --test-dir build --output-on-failure → 30/30 pass locally.

Local reproduction

bash scripts/test-distros.sh               # all
bash scripts/test-distros.sh rockylinux:9  # one

Build logs land under build-reports/.

Fixes

openSUSE Leap 15 default gcc 7.5 rejects -std=c17; workflow and script install gcc13 and export CC=gcc-13.

The Rocky 9 pipe2 fix landed on #523 separately since it addresses a Rocky-Linux compile failure the maintainer flagged.

Consolidation note

Superseded: #538 (critical-security), #539 (config-parser), #540 (memory-concurrency). All commits fast-forwarded onto this branch. Matches the consolidation pattern from #523 to avoid merge-order conflict risk across util.c / spine.c / platform layers / CI files.

Copilot AI review requested due to automatic review settings April 14, 2026 21:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR significantly expands CI and platform validation for Spine by adding a Docker-based Linux distro build matrix plus additional OS lanes, along with local reproduction tooling and platform documentation. It also introduces substantial portability/runtime work (new platform abstraction layer, new unit/integration tests, and systemd sd_notify support) and updates packaging/build scripts toward the CMake/Ninja path.

Changes:

  • Add a cross-distro CI build/test matrix (10 Linux containers + macOS + Windows advisory + FreeBSD) and supporting security/quality workflows.
  • Add local reproduction tooling (scripts/test-distros.sh) and new/expanded unit + integration tests for platform and networking behavior.
  • Add/extend platform portability code (platform layer, systemd notify wrapper) and update packaging/build scripts toward CMake/Ninja.

Reviewed changes

Copilot reviewed 112 out of 125 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
.github/actions/install-apt-deps/action.yml Composite action to install apt dependencies for workflows.
.github/cppcheck-baseline.txt Placeholder baseline for cppcheck outputs.
.github/instructions/instructions.md Updates repo contribution/CI guardrail guidance.
.github/nightly-leak-baseline.json Baseline thresholds for leak-trend gating.
.github/perf-baseline.json Baseline thresholds for perf regression checks.
.github/scripts/check-leak-trend.py Parses sanitizer/valgrind logs and enforces thresholds.
.github/scripts/check-unsafe-api-additions.sh CI guard against introducing banned unsafe C APIs.
.github/scripts/check-workflow-policy.py Workflow hygiene enforcement (pinned actions, set -euo pipefail, no `curl
.github/scripts/clang_tidy_to_sarif.py Converts clang-tidy output into SARIF.
.github/scripts/cppcheck_to_sarif.py Converts cppcheck output into SARIF.
.github/workflows/codeql.yml Adds CodeQL scanning workflow.
.github/workflows/coverage.yml Adds gcc+lcov coverage workflow and minimum coverage gate.
.github/workflows/distro-matrix.yml Adds the multi-distro container build matrix plus macOS/Windows/FreeBSD lanes.
.github/workflows/fuzzing.yml Adds sanitizer-based CLI fuzz smoke workflow.
.github/workflows/integration.yml Adds DB integration, net-snmp compatibility builds, and Docker-based integration tests.
.github/workflows/release-verification.yml Adds hardened release verification and install smoke checks.
.github/workflows/security-posture.yml Adds security posture checks (TruffleHog/Semgrep/Scorecard + workflow policy gate).
.github/workflows/weekly.yml Adds weekly deep checks (repro build, include graph, spell check, etc.).
.clang-tidy Enables clang-tidy check set for the repo.
.dockerignore Updates Docker build context exclusions (e.g., ignore build/).
CHANGELOG Records platform/build/security/testing updates in the changelog.
CMakePresets.json Adds CI-oriented CMake preset definitions.
Dockerfile Moves container build to CMake/Ninja-based build/install.
Dockerfile.dev Moves dev container build to CMake/Ninja and sanitizer-friendly flags.
Makefile.am Removes autotools Makefile.am (transition away from autotools).
SECURITY.md Adds a project security policy and deployment recommendations.
VERSION Adds/updates the canonical project version file.
bootstrap Removes legacy autotools bootstrap script.
config/config.h.cmake.in Adds a CMake-generated config.h template.
docs/platform-idioms.md Documents platform-specific implementation idioms.
docs/platforms.md Documents supported platforms and per-distro dependency install commands.
etc/spine.1 Adds a generated man page for spine.
etc/spine.conf.dist Adds a distributed default config file.
etc/systemd/spine.service Adds a systemd service unit for Spine with Type=notify and hardening.
etc/systemd/spine.timer Adds a systemd timer unit for periodic poll cycles.
package Updates packaging script to validate via CMake instead of autotools.
packaging/README.md Updates packaging docs to reflect CMake/Ninja build path.
packaging/debian/README Notes Debian packaging uses CMake+Ninja.
packaging/debian/control Switches Debian build-deps from autotools to cmake/ninja.
packaging/debian/rules Switches Debian build rules to cmake+ninja and CMake options.
packaging/rpm/README Notes RPM spec uses CMake+Ninja.
packaging/rpm/spine.spec Switches RPM build/install steps from autotools to CMake+Ninja.
scripts/copyright_year.sh Refactors/normalizes formatting and flow of copyright updater.
scripts/test-distros.sh Adds local Docker-based multi-distro build/test driver.
scripts/verify.sh Updates verification script to use CMake build + scan-build lanes.
src/common.h Removes older Cygwin workarounds and includes platform header.
src/error.c Uses platform localtime wrapper in signal handler path.
src/error.h Adds header for signal handler install/uninstall APIs.
src/keywords.c Adds/refactors keyword mapping implementation into its own TU.
src/keywords.h Adds header for keyword lookup APIs.
src/locks.h Adds header for lock/mutex APIs.
src/nft_popen.h Adds spine_build_child_env() API documentation declaration.
src/php.h Adds header for PHP process APIs.
src/ping.h Removes large Cygwin ICMP struct block and adds ping_init() declaration.
src/ping_ipv6_scope.c Adds standalone IPv6 link-local scope-id resolver TU.
src/ping_validate.c Adds standalone ICMP echo payload validator TU.
src/ping_wire.h Defines on-wire ping payload layout and validator signature.
src/poller.c Uses platform fd wrappers, expands buffers safely, improves OOM/DB-conn failure handling.
src/poller.h Adds header for poller APIs.
src/snmp.c Uses platform setenv wrapper; tightens initialization and sensitive-zeroing behavior.
src/snmp.h Adds header for SNMP APIs.
src/spine.c Adds systemd notify hooks, SIGHUP/SIGTERM behavior, platform init/cleanup, and portability tweaks.
src/spine.h Enforces include ordering, widens SNMP protocol buffers, updates PID type, removes cygwin fields.
src/spine_sem.h Adds pthread-based portable counting semaphore wrapper.
src/sql.c Replaces sleep/usleep with platform sleep wrappers; improves SSL option typing and db_escape bounds logic.
src/sql.h Minor macro formatting/indent changes for MySQL option-setting helper.
src/systemd_notify.c Adds systemd sd_notify wrapper implementation with no-op stubs when absent.
src/systemd_notify.h Adds systemd notify wrapper header and API docs.
src/util.h Removes add_slashes() declaration (Cygwin-related removal).
src/platform/platform.h Adds platform API header (init, sleeps, env, localtime, pid, tty detection).
src/platform/platform_common.c Adds init/cleanup reference counting.
src/platform/platform_error.h Adds platform error-string API header.
src/platform/platform_error_posix.c Adds POSIX strerror_r wrapper implementation.
src/platform/platform_error_win.c Adds Windows FormatMessageA error-string wrapper implementation.
src/platform/platform_fd.h Adds portable fd read/write/wait API header.
src/platform/platform_fd_posix.c Adds POSIX fd wrapper implementation (select-based wait).
src/platform/platform_fd_win.c Adds Windows fd wrapper implementation (PeekNamedPipe wait).
src/platform/platform_icmp.h Adds platform ICMP abstraction header.
src/platform/platform_icmp_posix.c Adds POSIX ICMP abstraction implementation (forwards into ping.c helpers).
src/platform/platform_posix.c Adds POSIX platform backend (env, localtime, sleep, pid, tty checks).
src/platform/platform_process.h Adds portable process API header (pipe/spawn/wait/terminate).
src/platform/platform_process_posix.c Adds POSIX process backend (pipe2/pipe+cloexec, posix_spawn retry).
src/platform/platform_socket.h Adds portable socket API header.
src/platform/platform_socket_posix.c Adds POSIX socket wrapper implementation.
src/platform/platform_socket_win.c Adds Windows socket wrapper implementation and error-domain helpers.
src/platform/platform_win.c Adds Windows platform backend (WSA init, sleeps incl. sub-ms, pid, tty checks).
tests/integration/test_ipv6_transport.sh Adds integration test for IPv6-targeted poll path behavior.
tests/integration/test_output_regex.sh Refactors helper functions/compose usage and improves robustness.
tests/snmpv3/db/init.sql Updates SNMPv3 auth protocol from SHA-256 to SHA.
tests/snmpv3/docker-compose.yml Updates SNMPv3 healthcheck auth protocol from SHA-256 to SHA.
tests/snmpv3/scripts/run_test.sh Formatting/robustness improvements in SNMPv3 test driver script.
tests/snmpv3/snmpd/snmpv3.conf Updates user creation to SHA (compatibility).
tests/unit/Makefile Reworks unit test Makefile to build/run multiple platform smoke tests.
tests/unit/test_build_fixes.c Removes cygwinshloc field usage in unit fixture struct.
tests/unit/test_ping_ipv6_scope.c Adds unit tests for IPv6 scope-id behavior (POSIX; no-op main on Windows).
tests/unit/test_ping_reply_validation.c Adds unit tests for ICMP reply payload validator.
tests/unit/test_platform_dns.c Adds unit tests around getaddrinfo numeric/dual-stack behavior.
tests/unit/test_platform_env.c Adds unit tests for platform setenv wrapper semantics.
tests/unit/test_platform_error.c Adds unit tests for platform error-string wrapper behavior.
tests/unit/test_platform_fd.c Adds unit tests for fd read/write/wait wrappers.
tests/unit/test_platform_helpers.h Adds minimal assertion helpers for platform/unit tests.
tests/unit/test_platform_process.c Adds unit tests for process helpers (spawn/wait/terminate, Windows UTF-8 path).
tests/unit/test_platform_time.c Adds unit tests for platform init/localtime/sleep helpers.
tests/unit/test_systemd_notify.c Adds unit tests for systemd notify wrapper idempotency/null-safety.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/distro-matrix.yml Outdated
Comment thread .github/workflows/distro-matrix.yml
Comment thread .github/workflows/distro-matrix.yml
Comment thread .github/workflows/distro-matrix.yml Outdated
Comment thread .github/workflows/distro-matrix.yml
Comment thread tests/unit/test_systemd_notify.c
Comment thread tests/unit/test_platform_process.c
Comment thread tests/integration/test_ipv6_transport.sh Outdated
Comment thread scripts/test-distros.sh Outdated
@somethingwithproof
Copy link
Copy Markdown
Contributor Author

Latest push extends the matrix to cover every supported OS as a first-class citizen with explicit tier classification.

  • Tier 1 (primary, blocking CI): Ubuntu 22.04/24.04, Debian 12, Rocky 9, Alma 9, Fedora latest, macOS.
  • Tier 2 (supported, blocking CI): Rocky 8, Debian trixie, openSUSE Leap 15, Alpine 3.20, FreeBSD 14.
  • Tier 3 (advisory, non-blocking): NetBSD 10, OpenBSD 7.5, Windows MSYS2/MinGW, UBI 9. New CI lanes for NetBSD and OpenBSD use `cross-platform-actions/action`.
  • Tier 4 (experimental, no CI): AIX and Solaris/illumos. Compile guards exist; tracking issue Platform support: AIX and Solaris feasibility (Tier 4) #536 covers what's needed to reach Tier 3.

Source side: extended the BSD macro guards in `src/ping.c` and `src/platform/platform_process_posix.c` to include `DragonFly` so DragonFly inherits the same `arc4random`/`pipe2` paths as the other BSDs.

`docs/platforms.md` is rewritten around the four tiers with install commands for every entry and an issue-reporting label scheme.

@github-advanced-security
Copy link
Copy Markdown

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

  • The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
  • Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
  • You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

Comment thread .github/actions/install-apt-deps/action.yml Fixed
Copy link
Copy Markdown

@github-advanced-security github-advanced-security AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cppcheck found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Copy link
Copy Markdown
Contributor Author

@somethingwithproof somethingwithproof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Security Analysis Summary

This PR significantly enhances Spine's security posture by introducing platform-specific sandboxing (pledge, unveil, PR_SET_NO_NEW_PRIVS) and a comprehensive multi-distro CI matrix. The automation infrastructure is well-structured and follows modern CI/CD best practices.

🔍 General Feedback

  • Sandbox Integration: Excellent addition of spine_sandbox_* hooks. The OpenBSD implementation correctly uses unveil to restrict filesystem access and pledge to narrow the syscall surface.
  • Linux Hardening: Using PR_SET_NO_NEW_PRIVS on Linux is a great low-cost win to prevent privilege escalation via execve.
  • CI Hygiene: Workflows use pinned SHAs for actions and limited GITHUB_TOKEN permissions (contents: read).
  • Matrix Coverage: The Docker-based distro matrix ensures consistent security behavior across diverse toolchains (glibc vs musl, different GCC versions).

Detailed Findings

  • LOW: Command Injection Risk in Developer Scripts (scripts/test-distros.sh, scripts/test-workflows.sh)

    • These scripts pass command-line arguments directly to shell execution points (docker run ... sh -c and act -j).
    • While these are local developer tools, they could be exploited if a developer is tricked into running them with malicious arguments.
    • Recommendation: Consider simple validation or quoting for arguments used in shell contexts.
  • LOW: Supply Chain Risk (GitHub Workflows)

    • The linux job in distro-matrix.yml uses distro tags (e.g., rockylinux:9) without SHA-256 digests.
    • Recommendation: For maximum security, pin container images by digest to prevent poisoning of the base image.

@somethingwithproof somethingwithproof changed the title ci: add Docker-based multi-distro build matrix feat: cross-platform CI matrix, sandbox hardening, and operator tooling Apr 15, 2026
Comment thread src/poller.c Fixed
Comment thread src/platform/platform_win.c Fixed
Comment thread src/poller.c Fixed
Comment thread src/poller.c Fixed
Comment thread src/poller.c Fixed
Comment thread src/ping.c Fixed
Comment thread src/php.c Fixed
Comment thread src/circuit_breaker.c Fixed
Comment thread src/circuit_breaker.c Fixed
Comment thread src/circuit_breaker.c Fixed
Comment thread build-fix/CMakeFiles/4.3.1/CompilerIdC/CMakeCCompilerId.c Fixed
Comment thread build-fix/CMakeFiles/4.3.1/CompilerIdC/CMakeCCompilerId.c Fixed
Comment thread build-mariadb/CMakeFiles/4.3.1/CompilerIdC/CMakeCCompilerId.c Fixed
Comment thread build-mariadb/CMakeFiles/4.3.1/CompilerIdC/CMakeCCompilerId.c Fixed
Comment thread build-preflight/CMakeFiles/4.3.1/CompilerIdC/CMakeCCompilerId.c Fixed
Comment thread build-uv/CMakeFiles/4.3.1/CompilerIdC/CMakeCCompilerId.c Fixed
Comment thread build-uv/CMakeFiles/4.3.1/CompilerIdC/CMakeCCompilerId.c Fixed
Comment thread src/ping.c
return HOST_DOWN;
}
#else
static int ping_icmp_ipv6(spine_spine_host_t *host, ping_t *ping) {
Comment thread src/ping.c
*
*/
int ping_tcp(spine_spine_host_t *host, ping_t *ping) {
double begin_time, end_time, total_time;
Comment thread src/poller.c Fixed
Comment thread build-cov/CMakeFiles/4.3.1/CompilerIdC/CMakeCCompilerId.c Fixed
Comment thread build-cov/CMakeFiles/4.3.1/CompilerIdC/CMakeCCompilerId.c Fixed
@somethingwithproof somethingwithproof force-pushed the feat/distro-test-matrix branch 2 times, most recently from c8e7207 to 4f66b29 Compare April 20, 2026 06:13
This squashes the entire asynchronous architecture rewrite, including:
- libuv event loop foundation and stage-based pipeline
- Persistent PHP script server socket pool
- PCRE2 JIT regex cache
- Global OID multiplexing with WRR scheduling
- Adaptive token-bucket governors
- Safe libuv memory lifecycles and telemetry sink
The dummy 'SELECT 1' that used to live in stage_flush was test
scaffolding I removed, but that was also the only thing forcing the
batcher to commit on cycle end. With just the 500ms internal timer,
a cycle that completed in <500ms left pending writes in memory;
process exit and db_disconnect then ran before they hit MySQL.

Call spine_async_batch_flush() at pipeline tear-down, before the
uv_walk/force-close pass, and let uv_run drain the resulting
mysql_real_query_cont callbacks. The real flush now runs at the one
point in the process lifecycle where we know every poll result is
in-queue and the DB handle is still live.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
ares_process_fd can call back into sock_state_cb ->
spine_async_dns_upsert_watcher -> spine_async_dns_remove_watcher,
which frees the current watcher mid-call. Any dereference of
`watcher` after ares_process_fd is UAF. The current code happens to
read fields only before the call, but there is no comment documenting
that invariant and a future refactor would break it silently.

Capture the fd explicitly before the call, document that runtime is
the only pointer safe to use post-call, and make spine_async_dns_on_poll
self-contained. No functional change.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Before uv_loop_close, spine runs db_disconnect on every MYSQL handle.
An async query that arrives between the flush drain and the db close
would succeed at submission, then fire its callback after the handle
was freed - use-after-free.

Add a process-wide shutdown fence:

- spine_async_mysql_shutdown_begin() flips a volatile flag; new
  queries via spine_async_mysql_query return -ESHUTDOWN.
- In-flight queries keep their existing uv_poll chain so the uv_run
  drain completes them normally.
- Non-HAVE_MYSQL_ASYNC build provides the symbol as a no-op so
  callers in spine.c do not need the compile-time guard.

Spine calls the fence immediately before the batch flush so the
flush's own callbacks still run, and then no further submissions are
possible while the drain is in progress.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
M2: setrlimit(RLIMIT_CORE, 0) silently ignored its return. On systems
with /etc/security/limits.conf enforcing a non-zero hard minimum, it
left the limit unchanged and the operator had no signal. Log to
stderr on failure so post-mortem cores have a documented cause.

M6: Dependabot weekly cadence for docker left a 7-day exposure
window on base-image CVEs. Switch to daily for docker; github-actions
stays weekly because action SHAs churn slowly and noisy.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Two #define SPINE_BZ / #undef pairs existed — one per scrub helper —
each selecting explicit_bzero vs spine_volatile_bzero based on
HAVE_EXPLICIT_BZERO. Collapse them into one static inline spine_bzero
with a single #ifdef. Every scrub call routes through it; adding a
new scrubbable field no longer requires remembering which macro
pair a file is using.

Also extend target_t scrub to cover snmp_engine_id and snmp_context.
They are not strictly secret, but they are per-device identifiers
operators prefer not to surface in a core dump alongside a password.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
The prior ordering flipped the async-mysql shutdown fence BEFORE
calling spine_async_batch_flush. spine_async_batch_flush submits
pending batched writes through spine_async_mysql_query; with the
fence already set, those submissions returned -ESHUTDOWN and the
entire pending batch was silently dropped - the same data-loss
mode the dummy-flush removal was supposed to fix.

Swap the order: flush first, drain callbacks, then close the gate.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
The prior flag list included -3p / -3x / --3authPassphrase /
--3privPassphrase. Those are not real net-snmp options; they were
added based on assumption, not documentation. Tests asserted
redaction against strings that never appear in real invocations.

Replace with the actual net-snmp key-carrying short flags from
snmp.conf(5):
  -3m  master auth key
  -3M  master priv key
  -3k  localized auth key
  -3K  localized priv key

Drop the fabricated long flags. Update tests to exercise the real
flag forms.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Prior code used UV_RUN_NOWAIT in a 300-iteration spin, which returned
immediately regardless of outstanding requests. The 3-second deadline
was fiction: 300 iterations completed in microseconds, not seconds,
and the runtime was freed while requests were still in flight. The
next on_resolved dereferenced ctx->runtime (now freed) to decrement
inflight - UAF.

Switch to UV_RUN_ONCE driven by a wall-clock deadline (time(NULL)+3).
UV_RUN_ONCE blocks until one libuv event fires or the loop has no
pending work, so the 3-second budget is actually spent waiting.

If the deadline still expires with requests in flight, DO NOT free
runtime. The outstanding callbacks carry a pointer to it; freeing
now is UAF. Leak the struct (process is terminating) and log a
warning so operators can see the stuck resolver.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Three issues flagged in self-review:

1. The reap pass only fired from spine_cb_record. On an idle daemon
   (no poll failures), the reap never ran and old entries from a
   prior failure burst sat in the hash for the process lifetime.
   Add the same reap call to spine_cb_should_skip, which is invoked
   every cycle for every host. Rate limiting via
   SPINE_CB_REAP_INTERVAL already prevents thrash.

2. The reap sweep held the table mutex O(N). A large unhealthy
   population stalled other threads. Add SPINE_CB_REAP_BATCH (256)
   cap; unscanned entries wait for the next sweep.

3. Tests could not drive the reap deterministically because it used
   time(2). Add spine_cb_set_clock_for_test(fn) and thread the clock
   through every time(NULL) caller via spine_cb_clock().

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Two self-review findings:

- g_spine_uv_leaked_handles was a plain int. Single-threaded today,
  but a future refactor that walks the loop from a worker thread
  would race on increment. Switch to _Atomic int with relaxed-order
  fetch_add; this is only an operator diagnostic so no synchronisation
  guarantees are required beyond 'no lost increments'.

- g_async_mysql_shutting_down was a plain volatile int. Worker
  threads on weakly-ordered architectures (ARM, POWER) might observe
  stale 0 after the main thread sets it. Switch to atomic_int with
  explicit release on the set and acquire on the read so the
  happens-before ordering is enforced across thread boundaries.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Prior test hardcoded 512. If spine.h BUFSIZE rises, the test stops
exercising the production-width truncation path. Use BUFSIZE with a
fallback for the case where the production header is not yet
available to the test TU, and guard with #error if BUFSIZE is ever
shrunk below the fallback.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Operators had no visibility into:

- how many CB entries exist, how many have been reaped over time,
  and how many trips have occurred (spine_cb_get_stats)
- how many async queries got refused after the shutdown fence
  flipped (spine_async_mysql_shutdown_refused_count)

Both are cheap atomic counters. The CB counters live behind the
table lock since they are read together with an exact HASH_COUNT;
the mysql refusal counter uses atomic_ulong for lock-free increment
on the hot submission path.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Three test files fill the gaps called out in the self-review:

- test_spine_cb_reap.c: drives the age-reap path through the mock
  clock seam. Asserts a healthy host occupies no hash entry, a trip
  records one, eviction at idle-threshold drops a recovered entry,
  and a still-cooling entry survives the idle threshold.

- test_async_mysql_shutdown.c: verifies spine_async_mysql_query
  returns a negative error after spine_async_mysql_shutdown_begin.
  API-only; works on both HAVE_MYSQL_ASYNC and fallback builds.

- test_async_exec_shell.c: source-scan invariant that async_exec.c
  still uses /bin/sh -c unconditionally. If anyone reintroduces the
  dropped metachar-detect hybrid (needs_shell, async_exec_parse_argv)
  the test fails. Defensive against review-refresh regressions.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
CRITICAL: tests/unit/test_async_exec_shell.c, test_spine_cb_reap.c,
and test_async_mysql_shutdown.c were not listed in cmake/SpineTests.cmake
and therefore did not build or run. Add matching add_executable +
add_test entries with the correct link libraries and source-file
dependencies.

HIGH: test_async_exec_shell.c now reads SPINE_SOURCE_ROOT from a
compile-time define set by CMake to CMAKE_SOURCE_DIR, so the source-
scan works from any ctest CWD rather than relying on 'run from
project root'.

HIGH: add spine_async_mysql_shutdown_reset_for_test so unit tests
can flip the fence more than once. Production callers must not use
this; the production fence stays a one-way latch.

HIGH: replace ASSERT_TRUE(count >= 0) tautology in the shutdown test
with a before/after delta and build-matrix-aware expected values
(HAVE_MYSQL_ASYNC bumps the counter by 1; fallback leaves it zero).

HIGH: add a source-scan test that asserts
spine_async_batch_flush() appears before
spine_async_mysql_shutdown_begin() in spine.c. Reversing the order
silently drops the pending batch at shutdown and now fails the
test.

HIGH: spine_uv_run_bounded caps each shutdown drain at 3 wall-clock
seconds. Three previously-unbounded uv_run(UV_RUN_DEFAULT) calls
could hang shutdown indefinitely on a wedged callback.

MEDIUM: spine_cb_init resets spine_cb_last_reap so tests cycling
init/shutdown/init under a rewound mock clock do not carry stale
reap-cadence state forward.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
DRY: the 3-second per-phase drain budget was repeated 4 times in
spine.c and once in async_dns.c with a magic literal. Define
SPINE_SHUTDOWN_DRAIN_SECS next to spine_uv_run_bounded and point the
async_dns comment at it. The helper no longer takes a seconds
parameter - every caller wants the same budget.

KISS: drops four repeated '(loop, 3)' call-site arguments in spine.c
in favour of a single-argument helper. Behaviour unchanged.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
atoi has three silent failure modes on the DB-sourced config values:

- Non-numeric input returns 0 without any signal. A typo in the DB
  silently resets a setting to zero.
- Integer overflow is undefined behaviour on C99+. A value of
  "9999999999" invokes UB on 32-bit int platforms.
- No distinction between "0" and "bogus".

spine_atoi uses strtol with errno detection: ERANGE clamps to
INT_MAX / INT_MIN rather than wrapping, non-numeric input (end == s)
returns 0 preserving the original contract for callers that already
treat 0 as 'use default'. 20 call sites in util.c swapped.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
CRITICAL: spine_atoi (now spine_parse_int) returned INT_MAX on large
negative overflow because the ERANGE branch dominated the sign check.
A config value of '-9999999999999999999' silently became INT_MAX, a
sign-flip that corrupts every setting passed through it. Split the
ERANGE branch on sign so overflow in either direction clamps to the
matching bound. Renamed to spine_parse_int so 'grep atoi' is no
longer ambiguous.

HIGH: new test CMake entries linked only against spine_hardening.
util.c / circuit_breaker.c / async_mysql.c transitively depend on
mysql, net-snmp, and libaudit; without linking them, the test
binaries fail the undefined-reference stage. Add spine_mysql,
spine_netsnmp, Threads, and the conditional audit target to every
new test target. Call spine_require_mysql / spine_require_netsnmp
first to register those INTERFACE libraries.

HIGH: atomic-style consistency. g_spine_uv_leaked_handles was
_Atomic int while the other atomics used atomic_int / atomic_ulong.
Switch to atomic_int so every counter matches the stdatomic.h
typedefs.

MEDIUM: spine_uv_run_bounded now measures the deadline with
uv_hrtime (monotonic nanoseconds) instead of time(2), so the 3-
second budget is real sub-second resolution rather than the 0-1s
jitter time(2) produces.

MEDIUM: PATH_MAX fallback for AIX/HP-UX builds that leave it
undefined. 4096 matches Linux default.

MEDIUM: shutdown metrics log line. One SPINE_LOG at teardown prints
libuv_leaked, cb_entries, cb_trips, cb_reaped, and
mysql_refused_after_shutdown. Grep-friendly for operators without a
telemetry endpoint.

LOW: SPINE_SHUTDOWN_DRAIN_SECS moved to src/spine_shutdown.h so
spine.c and async_dns.c share the constant. Raising the budget in
one place updates every subsystem.

LOW: test_async_mysql_shutdown uses a real uv_loop_init rather than
a synthetic (uv_loop_t *)1. Future refactor that dereferences the
loop before the fence check now fails cleanly against an empty
loop instead of segfaulting.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
HIGH: shutdown metrics SPINE_LOG always at INFO flooded the operator
log on cron-driven restarts. Emit at INFO only when any counter is
non-zero (the signal worth preserving); otherwise log at DEBUG.

HIGH: promote spine_parse_int from static to extern with declaration
in util.h. Future config-parser TUs reuse the safety contract
instead of duplicating it.

LOW: hoist 1000000000ULL literal in spine_uv_run_bounded to
SPINE_NS_PER_SEC in spine_shutdown.h alongside the drain-secs
macro. Matches naming of surrounding timeouts; readers do not have
to count zeros.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants