Skip to content

fix: memory and concurrency cleanup#540

Closed
somethingwithproof wants to merge 210 commits into
Cacti:developfrom
somethingwithproof:fix/memory-and-concurrency
Closed

fix: memory and concurrency cleanup#540
somethingwithproof wants to merge 210 commits into
Cacti:developfrom
somethingwithproof:fix/memory-and-concurrency

Conversation

@somethingwithproof
Copy link
Copy Markdown
Contributor

Memory and concurrency cleanup from the security review.

Scope

  • log: initialize flogmessage before strftime to prevent reading uninitialized stack on strftime failure.
  • util: guard die() strncat against size underflow; replace strcat in auth/priv list builder with bounded snprintf-append; guard snprintf accumulators against truncation in append_hostrange and related sites.
  • breaker: stack-copy state before unlock so logging survives a future concurrent delete; clamp CircuitBreakerThreshold to [1, 1000000].
  • nft_popen: replace assert(prev != NULL) with a runtime null check for concurrent unlink (safer under NDEBUG).
  • db: null-guard mysql_fetch_row dereferences at six hot-loop call sites.
  • php: break readpipe loop when the result buffer is full (prevents 0-size read spin).
  • log: redact SNMP community strings and auth args (-c, -u, -a, -x, -p, --password, --secret) from command-error logs.
  • signals: set sa.sa_flags explicitly per sigaction block.
  • poller: null-guard details[device_counter] in the worker thread.

Verification

21/21 tests pass. New tests: test_snprintf_guard, test_redact_args; extended test_circuit_breaker with clamp case.

- Add CMakeLists.txt mirroring configure.ac feature checks
- Add config/config.h.cmake.in template for cmake builds
- Add build-cmake-linux CI job (gcc/clang matrix)
- Add build-windows CI job (MSYS2/MinGW-w64, continue-on-error)
- Windows crash dump collection via WER LocalDumps
- Gate -Wall by compiler ID (GNU/Clang vs MSVC)
- Use CMAKE_DL_LIBS instead of hardcoded -ldl
- Parse net-snmp-config --libs into proper NETSNMP_LIBRARIES
- Add SNMP_LOCALNAME compile check for feature parity
- No C source files modified

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
spine.h depends on types from common.h (MYSQL, pid_t, size_t,
pthread types, RESULTS_BUFFER from config.h). Add a compile-time
guard that produces a clear error if spine.h is included without
common.h, rather than cascading type errors.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
strncasecmp() returns 0 on match, but the comparisons used the
raw return value as truthy, inverting the logic. TCP matched
non-TCP strings and vice versa.

Also fix: comparisons against 'hostname' instead of 'token'
(lines 1020, 1032), and strncasecmp length 3 for 4-char strings
"TCP6"/"UDP6" (should be 4).

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
getarg(opt, &argv) advances the argv pointer on each call. Three
successive calls in the --mode handler consumed three argv entries
instead of one, corrupting subsequent argument parsing when using
the space-separated form (--mode online).

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
SECURITY.md promises spine refuses to start when spine.conf has any
group or world bit set or is owned by someone other than root or the
startup euid. The old soft-warn path leaked credentials on a
misconfigured deploy. Open the file with O_NOFOLLOW so a planted
symlink at /etc/spine.conf cannot redirect credential loading, add
O_CLOEXEC so the fd does not leak into poll-script children, and
treat ELOOP, mode drift, and owner mismatch as fatal.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
The previous handler called stdio, time(3), localtime(3), strftime(3),
and exit(3). None are async-signal-safe, so a SIGSEGV mid-allocation
could deadlock inside libc. Pre-format per-signal messages at startup,
emit with write(2), and use _exit(128 + signo) on SIGSEGV/SIGBUS/SIGFPE/
SIGABRT to skip atexit handlers. Drop SIGPIPE from the fatal set and
ignore it process-wide so a poll script closing stdout early no longer
kills spine. Remove the redundant legacy signal() loop.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
…e into children

Every nft_popen'd poll script used to inherit the full descriptor table,
including privileged ICMP sockets, the authenticated MySQL connection,
the audit netlink fd, and the log file. A poll script that shells out
could reuse any of them. Set cloexec at creation time: SOCK_CLOEXEC
atomically on Linux/BSD with fcntl fallback for macOS, fcntl after each
raw ICMP socket(), the MySQL net.fd after connect, the audit fd after
audit_open(), and O_CLOEXEC on the log open().

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Two bugs: strncpy(name->hostname, hostname, sizeof(name->hostname))
leaves the buffer unterminated when the source equals or exceeds the
bound, and strncpy(...) followed by name->hostname[strlen(token)] = 0
writes one-past-the-buffer for tokens >= sizeof. Funnel both branches
through strncopy(), which caps at the bound and always NUL-terminates.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
SNMPv3 engine IDs are opaque octet strings (RFC 3411). strlen() on the
engine-ID buffer truncates at the first 0x00, which any RFC-compliant
engine ID is free to contain. Decode the Cacti hex-string form at DB
load time, carry bytes + length in target_t and host_t, and pass both
to snmp_host_init(). The hex path stays as a transitional fallback.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Copilot AI review requested due to automatic review settings April 17, 2026 19:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR addresses a broad set of security-review findings focused on memory safety, concurrency robustness, and safer cross-platform runtime behavior.

Changes:

  • Adds new platform abstraction layers (process, fd, socket, error, ICMP) and integrates them into hot paths (PHP, SNMP, DB retry sleeps).
  • Introduces new security/ops integrations (systemd sd_notify helpers, audit logging with rate limiting, sandbox stubs/implementations, SELinux/AppArmor assets).
  • Updates build/release/tooling ecosystem (Docker, Debian/RPM packaging, local test scripts, CI workflows, Nix flake, docs) to align with the CMake/Ninja build path.

Reviewed changes

Copilot reviewed 123 out of 185 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
src/spine_audit.c / src/spine_audit.h Adds libaudit wrapper + rate limiting for auditable security events
src/sql.c Adds DB write short-circuit for dry-run; improves retry sleep portability; tightens SSL verification; sets CLOEXEC on MySQL socket
src/platform/* Introduces portable wrappers for sleep/env, fds, sockets, spawn, ICMP, and error formatting
src/snmp_engine_id.c / src/snmp.c / src/snmp.h Adds standalone engine-id decoder and extends SNMP session init to support binary engine IDs
src/systemd_notify.[ch] / etc/systemd/* Adds sd_notify integration and systemd unit/timer templates
scripts/* / packaging/* / Dockerfile* / .github/workflows/* Moves build/test/release automation towards CMake/Ninja and expands security posture automation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/spine_audit.c
* -1 means "not yet attempted"; -2 means "attempted and failed, stop
* retrying" so a non-audit-enabled kernel doesn't burn syscalls forever. */
#ifdef HAVE_LIBAUDIT
static int g_audit_fd = -1;
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g_audit_fd and the rate-limit buckets (g_buckets) are mutated without any synchronization. If spine_audit_event() can be reached concurrently from multiple threads (e.g., worker threads tripping the circuit breaker), this introduces data races (undefined behavior) and can also leak/duplicate audit fds due to concurrent audit_open() calls. Mandatory fix: guard rate_allow() + the lazy-open block with a mutex (or use atomics + double-checked locking for g_audit_fd), and make the per-bucket counters thread-safe (mutex per bucket or a single global lock).

Copilot uses AI. Check for mistakes.
Comment thread src/spine_audit.c
int notified;
};

static struct rate_bucket g_buckets[AUDIT_BUCKET__COUNT];
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g_audit_fd and the rate-limit buckets (g_buckets) are mutated without any synchronization. If spine_audit_event() can be reached concurrently from multiple threads (e.g., worker threads tripping the circuit breaker), this introduces data races (undefined behavior) and can also leak/duplicate audit fds due to concurrent audit_open() calls. Mandatory fix: guard rate_allow() + the lazy-open block with a mutex (or use atomics + double-checked locking for g_audit_fd), and make the per-bucket counters thread-safe (mutex per bucket or a single global lock).

Copilot uses AI. Check for mistakes.
Comment thread src/spine_audit.c
Comment on lines +66 to +99
static int rate_allow(enum audit_bucket b, const char *op) {
time_t now = g_clock();
struct rate_bucket *rb = &g_buckets[b];

if (rb->window_start == 0 || now - rb->window_start >= SPINE_AUDIT_RATE_WINDOW_SECS) {
rb->window_start = now;
rb->count = 0;
rb->notified = 0;
}

if (rb->count < SPINE_AUDIT_RATE_MAX) {
rb->count++;
return 1;
}

if (!rb->notified) {
fprintf(stderr, "NOTE: audit rate limit reached for op=%s\n",
op ? op : "(null)");
rb->notified = 1;
}
return 0;
}

void spine_audit_event(const char *op, const char *detail, int result) {
if (!rate_allow(classify(op), op)) return;

#ifdef HAVE_LIBAUDIT
if (g_audit_fd == -2) return;
if (g_audit_fd == -1) {
g_audit_fd = audit_open();
if (g_audit_fd < 0) {
g_audit_fd = -2;
return;
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g_audit_fd and the rate-limit buckets (g_buckets) are mutated without any synchronization. If spine_audit_event() can be reached concurrently from multiple threads (e.g., worker threads tripping the circuit breaker), this introduces data races (undefined behavior) and can also leak/duplicate audit fds due to concurrent audit_open() calls. Mandatory fix: guard rate_allow() + the lazy-open block with a mutex (or use atomics + double-checked locking for g_audit_fd), and make the per-bucket counters thread-safe (mutex per bucket or a single global lock).

Copilot uses AI. Check for mistakes.
Comment thread src/sql.c
Comment on lines +46 to +54
static void spine_mysql_set_cloexec(MYSQL *mysql) {
#ifndef _WIN32
if (mysql == NULL) return;
int fd = (int) mysql->net.fd;
if (fd < 0) return;
int fl = fcntl(fd, F_GETFD);
if (fl >= 0) {
(void) fcntl(fd, F_SETFD, fl | FD_CLOEXEC);
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This accesses mysql->net.fd, which is an internal struct member and not part of the public MySQL C API. Even if it works on current MySQL/MariaDB client libs, it’s brittle across connector versions and can break compilation (or silently change semantics) when the library updates its internal layout. Prefer the supported API when available (e.g., mysql_get_socket() behind a feature test) and fall back to best-effort behavior if the API isn’t present.

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +35
static void spine_nanosleep_us(unsigned long microseconds) {
struct timespec req;
req.tv_sec = (time_t)(microseconds / 1000000UL);
req.tv_nsec = (long)((microseconds % 1000000UL) * 1000UL);
while (nanosleep(&req, &req) == -1 && (req.tv_sec > 0 || req.tv_nsec > 0)) {
/* Resume after EINTR with remaining time. */
}
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These retry loops don’t check errno == EINTR. If nanosleep() fails for any other reason (e.g., EINVAL due to an unexpected platform quirk), the loop can spin indefinitely because the “remaining time” condition is not a reliable proxy for “interrupted by signal.” Mandatory fix: retry only when nanosleep() fails with errno == EINTR; otherwise break/return.

Copilot uses AI. Check for mistakes.
Comment thread src/snmp_engine_id.c
hex += 2;
hex_len -= 2;
}
if (hex_len == 0 || (hex_len % 2) != 0 || (int)(hex_len / 2) > bin_cap) {
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(int)(hex_len / 2) > bin_cap can overflow/truncate when hex_len is large (size_t → int), potentially turning an oversized input into a “passes” case. Fix by keeping the comparison in size_t space (e.g., compare (hex_len / 2) to (size_t)bin_cap) after the existing bin_cap <= 0 guard.

Suggested change
if (hex_len == 0 || (hex_len % 2) != 0 || (int)(hex_len / 2) > bin_cap) {
if (hex_len == 0 || (hex_len % 2) != 0 || (hex_len / 2) > (size_t)bin_cap) {

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,308 @@
name: Static Analysis
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title/description focuses on “memory and concurrency cleanup from the security review”, but the diff also introduces substantial build/packaging/CI/workflow infrastructure (multiple new workflows, Nix flake, Docker and packaging conversion to CMake/Ninja, SELinux/AppArmor assets, systemd units/timer, devcontainer, etc.). This scope expansion makes the PR harder to review and reason about operationally. Mandatory: either update the PR description to explicitly list these additional deliverables, or split the infrastructure/build-system work into a separate PR from the runtime safety fixes.

Copilot uses AI. Check for mistakes.
Comment thread package

# Sanity checks
[ ! -e configure.ac ] && echo "ERROR: Your current working directory must be the SVN check out of Spine" && exit -1
[ ! -e CMakeLists.txt ] && echo "ERROR: Your current working directory must be a Spine source checkout" && exit -1
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exit -1 is not portable in shell scripts (exit codes are 0–255; -1 becomes 255). Use exit 1 (or another explicit positive code) for predictable behavior across shells and CI environments.

Copilot uses AI. Check for mistakes.
Comment on lines +105 to +106
#elif defined(__sun) || defined(__sun__)
(void) pthread_setname_np(pthread_self(), name);
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition redundantly checks defined(__sun) twice. This should be deduplicated (e.g., defined(__sun) || defined(__sun__) if you intended both macros, or just one if only one is expected in practice) to reduce noise and avoid confusion during future portability work.

Copilot uses AI. Check for mistakes.
Comment thread src/sql.c
Comment on lines +338 to +344
#if defined(MARIADB_BASE_VERSION) || defined(MARIADB_VERSION_ID)
# define SPINE_SSL_VERIFY_T my_bool
#elif defined(MYSQL_VERSION_ID) && MYSQL_VERSION_ID >= 80000
# define SPINE_SSL_VERIFY_T bool
#else
# define SPINE_SSL_VERIFY_T my_bool
#endif
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPINE_SSL_VERIFY_T is defined inside a .c file but not #undef’d afterward. Even though it’s in a single translation unit, leaving helper macros defined beyond their usage increases the chance of accidental reuse/collision in later edits. Consider #undef SPINE_SSL_VERIFY_T after the last use (optional but helps keep preprocessor scope tight).

Copilot uses AI. Check for mistakes.
size_t body_len = BUFSIZE + 64;
char *body = malloc(body_len + 2);
ASSERT_TRUE(body != NULL);
memcpy(body, "DB_Host ", 8);
Comment thread tests/unit/test_config_parser.c Fixed
Comment thread tests/unit/test_config_parser.c Fixed
Comment thread tests/unit/test_config_parser.c Fixed
Comment thread src/spine.c
remaining = MEGA_BUFSIZE - (qp - querybuf);
qp += snprintf(qp, remaining, " WHERE disabled = ''");
size_t remaining = (size_t)(MEGA_BUFSIZE - (qp - querybuf));
int n;
Comment thread src/util.c
sqlp += n; remaining -= (size_t)n;

sqlp += append_hostrange(sqlp, "host_id");
remaining = BUFSIZE - (size_t)(sqlp - sqlbuf);
Comment thread src/util.c
@@ -711,12 +765,30 @@
set.end_host_id,
num_rows));
} else {
size_t remaining = BUFSIZE;
int n;
Comment thread src/util.c
if (buff[i] != '\0') {
spine_config_warn("WARNING: %s:%d embedded NUL byte rejected\n",
file, lineno);
return -1;
Comment thread src/util.c

static int long_flag_is_cred(const char *flag, size_t flag_len) {
int i;
size_t n;

static int long_flag_is_cred(const char *flag, size_t flag_len) {
int i;
size_t n;
@somethingwithproof somethingwithproof force-pushed the fix/memory-and-concurrency branch from 98ab52a to 34169c8 Compare April 17, 2026 20:37
char *body = malloc(body_len + 2);
ASSERT_TRUE(body != NULL);
memcpy(body, "DB_Host ", 8);
memset(body + 8, 'x', body_len - 8);
ASSERT_TRUE(body != NULL);
memcpy(body, "DB_Host ", 8);
memset(body + 8, 'x', body_len - 8);
body[body_len] = '\n';
memcpy(body, "DB_Host ", 8);
memset(body + 8, 'x', body_len - 8);
body[body_len] = '\n';
body[body_len + 1] = '\0';
H1 (SECURITY.md) refuses startup when spine.conf has any group or world
bit set or is owned by someone other than root or the startup euid.  The
integration fixtures shipped the file at the COPY/checkout default of
0644, which H1 correctly rejects.

Dockerfile: add RUN chmod 0600 after the COPY so the baked conf (owned
by root, mode 0600) satisfies both H1 checks.  This fixes the Docker
Integration Tests, Poll Timing Benchmark, and build-cmake-linux-
sanitizers jobs that use the production image without a bind-mount
override.

docker-compose.yml: the bind-mount of tests/snmpv3/spine/spine.conf
arrives at 0644 from the git checkout and cannot be chmod'd through a
:ro mount.  Switch to a staging mount at spine.conf.src, set user: "0:0"
so the process euid is root, and add a wrapper entrypoint that installs
the staged conf to /tmp/spine.conf at 0600 before exec-ing spine.

ci.yml / distro-matrix.yml: add chmod 0600 tests/snmpv3/spine/spine.conf
before the cmake build in both FreeBSD jobs so ctest passes on FreeBSD 14
(Tier 1).

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
…hecks

read_spine_config() relied on sscanf("%15s %255s"), which silently
truncated keywords > 15 bytes and values > 255 bytes and dropped the
remainder of any value containing whitespace. The new tokenizer reads
whole lines, rejects over-length lines, embedded NULs, and overlong
keywords with a warning, and preserves interior whitespace so passwords
with spaces round-trip. DB_Port, RDB_Port, DB_UseSSL, RDB_UseSSL, and
CircuitBreakerThreshold now range-check via strtoul. Config open uses
O_NOFOLLOW|O_CLOEXEC so a swapped symlink cannot redirect credential
reads, and ELOOP is reported distinctly. Security-relevant warnings
route through spine_config_warn() so systemd captures them unconditionally.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Spine inherits whatever umask the service manager hands it, commonly
0022 on systemd. A 0022 mask leaves log files group-readable, which
matters because spine.log captures query errors that can include hosts,
user names, and SQL fragments. umask(027) is installed immediately after
spine_capture_startup_euid(), before any fd is opened, so the log file,
pid file, and any temp files land as 0640. The prior umask is stashed
and logged at debug level so operators can see what they inherited.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
The inline scrub in main() only fired on the happy path. die() bypassed
it, leaving credentials in memory for any post-mortem that catches the
fatal-error exit (core dump, journal rotation, etc.). Extracted the logic
to spine_scrub_secrets() in util.c so both paths go through the same
code. The helper uses explicit_bzero() when available (glibc >= 2.25,
musl, *BSD) and a volatile-pointer memset fallback elsewhere. Signal
handlers remain untouched; they must stay AS-safe and scrubbing from
them is not.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
mkstemp(3) honours the process umask, so on Linux with umask 022 temp
files land at 0644.  open(2) mode applies only when O_CREAT creates a
new file; re-opening an existing mkstemp path with O_CREAT|O_TRUNC,0600
leaves the inode mode untouched.  H1 then rejects every test config
with rc=-1 before the parser is even reached.

Replace write_conf with make_test_conf, which calls fchmod(fd, 0600) on
the mkstemp fd before writing content, matching the pattern used in
test_config_perms.c.  Also add umask(077) and spine_capture_startup_euid()
at the top of main so the owner check passes (the test process creates
the file, so geteuid() matches st_uid).

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
If strftime returns 0 the buffer is left untouched, so the subsequent
strncat in spine_log would read uninitialized stack bytes. Zero the
first byte up front so the buffer is always a valid C string.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
spine_auth and spine_priv are fixed BUFSIZE arrays. The strcat-based
list builder has no bounds check, so a large net-snmp capability set
could walk off the end. Replace with a local APPEND_CSV_TOKEN macro
that tracks used length and truncates instead of overflowing.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
The per-device details slot can be cleared by reload or delete under
LOCK_THDET while a worker is still running. The completion block
dereferences details[device_counter] unconditionally, which segfaults
if the slot was freed. Hold LOCK_THDET while testing for NULL and
skip the accounting with a debug log when the slot is gone.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
The trip path dropped spine_cb_lock and then logged host_id and
skip_cycles. A concurrent reload can free the entry between unlock
and log, turning the second read into a use-after-free. Snapshot
both values into stack locals under the lock, unlock, then log from
the locals.

Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
Signed-off-by: Thomas Vincent <thomasvincent@gmail.com>
@somethingwithproof somethingwithproof force-pushed the fix/memory-and-concurrency branch from 34169c8 to 9c595a3 Compare April 18, 2026 00:32
@somethingwithproof
Copy link
Copy Markdown
Contributor Author

Superseded by consolidated PR #535. All 11 memory/concurrency cleanup commits are now on feat/distro-test-matrix via fast-forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants