Skip to content

Harden RTSP/RTP stack against malformed input; add UDP, IPv6, and keepalive#7

Merged
steelbrain merged 53 commits into
mainfrom
steelbrain/audit-hardening
May 30, 2026
Merged

Harden RTSP/RTP stack against malformed input; add UDP, IPv6, and keepalive#7
steelbrain merged 53 commits into
mainfrom
steelbrain/audit-hardening

Conversation

@steelbrain

Copy link
Copy Markdown
Owner

Summary

A security/robustness audit-and-hardening pass over the RTSP/RTP/codec stack,
plus three new transport-level capabilities. The headline goal: a malformed or
hostile camera (or anything on the path) should never be able to crash the
process or silently stall a stream — it should degrade to a diagnostic and keep
going. Along the way this lands real UDP transport, IPv6 on both transports, and
automatic session keepalive.

51 commits since 0.2.0 (main @ bc2ccd3). Work is split into small, individually
reviewed commits; see CHANGELOG.md for the user-facing summary.

Hardening / security

  • Stop the codec, SDP, and RTSP parsers from trapping (crashing the process)
    on malformed or hostile input — out-of-range reads, bad lengths, and truncated
    buffers now error or are tolerated instead of aborting.
  • Don't tear down the stream on a single bad packet — drop the packet, keep going.
  • Bound depacketizer memory and reset depacketizer state on failure, so a
    malformed stream can't grow memory without limit.
  • Validate the RTCP sender-report SSRC before anchoring the RTP/NTP timeline,
    so a stray report can't corrupt timestamps.
  • Strengthen RTSP Digest auth: correctly escape quoted-string parameters and fix
    the 401 retry path.
  • Harden H.264 SPS parsing, the H.265 depacketizer + parameter-set parsing, RTP
    packet/context handling, and saturate loss counters to avoid overflow traps.
  • Make ntpUnixEpoch private (was an accidentally-public top-level symbol).

Postel's-law tolerance for real cameras

  • Tolerate sloppy SDP and lenient H.265 fmtp; make AAC depacketization tolerant
    of real-camera deviations.
  • Ignore the RTP MARK bit on H.264 SEI packets (some cameras set it on a trailing
    SEI, splitting the access unit early).
  • Tolerate a complete (un-fragmented) NAL that arrives inside an FU wrapper.
  • Accept a scheme-less Content-Base header by resolving it against the request URL.
  • Fix the static L16 payload clock rate to 44100 Hz; normalize audio frame data
    to a standalone Data.

New capabilities

  • Real UDP transport for RTP/RTCP. Transport.udp now streams end-to-end over
    an Apple Network-framework socket pair (even RTP port + consecutive odd RTCP
    port, negotiated via client_port), replacing the non-functional scaffolding.
    Best-effort NAT hole-punch after PLAY.
  • IPv6 on both transports, including bracketed IPv6 literals in the URL host
    (rtsp://[2001:db8::1]:554/stream).
  • Automatic RTSP keepalive while streaming: GET_PARAMETER when advertised,
    otherwise OPTIONS, at ~half the negotiated session timeout. Each round-trip
    reported via onDiagnostic at .info.
  • Add a connect timeout to the RTSP TCP transport so an unreachable/refused
    camera fails fast instead of parking in .waiting for the full timeout.

Diagnostics

  • Surface transport-level failures that previously stalled or died silently —
    e.g. a terminal UDP receive error (ICMP port-unreachable from a NAT'd camera).
  • Broaden onDiagnostic coverage of the drop/recover paths (unnegotiated
    channels, FU-reassembly anomalies, audio/metadata SETUP or init failures, sprop
    parse fallback, pre-PLAY interleaved media, partially-parsable SDP). Each
    condition is rate-limited to fire once so a misbehaving camera can't flood the
    consumer.

Testing

  • New live integration test suite: drives RTSPClientSession end-to-end
    against an ffmpeg-published stream relayed by mediamtx, exercising
    H.264/H.265/AAC over both RTSP-interleaved TCP and UDP, including IPv6.
  • CI installs ffmpeg and mediamtx to run them.
  • Regression tests added alongside each fix.

steelbrain added 30 commits May 30, 2026 05:56
RFC 3551 payload types 10 and 11 (L16) run at 44.1 kHz. They were
defined as 441000 Hz, producing presentation timestamps 10x too fast
for cameras that rely on the static rtpmap.
A camera or proxy could crash the client with crafted responses:
- SETUP interleaved/server_port "first-second" pairs trapped on
  UInt8/UInt16 overflow when first was 255/65535; compare as Int.
- A huge Content-Length (e.g. Int.max) overflowed the body-length
  addition and trapped; negatives were silently trusted. Reject
  negatives and bound the length against bytes remaining.
- Tolerate repeated spaces in the status line (RTSP/1.0  200 OK).
- serializeInterleaved now documents its 65535-byte contract instead
  of trapping opaquely on the 16-bit length field.
- Validate URL port with UInt16(exactly:) instead of trapping on
  out-of-range ports (e.g. rtsp://host:99999/).
- Apply the 2xx success check to the post-401 auth retry; previously a
  4xx/5xx on the retry was returned as success, confusing downstream
  parsing instead of surfacing sessionSetupFailed.
- Make audio timeline initialization best-effort: a malformed audio
  clock rate now disables audio with a diagnostic instead of aborting
  the whole session (mirrors the metadata path).
UDP was never wired up: no socket pair, no client_port negotiation,
and the receive loop only reads TCP-interleaved data, so a .udp session
would PLAY and hang forever. start() now fails fast with a clear error,
the .udp case is deprecated to warn at compile time, and the unused
UDPTransport.swift stub is removed. Also drop the write-only readPos
field in RTSPTransportConnection.
- Compute the RTP payload range from absolute indices so payload access
  is correct for Data slices with a nonzero startIndex (latent today,
  since callers pass startIndex==0, but fragile).
- Treat sequence delta 0x8000 as out-of-order (top-bit-set half) instead
  of reporting a spurious 32768-packet loss at the boundary.
- Avoid UInt8/UInt16 +1 traps in StreamContext description strings.
A flaky or hostile stream can accumulate more than 65535 lost packets
before a frame is emitted, trapping the UInt16 loss arithmetic. Clamp
loss accumulation in the H.264, H.265, AAC, and metadata depacketizers
so the counter pins at the max instead of crashing.
- Scaling lists 0-5 are 4x4 (16 coeffs) and 6+ are 8x8 (64); the size
  was keyed off the list count (always 64), over-consuming bits and
  corrupting dimensions/VUI on High-profile cameras that send scaling
  matrices. Key it off the list index instead.
- Run scaling-list scale math in Int so a large signed exp-Golomb delta
  can't overflow Int32 and trap.
- Compute pixel dimensions in Int and validate they fit UInt16 (>0) so
  malformed crop/size values throw cleanly instead of trapping.
- Guard num_units_in_tick*2 against UInt32 overflow.
- Bound the num_ref_frames loop so a bogus count can't spin.
- Reject an Aggregation Packet (type 48) that arrives mid-fragment; it
  would otherwise finalize an access unit whose open FU NALEntry still
  carries the Int.max piece-index sentinel and trap on the slice.
- Bound the attacker-controlled HRD cpb_cnt (<=31), num_long_term_ref_pics
  (<=32), and PPS tile column/row loops so malformed parameter sets can't
  spin or overflow n+1 on UInt32.
- Advance the nonce-count: authorize() is now mutating and persists nc,
  so each qop=auth request sends a unique, increasing nc instead of
  always 00000001 (strict servers reject the replay).
- Honor the algorithm: support MD5, MD5-sess, SHA-256, SHA-256-sess and
  echo algorithm= in the header (was hard-wired to MD5).
- Match qop by token, so a server offering only auth-int no longer
  triggers a bogus qop=auth response.
- Reject a challenge with an empty nonce instead of emitting a useless
  Authorization header.
Adds regression tests for nc increment, qop matching, SHA-256, and the
empty-nonce case.
PCM/G.726/L16/G.723 depacketizers handed the public AudioFrame a Data
slice of the receive buffer with a nonzero startIndex; consumers doing
index-based access would read wrong bytes. Copy into a standalone Data
(startIndex 0), matching the AAC and video paths.
NWConnection has no built-in connect deadline, so a host that accepts
the socket but never reaches .ready would hang connect() forever. Race
the state handler against a 15s timeout on the same serial queue, with a
one-shot guard for exact-once resume, and tear down the half-open socket
on timeout. Also stop force-unwrapping the port.
The video, audio, and application SETUP blocks were near-identical; fold
them into one setupStreamTCP() helper (-31 lines). The helper also adopts
the server's interleaved channel when it renumbers to a different free
even channel, instead of always assuming our requested channel — a camera
that renumbers would otherwise deliver media on a channel we never mapped,
and all of its frames would be silently dropped. Drops the now-dead UDP
header branches and the manual Session header (sendRequest adds it).
Port of retina ff771fe. Some cameras (e.g. "H264DVR 1.0" firmware) set
the MARK bit on a trailing SEI, splitting the access unit early. Exclude
SEI (NAL type 6) from canEndAU alongside SPS/PPS, and update the FU-A
test vectors that relied on an SEI ending an AU to use non-IDR slices.

Refs scottlamb/moonfire-nvr#352
Port of retina 8ff7a0f + e1c0bbf. A Fragmentation Unit (FU-A for H.264,
FU for H.265) with both the START and END bits set is forbidden by
RFC 6184 section 5.8 / RFC 7798 section 4.4.3, but some cameras wrap
small NALs this way. Treat such a single-fragment FU as one complete
NAL (hoisting the end-of-NAL finalization out of the start/continuation
branches) instead of erroring out the stream.
Port of retina 6972ac4. Some cameras (e.g. Anjvision) send a Content-Base
or Content-Location like "192.168.1.10:554/stream0/" with no rtsp://
scheme. Resolve those against the request URL's scheme instead of using
them verbatim (which broke control-URL resolution). Absolute Content-Base
headers are unaffected.

Refs scottlamb/moonfire-nvr#356
Mark ff771fe, 8ff7a0f (+e1c0bbf), and 6972ac4 as ported (commits
42010e4, 66ab85e, b6ff670). Document 6339bd6 and the API-exposure
commits as evaluated and intentionally skipped (feature / already
covered by IPCamKit's public surface).
The live integration suite publishes a synthetic stream with ffmpeg to a
mediamtx RTSP server and pulls it back through RTSPClientSession, so both
tools must be on PATH in CI.
End-to-end tests that drive the real RTSPClientSession: ffmpeg publishes
a synthetic H.264/H.265/AAC stream to a mediamtx RTSP server, and the
client pulls it back over RTSP-interleaved TCP, asserting keyframes,
SPS/PPS(/VPS), and 48 kHz audio.

The harness manages mediamtx + ffmpeg via Foundation.Process with bounded
waits, a per-test SIGKILL watchdog, and pkill teardown so a stalled
exchange can never hang the suite. Requires ffmpeg and mediamtx on PATH.
Long pull sessions were dropped by cameras after their session timeout
because only DESCRIBE/SETUP/PLAY were ever sent. The streamFrames reader
now demultiplexes interleaved RTP from RTSP responses, routing each
response to its CSeq waiter, so control requests can interleave with RTP
on the one TCP connection.

On that foundation:
- OPTIONS after connect discovers GET_PARAMETER support (else OPTIONS is
  the keepalive method);
- a periodic keepalive runs at ~half the SETUP-advertised session timeout
  while streaming;
- stop() now routes its TEARDOWN through the reader instead of issuing a
  competing read, fixing a ~60s block when called during active streaming
  (the reader used to consume the response). A live test exercises stop()
  mid-stream to guard the fix.
Adds a live test that proves keepalive sustains a session past a server
read timeout: ffmpeg publishes to a mediamtx configured with readTimeout=3s
(which drops idle readers), and with a 1s keepalive the client keeps
receiving frames well past 6s. Uses an internal RTSPClientSession keepalive
interval override so the cadence is short and deterministic.

Also moves the fixture's blocking process/socket work (port reservation,
launch, port-wait) onto a dedicated thread. Done on a Swift Concurrency
cooperative thread, that blocking work could wedge the shared pool under the
test runner's default parallelism and deadlock the whole suite; confining it
to a private queue keeps the cooperative pool free.
Reserves an even-RTP / odd-RTCP local UDP socket pair (client_port), connects
to the server's RTP/RTCP ports, hole-punches, and delivers datagrams via a
DispatchSource drain loop on a private queue. IPv4 only; bounded bind retries
and recv draining; idempotent close.
Wire UDP end to end: SETUP advertises client_port from a bound even/odd socket
pair, connects to the server's server_port, and hole-punches after PLAY. The
receive loop is generalized — shared routeRTP/routeRTCP feed both the TCP
interleaved reader and a new UDP reader that multiplexes per-stream RTP/RTCP
sockets with TCP control. Keepalive/TEARDOWN responses route over TCP (handled
directly, off the lossy data buffer) for UDP sessions too. Peer host selection
prefers a numeric IPv4 source. Drops the .udp rejection and deprecation.
Exercise the full UDP pipeline against mediamtx: client_port negotiation, the
bound RTP/RTCP socket pair, multiplexed reception, and graceful stop() while
RTP flows on UDP and control on TCP.
mediamtx ran dual [tcp, udp], so a UDP test could not prove the client used
UDP (the server would serve either). Each test now runs mediamtx in one RTP
mode: TCP tests use rtspTransports [tcp], UDP tests use [udp]. A single-mode
server 461-rejects a SETUP for the other transport, so a passing test proves
the matching client path. rtspTransports also gates the publisher, so the
ffmpeg publish leg now follows suit (UDP publish for a UDP-only server). The
fixture transports argument is now required (no dual default).
Replace the Darwin/BSD-socket UDPPair with an NWListener/NWConnection
implementation, so the UDP RTP/RTCP path no longer imports Darwin and
handles IPv4 and IPv6 peers natively (notably [::1] loopback).

The even/odd local port pair is reserved by NWListeners held across the
SETUP round-trip, then handed to NWConnections bound to the same local
ports and connected to the server's server_port — a connected UDP flow,
so the kernel keeps the symmetric-RTP peer filtering and holePunch()/
sends can omit the address. bind()/connect() are now async to avoid
blocking a cooperative thread on the Network framework callbacks.

RTSPConnection gains resolvedRemote() (host + family) so the session
picks the local socket family from the control peer and selects an
RTP/RTCP peer of the matching family.
steelbrain added 23 commits May 30, 2026 11:55
Foundation's URL parser surfaces a bracketed IPv6 host verbatim
(rtsp://[::1]:554/… -> [::1]), but NWEndpoint.Host expects the bare
address and rejects the bracketed form, so the control connection (and
the UDP peer fallback / family detection that reuse the host) failed on
any IPv6 URL. Normalize via a testable unbracketedHost() that strips the
surrounding brackets while preserving a link-local zone id.
Rewrite the live fixture's loopback port-reservation and TCP-probe
helpers from raw BSD sockets to Network framework (NWListener/
NWConnection), making them address-family-aware so one fixture drives
the client over both 127.0.0.1 and [::1]. The test file no longer
imports Darwin.

Add H.264-over-TCP and H.264-over-UDP tests on IPv6 (mediamtx bound on
[::], pulled from rtsp://[::1]:PORT), completing the TCP/UDP x IPv4/IPv6
matrix; each asserts a keyframe plus SPS/PPS.
Audit API.md against the public source surface:

- Correct keepalive diagnostic severity (success is .info, failure .warning).
- Add the referenced-but-undefined TcpStreamContext / UdpStreamContext.
- Document IPv6: both transports work over IPv4/IPv6, and the url host
  accepts a bracketed IPv6 literal (incl. link-local zone ids).
- Update the IPCamKit.swift header comment to note .udp / IPv4/IPv6.
It was accidentally exposed as a public top-level constant but is only used internally within Timestamp.swift.
Cap the unparsed read buffer (4 MiB) and Content-Length (2 MiB) so a
camera that never frames a message, or declares an absurd body, can no
longer exhaust memory. Trim header field names (not just values) so
"CSeq : 1" still routes; tolerate bare-LF line endings and tab/junk in
the status line (Postel). Scan for the header terminator in place via
withUnsafeBytes instead of copying the whole buffer each call, which was
O(n^2) as a body streams in.

Addresses audit findings #2/#3/#15/#37/#39/#42/#56. Adds regression
tests including a mixed CRLF-headers/bare-LF-terminator case.
Match SDP attribute names case-insensitively (cameras emit a=Control:/
a=RTPMAP:), normalized once at parse time with case-insensitive lookup
helpers, so an off-case control attribute no longer SETUPs the wrong
track and an off-case rtpmap no longer drops the whole stream. Retain
session-level b= (was silently dropped). Tolerate tab/repeated-space
separators in the m= line and fmt list.

Addresses audit findings #43/#44/#45/#48. Adds regression tests.
Escape backslash/quote in Digest quoted-string header fields and
unescape them when parsing the challenge (RFC 7616), so credentials or a
server-echoed realm containing a quote no longer corrupt the
Authorization header. Reject control characters in challenge values to
prevent control-byte injection into the request. Send Basic pre-emptively
after the first 401 instead of paying a round-trip on every request.
Preserve Session and User-Agent on the 401 retry so mid-session re-auth
on SETUP/PLAY works, and iterate all WWW-Authenticate challenges so
Digest is chosen over Basic regardless of ordering.

Addresses audit findings #10/#13/#38/#40/#41. Adds regression tests.
A foreign or early RTCP Sender Report under the default dropPackets
policy was anchoring the timeline origin before its SSRC was validated,
corrupting NPT for the whole session. Decide the SSRC policy first and
only place() an SR we keep; warn once per stream on the first dropped
unknown-SSRC SR (wiring up the previously-dead latch). Clamp
NtpTimestamp.date to 1970 for pre-epoch/unsynced values instead of
wrapping to a far-future date. Drop the bogus streamIndex<255 channel
guard and remove dead RTCP code (asTyped/TypedPacketRef/
ReceiverReportRef/packets()).

Addresses audit findings #9/#33/#34/#35/#51. Adds regression tests.
A VPS/SPS/PPS NAL over 65535 bytes trapped building the HEVC record;
reject it up front, mirroring the H264 guard. SimpleAudioDepacketizer
frameLength now returns nil on an oversized payload instead of a
precondition trap. Compute the H.265 conformance-window crop in 64-bit
so malformed offsets can't wrap a 32-bit add to a wrong/zero size, and
require the crop to leave a positive picture. Tighten bit_depth_*_minus8
to the spec range [0,6] (7/8 also collided with the reserved
HEVC-record bit).

Addresses audit findings #4/#19/#31/#32. Adds regression tests for the
NAL-size and oversized-payload guards.
Match fmtp keys case-insensitively (cameras emit Sprop-VPS, tx-mode=srst)
and skip valueless/empty tokens instead of rejecting the whole fmtp, so a
stray flag or trailing ';' no longer drops the parameter sets. Parse
sprop-vps/sps/pps as the RFC 7798 comma-separated list and strip a
leading Annex B start code some cameras prepend. Fail loud on
sprop-max-don-diff>0 (DONL/DOND decoding is unimplemented) rather than
silently emitting corrupt frames.

Addresses audit findings #8/#14/#26/#27/#28/#29. Adds regression tests.
Trust the AudioSpecificConfig over the rtpmap channel count (RFC 3640
makes the ASC authoritative; the rtpmap field is informational and often
wrong/omitted), and accept an RTP clock that is a small integer multiple
of the ASC sampling frequency (HE-AAC/SBR) — both previously dropped
audio silently. Enforce the 16-bit AU-headers-length invariant the code
already documents instead of truncating an off-spec length, and compare
the full 32-bit fragment timestamp so a low-16-bit collision can't
mis-stitch. Remove the dead SBR/PS branch and the unreachable
channelConfiguration==0 guard.

Addresses audit findings #5/#6/#17/#18/#20/#54. Adds regression tests.
Cap a single accumulated H.264/H.265 access unit at 16 MiB so a hostile
camera can't stream FU continuations (or same-timestamp NALs) forever and
exhaust memory: every NAL-body append now flows through one checked
helper and every buffer clear through one reset helper, so the byte
accumulator can't drift. Reset the depacketizer to a clean state on a
failed push so a later push can't build on a half-mutated buffer
(prerequisite for making per-packet errors recoverable). Drop the dead
validateOrder function and the duplicate SPS RBSP decode in
parseSPSAndPPS (reuse parseSPS's already-decoded values).

Addresses audit findings #7/#22/#24/#25/#30. Adds cap regression tests.
A malformed, SSRC-mismatched, or timeline-rejected RTP/RTCP packet now
drops with a rate-limited (once-per-stream) diagnostic instead of
finishing the whole AsyncThrowingStream — a single bad packet from a
sloppy or hostile camera on a lossy link no longer kills video, audio,
and metadata. InorderParser is the single decision point: it drops these
cases and throws only under the opt-in .abortSession SSRC policy;
routeRTP catches depacketizer push/pull failures and drops the
packet/frame rather than rethrowing. Per-packet diagnostics are latched
so a misbehaving camera can't flood the consumer. Narrow the RTP-Info
seq seed to ignore only the seq=0;rtptime=0 placeholder, restoring
initial-loss detection for cameras that start at seq 1.

Addresses audit findings #0/#1/#11/#50. Adds InorderParser regression
tests.
Reject a second concurrent frames() consumer (it would race the same
connection and corrupt the RTP pipeline) with a clear invalidState
error. Cancel each keepalive/control timeout task as soon as its
response resolves, so a fast response no longer leaves a sleeping task
lingering for the full timeout. Tolerate Session timeout=0 ("no expiry")
and a single-value interleaved= transport param instead of aborting
SETUP.

Addresses audit findings #12/#46/#47/#49. Adds SETUP regression tests.
Fail fast on a refused/unreachable camera: NWConnection parks such
connections in .waiting, which connect() ignored and so burned the full
15s timeout — give it a short grace window then surface the real cause.
Surface a terminal UDP receive error (e.g. an ICMP port-unreachable for
a NAT'd camera) via a diagnostic rather than letting the receive loop
die silently. Count and report interleaved media a camera streams on the
control channel before its PLAY response completes (those frames are
dropped).

Addresses audit findings #59/#60/#61.
Fix the RTSPClientSession usage example (frames(), not the non-existent
videoFrames()) and note it is single-consume. Document that the loss
field is a forward RTP-sequence delta (may be large/spurious), the
push/pull drain contract on the H.264/H.265 depacketizers, the
ApplicationDepacketizer single-packet-doc-after-loss edge, and two
deliberately-omitted transport behaviours (no per-read idle timeout; UDP
connect cancellation observed only at the connect timeout).

Addresses audit findings #21/#23/#36/#52.
InternalError mirrored upstream retina's rich ErrorInt context but had
zero references anywhere; the live error path uses RTSPError and the
context-carrying DepacketizeError. The context types it referenced
(ConnectionContext/StreamContext/PacketContext/RtspMessageContext) stay,
still used via PacketContext and DepacketizeError. Net -95 lines.

Addresses audit finding #55 (dead-taxonomy half; the public RTSPError
catch-all is left as-is now that per-packet errors surface as
diagnostics).
Surface conditions that previously dropped data silently: interleaved
data on an unnegotiated channel, the FU-reassembly anomalies the video
depacketizer already tracked (inconsistent FU NAL header, single-fragment
FU) but never reported despite the comments saying callers should, and
SDP streams that fail to parse while others succeed. All rate-limited to
once per condition so a misbehaving camera can't flood the consumer.

Completes diagnostic coverage of the drop/recover paths. Adds a
parseDescribe regression test.
Add a live integration test that drives a real ffmpeg->mediamtx session,
confirms a frames() consumer is delivering, then asserts a second
concurrent frames() is rejected with .invalidState and yields nothing —
covering the single-consumer guard (#12) end to end.

Also surface a diagnostic when the SDP advertised sprop parameter sets
but they failed to parse and the depacketizer fell back to in-band
parameters (previously silent).
@steelbrain steelbrain merged commit fd3577a into main May 30, 2026
2 checks passed
@steelbrain steelbrain deleted the steelbrain/audit-hardening branch June 1, 2026 05:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant