Harden RTSP/RTP stack against malformed input; add UDP, IPv6, and keepalive#7
Merged
Conversation
RFC 3551 payload types 10 and 11 (L16) run at 44.1 kHz. They were defined as 441000 Hz, producing presentation timestamps 10x too fast for cameras that rely on the static rtpmap.
A camera or proxy could crash the client with crafted responses: - SETUP interleaved/server_port "first-second" pairs trapped on UInt8/UInt16 overflow when first was 255/65535; compare as Int. - A huge Content-Length (e.g. Int.max) overflowed the body-length addition and trapped; negatives were silently trusted. Reject negatives and bound the length against bytes remaining. - Tolerate repeated spaces in the status line (RTSP/1.0 200 OK). - serializeInterleaved now documents its 65535-byte contract instead of trapping opaquely on the 16-bit length field.
- Validate URL port with UInt16(exactly:) instead of trapping on out-of-range ports (e.g. rtsp://host:99999/). - Apply the 2xx success check to the post-401 auth retry; previously a 4xx/5xx on the retry was returned as success, confusing downstream parsing instead of surfacing sessionSetupFailed. - Make audio timeline initialization best-effort: a malformed audio clock rate now disables audio with a diagnostic instead of aborting the whole session (mirrors the metadata path).
UDP was never wired up: no socket pair, no client_port negotiation, and the receive loop only reads TCP-interleaved data, so a .udp session would PLAY and hang forever. start() now fails fast with a clear error, the .udp case is deprecated to warn at compile time, and the unused UDPTransport.swift stub is removed. Also drop the write-only readPos field in RTSPTransportConnection.
- Compute the RTP payload range from absolute indices so payload access is correct for Data slices with a nonzero startIndex (latent today, since callers pass startIndex==0, but fragile). - Treat sequence delta 0x8000 as out-of-order (top-bit-set half) instead of reporting a spurious 32768-packet loss at the boundary. - Avoid UInt8/UInt16 +1 traps in StreamContext description strings.
A flaky or hostile stream can accumulate more than 65535 lost packets before a frame is emitted, trapping the UInt16 loss arithmetic. Clamp loss accumulation in the H.264, H.265, AAC, and metadata depacketizers so the counter pins at the max instead of crashing.
- Scaling lists 0-5 are 4x4 (16 coeffs) and 6+ are 8x8 (64); the size was keyed off the list count (always 64), over-consuming bits and corrupting dimensions/VUI on High-profile cameras that send scaling matrices. Key it off the list index instead. - Run scaling-list scale math in Int so a large signed exp-Golomb delta can't overflow Int32 and trap. - Compute pixel dimensions in Int and validate they fit UInt16 (>0) so malformed crop/size values throw cleanly instead of trapping. - Guard num_units_in_tick*2 against UInt32 overflow. - Bound the num_ref_frames loop so a bogus count can't spin.
- Reject an Aggregation Packet (type 48) that arrives mid-fragment; it would otherwise finalize an access unit whose open FU NALEntry still carries the Int.max piece-index sentinel and trap on the slice. - Bound the attacker-controlled HRD cpb_cnt (<=31), num_long_term_ref_pics (<=32), and PPS tile column/row loops so malformed parameter sets can't spin or overflow n+1 on UInt32.
- Advance the nonce-count: authorize() is now mutating and persists nc, so each qop=auth request sends a unique, increasing nc instead of always 00000001 (strict servers reject the replay). - Honor the algorithm: support MD5, MD5-sess, SHA-256, SHA-256-sess and echo algorithm= in the header (was hard-wired to MD5). - Match qop by token, so a server offering only auth-int no longer triggers a bogus qop=auth response. - Reject a challenge with an empty nonce instead of emitting a useless Authorization header. Adds regression tests for nc increment, qop matching, SHA-256, and the empty-nonce case.
PCM/G.726/L16/G.723 depacketizers handed the public AudioFrame a Data slice of the receive buffer with a nonzero startIndex; consumers doing index-based access would read wrong bytes. Copy into a standalone Data (startIndex 0), matching the AAC and video paths.
NWConnection has no built-in connect deadline, so a host that accepts the socket but never reaches .ready would hang connect() forever. Race the state handler against a 15s timeout on the same serial queue, with a one-shot guard for exact-once resume, and tear down the half-open socket on timeout. Also stop force-unwrapping the port.
The video, audio, and application SETUP blocks were near-identical; fold them into one setupStreamTCP() helper (-31 lines). The helper also adopts the server's interleaved channel when it renumbers to a different free even channel, instead of always assuming our requested channel — a camera that renumbers would otherwise deliver media on a channel we never mapped, and all of its frames would be silently dropped. Drops the now-dead UDP header branches and the manual Session header (sendRequest adds it).
Port of retina ff771fe. Some cameras (e.g. "H264DVR 1.0" firmware) set the MARK bit on a trailing SEI, splitting the access unit early. Exclude SEI (NAL type 6) from canEndAU alongside SPS/PPS, and update the FU-A test vectors that relied on an SEI ending an AU to use non-IDR slices. Refs scottlamb/moonfire-nvr#352
Port of retina 8ff7a0f + e1c0bbf. A Fragmentation Unit (FU-A for H.264, FU for H.265) with both the START and END bits set is forbidden by RFC 6184 section 5.8 / RFC 7798 section 4.4.3, but some cameras wrap small NALs this way. Treat such a single-fragment FU as one complete NAL (hoisting the end-of-NAL finalization out of the start/continuation branches) instead of erroring out the stream.
Port of retina 6972ac4. Some cameras (e.g. Anjvision) send a Content-Base or Content-Location like "192.168.1.10:554/stream0/" with no rtsp:// scheme. Resolve those against the request URL's scheme instead of using them verbatim (which broke control-URL resolution). Absolute Content-Base headers are unaffected. Refs scottlamb/moonfire-nvr#356
The live integration suite publishes a synthetic stream with ffmpeg to a mediamtx RTSP server and pulls it back through RTSPClientSession, so both tools must be on PATH in CI.
End-to-end tests that drive the real RTSPClientSession: ffmpeg publishes a synthetic H.264/H.265/AAC stream to a mediamtx RTSP server, and the client pulls it back over RTSP-interleaved TCP, asserting keyframes, SPS/PPS(/VPS), and 48 kHz audio. The harness manages mediamtx + ffmpeg via Foundation.Process with bounded waits, a per-test SIGKILL watchdog, and pkill teardown so a stalled exchange can never hang the suite. Requires ffmpeg and mediamtx on PATH.
Long pull sessions were dropped by cameras after their session timeout because only DESCRIBE/SETUP/PLAY were ever sent. The streamFrames reader now demultiplexes interleaved RTP from RTSP responses, routing each response to its CSeq waiter, so control requests can interleave with RTP on the one TCP connection. On that foundation: - OPTIONS after connect discovers GET_PARAMETER support (else OPTIONS is the keepalive method); - a periodic keepalive runs at ~half the SETUP-advertised session timeout while streaming; - stop() now routes its TEARDOWN through the reader instead of issuing a competing read, fixing a ~60s block when called during active streaming (the reader used to consume the response). A live test exercises stop() mid-stream to guard the fix.
Adds a live test that proves keepalive sustains a session past a server read timeout: ffmpeg publishes to a mediamtx configured with readTimeout=3s (which drops idle readers), and with a 1s keepalive the client keeps receiving frames well past 6s. Uses an internal RTSPClientSession keepalive interval override so the cadence is short and deterministic. Also moves the fixture's blocking process/socket work (port reservation, launch, port-wait) onto a dedicated thread. Done on a Swift Concurrency cooperative thread, that blocking work could wedge the shared pool under the test runner's default parallelism and deadlock the whole suite; confining it to a private queue keeps the cooperative pool free.
Reserves an even-RTP / odd-RTCP local UDP socket pair (client_port), connects to the server's RTP/RTCP ports, hole-punches, and delivers datagrams via a DispatchSource drain loop on a private queue. IPv4 only; bounded bind retries and recv draining; idempotent close.
Wire UDP end to end: SETUP advertises client_port from a bound even/odd socket pair, connects to the server's server_port, and hole-punches after PLAY. The receive loop is generalized — shared routeRTP/routeRTCP feed both the TCP interleaved reader and a new UDP reader that multiplexes per-stream RTP/RTCP sockets with TCP control. Keepalive/TEARDOWN responses route over TCP (handled directly, off the lossy data buffer) for UDP sessions too. Peer host selection prefers a numeric IPv4 source. Drops the .udp rejection and deprecation.
Exercise the full UDP pipeline against mediamtx: client_port negotiation, the bound RTP/RTCP socket pair, multiplexed reception, and graceful stop() while RTP flows on UDP and control on TCP.
mediamtx ran dual [tcp, udp], so a UDP test could not prove the client used UDP (the server would serve either). Each test now runs mediamtx in one RTP mode: TCP tests use rtspTransports [tcp], UDP tests use [udp]. A single-mode server 461-rejects a SETUP for the other transport, so a passing test proves the matching client path. rtspTransports also gates the publisher, so the ffmpeg publish leg now follows suit (UDP publish for a UDP-only server). The fixture transports argument is now required (no dual default).
Replace the Darwin/BSD-socket UDPPair with an NWListener/NWConnection implementation, so the UDP RTP/RTCP path no longer imports Darwin and handles IPv4 and IPv6 peers natively (notably [::1] loopback). The even/odd local port pair is reserved by NWListeners held across the SETUP round-trip, then handed to NWConnections bound to the same local ports and connected to the server's server_port — a connected UDP flow, so the kernel keeps the symmetric-RTP peer filtering and holePunch()/ sends can omit the address. bind()/connect() are now async to avoid blocking a cooperative thread on the Network framework callbacks. RTSPConnection gains resolvedRemote() (host + family) so the session picks the local socket family from the control peer and selects an RTP/RTCP peer of the matching family.
Foundation's URL parser surfaces a bracketed IPv6 host verbatim (rtsp://[::1]:554/… -> [::1]), but NWEndpoint.Host expects the bare address and rejects the bracketed form, so the control connection (and the UDP peer fallback / family detection that reuse the host) failed on any IPv6 URL. Normalize via a testable unbracketedHost() that strips the surrounding brackets while preserving a link-local zone id.
Rewrite the live fixture's loopback port-reservation and TCP-probe helpers from raw BSD sockets to Network framework (NWListener/ NWConnection), making them address-family-aware so one fixture drives the client over both 127.0.0.1 and [::1]. The test file no longer imports Darwin. Add H.264-over-TCP and H.264-over-UDP tests on IPv6 (mediamtx bound on [::], pulled from rtsp://[::1]:PORT), completing the TCP/UDP x IPv4/IPv6 matrix; each asserts a keyframe plus SPS/PPS.
Audit API.md against the public source surface: - Correct keepalive diagnostic severity (success is .info, failure .warning). - Add the referenced-but-undefined TcpStreamContext / UdpStreamContext. - Document IPv6: both transports work over IPv4/IPv6, and the url host accepts a bracketed IPv6 literal (incl. link-local zone ids). - Update the IPCamKit.swift header comment to note .udp / IPv4/IPv6.
It was accidentally exposed as a public top-level constant but is only used internally within Timestamp.swift.
Cap the unparsed read buffer (4 MiB) and Content-Length (2 MiB) so a camera that never frames a message, or declares an absurd body, can no longer exhaust memory. Trim header field names (not just values) so "CSeq : 1" still routes; tolerate bare-LF line endings and tab/junk in the status line (Postel). Scan for the header terminator in place via withUnsafeBytes instead of copying the whole buffer each call, which was O(n^2) as a body streams in. Addresses audit findings #2/#3/#15/#37/#39/#42/#56. Adds regression tests including a mixed CRLF-headers/bare-LF-terminator case.
Match SDP attribute names case-insensitively (cameras emit a=Control:/ a=RTPMAP:), normalized once at parse time with case-insensitive lookup helpers, so an off-case control attribute no longer SETUPs the wrong track and an off-case rtpmap no longer drops the whole stream. Retain session-level b= (was silently dropped). Tolerate tab/repeated-space separators in the m= line and fmt list. Addresses audit findings #43/#44/#45/#48. Adds regression tests.
Escape backslash/quote in Digest quoted-string header fields and unescape them when parsing the challenge (RFC 7616), so credentials or a server-echoed realm containing a quote no longer corrupt the Authorization header. Reject control characters in challenge values to prevent control-byte injection into the request. Send Basic pre-emptively after the first 401 instead of paying a round-trip on every request. Preserve Session and User-Agent on the 401 retry so mid-session re-auth on SETUP/PLAY works, and iterate all WWW-Authenticate challenges so Digest is chosen over Basic regardless of ordering. Addresses audit findings #10/#13/#38/#40/#41. Adds regression tests.
A foreign or early RTCP Sender Report under the default dropPackets policy was anchoring the timeline origin before its SSRC was validated, corrupting NPT for the whole session. Decide the SSRC policy first and only place() an SR we keep; warn once per stream on the first dropped unknown-SSRC SR (wiring up the previously-dead latch). Clamp NtpTimestamp.date to 1970 for pre-epoch/unsynced values instead of wrapping to a far-future date. Drop the bogus streamIndex<255 channel guard and remove dead RTCP code (asTyped/TypedPacketRef/ ReceiverReportRef/packets()). Addresses audit findings #9/#33/#34/#35/#51. Adds regression tests.
A VPS/SPS/PPS NAL over 65535 bytes trapped building the HEVC record; reject it up front, mirroring the H264 guard. SimpleAudioDepacketizer frameLength now returns nil on an oversized payload instead of a precondition trap. Compute the H.265 conformance-window crop in 64-bit so malformed offsets can't wrap a 32-bit add to a wrong/zero size, and require the crop to leave a positive picture. Tighten bit_depth_*_minus8 to the spec range [0,6] (7/8 also collided with the reserved HEVC-record bit). Addresses audit findings #4/#19/#31/#32. Adds regression tests for the NAL-size and oversized-payload guards.
Match fmtp keys case-insensitively (cameras emit Sprop-VPS, tx-mode=srst) and skip valueless/empty tokens instead of rejecting the whole fmtp, so a stray flag or trailing ';' no longer drops the parameter sets. Parse sprop-vps/sps/pps as the RFC 7798 comma-separated list and strip a leading Annex B start code some cameras prepend. Fail loud on sprop-max-don-diff>0 (DONL/DOND decoding is unimplemented) rather than silently emitting corrupt frames. Addresses audit findings #8/#14/#26/#27/#28/#29. Adds regression tests.
Trust the AudioSpecificConfig over the rtpmap channel count (RFC 3640 makes the ASC authoritative; the rtpmap field is informational and often wrong/omitted), and accept an RTP clock that is a small integer multiple of the ASC sampling frequency (HE-AAC/SBR) — both previously dropped audio silently. Enforce the 16-bit AU-headers-length invariant the code already documents instead of truncating an off-spec length, and compare the full 32-bit fragment timestamp so a low-16-bit collision can't mis-stitch. Remove the dead SBR/PS branch and the unreachable channelConfiguration==0 guard. Addresses audit findings #5/#6/#17/#18/#20/#54. Adds regression tests.
Cap a single accumulated H.264/H.265 access unit at 16 MiB so a hostile camera can't stream FU continuations (or same-timestamp NALs) forever and exhaust memory: every NAL-body append now flows through one checked helper and every buffer clear through one reset helper, so the byte accumulator can't drift. Reset the depacketizer to a clean state on a failed push so a later push can't build on a half-mutated buffer (prerequisite for making per-packet errors recoverable). Drop the dead validateOrder function and the duplicate SPS RBSP decode in parseSPSAndPPS (reuse parseSPS's already-decoded values). Addresses audit findings #7/#22/#24/#25/#30. Adds cap regression tests.
A malformed, SSRC-mismatched, or timeline-rejected RTP/RTCP packet now drops with a rate-limited (once-per-stream) diagnostic instead of finishing the whole AsyncThrowingStream — a single bad packet from a sloppy or hostile camera on a lossy link no longer kills video, audio, and metadata. InorderParser is the single decision point: it drops these cases and throws only under the opt-in .abortSession SSRC policy; routeRTP catches depacketizer push/pull failures and drops the packet/frame rather than rethrowing. Per-packet diagnostics are latched so a misbehaving camera can't flood the consumer. Narrow the RTP-Info seq seed to ignore only the seq=0;rtptime=0 placeholder, restoring initial-loss detection for cameras that start at seq 1. Addresses audit findings #0/#1/#11/#50. Adds InorderParser regression tests.
Reject a second concurrent frames() consumer (it would race the same
connection and corrupt the RTP pipeline) with a clear invalidState
error. Cancel each keepalive/control timeout task as soon as its
response resolves, so a fast response no longer leaves a sleeping task
lingering for the full timeout. Tolerate Session timeout=0 ("no expiry")
and a single-value interleaved= transport param instead of aborting
SETUP.
Addresses audit findings #12/#46/#47/#49. Adds SETUP regression tests.
Fail fast on a refused/unreachable camera: NWConnection parks such connections in .waiting, which connect() ignored and so burned the full 15s timeout — give it a short grace window then surface the real cause. Surface a terminal UDP receive error (e.g. an ICMP port-unreachable for a NAT'd camera) via a diagnostic rather than letting the receive loop die silently. Count and report interleaved media a camera streams on the control channel before its PLAY response completes (those frames are dropped). Addresses audit findings #59/#60/#61.
Fix the RTSPClientSession usage example (frames(), not the non-existent videoFrames()) and note it is single-consume. Document that the loss field is a forward RTP-sequence delta (may be large/spurious), the push/pull drain contract on the H.264/H.265 depacketizers, the ApplicationDepacketizer single-packet-doc-after-loss edge, and two deliberately-omitted transport behaviours (no per-read idle timeout; UDP connect cancellation observed only at the connect timeout). Addresses audit findings #21/#23/#36/#52.
InternalError mirrored upstream retina's rich ErrorInt context but had zero references anywhere; the live error path uses RTSPError and the context-carrying DepacketizeError. The context types it referenced (ConnectionContext/StreamContext/PacketContext/RtspMessageContext) stay, still used via PacketContext and DepacketizeError. Net -95 lines. Addresses audit finding #55 (dead-taxonomy half; the public RTSPError catch-all is left as-is now that per-packet errors surface as diagnostics).
Surface conditions that previously dropped data silently: interleaved data on an unnegotiated channel, the FU-reassembly anomalies the video depacketizer already tracked (inconsistent FU NAL header, single-fragment FU) but never reported despite the comments saying callers should, and SDP streams that fail to parse while others succeed. All rate-limited to once per condition so a misbehaving camera can't flood the consumer. Completes diagnostic coverage of the drop/recover paths. Adds a parseDescribe regression test.
Add a live integration test that drives a real ffmpeg->mediamtx session, confirms a frames() consumer is delivering, then asserts a second concurrent frames() is rejected with .invalidState and yields nothing — covering the single-consumer guard (#12) end to end. Also surface a diagnostic when the SDP advertised sprop parameter sets but they failed to parse and the depacketizer fell back to in-band parameters (previously silent).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A security/robustness audit-and-hardening pass over the RTSP/RTP/codec stack,
plus three new transport-level capabilities. The headline goal: a malformed or
hostile camera (or anything on the path) should never be able to crash the
process or silently stall a stream — it should degrade to a diagnostic and keep
going. Along the way this lands real UDP transport, IPv6 on both transports, and
automatic session keepalive.
51 commits since 0.2.0 (
main@ bc2ccd3). Work is split into small, individuallyreviewed commits; see
CHANGELOG.mdfor the user-facing summary.Hardening / security
on malformed or hostile input — out-of-range reads, bad lengths, and truncated
buffers now error or are tolerated instead of aborting.
malformed stream can't grow memory without limit.
so a stray report can't corrupt timestamps.
the 401 retry path.
packet/context handling, and saturate loss counters to avoid overflow traps.
ntpUnixEpochprivate (was an accidentally-public top-level symbol).Postel's-law tolerance for real cameras
fmtp; make AAC depacketization tolerantof real-camera deviations.
SEI, splitting the access unit early).
Content-Baseheader by resolving it against the request URL.to a standalone
Data.New capabilities
Transport.udpnow streams end-to-end overan Apple Network-framework socket pair (even RTP port + consecutive odd RTCP
port, negotiated via
client_port), replacing the non-functional scaffolding.Best-effort NAT hole-punch after
PLAY.(
rtsp://[2001:db8::1]:554/stream).GET_PARAMETERwhen advertised,otherwise
OPTIONS, at ~half the negotiated session timeout. Each round-tripreported via
onDiagnosticat.info.camera fails fast instead of parking in
.waitingfor the full timeout.Diagnostics
e.g. a terminal UDP receive error (ICMP port-unreachable from a NAT'd camera).
onDiagnosticcoverage of the drop/recover paths (unnegotiatedchannels, FU-reassembly anomalies, audio/metadata SETUP or init failures, sprop
parse fallback, pre-
PLAYinterleaved media, partially-parsable SDP). Eachcondition is rate-limited to fire once so a misbehaving camera can't flood the
consumer.
Testing
RTSPClientSessionend-to-endagainst an
ffmpeg-published stream relayed bymediamtx, exercisingH.264/H.265/AAC over both RTSP-interleaved TCP and UDP, including IPv6.
ffmpegandmediamtxto run them.