Environment
- nats.java 2.25.3 (also reproduces against current
main — no diff in the relevant files)
- nats-server 2.14.0 (also seen on 2.12.x)
- High-cardinality pull consumer: ~600 distinct subjects, ~500 msg/s sustained
- Client loop:
while (it.hasNext()) handle(it.next()) over iterate(batchSize, Duration.ofSeconds(2))
Symptom
After hours-to-days of healthy operation, the consume loop silently stops making progress. No exception, no onError, no log. Server-side nats consumer info shows Waiting Pulls = 0 and delivered.last_active growing without bound. App-level message timestamp does not advance. The only mitigation found is to dispose the subscription and rebuild from a fresh ConsumerContext + pull request.
Analysis (two compounding bugs)
Bug 1 — iterate() per-call maxWait budget, not per-iterator
In NatsJetStreamPullSubscription._iterate(batchSize, maxWaitNanos) (line 240), one _pull(...) is issued (line 260), then a nested Iterator is returned whose hasNext() calls _nextUnmanaged(maxWaitNanos, pullSubject) on line 280, passing the full maxWaitNanos every time.
Inside _nextUnmanaged (line 138+), start = NatsSystemClock.nanoTime() is captured at the top (line 139) and used to compute timeLeftNanos within that call. So the deadline resets on every hasNext().
A tight client loop while (hasNext()) next() can therefore wait indefinitely — the 2-second maxWait is per-hasNext(), not per-iterator. The Javadoc on iterate() suggests a per-iterator deadline; the implementation doesn't enforce one.
Bug 2 — STATUS_TERMINUS from a stale pullSubject is silently dropped
_pull() increments pullSubjectIdHolder.incrementAndGet() (line 62) so each pull request uses a unique inbox like _INBOX.xxx.42.
In _nextUnmanaged (lines 160-164):
case STATUS_TERMINUS:
// if there is a match, the status applies otherwise it's ignored
if (pullSubject.equals(msg.getSubject())) {
return messages;
}
break;
When the client tears down one iterator and starts another (next batch), a late 408 Pull Expired from the previous pull (id 41) can arrive during the new pull (id 42). The equality check fails, the terminator is silently dropped, and the while (batchLeft > 0 && timeLeftNanos > 0) loop continues — eating the remainder of the budget waiting for messages that won't come on the new inbox either.
Under low subject cardinality / low traffic, the window for a late 408 to overlap with a new pull is small and this is hard to hit. With ~600 subjects and ~500 msg/s the overlap is routine; the client effectively stops noticing pull expirations.
Server-side observability
The server-side Waiting Pulls = 0 finding is consistent with this diagnosis — the server did expire the pull and did send the 408. The client just didn't react to it.
Workaround we are using
App-side wedge watchdog (separate thread) that observes "no message delivered for N seconds" and force-rebuilds the subscription from a fresh ConsumerContext. Effective but obviously a band-aid; ~50 forced restarts per hour on the affected consumer.
Suggested fixes (non-exhaustive)
- In
_iterate(), track a per-iterator deadline computed once at iterator construction; pass timeLeftNanos = deadline - now() into _nextUnmanaged instead of always passing maxWaitNanos.
- In
_nextUnmanaged, when a STATUS_TERMINUS for a stale pullSubject arrives, treat it as a signal that the budget is gone (the server has clearly moved on) — at minimum, do not credit the ignored message back to timeLeftNanos.
- Document that
iterate(batch, maxWait) has a per-call (not per-iterator) deadline, until (1) lands, so consumers don't misinterpret the contract.
Happy to provide
- Reproducer (high-cardinality pull consumer + synthetic traffic) if helpful.
- Thread dumps showing
_nextUnmanaged blocked beyond maxWait.
Thanks!
Environment
main— no diff in the relevant files)while (it.hasNext()) handle(it.next())overiterate(batchSize, Duration.ofSeconds(2))Symptom
After hours-to-days of healthy operation, the consume loop silently stops making progress. No exception, no
onError, no log. Server-sidenats consumer infoshowsWaiting Pulls = 0anddelivered.last_activegrowing without bound. App-level message timestamp does not advance. The only mitigation found is to dispose the subscription and rebuild from a freshConsumerContext+ pull request.Analysis (two compounding bugs)
Bug 1 —
iterate()per-callmaxWaitbudget, not per-iteratorIn
NatsJetStreamPullSubscription._iterate(batchSize, maxWaitNanos)(line 240), one_pull(...)is issued (line 260), then a nestedIteratoris returned whosehasNext()calls_nextUnmanaged(maxWaitNanos, pullSubject)on line 280, passing the fullmaxWaitNanosevery time.Inside
_nextUnmanaged(line 138+),start = NatsSystemClock.nanoTime()is captured at the top (line 139) and used to computetimeLeftNanoswithin that call. So the deadline resets on everyhasNext().A tight client loop
while (hasNext()) next()can therefore wait indefinitely — the 2-secondmaxWaitis per-hasNext(), not per-iterator. The Javadoc oniterate()suggests a per-iterator deadline; the implementation doesn't enforce one.Bug 2 —
STATUS_TERMINUSfrom a stalepullSubjectis silently dropped_pull()incrementspullSubjectIdHolder.incrementAndGet()(line 62) so each pull request uses a unique inbox like_INBOX.xxx.42.In
_nextUnmanaged(lines 160-164):When the client tears down one iterator and starts another (next batch), a late
408 Pull Expiredfrom the previous pull (id 41) can arrive during the new pull (id 42). The equality check fails, the terminator is silently dropped, and thewhile (batchLeft > 0 && timeLeftNanos > 0)loop continues — eating the remainder of the budget waiting for messages that won't come on the new inbox either.Under low subject cardinality / low traffic, the window for a late 408 to overlap with a new pull is small and this is hard to hit. With ~600 subjects and ~500 msg/s the overlap is routine; the client effectively stops noticing pull expirations.
Server-side observability
The server-side
Waiting Pulls = 0finding is consistent with this diagnosis — the server did expire the pull and did send the 408. The client just didn't react to it.Workaround we are using
App-side wedge watchdog (separate thread) that observes "no message delivered for N seconds" and force-rebuilds the subscription from a fresh
ConsumerContext. Effective but obviously a band-aid; ~50 forced restarts per hour on the affected consumer.Suggested fixes (non-exhaustive)
_iterate(), track a per-iterator deadline computed once at iterator construction; passtimeLeftNanos = deadline - now()into_nextUnmanagedinstead of always passingmaxWaitNanos._nextUnmanaged, when aSTATUS_TERMINUSfor a stalepullSubjectarrives, treat it as a signal that the budget is gone (the server has clearly moved on) — at minimum, do not credit the ignored message back totimeLeftNanos.iterate(batch, maxWait)has a per-call (not per-iterator) deadline, until (1) lands, so consumers don't misinterpret the contract.Happy to provide
_nextUnmanagedblocked beyondmaxWait.Thanks!