Skip to content

NatsJetStreamPullSubscription.iterate() can block far beyond maxWait under high subject cardinality #1576

@mrmx

Description

@mrmx

Environment

  • nats.java 2.25.3 (also reproduces against current main — no diff in the relevant files)
  • nats-server 2.14.0 (also seen on 2.12.x)
  • High-cardinality pull consumer: ~600 distinct subjects, ~500 msg/s sustained
  • Client loop: while (it.hasNext()) handle(it.next()) over iterate(batchSize, Duration.ofSeconds(2))

Symptom

After hours-to-days of healthy operation, the consume loop silently stops making progress. No exception, no onError, no log. Server-side nats consumer info shows Waiting Pulls = 0 and delivered.last_active growing without bound. App-level message timestamp does not advance. The only mitigation found is to dispose the subscription and rebuild from a fresh ConsumerContext + pull request.

Analysis (two compounding bugs)

Bug 1 — iterate() per-call maxWait budget, not per-iterator

In NatsJetStreamPullSubscription._iterate(batchSize, maxWaitNanos) (line 240), one _pull(...) is issued (line 260), then a nested Iterator is returned whose hasNext() calls _nextUnmanaged(maxWaitNanos, pullSubject) on line 280, passing the full maxWaitNanos every time.

Inside _nextUnmanaged (line 138+), start = NatsSystemClock.nanoTime() is captured at the top (line 139) and used to compute timeLeftNanos within that call. So the deadline resets on every hasNext().

A tight client loop while (hasNext()) next() can therefore wait indefinitely — the 2-second maxWait is per-hasNext(), not per-iterator. The Javadoc on iterate() suggests a per-iterator deadline; the implementation doesn't enforce one.

Bug 2 — STATUS_TERMINUS from a stale pullSubject is silently dropped

_pull() increments pullSubjectIdHolder.incrementAndGet() (line 62) so each pull request uses a unique inbox like _INBOX.xxx.42.

In _nextUnmanaged (lines 160-164):

case STATUS_TERMINUS:
    // if there is a match, the status applies otherwise it's ignored
    if (pullSubject.equals(msg.getSubject())) {
        return messages;
    }
    break;

When the client tears down one iterator and starts another (next batch), a late 408 Pull Expired from the previous pull (id 41) can arrive during the new pull (id 42). The equality check fails, the terminator is silently dropped, and the while (batchLeft > 0 && timeLeftNanos > 0) loop continues — eating the remainder of the budget waiting for messages that won't come on the new inbox either.

Under low subject cardinality / low traffic, the window for a late 408 to overlap with a new pull is small and this is hard to hit. With ~600 subjects and ~500 msg/s the overlap is routine; the client effectively stops noticing pull expirations.

Server-side observability

The server-side Waiting Pulls = 0 finding is consistent with this diagnosis — the server did expire the pull and did send the 408. The client just didn't react to it.

Workaround we are using

App-side wedge watchdog (separate thread) that observes "no message delivered for N seconds" and force-rebuilds the subscription from a fresh ConsumerContext. Effective but obviously a band-aid; ~50 forced restarts per hour on the affected consumer.

Suggested fixes (non-exhaustive)

  1. In _iterate(), track a per-iterator deadline computed once at iterator construction; pass timeLeftNanos = deadline - now() into _nextUnmanaged instead of always passing maxWaitNanos.
  2. In _nextUnmanaged, when a STATUS_TERMINUS for a stale pullSubject arrives, treat it as a signal that the budget is gone (the server has clearly moved on) — at minimum, do not credit the ignored message back to timeLeftNanos.
  3. Document that iterate(batch, maxWait) has a per-call (not per-iterator) deadline, until (1) lands, so consumers don't misinterpret the contract.

Happy to provide

  • Reproducer (high-cardinality pull consumer + synthetic traffic) if helpful.
  • Thread dumps showing _nextUnmanaged blocked beyond maxWait.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions