NatsJetStreamPullSubscription.iterate() can block far beyond `maxWait` under high subject cardinality

## Environment
- nats.java 2.25.3 (also reproduces against current `main` — no diff in the relevant files)
- nats-server 2.14.0 (also seen on 2.12.x)
- High-cardinality pull consumer: ~600 distinct subjects, ~500 msg/s sustained
- Client loop: `while (it.hasNext()) handle(it.next())` over `iterate(batchSize, Duration.ofSeconds(2))`

## Symptom
After hours-to-days of healthy operation, the consume loop silently stops making progress. No exception, no `onError`, no log. Server-side `nats consumer info` shows `Waiting Pulls = 0` and `delivered.last_active` growing without bound. App-level message timestamp does not advance. The only mitigation found is to dispose the subscription and rebuild from a fresh `ConsumerContext` + pull request.

## Analysis (two compounding bugs)

### Bug 1 — `iterate()` per-call `maxWait` budget, not per-iterator

In `NatsJetStreamPullSubscription._iterate(batchSize, maxWaitNanos)` (line 240), one `_pull(...)` is issued (line 260), then a nested `Iterator` is returned whose `hasNext()` calls `_nextUnmanaged(maxWaitNanos, pullSubject)` on line 280, passing the **full** `maxWaitNanos` every time.

Inside `_nextUnmanaged` (line 138+), `start = NatsSystemClock.nanoTime()` is captured at the top (line 139) and used to compute `timeLeftNanos` within that call. So the deadline resets on every `hasNext()`.

A tight client loop `while (hasNext()) next()` can therefore wait indefinitely — the 2-second `maxWait` is per-`hasNext()`, not per-iterator. The Javadoc on `iterate()` suggests a per-iterator deadline; the implementation doesn't enforce one.

### Bug 2 — `STATUS_TERMINUS` from a stale `pullSubject` is silently dropped

`_pull()` increments `pullSubjectIdHolder.incrementAndGet()` (line 62) so each pull request uses a unique inbox like `_INBOX.xxx.42`.

In `_nextUnmanaged` (lines 160-164):

```java
case STATUS_TERMINUS:
    // if there is a match, the status applies otherwise it's ignored
    if (pullSubject.equals(msg.getSubject())) {
        return messages;
    }
    break;
```

When the client tears down one iterator and starts another (next batch), a late `408 Pull Expired` from the **previous** pull (id 41) can arrive during the new pull (id 42). The equality check fails, the terminator is silently dropped, and the `while (batchLeft > 0 && timeLeftNanos > 0)` loop continues — eating the remainder of the budget waiting for messages that won't come on the new inbox either.

Under low subject cardinality / low traffic, the window for a late 408 to overlap with a new pull is small and this is hard to hit. With ~600 subjects and ~500 msg/s the overlap is routine; the client effectively stops noticing pull expirations.

## Server-side observability

The server-side `Waiting Pulls = 0` finding is consistent with this diagnosis — the server *did* expire the pull and *did* send the 408. The client just didn't react to it.

## Workaround we are using

App-side wedge watchdog (separate thread) that observes "no message delivered for N seconds" and force-rebuilds the subscription from a fresh `ConsumerContext`. Effective but obviously a band-aid; ~50 forced restarts per hour on the affected consumer.

## Suggested fixes (non-exhaustive)

1. In `_iterate()`, track a per-iterator deadline computed once at iterator construction; pass `timeLeftNanos = deadline - now()` into `_nextUnmanaged` instead of always passing `maxWaitNanos`.
2. In `_nextUnmanaged`, when a `STATUS_TERMINUS` for a *stale* `pullSubject` arrives, treat it as a signal that the budget is gone (the server has clearly moved on) — at minimum, do not credit the ignored message back to `timeLeftNanos`.
3. Document that `iterate(batch, maxWait)` has a per-call (not per-iterator) deadline, until (1) lands, so consumers don't misinterpret the contract.

## Happy to provide
- Reproducer (high-cardinality pull consumer + synthetic traffic) if helpful.
- Thread dumps showing `_nextUnmanaged` blocked beyond `maxWait`.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NatsJetStreamPullSubscription.iterate() can block far beyond `maxWait` under high subject cardinality #1576

Environment

Symptom

Analysis (two compounding bugs)

Bug 1 — `iterate()` per-call `maxWait` budget, not per-iterator

Bug 2 — `STATUS_TERMINUS` from a stale `pullSubject` is silently dropped

Server-side observability

Workaround we are using

Suggested fixes (non-exhaustive)

Happy to provide

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

NatsJetStreamPullSubscription.iterate() can block far beyond maxWait under high subject cardinality #1576

Description

Environment

Symptom

Analysis (two compounding bugs)

Bug 1 — iterate() per-call maxWait budget, not per-iterator

Bug 2 — STATUS_TERMINUS from a stale pullSubject is silently dropped

Server-side observability

Workaround we are using

Suggested fixes (non-exhaustive)

Happy to provide

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

NatsJetStreamPullSubscription.iterate() can block far beyond `maxWait` under high subject cardinality #1576

Bug 1 — `iterate()` per-call `maxWait` budget, not per-iterator

Bug 2 — `STATUS_TERMINUS` from a stale `pullSubject` is silently dropped