Skip to content

Conversation

@baelter
Copy link
Member

@baelter baelter commented Dec 19, 2025

Problem

When a client crashes (e.g., ctrl-c), the proxy can send duplicate Channel::Close frames to the upstream broker:

  1. Client sends Channel::Close for a channel
  2. Proxy forwards it to upstream
  3. Client crashes before receiving CloseOk
  4. Proxy's close_all_upstream_channels sends another Channel::Close with reason "CLIENT_DISCONNECTED"

The upstream broker sees Channel::Close for a channel that's already closing/closed, causing error:

operation channel.close caused a connection exception channel_error: "expected 'channel.open'"

This closes the entire upstream connection, affecting all clients sharing that connection.

Evidence from investigation

tcpdump on upstream showed: "back to back Channel.Close reply=CLIENT_DISCONNECTED frames"

Reliable reproduction: rapidly open/close channels, then kill client with ctrl-c.

Solution

Delete the channel from @channel_map immediately when forwarding Channel::Close, rather than waiting for CloseOk. This way, if the client crashes, close_all_upstream_channels won't find the channel and won't send a duplicate close.

This is simpler than using nil as a sentinel value (as proposed in #233) - just remove the channel from the map when it starts closing, regardless of which side initiated the close.

Changes

  • Add explicit handling for Channel::Close in read_loop: delete from map and forward to upstream
  • Combine Channel::Close and Channel::CloseOk cases in write method since both just delete from map
  • Add specs that reproduce the issue

Test plan

  • New spec: "does not send duplicate channel.close frames when client crashes after sending close" - opens 10 channels, closes all rapidly, crashes
  • New spec: "does not send duplicate channel.close when upstream initiates close and client crashes"
  • All existing specs pass

Add two specs that verify the proxy does not send duplicate Channel::Close
frames to upstream when clients crash:

1. Client-initiated close: Opens multiple channels, sends Channel::Close for
   all of them, then crashes before receiving CloseOk. Without the fix, this
   causes duplicate Channel::Close frames which triggers RabbitMQ to close
   the connection with "expected 'channel.open'" error.

2. Upstream-initiated close: Triggers a channel error (consume from non-existent
   queue), receives Channel::Close from upstream, then crashes before sending
   CloseOk. The proxy should have already sent CloseOk to upstream.

These specs reproduce the issue reported where client crashes could cause
the upstream connection to be closed, affecting all clients sharing that
connection.
@baelter baelter requested a review from a team as a code owner December 19, 2025 11:48
@baelter baelter force-pushed the channel_close_duplicates_v2 branch 2 times, most recently from 7d01825 to 59ee86f Compare December 19, 2025 12:15
When a client sends Channel::Close and then crashes before receiving
CloseOk, the proxy would send a duplicate Channel::Close from
close_all_upstream_channels, causing the upstream to close the
connection with "expected 'channel.open'" error.

Fix: Delete the channel from @channel_map immediately when forwarding
Channel::Close, rather than waiting for CloseOk. This way, if the client
crashes, close_all_upstream_channels won't find the channel and won't
send a duplicate close.

This is simpler than using nil as a sentinel value - just remove the
channel from the map when it starts closing, regardless of which side
initiated the close.
Copy link
Member

@carlhoerberg carlhoerberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good! Should we retype @channel_map = Hash(UInt16, UpstreamChannel?).new so that it cant be nil?

Was the nil value never used before either?

@chadknutson
Copy link

Specs are new to me [added to my "learn someday" list]. For now, I took the 2 specs that you added and ran them (with slight tweaks) as stand-alone functions.
spec 1: I see the channel.close error for master amqproxy but not for branch channel_close_duplicates. That seems very good!
spec 2: when amqproxy's connection is closed by RMQ due to the channel error, it VERY quickly opens a new connection to the upstream. Thus, I'm not sure this spec tests anything meaningful right now. Maybe we can have 2 sockets, each with a connection to amqproxy. close 1 of them, and then verify that the other remained connected? Not sure exactly how to do that.

@chadknutson
Copy link

chadknutson commented Dec 19, 2025

Proposed change to the specs:

  • Since we are hard-coding the user as guest/guest, we don't need a proxy_url, right? Just use those creds for the amqp-client connection. Or use the proxy_url creds for the socket connection. Just make sure that vhost is the same for both methods.
  • Open a connection using amqp-client.cr at the beginning of the specs. Then check that this connection is still open at the end of the specs. E.g.
connA = AMQP::Client.new(proxy_url).connect
# do the socket stuff
# ...
connA.closed?.should be_false

The channel_map no longer stores nil values after the previous fix changed
from assigning nil to using delete(). This simplifies the type from
Hash(UInt16, UpstreamChannel?) to Hash(UInt16, UpstreamChannel) and removes
the unnecessary .try call when closing channels.
@baelter
Copy link
Member Author

baelter commented Dec 19, 2025

Good! Should we retype @channel_map = Hash(UInt16, UpstreamChannel?).new so that it cant be nil?

Was the nil value never used before either?

Good catch, nil was used to communicate that a channel was closing. The new approach just deletes from the map instead, which is cleaner imo. Updated the type.

Added a canary connection in both channel.close specs that stays open
throughout the test. If the bug exists (duplicate Channel::Close sent),
the upstream connection would close, affecting ALL clients including the
canary. Verifying the canary remains open proves the upstream connection
didn't close due to duplicate frames.

This addresses Chad's concern that checking upstream_connections.should eq 1
alone might not be meaningful if the proxy quickly reconnects.
@baelter
Copy link
Member Author

baelter commented Dec 19, 2025

Open a connection using amqp-client.cr at the beginning of the specs

Good idea! Both specs now open a second connection that stays alive throughout the test. If the bug exists and sends duplicate Channel::Close, it would close the upstream connection and kill this connection too. Verifying it stays open proves the fix works.

Copy link

@chadknutson chadknutson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants