Skip to content

HCD-241: DSE 6.8/9 upgradeability#2240

Open
szymon-miezal wants to merge 4 commits intomainfrom
hcd-241
Open

HCD-241: DSE 6.8/9 upgradeability#2240
szymon-miezal wants to merge 4 commits intomainfrom
hcd-241

Conversation

@szymon-miezal
Copy link

@szymon-miezal szymon-miezal commented Feb 19, 2026

This PR addresses critical compatibility issues discovered during DSE to HCD (Hyper-Converged Database) upgrade scenarios. The changes focus on ensuring smooth interoperability between DSE 6.8+ nodes and HCD nodes during mixed-version cluster operations.

What Problems Do These Changes Fix?

1. Enhanced Debugging Capabilities for Upgrade Scenarios

Problem: When upgrading from DSE to HCD, troubleshooting connection and gossip issues was difficult due to insufficient logging.

Solution: Added comprehensive logging throughout the handshake and gossip processes, including:

  • Port selection logic for DSE → HCD upgrades
  • Messaging version negotiation during handshake
  • Protocol proposals from DSE peers
  • Gossip state values (both loaded and saved)
  • Connection initialization events

Why it matters: This gives operators visibility into what's happening during the upgrade process, making it much easier to diagnose and resolve issues in production environments.


2. Gossip Deserialization Crash Prevention

Problem: Nodes would crash with ArrayIndexOutOfBoundsException when processing gossip messages from pre-4.0 nodes. This happened because the code assumed certain delimiters would always be present in address and status values, but older nodes sent data in different formats. Both INTERNAL_ADDRESS_AND_PORT and NATIVE_ADDRESS_AND_PORT could arrive without port delimiters (e.g., "10.0.0.1" instead of "10.0.0.1:7000"), and STATUS_WITH_PORT could arrive without additional port information.

Solution: Added defensive length checks after splitting gossip values to gracefully handle cases where expected delimiters are missing:

  • Internal and native IP addresses without ports (e.g., "10.0.0.1" vs "10.0.0.1:7000")
  • Status values without additional port information

Why it matters: Prevents cluster instability and crashes during mixed-version operations.


3. DSE 6.8+ PING Request Compatibility

Problem: During startup connectivity checks, HCD nodes were attempting to send PING requests to DSE 6.8+ peers, but these versions don't support PING_REQ messages (similar to DSE 6.x and Cassandra 3.x). This resulted in noisy error logs during cluster startup.

Solution: Extended the existing version detection logic to recognize and skip PING requests for DSE 6.8+ peers during the startup connectivity check phase.

Why it matters: Eliminates unnecessary error logs when HCD nodes join or restart in a mixed DSE/HCD environment.


4. User-Defined Type Preservation During Schema Migration

Problem: When keyspace schemas were updated during migration, User-Defined Types (UDTs) from the previous schema could be lost if they weren't explicitly present in the new schema. This caused failures when inherited tables depended on these types.

Solution: Modified the schema transformation logic to preserve UDTs from the previous schema when they don't exist in the new schema, ensuring dependent tables continue to function correctly.

Why it matters: Prevents data model corruption and application failures during schema migrations, particularly important when dealing with complex schemas that use inheritance and UDTs.


Overall Impact

These changes collectively improve the robustness and reliability of DSE to HCD upgrades by:

  • Making issues easier to diagnose through better logging
  • Preventing crashes from malformed gossip data
  • Ensuring protocol compatibility across versions
  • Preserving schema integrity during migrations

Latest test run: http://10.169.74.112:8081/job/ds-cassandra-build/2117/
Baseline: http://10.169.74.112:8081/job/ds-cassandra-build/10/

TODO: attache CNDB PR

@github-actions
Copy link

github-actions bot commented Feb 19, 2026

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

Message<PingRequest> large = Message.out(PING_REQ, PingRequest.forLarge);
for (InetAddressAndPort peer : peers)
{
logger.info("Peer {} has version {}", peer, MessagingService.instance().versions.get(peer));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like something worth keeping.

@szymon-miezal
Copy link
Author

TODO before merging: alter the commit message of cdc2d1a

@szymon-miezal szymon-miezal changed the title HCD-241: DSE 6.8/9 upgradeability (WIP) HCD-241: DSE 6.8/9 upgradeability Feb 20, 2026
@szymon-miezal szymon-miezal self-assigned this Feb 20, 2026
int peerMessagingVersion = msg.maxMessagingVersion;
logger.trace("received second handshake message from peer {}, msg = {}", settings.connectTo, msg);

logger.debug("Summary of messaing versions while connecting to {} " +
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

- Add logging to debug the port selecton for DSE -> HCD upgrade
- Add logging to debug the messaging version selection during handshake
- Logging in outbound connection at handshake success
- Add the node address to logs
- Log the protocol proposed by DSE peer
- Log the value gossip fails on
- Log the loaded/saved gossip values
- Logs for Initiate, Accept and ConfirmOutboundPre40
- Logs for InboundConnectionInitiator
Adds tests validating the fix for ArrayIndexOutOfBoundsException that
occurred when deserializing gossip state from pre-4.0 nodes containing
address/port values without expected delimiters.

The bug manifested when filterOutgoingState() attempted to split values
like "10.0.0.1" or "NORMAL" and blindly access array indices [1] or [0]
that didn't exist, causing crashes during gossip message processing in
mixed-version clusters.

The fix adds length checks after splitting to gracefully handle:
- IP addresses without ports (e.g., "10.0.0.1" vs "10.0.0.1:7000")
- Status values without tokens (e.g., "NORMAL" vs "NORMAL,10.0.0.1:7000")
…ivity check

DSE 6.8 and later versions don't support PING_REQ messages, similar to
DSE 6.x and Cassandra 3.x. This change extends the existing logic to
detect and skip PING requests for DSE 6.8+ peers during the startup
cluster connectivity check phase.
When updating keyspace schemas, UDTs from the previous schema are now
preserved if they don't exist in the new schema.
This prevents issues where inherited tables depend on types that would
otherwise be lost during schema transformations.
@sonarqubecloud
Copy link

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-2240 rejected by Butler


2 regressions found
See build details here


Found 2 new test failures

Test Explanation Runs Upstream
o.a.c.index.sai.cql.VectorCompaction100dTest.testCompactionWithEnoughRowsForPQAndDeleteARow[dc true] REGRESSION 🔴 0 / 11
o.a.c.index.sai.cql.VectorSiftSmallTest.testSiftSmall[eb false] REGRESSION 🔴 0 / 11

Found 4 known test failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants