Skip to content

Conversation

@tonygermano
Copy link
Member

This addresses an upsert race condition that occurred when saving plugin properties (e.g., Data Pruner settings, third-party plugins) in environments with a read/write split database configuration where the read-only connection points to a replica.

The Problem:
The prior code attempted to determine whether to INSERT or UPDATE by first checking for the property's existence using the read-only database connection. Since updating all properties for a plugin involves deleting them all first, if this DELETE operation had not yet propagated to the replica, the read-only check would incorrectly indicate the property still existed.

The Result:
An UPDATE statement would be attempted, which would fail to match any rows (since the data had already been deleted from the primary) and silently return zero rows updated. This failure was not being checked, leading to data loss for the affected property.

The Solution:
This change eliminates the preliminary read check. It now attempts an UPDATE first. If the update affects zero rows, a guaranteed INSERT is performed. This pattern ensures atomicity and correctness regardless of replication latency.

See https://sqlperformance.com/2020/09/locking/upsert-anti-pattern

Issue: Innovar-Healthcare/BridgeLink#66

This addresses an upsert race condition that occurred when saving plugin
properties (e.g., Data Pruner settings, third-party plugins) in
environments with a read/write split database configuration where the
read-only connection points to a replica.

The Problem:
The prior code attempted to determine whether to INSERT or
UPDATE by first checking for the property's existence using the
read-only database connection. Since updating all properties for a
plugin involves deleting them all first, if this DELETE operation had
not yet propagated to the replica, the read-only check would incorrectly
indicate the property still existed.

The Result:
An UPDATE statement would be attempted, which would fail to
match any rows (since the data had already been deleted from the
primary) and silently return zero rows updated. This failure was not
being checked, leading to data loss for the affected property.

The Solution:
This change eliminates the preliminary read check. It now
attempts an UPDATE first. If the update affects zero rows, a guaranteed
INSERT is performed. This pattern ensures atomicity and correctness
regardless of replication latency.

See https://sqlperformance.com/2020/09/locking/upsert-anti-pattern

Issue: Innovar-Healthcare/BridgeLink#66
Signed-off-by: Tony Germano <tony@germano.name>
Copy link
Contributor

@mgaffigan mgaffigan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tonygermano tonygermano requested review from a team, gibson9583, kayyagari, kpalang and pacmano1 and removed request for a team October 22, 2025 23:01
Copy link
Contributor

@pacmano1 pacmano1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the "fix" to be never sending requests to the read only replica for things related to configuration. Is that programatically a much heavier lift? e.g. "if the ultimate intention of an action is to update something, never use the read only replica".

@jonbartels
Copy link
Contributor

jonbartels commented Oct 22, 2025

Code LGTM

A reference - The Configuration.updateProperty and Configuration.insertProperty SQL mappings for each DB can be found here: https://github.com/search?q=repo%3AOpenIntegrationEngine%2Fengine%20updateProperty&type=code

I reviewed the mappings to ensure there wasn't some funky SQL that would interfere with this solution on other DBs. Looks OK. This was a review and not actively tested across the supported DB engines though.

I have some questions:

  1. Is this practical to unit or integration test?
  2. What code smells or patterns could indicate this problem exists in other workflows?
  3. (long question) would a change to getProperty potentially address this systemically?

Consider the function definition. Would implementing something like getProperty(prop, bool forWrite) potentially reduce the risk of this problem happening elsewhere?

@tonygermano
Copy link
Member Author

I would expect the "fix" to be never sending requests to the read only replica for things related to configuration. Is that programatically a much heavier lift? e.g. "if the ultimate intention of an action is to update something, never use the read only replica".

This change removes the query against the read-only replica by removing the query for existence entirely. Instead the UPDATE replaces the existence query by either updating the row (at which point it's finished) or returning 0 rows affected (indicating an insert is needed.)

@tonygermano
Copy link
Member Author

Code LGTM

A reference - The Configuration.updateProperty and Configuration.insertProperty SQL mappings for each DB can be found here: https://github.com/search?q=repo%3AOpenIntegrationEngine%2Fengine%20updateProperty&type=code

I reviewed the mappings to ensure there wasn't some funky SQL that would interfere with this solution on other DBs. Looks OK. This was a review and not actively tested across the supported DB engines though.

I have some questions:

  1. Is this practical to unit or integration test?
  2. What code smells or patterns could indicate this problem exists in other workflows?
  3. (long question) would a change to getProperty potentially address this systemically?

Consider the function definition. Would implementing something like getProperty(prop, bool forWrite) potentially reduce the risk of this problem happening elsewhere?

  1. I'm not sure. I'd need to think about that.
  2. Check out the link in my PR description. Calling getProperty to make the determination of whether to insert or update is an anti-pattern.
  3. Because of the answer to (2), I don't think this is necessary.

@mgaffigan
Copy link
Contributor

In re: testing, this sort of TOCTOU/data propagation bug is notoriously hard to test for. I would suggest the best practice is to rely on atomic statements without trying to make data consistency guarantees. (Put differently: make it the DBMS's problem, but do so in the simplest possible way to avoid implementation issues)

For an example of how to prove data consistency correctness, see TLA+, but I don't imagine it is practical to do so.

@mgaffigan
Copy link
Contributor

I would expect the "fix" to be never sending requests to the read only replica for things related to configuration. Is that programatically a much heavier lift? e.g. "if the ultimate intention of an action is to update something, never use the read only replica".

I agree with this thought, but I don't think it should block this PR. This PR still moves us to a better place, and does not apparently introduce any new undesirable behavior.

I'm not sure what the "goal" is with the read replicas - are they meant for HA or for avoiding reads for dashboard queries? Most real world traffic I've seen from mirth is 90% writes (excluding dashboard searches).

Copy link
Contributor

@pacmano1 pacmano1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will approve but wrapping stuff in transactions might be a longer term objective.

Copy link

@kayyagari kayyagari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A transaction boundary would prevent any other race conditions that might arise when the the same database is used by multiple OIE instances.
Performing all these operations on an SqlSession (SqlConfig.getInstance().getSqlSessionManager().openSession(TransactionIsolationLevel.READ_COMMITTED)) would be a correct approach.

I agree that the current fix is good enough to roll out immediately and the said enhancement can be made later.

@tonygermano tonygermano merged commit 5ff9715 into OpenIntegrationEngine:main Oct 23, 2025
2 checks passed
@tonygermano tonygermano deleted the bug/save-property-race-condition branch October 23, 2025 01:16
@tonygermano tonygermano added this to the Next Release milestone Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants