Fix potential data loss saving plugin properties #195

tonygermano · 2025-10-22T19:45:49Z

This addresses an upsert race condition that occurred when saving plugin properties (e.g., Data Pruner settings, third-party plugins) in environments with a read/write split database configuration where the read-only connection points to a replica.

The Problem:
The prior code attempted to determine whether to INSERT or UPDATE by first checking for the property's existence using the read-only database connection. Since updating all properties for a plugin involves deleting them all first, if this DELETE operation had not yet propagated to the replica, the read-only check would incorrectly indicate the property still existed.

The Result:
An UPDATE statement would be attempted, which would fail to match any rows (since the data had already been deleted from the primary) and silently return zero rows updated. This failure was not being checked, leading to data loss for the affected property.

The Solution:
This change eliminates the preliminary read check. It now attempts an UPDATE first. If the update affects zero rows, a guaranteed INSERT is performed. This pattern ensures atomicity and correctness regardless of replication latency.

See https://sqlperformance.com/2020/09/locking/upsert-anti-pattern

Issue: Innovar-Healthcare/BridgeLink#66

This addresses an upsert race condition that occurred when saving plugin properties (e.g., Data Pruner settings, third-party plugins) in environments with a read/write split database configuration where the read-only connection points to a replica. The Problem: The prior code attempted to determine whether to INSERT or UPDATE by first checking for the property's existence using the read-only database connection. Since updating all properties for a plugin involves deleting them all first, if this DELETE operation had not yet propagated to the replica, the read-only check would incorrectly indicate the property still existed. The Result: An UPDATE statement would be attempted, which would fail to match any rows (since the data had already been deleted from the primary) and silently return zero rows updated. This failure was not being checked, leading to data loss for the affected property. The Solution: This change eliminates the preliminary read check. It now attempts an UPDATE first. If the update affects zero rows, a guaranteed INSERT is performed. This pattern ensures atomicity and correctness regardless of replication latency. See https://sqlperformance.com/2020/09/locking/upsert-anti-pattern Issue: Innovar-Healthcare/BridgeLink#66 Signed-off-by: Tony Germano <tony@germano.name>

mgaffigan

LGTM

pacmano1

I would expect the "fix" to be never sending requests to the read only replica for things related to configuration. Is that programatically a much heavier lift? e.g. "if the ultimate intention of an action is to update something, never use the read only replica".

jonbartels · 2025-10-22T23:21:10Z

Code LGTM

A reference - The Configuration.updateProperty and Configuration.insertProperty SQL mappings for each DB can be found here: https://github.com/search?q=repo%3AOpenIntegrationEngine%2Fengine%20updateProperty&type=code

I reviewed the mappings to ensure there wasn't some funky SQL that would interfere with this solution on other DBs. Looks OK. This was a review and not actively tested across the supported DB engines though.

I have some questions:

Is this practical to unit or integration test?
What code smells or patterns could indicate this problem exists in other workflows?
(long question) would a change to getProperty potentially address this systemically?

Consider the function definition. Would implementing something like getProperty(prop, bool forWrite) potentially reduce the risk of this problem happening elsewhere?

tonygermano · 2025-10-22T23:28:06Z

I would expect the "fix" to be never sending requests to the read only replica for things related to configuration. Is that programatically a much heavier lift? e.g. "if the ultimate intention of an action is to update something, never use the read only replica".

This change removes the query against the read-only replica by removing the query for existence entirely. Instead the UPDATE replaces the existence query by either updating the row (at which point it's finished) or returning 0 rows affected (indicating an insert is needed.)

tonygermano · 2025-10-22T23:35:40Z

Code LGTM

A reference - The Configuration.updateProperty and Configuration.insertProperty SQL mappings for each DB can be found here: https://github.com/search?q=repo%3AOpenIntegrationEngine%2Fengine%20updateProperty&type=code

I reviewed the mappings to ensure there wasn't some funky SQL that would interfere with this solution on other DBs. Looks OK. This was a review and not actively tested across the supported DB engines though.

I have some questions:

Is this practical to unit or integration test?

What code smells or patterns could indicate this problem exists in other workflows?

(long question) would a change to getProperty potentially address this systemically?

Consider the function definition. Would implementing something like getProperty(prop, bool forWrite) potentially reduce the risk of this problem happening elsewhere?

I'm not sure. I'd need to think about that.
Check out the link in my PR description. Calling getProperty to make the determination of whether to insert or update is an anti-pattern.
Because of the answer to (2), I don't think this is necessary.

mgaffigan · 2025-10-22T23:51:24Z

In re: testing, this sort of TOCTOU/data propagation bug is notoriously hard to test for. I would suggest the best practice is to rely on atomic statements without trying to make data consistency guarantees. (Put differently: make it the DBMS's problem, but do so in the simplest possible way to avoid implementation issues)

For an example of how to prove data consistency correctness, see TLA+, but I don't imagine it is practical to do so.

mgaffigan · 2025-10-22T23:57:01Z

I would expect the "fix" to be never sending requests to the read only replica for things related to configuration. Is that programatically a much heavier lift? e.g. "if the ultimate intention of an action is to update something, never use the read only replica".

I agree with this thought, but I don't think it should block this PR. This PR still moves us to a better place, and does not apparently introduce any new undesirable behavior.

I'm not sure what the "goal" is with the read replicas - are they meant for HA or for avoiding reads for dashboard queries? Most real world traffic I've seen from mirth is 90% writes (excluding dashboard searches).

pacmano1

Will approve but wrapping stuff in transactions might be a longer term objective.

kayyagari

A transaction boundary would prevent any other race conditions that might arise when the the same database is used by multiple OIE instances.
Performing all these operations on an SqlSession (SqlConfig.getInstance().getSqlSessionManager().openSession(TransactionIsolationLevel.READ_COMMITTED)) would be a correct approach.

I agree that the current fix is good enough to roll out immediately and the said enhancement can be made later.

mgaffigan approved these changes Oct 22, 2025

View reviewed changes

tonygermano requested review from a team, gibson9583, kayyagari, kpalang and pacmano1 and removed request for a team October 22, 2025 23:01

pacmano1 reviewed Oct 22, 2025

View reviewed changes

jonbartels approved these changes Oct 22, 2025

View reviewed changes

pacmano1 approved these changes Oct 23, 2025

View reviewed changes

kayyagari approved these changes Oct 23, 2025

View reviewed changes

tonygermano merged commit 5ff9715 into OpenIntegrationEngine:main Oct 23, 2025
2 checks passed

tonygermano deleted the bug/save-property-race-condition branch October 23, 2025 01:16

tonygermano added this to the Next Release milestone Dec 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix potential data loss saving plugin properties #195

Fix potential data loss saving plugin properties #195

Uh oh!

tonygermano commented Oct 22, 2025

Uh oh!

mgaffigan left a comment

Uh oh!

pacmano1 left a comment

Uh oh!

jonbartels commented Oct 22, 2025 •

edited

Loading

Uh oh!

tonygermano commented Oct 22, 2025

Uh oh!

tonygermano commented Oct 22, 2025

Uh oh!

mgaffigan commented Oct 22, 2025

Uh oh!

mgaffigan commented Oct 22, 2025

Uh oh!

pacmano1 left a comment

Uh oh!

kayyagari left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix potential data loss saving plugin properties #195

Fix potential data loss saving plugin properties #195

Uh oh!

Conversation

tonygermano commented Oct 22, 2025

Uh oh!

mgaffigan left a comment

Choose a reason for hiding this comment

Uh oh!

pacmano1 left a comment

Choose a reason for hiding this comment

Uh oh!

jonbartels commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tonygermano commented Oct 22, 2025

Uh oh!

tonygermano commented Oct 22, 2025

Uh oh!

mgaffigan commented Oct 22, 2025

Uh oh!

mgaffigan commented Oct 22, 2025

Uh oh!

pacmano1 left a comment

Choose a reason for hiding this comment

Uh oh!

kayyagari left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jonbartels commented Oct 22, 2025 •

edited

Loading