Add apache client reloading on trust and key store file change by atrocities · Pull Request #2941 · palantir/dialogue

atrocities · 2026-04-01T20:25:50Z

Support automatic reloading of keystore and truststore per apache hc5 client upon changes to these files.

Adds a refresh mechanism similar to the DNS resolution polling feature, and uses hashes of the keystore and truststore contents as a new cache key component. Key material refreshing happens at the DialogueChannel level, and the hashed data is passed down into the apache cache level. A change in key material contents results in invalidation of entries in the apache cache, causing the hc5 clients to reload with the new key/truststore contents.

Changes in the location and nature of the key material constitute a configuration change, which would also cause a reload.

changelog-app · 2026-04-01T20:25:56Z

Generate changelog in `changelog/@unreleased`

Type (Select exactly one)

Feature (Adding new functionality)
Improvement (Improving existing functionality)
Fix (Fixing an issue with existing functionality)
Break (Creating a new major version by breaking public APIs)
Deprecation (Removing functionality in a non-breaking way)
Migration (Automatically moving data/functionality to a new system)

Description

Add apache client reloading on trust and key store file change

Check the box to generate changelog(s)

Generate changelog entry

bjlaub

Some initial thoughts:

what happens to in-flight requests when we detect a change to the key/trust material and reload the hc5 client? I believe we're using pooled connections; I'm not sure what will happen to in-flight requests using connection from the pool if the client is closed before those requests finish. Probably worth adding some tests for this.
watching for changes to the keystore in addition to the truststore adds a small bit of complexity here; I wonder if it's really necessary?

bjlaub · 2026-04-06T16:40:13Z

+    }
+
+    private static HashCode hashFile(Path path) throws IOException {
+        byte[] bytes = Files.readAllBytes(path);


nit: we could possibly avoid pulling the whole file into memory by using guava's HashingInputStream instead. I don't have a sense for how large these files will get in practice, but it's probably not a huge concern.

afaict, the size of these keystore files is usually a couple kb - I haven't found any enormous ones.

fwiw, this was copied from the way that witchcraft does hashing of configuration files, which is using readAllBytes

bjlaub · 2026-04-06T16:57:55Z


        Optional<InetAddress> resolvedAddress();

+        Optional<String> sslStoreHash();


what happens if this is empty?

That would cause the apache cache key to also not contain an sslStoreHash - it's not used in the construction of the actual client itself.

To that end, whether to include this argument as part of the ChannelArgs is debatable. It's not strictly a 'channel arg', but it feels like the best place for it in the way that things are currently structured.

bjlaub · 2026-04-06T19:09:01Z

-
-        ApacheCacheEntry apacheClient = getApacheClient(request);
+        Refreshable<SslStoreMetadata> storeMetadata =
+                KeystoreSupport.pollForChanges(channelCacheRequest.serviceConf().security());


I think there's some nuance to this we should think about, particularly since it was adapted from the DNS polling stuff (disclaimer: this code is all a bit confusing and it's taking me a while to wrap my head around it again, so I could be wrong):

This will create one MetadataPollingTask per cached DialogueChannel, where the work of each polling task is tied directly to reading bytes from the configured keystore/truststore on disk. But in practice, we may have many DialogueChannel instances that share the same truststore on disk, so with a large number of channels we end up mostly repeating work at fixed intervals.

The DialogueDnsResolutionWorker operates similarly (we create one worker task per channel and run them at fixed intervals), and also may repeat work for services that share URIs, though this seems less likely (e.g. we are less likely to have 10 difference channels with overlapping host names to resolve, since they likely represent 10 different services with distinct host names, though that's not guaranteed). Another important difference is that we implicitly rely on the JVM DNS cache, so even though we poll at 1-second intervals in the DNS case, lookups may end up being very fast if the JVM has already cached results. Again, not guaranteed, and eventually we will make a network call.

I might be overthinking this, perhaps the cost to compute the hash is cheap enough that many polling tasks over the same files on disk just isn't worth optimizing for. Ideally we might be able to create one polling task per keystore/truststore on disk, and have it update any number of refreshables (connected to the hc5 clients or whatever) to get them to reload trust material when it changes.

That is a very valid concern. I'll have a look into deduping existing tasks, because yes - these file operations, while not terribly resource intensive, are also not exactly free.

If it's not complicated (trust/keystore combos map to multiple DialogueChannels that need to be invalidated), then I do think it's worth this optimization.

bjlaub · 2026-04-06T19:12:15Z

+    private static final class MetadataPollingTask implements Runnable {
+
+        private final SslConfiguration sslConfiguration;
+        private final SettableRefreshable<SslStoreMetadata> metadataRefreshable;


DialogueDnsResolutionWorker stores a WeakReference to the output refreshable so that if it has been garbage collected we can avoid doing any work and exit early. Should we do something similar here?

+1 - otherwise, I think the cleaner will also never remove the scheduled future nor the task itself?

atrocities · 2026-04-06T21:09:24Z

Thanks much for having a look @bjlaub . I'll do the following:

Try to coalesce MetadataPollingTasks by the paths of the key/truststore on disk instead of having one for every DialogueChannel.
Add a test to demonstrate what happens when the keymaterial changes. My thought is that it ought to allow for any existing requests to finish, and then be eventually closed and cleaned up. On that note - will want to test for that as well.
Figure out whether or not we're actually using the keystore (not truststore) in dialogue.
Investigate WeakReference in KeystoreSupport.

aldexis

Pretty nice work - I like the tests!

aldexis · 2026-04-08T12:19:12Z


        DialogueDnsResolver dnsResolver();

+        Optional<String> sslStoreHash();


nit: Add a comment here explaining that this exists to force a cache miss when the ssl stores contents change, to force a reload of their values, even if the hash itself isn't directly used. Probably same for ChannelArgs#sslStoreHash (or refer to this method from there)

aldexis · 2026-04-08T12:20:15Z

-                            return NodeSelectionStrategyChannel.create(cf, targetChannels);
-                        }
-                    }));
+            LimitedChannel keystoreUpdatingChannel = createKeystoreUpdatingChannel(


Technically, this is updating for both keystore and truststore, isn't it? If so, thoughts on naming it createSslStoresUpdatingChannel?

aldexis · 2026-04-08T12:21:23Z

+        private static LimitedChannel createKeystoreUpdatingChannel(
+                Config cf, Meter _reloadMeter, Function<SslStoreMetadata, LimitedChannel> delegateSupplier) {
+            return new SupplierChannel(cf.storeMetadata().map(new Function<SslStoreMetadata, LimitedChannel>() {
+
+                @Override
+                public LimitedChannel apply(SslStoreMetadata storeMetadata) {
+                    return delegateSupplier.apply(storeMetadata);
+                }
+            }));
+        }


nit: Fwiw, I think this can be simplified to

Suggested change

private static LimitedChannel createKeystoreUpdatingChannel(

Config cf, Meter _reloadMeter, Function<SslStoreMetadata, LimitedChannel> delegateSupplier) {

return new SupplierChannel(cf.storeMetadata().map(new Function<SslStoreMetadata, LimitedChannel>() {

@Override

public LimitedChannel apply(SslStoreMetadata storeMetadata) {

return delegateSupplier.apply(storeMetadata);

}

}));

}

private static LimitedChannel createKeystoreUpdatingChannel(

Config cf, Meter _reloadMeter, Function<SslStoreMetadata, LimitedChannel> delegateSupplier) {

return new SupplierChannel(cf.storeMetadata().map(delegateSupplier));

}

Yes, thanks! The reason was that there were previously some debug loglines in the overriden apply.

aldexis · 2026-04-08T12:31:49Z

@@ -142,15 +144,8 @@ DialogueChannel getNonReloadingChannel(
    }

    private DialogueChannel createNonLiveReloadingChannel(ChannelCacheKey channelCacheRequest) {


Regarding some comments I made last week when discussing this PR, I had a look at other places this was used, and realized that this is used in ReloadingClientFactory#perHost. I'm not sure that's correct fwiw, since there are things beyond the uris that may update, which makes me wonder whether we should have this createNonLiveReloadingChannel in the first place.

It's also used through the ReloadingClientFactory#getNonReloading method, which might be used in a few places, including through the legacy DialogueClients#create

This also made me realize that we might want to handle

dialogue/dialogue-clients/src/main/java/com/palantir/dialogue/clients/ReloadingClientFactory.java

Lines 83 to 110 in 099b0a2

public Channel getNonReloadingChannel(String channelName, ClientConfiguration input) {

ClientConfiguration clientConf = hydrate(input);

ApacheHttpClientChannels.ClientBuilder clientBuilder = ApacheHttpClientChannels.clientBuilder()

.clientConfiguration(clientConf)

.clientName(channelName)

.dnsResolver(params.dnsResolver());

params.blockingExecutor().ifPresent(clientBuilder::executor);

ApacheHttpClientChannels.CloseableClient apacheClient = clientBuilder.build();

return DialogueChannel.builder()

.channelName(channelName)

.clientConfiguration(clientConf)

.uris(DnsSupport.pollForChanges(

params.dnsNodeDiscovery(),

DnsPollingSpec.clientConfig(channelName),

params.dnsResolver(),

params.dnsRefreshInterval(),

params.taggedMetrics(),

Refreshable.only(clientConf))

.map(dnsResult -> DnsSupport.getTargetUris(

channelName,

dnsResult.config().uris(),

dnsResult.config().proxy(),

dnsResult.resolvedHosts(),

params.taggedMetrics())))

.factory(args -> ApacheHttpClientChannels.createSingleUri(args, apacheClient))

.deadlineEnforcement(params.deadlineEnforcement())

.build();

}

? (honestly this whole client creation codepath should get refactored - we shouldn't be creating this many different clients in so many places)

aldexis · 2026-04-08T12:33:27Z

+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.function.Supplier;
+
+final class KeystoreSupport {


This is for both keystore and truststore, right? (e.g. SslStoresSupport?)

Yes, will rename.

aldexis · 2026-04-08T12:47:02Z

+
+class KeystoreSupportTest {
+    @Test
+    void pollForChanges_updates_on_change(@TempDir Path tempDir) throws IOException {


This seems to only test changes on the truststore. Should we also write another test for keystore changes?

aldexis · 2026-04-08T12:47:27Z

+                assertThat(updated).isNotEqualTo(initial);
+                assertThat(updated.trustStore().hash()).isNotEqualTo(initialTrustHash);


nit: may be interesting to also validate the hash is the one we expect from the file we wrote?

aldexis · 2026-04-08T12:50:33Z

+            Files.move(trustStore, movedTrustStore);
+            Thread.sleep(150);


Technically, we aren't actually testing that we did a reload during this period, but I don't think we have any visible way to observe this?
Makes me wonder whether we may want to add a metric for failing to refresh the ssl stores (in which case, we could await the metric increasing, then move the file back and verify it updates)

Good idea, will add metrics in the same way as the dns polling does.

aldexis · 2026-04-08T12:51:10Z

+            Files.move(movedTrustStore, trustStore);
+            Files.write(trustStore, new byte[] {9}, java.nio.file.StandardOpenOption.APPEND);


vnit: (update before moving back, that way it's atomic - doesn't matter much but is slightly cleaner?)

Suggested change

Files.move(movedTrustStore, trustStore);

Files.write(trustStore, new byte[] {9}, java.nio.file.StandardOpenOption.APPEND);

Files.write(trustStore, new byte[] {9}, StandardOpenOption.APPEND);

Files.move(movedTrustStore, trustStore);

aldexis · 2026-04-08T12:51:51Z

+        KeyStore trustStore = KeyStore.getInstance("JKS");
+        trustStore.load(null, new char[0]);
+        trustStore.setCertificateEntry("cert", certificate);
+        try (java.io.OutputStream outputStream = Files.newOutputStream(trustStorePath)) {


Suggested change

try (java.io.OutputStream outputStream = Files.newOutputStream(trustStorePath)) {

try (OutputStream outputStream = Files.newOutputStream(trustStorePath)) {

atrocities added 6 commits March 31, 2026 23:14

Add entry points for ssl file awareness

3046e48

Initial KeystoreSupport and metadata

f6c2878

Compose a store refresh around the DNS channel refresh

ac9797a

Fixup key polling and hashing

7d854d7

Fixup KeystoreSupport

e0c70c2

Add integration test for cert swapping

4dc7256

atrocities added 7 commits April 1, 2026 15:01

Spotless

49c964e

Remove temp test

e5c4837

Fix implicit dep, style todo

930707f

Absorb exceptions in keystore polling loop

b9d939e

Use consistent metadata throughout ChannelCache creation

f3e90f4

Remove file size from metadata, compute on creation

8654792

Rpotless

34661ec

bjlaub reviewed Apr 6, 2026

View reviewed changes

Use WeakReference for KeystoreSupport

065a1ab

aldexis reviewed Apr 8, 2026

View reviewed changes

atrocities added 9 commits April 8, 2026 12:57

Add test for in-flight requests

e914d4d

Small changes and renames for better relevance

06808eb

Add test confirming that keystore path changes causes reload

0e9d6a8

Test equivalence, and keystore in KeystoreSupportTest

9ef2627

Fixup some of the SslStoresSupport tests

803ca37

Add refresh and failure metrics

d8fb280

Apply recommendations for SslStoresSupportTest

10be772

spotless

3982502

Add metrics md

5016807


		Optional<InetAddress> resolvedAddress();

		Optional<String> sslStoreHash();


		DialogueDnsResolver dnsResolver();

		Optional<String> sslStoreHash();

		@@ -142,15 +144,8 @@ DialogueChannel getNonReloadingChannel(
		}

		private DialogueChannel createNonLiveReloadingChannel(ChannelCacheKey channelCacheRequest) {

	public Channel getNonReloadingChannel(String channelName, ClientConfiguration input) {
	ClientConfiguration clientConf = hydrate(input);
	ApacheHttpClientChannels.ClientBuilder clientBuilder = ApacheHttpClientChannels.clientBuilder()
	.clientConfiguration(clientConf)
	.clientName(channelName)
	.dnsResolver(params.dnsResolver());
	params.blockingExecutor().ifPresent(clientBuilder::executor);
	ApacheHttpClientChannels.CloseableClient apacheClient = clientBuilder.build();
	return DialogueChannel.builder()
	.channelName(channelName)
	.clientConfiguration(clientConf)
	.uris(DnsSupport.pollForChanges(
	params.dnsNodeDiscovery(),
	DnsPollingSpec.clientConfig(channelName),
	params.dnsResolver(),
	params.dnsRefreshInterval(),
	params.taggedMetrics(),
	Refreshable.only(clientConf))
	.map(dnsResult -> DnsSupport.getTargetUris(
	channelName,
	dnsResult.config().uris(),
	dnsResult.config().proxy(),
	dnsResult.resolvedHosts(),
	params.taggedMetrics())))
	.factory(args -> ApacheHttpClientChannels.createSingleUri(args, apacheClient))
	.deadlineEnforcement(params.deadlineEnforcement())
	.build();
	}

		assertThat(updated).isNotEqualTo(initial);
		assertThat(updated.trustStore().hash()).isNotEqualTo(initialTrustHash);

		Files.move(movedTrustStore, trustStore);
		Files.write(trustStore, new byte[] {9}, java.nio.file.StandardOpenOption.APPEND);

	try (java.io.OutputStream outputStream = Files.newOutputStream(trustStorePath)) {
	try (OutputStream outputStream = Files.newOutputStream(trustStorePath)) {

Conversation

atrocities commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changelog-app Bot commented Apr 1, 2026

Generate changelog in changelog/@unreleased

Uh oh!

bjlaub left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atrocities commented Apr 6, 2026

Uh oh!

aldexis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

atrocities commented Apr 1, 2026 •

edited

Loading

Generate changelog in `changelog/@unreleased`