Skip to content

Add apache client reloading on trust and key store file change#2941

Draft
atrocities wants to merge 23 commits into
developfrom
atrocities/live-ssl-reload
Draft

Add apache client reloading on trust and key store file change#2941
atrocities wants to merge 23 commits into
developfrom
atrocities/live-ssl-reload

Conversation

@atrocities
Copy link
Copy Markdown

@atrocities atrocities commented Apr 1, 2026

Support automatic reloading of keystore and truststore per apache hc5 client upon changes to these files.

Adds a refresh mechanism similar to the DNS resolution polling feature, and uses hashes of the keystore and truststore contents as a new cache key component. Key material refreshing happens at the DialogueChannel level, and the hashed data is passed down into the apache cache level. A change in key material contents results in invalidation of entries in the apache cache, causing the hc5 clients to reload with the new key/truststore contents.

Changes in the location and nature of the key material constitute a configuration change, which would also cause a reload.

@changelog-app
Copy link
Copy Markdown

changelog-app Bot commented Apr 1, 2026

Generate changelog in changelog/@unreleased

Type (Select exactly one)

  • Feature (Adding new functionality)
  • Improvement (Improving existing functionality)
  • Fix (Fixing an issue with existing functionality)
  • Break (Creating a new major version by breaking public APIs)
  • Deprecation (Removing functionality in a non-breaking way)
  • Migration (Automatically moving data/functionality to a new system)

Description

Add apache client reloading on trust and key store file change

Check the box to generate changelog(s)

  • Generate changelog entry

Copy link
Copy Markdown
Contributor

@bjlaub bjlaub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial thoughts:

  • what happens to in-flight requests when we detect a change to the key/trust material and reload the hc5 client? I believe we're using pooled connections; I'm not sure what will happen to in-flight requests using connection from the pool if the client is closed before those requests finish. Probably worth adding some tests for this.
  • watching for changes to the keystore in addition to the truststore adds a small bit of complexity here; I wonder if it's really necessary?

}

private static HashCode hashFile(Path path) throws IOException {
byte[] bytes = Files.readAllBytes(path);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could possibly avoid pulling the whole file into memory by using guava's HashingInputStream instead. I don't have a sense for how large these files will get in practice, but it's probably not a huge concern.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaict, the size of these keystore files is usually a couple kb - I haven't found any enormous ones.

fwiw, this was copied from the way that witchcraft does hashing of configuration files, which is using readAllBytes


Optional<InetAddress> resolvedAddress();

Optional<String> sslStoreHash();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if this is empty?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would cause the apache cache key to also not contain an sslStoreHash - it's not used in the construction of the actual client itself.

To that end, whether to include this argument as part of the ChannelArgs is debatable. It's not strictly a 'channel arg', but it feels like the best place for it in the way that things are currently structured.


ApacheCacheEntry apacheClient = getApacheClient(request);
Refreshable<SslStoreMetadata> storeMetadata =
KeystoreSupport.pollForChanges(channelCacheRequest.serviceConf().security());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some nuance to this we should think about, particularly since it was adapted from the DNS polling stuff (disclaimer: this code is all a bit confusing and it's taking me a while to wrap my head around it again, so I could be wrong):

  • This will create one MetadataPollingTask per cached DialogueChannel, where the work of each polling task is tied directly to reading bytes from the configured keystore/truststore on disk. But in practice, we may have many DialogueChannel instances that share the same truststore on disk, so with a large number of channels we end up mostly repeating work at fixed intervals.
  • The DialogueDnsResolutionWorker operates similarly (we create one worker task per channel and run them at fixed intervals), and also may repeat work for services that share URIs, though this seems less likely (e.g. we are less likely to have 10 difference channels with overlapping host names to resolve, since they likely represent 10 different services with distinct host names, though that's not guaranteed). Another important difference is that we implicitly rely on the JVM DNS cache, so even though we poll at 1-second intervals in the DNS case, lookups may end up being very fast if the JVM has already cached results. Again, not guaranteed, and eventually we will make a network call.

I might be overthinking this, perhaps the cost to compute the hash is cheap enough that many polling tasks over the same files on disk just isn't worth optimizing for. Ideally we might be able to create one polling task per keystore/truststore on disk, and have it update any number of refreshables (connected to the hc5 clients or whatever) to get them to reload trust material when it changes.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very valid concern. I'll have a look into deduping existing tasks, because yes - these file operations, while not terribly resource intensive, are also not exactly free.

If it's not complicated (trust/keystore combos map to multiple DialogueChannels that need to be invalidated), then I do think it's worth this optimization.

private static final class MetadataPollingTask implements Runnable {

private final SslConfiguration sslConfiguration;
private final SettableRefreshable<SslStoreMetadata> metadataRefreshable;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DialogueDnsResolutionWorker stores a WeakReference to the output refreshable so that if it has been garbage collected we can avoid doing any work and exit early. Should we do something similar here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - otherwise, I think the cleaner will also never remove the scheduled future nor the task itself?

@atrocities
Copy link
Copy Markdown
Author

Thanks much for having a look @bjlaub . I'll do the following:

  • Try to coalesce MetadataPollingTasks by the paths of the key/truststore on disk instead of having one for every DialogueChannel.
  • Add a test to demonstrate what happens when the keymaterial changes. My thought is that it ought to allow for any existing requests to finish, and then be eventually closed and cleaned up. On that note - will want to test for that as well.
  • Figure out whether or not we're actually using the keystore (not truststore) in dialogue.
  • Investigate WeakReference in KeystoreSupport.

Copy link
Copy Markdown
Contributor

@aldexis aldexis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty nice work - I like the tests!


DialogueDnsResolver dnsResolver();

Optional<String> sslStoreHash();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add a comment here explaining that this exists to force a cache miss when the ssl stores contents change, to force a reload of their values, even if the hash itself isn't directly used. Probably same for ChannelArgs#sslStoreHash (or refer to this method from there)

return NodeSelectionStrategyChannel.create(cf, targetChannels);
}
}));
LimitedChannel keystoreUpdatingChannel = createKeystoreUpdatingChannel(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, this is updating for both keystore and truststore, isn't it? If so, thoughts on naming it createSslStoresUpdatingChannel?

Comment on lines +297 to +306
private static LimitedChannel createKeystoreUpdatingChannel(
Config cf, Meter _reloadMeter, Function<SslStoreMetadata, LimitedChannel> delegateSupplier) {
return new SupplierChannel(cf.storeMetadata().map(new Function<SslStoreMetadata, LimitedChannel>() {

@Override
public LimitedChannel apply(SslStoreMetadata storeMetadata) {
return delegateSupplier.apply(storeMetadata);
}
}));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Fwiw, I think this can be simplified to

Suggested change
private static LimitedChannel createKeystoreUpdatingChannel(
Config cf, Meter _reloadMeter, Function<SslStoreMetadata, LimitedChannel> delegateSupplier) {
return new SupplierChannel(cf.storeMetadata().map(new Function<SslStoreMetadata, LimitedChannel>() {
@Override
public LimitedChannel apply(SslStoreMetadata storeMetadata) {
return delegateSupplier.apply(storeMetadata);
}
}));
}
private static LimitedChannel createKeystoreUpdatingChannel(
Config cf, Meter _reloadMeter, Function<SslStoreMetadata, LimitedChannel> delegateSupplier) {
return new SupplierChannel(cf.storeMetadata().map(delegateSupplier));
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks! The reason was that there were previously some debug loglines in the overriden apply.

@@ -142,15 +144,8 @@ DialogueChannel getNonReloadingChannel(
}

private DialogueChannel createNonLiveReloadingChannel(ChannelCacheKey channelCacheRequest) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding some comments I made last week when discussing this PR, I had a look at other places this was used, and realized that this is used in ReloadingClientFactory#perHost. I'm not sure that's correct fwiw, since there are things beyond the uris that may update, which makes me wonder whether we should have this createNonLiveReloadingChannel in the first place.

It's also used through the ReloadingClientFactory#getNonReloading method, which might be used in a few places, including through the legacy DialogueClients#create

This also made me realize that we might want to handle

public Channel getNonReloadingChannel(String channelName, ClientConfiguration input) {
ClientConfiguration clientConf = hydrate(input);
ApacheHttpClientChannels.ClientBuilder clientBuilder = ApacheHttpClientChannels.clientBuilder()
.clientConfiguration(clientConf)
.clientName(channelName)
.dnsResolver(params.dnsResolver());
params.blockingExecutor().ifPresent(clientBuilder::executor);
ApacheHttpClientChannels.CloseableClient apacheClient = clientBuilder.build();
return DialogueChannel.builder()
.channelName(channelName)
.clientConfiguration(clientConf)
.uris(DnsSupport.pollForChanges(
params.dnsNodeDiscovery(),
DnsPollingSpec.clientConfig(channelName),
params.dnsResolver(),
params.dnsRefreshInterval(),
params.taggedMetrics(),
Refreshable.only(clientConf))
.map(dnsResult -> DnsSupport.getTargetUris(
channelName,
dnsResult.config().uris(),
dnsResult.config().proxy(),
dnsResult.resolvedHosts(),
params.taggedMetrics())))
.factory(args -> ApacheHttpClientChannels.createSingleUri(args, apacheClient))
.deadlineEnforcement(params.deadlineEnforcement())
.build();
}
? (honestly this whole client creation codepath should get refactored - we shouldn't be creating this many different clients in so many places)

import java.util.concurrent.atomic.AtomicBoolean;
import java.util.function.Supplier;

final class KeystoreSupport {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for both keystore and truststore, right? (e.g. SslStoresSupport?)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will rename.


class KeystoreSupportTest {
@Test
void pollForChanges_updates_on_change(@TempDir Path tempDir) throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to only test changes on the truststore. Should we also write another test for keystore changes?

Comment on lines +60 to +61
assertThat(updated).isNotEqualTo(initial);
assertThat(updated.trustStore().hash()).isNotEqualTo(initialTrustHash);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: may be interesting to also validate the hash is the one we expect from the file we wrote?

Comment on lines +84 to +85
Files.move(trustStore, movedTrustStore);
Thread.sleep(150);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, we aren't actually testing that we did a reload during this period, but I don't think we have any visible way to observe this?
Makes me wonder whether we may want to add a metric for failing to refresh the ssl stores (in which case, we could await the metric increasing, then move the file back and verify it updates)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will add metrics in the same way as the dns polling does.

Comment on lines +87 to +88
Files.move(movedTrustStore, trustStore);
Files.write(trustStore, new byte[] {9}, java.nio.file.StandardOpenOption.APPEND);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vnit: (update before moving back, that way it's atomic - doesn't matter much but is slightly cleaner?)

Suggested change
Files.move(movedTrustStore, trustStore);
Files.write(trustStore, new byte[] {9}, java.nio.file.StandardOpenOption.APPEND);
Files.write(trustStore, new byte[] {9}, StandardOpenOption.APPEND);
Files.move(movedTrustStore, trustStore);

KeyStore trustStore = KeyStore.getInstance("JKS");
trustStore.load(null, new char[0]);
trustStore.setCertificateEntry("cert", certificate);
try (java.io.OutputStream outputStream = Files.newOutputStream(trustStorePath)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try (java.io.OutputStream outputStream = Files.newOutputStream(trustStorePath)) {
try (OutputStream outputStream = Files.newOutputStream(trustStorePath)) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants