Add apache client reloading on trust and key store file change#2941
Add apache client reloading on trust and key store file change#2941atrocities wants to merge 23 commits into
Conversation
Generate changelog in
|
bjlaub
left a comment
There was a problem hiding this comment.
Some initial thoughts:
- what happens to in-flight requests when we detect a change to the key/trust material and reload the hc5 client? I believe we're using pooled connections; I'm not sure what will happen to in-flight requests using connection from the pool if the client is closed before those requests finish. Probably worth adding some tests for this.
- watching for changes to the keystore in addition to the truststore adds a small bit of complexity here; I wonder if it's really necessary?
| } | ||
|
|
||
| private static HashCode hashFile(Path path) throws IOException { | ||
| byte[] bytes = Files.readAllBytes(path); |
There was a problem hiding this comment.
nit: we could possibly avoid pulling the whole file into memory by using guava's HashingInputStream instead. I don't have a sense for how large these files will get in practice, but it's probably not a huge concern.
There was a problem hiding this comment.
afaict, the size of these keystore files is usually a couple kb - I haven't found any enormous ones.
fwiw, this was copied from the way that witchcraft does hashing of configuration files, which is using readAllBytes
|
|
||
| Optional<InetAddress> resolvedAddress(); | ||
|
|
||
| Optional<String> sslStoreHash(); |
There was a problem hiding this comment.
what happens if this is empty?
There was a problem hiding this comment.
That would cause the apache cache key to also not contain an sslStoreHash - it's not used in the construction of the actual client itself.
To that end, whether to include this argument as part of the ChannelArgs is debatable. It's not strictly a 'channel arg', but it feels like the best place for it in the way that things are currently structured.
|
|
||
| ApacheCacheEntry apacheClient = getApacheClient(request); | ||
| Refreshable<SslStoreMetadata> storeMetadata = | ||
| KeystoreSupport.pollForChanges(channelCacheRequest.serviceConf().security()); |
There was a problem hiding this comment.
I think there's some nuance to this we should think about, particularly since it was adapted from the DNS polling stuff (disclaimer: this code is all a bit confusing and it's taking me a while to wrap my head around it again, so I could be wrong):
- This will create one
MetadataPollingTaskper cached DialogueChannel, where the work of each polling task is tied directly to reading bytes from the configured keystore/truststore on disk. But in practice, we may have many DialogueChannel instances that share the same truststore on disk, so with a large number of channels we end up mostly repeating work at fixed intervals. - The
DialogueDnsResolutionWorkeroperates similarly (we create one worker task per channel and run them at fixed intervals), and also may repeat work for services that share URIs, though this seems less likely (e.g. we are less likely to have 10 difference channels with overlapping host names to resolve, since they likely represent 10 different services with distinct host names, though that's not guaranteed). Another important difference is that we implicitly rely on the JVM DNS cache, so even though we poll at 1-second intervals in the DNS case, lookups may end up being very fast if the JVM has already cached results. Again, not guaranteed, and eventually we will make a network call.
I might be overthinking this, perhaps the cost to compute the hash is cheap enough that many polling tasks over the same files on disk just isn't worth optimizing for. Ideally we might be able to create one polling task per keystore/truststore on disk, and have it update any number of refreshables (connected to the hc5 clients or whatever) to get them to reload trust material when it changes.
There was a problem hiding this comment.
That is a very valid concern. I'll have a look into deduping existing tasks, because yes - these file operations, while not terribly resource intensive, are also not exactly free.
If it's not complicated (trust/keystore combos map to multiple DialogueChannels that need to be invalidated), then I do think it's worth this optimization.
| private static final class MetadataPollingTask implements Runnable { | ||
|
|
||
| private final SslConfiguration sslConfiguration; | ||
| private final SettableRefreshable<SslStoreMetadata> metadataRefreshable; |
There was a problem hiding this comment.
DialogueDnsResolutionWorker stores a WeakReference to the output refreshable so that if it has been garbage collected we can avoid doing any work and exit early. Should we do something similar here?
There was a problem hiding this comment.
+1 - otherwise, I think the cleaner will also never remove the scheduled future nor the task itself?
|
Thanks much for having a look @bjlaub . I'll do the following:
|
aldexis
left a comment
There was a problem hiding this comment.
Pretty nice work - I like the tests!
|
|
||
| DialogueDnsResolver dnsResolver(); | ||
|
|
||
| Optional<String> sslStoreHash(); |
There was a problem hiding this comment.
nit: Add a comment here explaining that this exists to force a cache miss when the ssl stores contents change, to force a reload of their values, even if the hash itself isn't directly used. Probably same for ChannelArgs#sslStoreHash (or refer to this method from there)
| return NodeSelectionStrategyChannel.create(cf, targetChannels); | ||
| } | ||
| })); | ||
| LimitedChannel keystoreUpdatingChannel = createKeystoreUpdatingChannel( |
There was a problem hiding this comment.
Technically, this is updating for both keystore and truststore, isn't it? If so, thoughts on naming it createSslStoresUpdatingChannel?
| private static LimitedChannel createKeystoreUpdatingChannel( | ||
| Config cf, Meter _reloadMeter, Function<SslStoreMetadata, LimitedChannel> delegateSupplier) { | ||
| return new SupplierChannel(cf.storeMetadata().map(new Function<SslStoreMetadata, LimitedChannel>() { | ||
|
|
||
| @Override | ||
| public LimitedChannel apply(SslStoreMetadata storeMetadata) { | ||
| return delegateSupplier.apply(storeMetadata); | ||
| } | ||
| })); | ||
| } |
There was a problem hiding this comment.
nit: Fwiw, I think this can be simplified to
| private static LimitedChannel createKeystoreUpdatingChannel( | |
| Config cf, Meter _reloadMeter, Function<SslStoreMetadata, LimitedChannel> delegateSupplier) { | |
| return new SupplierChannel(cf.storeMetadata().map(new Function<SslStoreMetadata, LimitedChannel>() { | |
| @Override | |
| public LimitedChannel apply(SslStoreMetadata storeMetadata) { | |
| return delegateSupplier.apply(storeMetadata); | |
| } | |
| })); | |
| } | |
| private static LimitedChannel createKeystoreUpdatingChannel( | |
| Config cf, Meter _reloadMeter, Function<SslStoreMetadata, LimitedChannel> delegateSupplier) { | |
| return new SupplierChannel(cf.storeMetadata().map(delegateSupplier)); | |
| } |
There was a problem hiding this comment.
Yes, thanks! The reason was that there were previously some debug loglines in the overriden apply.
| @@ -142,15 +144,8 @@ DialogueChannel getNonReloadingChannel( | |||
| } | |||
|
|
|||
| private DialogueChannel createNonLiveReloadingChannel(ChannelCacheKey channelCacheRequest) { | |||
There was a problem hiding this comment.
Regarding some comments I made last week when discussing this PR, I had a look at other places this was used, and realized that this is used in ReloadingClientFactory#perHost. I'm not sure that's correct fwiw, since there are things beyond the uris that may update, which makes me wonder whether we should have this createNonLiveReloadingChannel in the first place.
It's also used through the ReloadingClientFactory#getNonReloading method, which might be used in a few places, including through the legacy DialogueClients#create
This also made me realize that we might want to handle
? (honestly this whole client creation codepath should get refactored - we shouldn't be creating this many different clients in so many places)| import java.util.concurrent.atomic.AtomicBoolean; | ||
| import java.util.function.Supplier; | ||
|
|
||
| final class KeystoreSupport { |
There was a problem hiding this comment.
This is for both keystore and truststore, right? (e.g. SslStoresSupport?)
|
|
||
| class KeystoreSupportTest { | ||
| @Test | ||
| void pollForChanges_updates_on_change(@TempDir Path tempDir) throws IOException { |
There was a problem hiding this comment.
This seems to only test changes on the truststore. Should we also write another test for keystore changes?
| assertThat(updated).isNotEqualTo(initial); | ||
| assertThat(updated.trustStore().hash()).isNotEqualTo(initialTrustHash); |
There was a problem hiding this comment.
nit: may be interesting to also validate the hash is the one we expect from the file we wrote?
| Files.move(trustStore, movedTrustStore); | ||
| Thread.sleep(150); |
There was a problem hiding this comment.
Technically, we aren't actually testing that we did a reload during this period, but I don't think we have any visible way to observe this?
Makes me wonder whether we may want to add a metric for failing to refresh the ssl stores (in which case, we could await the metric increasing, then move the file back and verify it updates)
There was a problem hiding this comment.
Good idea, will add metrics in the same way as the dns polling does.
| Files.move(movedTrustStore, trustStore); | ||
| Files.write(trustStore, new byte[] {9}, java.nio.file.StandardOpenOption.APPEND); |
There was a problem hiding this comment.
vnit: (update before moving back, that way it's atomic - doesn't matter much but is slightly cleaner?)
| Files.move(movedTrustStore, trustStore); | |
| Files.write(trustStore, new byte[] {9}, java.nio.file.StandardOpenOption.APPEND); | |
| Files.write(trustStore, new byte[] {9}, StandardOpenOption.APPEND); | |
| Files.move(movedTrustStore, trustStore); |
| KeyStore trustStore = KeyStore.getInstance("JKS"); | ||
| trustStore.load(null, new char[0]); | ||
| trustStore.setCertificateEntry("cert", certificate); | ||
| try (java.io.OutputStream outputStream = Files.newOutputStream(trustStorePath)) { |
There was a problem hiding this comment.
| try (java.io.OutputStream outputStream = Files.newOutputStream(trustStorePath)) { | |
| try (OutputStream outputStream = Files.newOutputStream(trustStorePath)) { |
Support automatic reloading of keystore and truststore per apache hc5 client upon changes to these files.
Adds a refresh mechanism similar to the DNS resolution polling feature, and uses hashes of the keystore and truststore contents as a new cache key component. Key material refreshing happens at the DialogueChannel level, and the hashed data is passed down into the apache cache level. A change in key material contents results in invalidation of entries in the apache cache, causing the hc5 clients to reload with the new key/truststore contents.
Changes in the location and nature of the key material constitute a configuration change, which would also cause a reload.