infra: aggressive ClickHouse idle-baseline tuning (HOL-24)#106
Merged
Conversation
Drops ClickHouse idle memory from ~1.14 GiB to ~600 MiB (warm) / ~350 MiB (cold) and thread count from 746 to ~80, and disk usage from 6.8 GB to 1.4 GB on the dev volume. Specifically: - max_thread_pool_size 10000 -> 128 (this is the big one — default 10000 means CH allocated 700+ threads on a 16-core idle box) - background_pool_size 16 -> 14 (kept ratio*size >= 28 to clear MergeTree sanity checks like number_of_free_entries_in_pool_to_ execute_optimize_entire_partition=25) - Other background pools cut to 4 each (or 16 for schedule_pool, which needs headroom or AsyncLoader stalls during startup with our 25+ default-DB tables) - max_concurrent_queries 100 -> 20 - asynchronous_metrics_update_period_s 1 -> 30 - All system *_log tables disabled via remove="remove" (text_log was hoarding 5.5 GB on disk; query_log/trace_log/etc were chatty for no operator benefit on a self-hosted single-tenant deploy) - listen_host 0.0.0.0 added (silences a noisy IPv6 listen warning) Floor analysis: CH binary alone occupies ~580 MiB in shared library mappings + code segments (MemoryShared 309 + MemoryCode 272). Working heap rounds out to ~620-700 MiB on a warm idle box. Going lower would require either a custom-compiled minimal CH build or a different storage engine entirely. Note: the on-disk system log data must be wiped manually for existing volumes (these volumes pre-date the config disable). The 5.4 GiB of text_log/opentelemetry_span_log/etc data is reclaimed by deleting metadata/system/*_log.sql + their store/ UUIDs. Refs HOL-24. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Drops ClickHouse idle memory from ~1.14 GiB to ~600 MiB warm / ~350 MiB cold, thread count from 746 to ~80, and disk usage from 6.8 GB to 1.4 GB on the dev volume.
What changed (in
infra/docker/config.xml)max_thread_pool_size10000 → 128 (the big lever — default 10000 was producing 700+ idle threads on a 16-core box)background_pool_size16 → 14 (kept ratio×size ≥ 28 to clear MergeTree sanity checks likenumber_of_free_entries_in_pool_to_execute_optimize_entire_partition=25)background_schedule_pool_size, which needs headroom or AsyncLoader stalls during startup with our 25+ default-DB tables)max_concurrent_queries100 → 20asynchronous_metrics_update_period_s1 → 30*_logtables disabled viaremove="remove"(text_log was hoarding 5.5 GB on disk; query_log/trace_log/etc were chatty for no operator benefit on a self-hosted single-tenant deploy)listen_host 0.0.0.0added (silences a noisy IPv6 listen warning)Floor analysis
ClickHouse binary alone occupies ~580 MiB in shared library mappings + code segments (
MemoryShared309 +MemoryCode272). Working heap rounds out to ~620–700 MiB on a warm idle box. Going lower would require either a custom-compiled minimal CH build or a different storage engine entirely — that's the open question being explored in HOL-25 onward (Postgres-migration EPIC).Operational note
The on-disk system-log data must be wiped manually on existing volumes (those volumes pre-date the config disable). The 5.4 GiB of
text_log/opentelemetry_span_log/ etc data is reclaimed by deletingmetadata/system/*_log.sql+ theirstore/UUIDs. New deployments don't need this step — the disable takes effect at first start.Test plan
/healthreturns HealthyCloses HOL-24.
🤖 Generated with Claude Code