Major refactoring of Reloader to use controller-runtime and fix multiple bugs #1071

TheiLLeniumStudios · 2025-12-28T08:17:16Z

Issues Fixed

Fixes [BUG] Ignore configmaps results in permission errors #1068: ConfigMaps not ignored when using --resources-to-ignore=configMaps (case sensitivity)
Fixes [BUG] Reloader fails to combine Auto + Named ConfigMap reload modes #1055: Explicit reload annotations ignored when auto: true is also set
Fixes [BUG] Race condition on reloader deployment restart #1016: Workloads stuck paused after Reloader restart (pause race condition)
Fixes [BUG] Reloader doesn't pick up the changes when secret changes #1020: Secrets not being detected/reloaded (regression since v1.4.5)
Fixes [BUG] Reloader must crash on API issues #926: API server issues should cause reload to restart automatically

Changes

Bug Fixes

Case-insensitive resource comparison - configMaps now matches configmaps
Correct annotation precedence - Explicit reload annotations checked before auto
Persistent pause handling - DeploymentReconciler watches paused-at annotation and resumes paused deployments even after restart
Unified ConfigMap/Secret handling - Both use identical reconcilers and ResourceChange

Tests

Dedicated e2e tests to test every logical flow
Detailed unit tests for every package

Test coverage (pulled using `go tool cover`)

Component	Coverage	Test File	Key Test Cases
Matcher	94.4%	reload/matcher_test.go	Auto/explicit/search annotation, regex patterns, ignore, exclude list, precedence
Hasher	94.4%	reload/hasher_test.go	ConfigMap/Secret hash, deterministic output, empty/nil handling
Strategy	94.4%	reload/strategy_test.go	EnvVar apply/naming, Annotation apply, deletion handling
Service	94.4%	reload/service_test.go	Process ConfigMap/Secret, container detection (volume/env/envFrom/init), multi-workload
Predicate	94.4%	reload/predicate_test.go	Namespace filter, label selector, ignore annotation, hash change detection
Pause	94.4%	reload/pause_test.go	Should pause, get period, apply/check/clear pause
Deployment	63.4%	workload/deployment_test.go	Getters, annotations, volumes, uses ConfigMap/Secret (all patterns)
DaemonSet	63.4%	workload/daemonset_test.go	Getters, annotations, volumes, uses ConfigMap/Secret
StatefulSet	63.4%	workload/statefulset_test.go	Getters, annotations, volumes, uses ConfigMap/Secret
Job	63.4%	workload/job_test.go	Getters, uses ConfigMap/Secret, deep copy
CronJob	63.4%	workload/cronjob_test.go	Getters, uses ConfigMap/Secret, deep copy
Rollout	63.4%	workload/rollout_test.go	Getters, strategy (restart/rollout), uses ConfigMap/Secret
Registry	63.4%	workload/registry_test.go	With/without Argo, ListerFor, FromObject all types
ConfigMapReconciler	56.0%	controller/configmap_reconciler_test.go	Not found, ignored namespace, auto/explicit, volume/projected/search
SecretReconciler	56.0%	controller/secret_reconciler_test.go	Same as ConfigMap + ServiceAccount token ignore
NamespaceReconciler	56.0%	controller/namespace_reconciler_test.go	Add/remove cache, label changes, multiple selectors
Retry	56.0%	controller/retry_test.go	All workload types, both strategies, conflict handling
Filter	56.0%	controller/filter_test.go	Create event predicate, combined filters
Config	97.2%	config/config_test.go	Defaults, validation, normalization
Flags	97.2%	config/flags_test.go	All flags, env vars, boolean strings, comma lists, selectors
Alerter	86.1%	alerting/alerter_test.go	All 4 sinks, structured JSON, webhook errors
Webhook	90.5%	webhook/webhook_test.go	Payload format, HTTP errors
Events	100%	events/recorder_test.go	Success/failure event recording
Metrics	68.4%	metrics/prometheus_test.go	Counter increments, namespace metrics
Metadata	85.4%	metadata/metadata_test.go	Meta-info ConfigMap creation
E2E	-	test/e2e/e2e_test.go	ConfigMap/Secret update, auto-reload-all, DaemonSet/StatefulSet reload, label-only change, multiple ConfigMaps, auto+explicit annotation, auto disabled

No breaking changes

All existing annotations work the same way
CLI flags remain compatible

Other improvements

Migrated from logrus to zerolog + logr for standardized logging
Upgraded all go packages to latest versions
Added support for structured output for Raw alerts
Used viper for configuration and flag management
Graceful shutdown
Reloader binary size reduced to 31MB (from 68MB)
Container image size reduced to 67.8MB (from 103MB)

Architecture

Migrated from custom informer controller to controller-runtime reconcilers
Dedicated reconcilers: ConfigMapReconciler, SecretReconciler, DeploymentReconciler, NamespaceReconciler
Uses controller-runtime's built-in leader election
Cleaner separation: config → reload (matcher/hasher/strategy) → workload

New components

Component	Package	Responsibility	Replaces
Matcher	reload	Evaluates annotations to decide if workload should reload	common.ContainsRecursive() scattered logic
Hasher	reload	Computes SHA hash of ConfigMap/Secret content	util.GetSHAFromConfigmap/Secret()
Strategy	reload	Applies reload (EnvVar or Annotation) to pod template	handler.updateContainerEnvVars/Annotations()
Service	reload	Orchestrates reload: list workloads → match → decide	handler.doRollingUpgrade()
Predicate	reload	Filters events by hash change, namespace, labels	Inline checks in handler/*.go
WorkloadAccessor	workload	Unified interface for all workload types	callbacks.ItemsFunc type-specific callbacks
Registry	workload	Factory for creating workload accessors	handler.getWorkloadItems() switch statement
Lister	workload	Lists all workloads in a namespace	Inline listing in handler.doRollingUpgrade()
ConfigMapReconciler	controller	Watches ConfigMaps, triggers reload on change	controller.Start() + informer handlers
SecretReconciler	controller	Watches Secrets, triggers reload on change	controller.Start() + informer handlers
DeploymentReconciler	controller	Watches paused deployments, resumes after period	handler.checkMissingPauseTimers() in-memory
NamespaceReconciler	controller	Caches namespaces matching label selectors	common.IsNamespaceAllowed() per-request
ReloadHandler	controller	Common reload workflow for both reconcilers	handler.{Create,Update,Delete} handlers
Config	config	Single struct for all configuration	options.Collectors + scattered globals
Alerter	alerting	Interface + implementations for alert sinks	alerts.SendAlert() monolithic function
Recorder	events	Records K8s events on workloads	handler.sendEvent()
Publisher	metadata	Publishes reloader-meta-info ConfigMap	common.CreateMetaInfo()

High-level flow comparison

  OLD:                                NEW:
  ┌─────────────────────┐                  ┌─────────────────────┐
  │ controller.Start()  │                  │ controller-runtime  │
  │ (custom informers)  │                  │ Manager             │
  └──────────┬──────────┘                  └──────────┬──────────┘
             │                                        │
             ▼                                        ▼
  ┌─────────────────────┐                  ┌─────────────────────┐
  │ handler.Create/     │                  │ ConfigMapReconciler │
  │ Update/Delete()     │                  │ SecretReconciler    │
  │ (one big file)      │                  │ DeploymentReconciler│
  └──────────┬──────────┘                  └──────────┬──────────┘
             │                                        │
             ▼                                        ▼
  ┌─────────────────────┐                  ┌─────────────────────┐
  │ handler.            │                  │ ReloadHandler       │
  │ doRollingUpgrade()  │                  │ (common workflow)   │
  │ (800+ lines)        │                  └──────────┬──────────┘
  └──────────┬──────────┘                             │
             │                             ┌──────────┴──────────┐
             │                             ▼                     ▼
             │                  ┌─────────────────┐   ┌─────────────────┐
             │                  │ reload.Service  │   │ reload.Matcher  │
             │                  │ (orchestration) │   │ (decisions)     │
             │                  └────────┬────────┘   └─────────────────┘
             │                           │
             ▼                           ▼
  ┌─────────────────────┐     ┌─────────────────────┐
  │ callbacks.ItemsFunc │     │ workload.Accessor   │
  │ (per-type callbacks)│     │ (unified interface) │
  └─────────────────────┘     └─────────────────────┘

New detailed flow

  ┌─────────────────────────────────────────────────────────────────────────────┐
  │                         ConfigMap or Secret Updated                          │
  └──────────────────────────────────┬──────────────────────────────────────────┘
                                     │
                                     ▼
                      ┌──────────────────────────┐
                      │   Event Filter (Filter)   │
                      │  ┌────────────────────┐  │
                      │  │ Namespace ignored? │──┼──Yes──► Skip
                      │  └─────────┬──────────┘  │
                      │            │ No          │
                      │  ┌─────────▼──────────┐  │
                      │  │ Resource selectors │──┼──No match──► Skip
                      │  │     match?         │  │
                      │  └─────────┬──────────┘  │
                      │            │ Yes         │
                      │  ┌─────────▼──────────┐  │
                      │  │  Hash changed?     │──┼──No──► Skip (no-op)
                      │  └─────────┬──────────┘  │
                      └────────────┼─────────────┘
                                   │ Yes
                                   ▼
                      ┌──────────────────────────┐
                      │     Reconciler.Reconcile │
                      └────────────┬─────────────┘
                                   │
                                   ▼
                      ┌──────────────────────────┐
                      │  ReloadHandler.Process   │
                      └────────────┬─────────────┘
                                   │
                ┌──────────────────┴──────────────────┐
                ▼                                     ▼
     ┌────────────────────┐               ┌────────────────────┐
     │  Lister.List()     │               │ ReloadService      │
     │  Get all workloads │               │ .Process()         │
     │  in namespace      │               │                    │
     └─────────┬──────────┘               └─────────┬──────────┘
               │                                    │
               └──────────────┬─────────────────────┘
                              ▼
                ┌──────────────────────────┐
                │   For each workload:     │
                │   Matcher.ShouldReload() │
                └────────────┬─────────────┘
                             │
                             ▼
                ┌──────────────────────────┐
                │   Filter Decisions       │
                │   (ShouldReload = true)  │
                └────────────┬─────────────┘
                             │
             ┌───────────────┴───────────────┐
             │                               │
             ▼                               ▼
  ┌─────────────────────┐         ┌─────────────────────┐
  │ Webhook configured? │──Yes──► │  Send webhook       │
  └──────────┬──────────┘         │  notification       │
             │ No                 └─────────────────────┘
             ▼
  ┌─────────────────────┐
  │  Apply Reloads      │
  │  (update workloads) │
  └──────────┬──────────┘
             │
             ▼
  ┌─────────────────────┐
  │ For each decision:  │
  │ • Update workload   │
  │ • Record metrics    │
  │ • Send event        │
  │ • Send alert        │
  └─────────────────────┘

…ckag

…ering

…isters and add Argo Rollouts support

…arity and consistency across packages

…dedicated files and remove redundant code

…d improve ldflags injection, fix defer resp.Body.Close() usage, replace os.Setenv with t.Setenv in tests, correct error message casing, and adjust Dockerfile and Makefile for cmd/reloader structure

rasheedamir · 2025-12-28T11:42:01Z

@waseem-h Can you plz add "Functionality Parity Matrix" i.e. make a table of every behavior the old controller had, and mark how it’s implemented now + how it’s tested.

TheiLLeniumStudios · 2025-12-28T11:54:52Z

@rasheedamir sure, I'll update the description to include that

TheiLLeniumStudios · 2025-12-28T12:50:04Z

@rasheedamir added test coverage information and details about the new components and what they replaced

… publishing and update RBAC to include watch permission

rasheedamir · 2025-12-28T19:56:20Z

@waseem-h can't the coverage be 100% for all components? And is the core business logic separated out into small unit testable pieces of code?

TheiLLeniumStudios · 2025-12-28T20:32:11Z

@rasheedamir the original coverage was less than 50% for all of the code and most components didn't even have tests. The new tests cover all the scenarios and their functions that are critical, but not the getters setters or constructor functions, which is why there is not 100% coverage.
As for the separation, the entire refactor is done in a way that each core piece of logic is split up in a smaller package with its own tests so we have much better testing coverage as before. I'll see if there are more test cases we need to add, and then will add them if needed

rasheedamir · 2025-12-28T21:17:36Z

@TheiLLeniumStudios Can we add a minimal load/stress test suite as part of this migration? Since Reloader is widely adopted and used by enterprise customers, we should validate that moving to controller-runtime reconcilers doesn’t regress performance or stability. A small, repeatable test bed would let us run identical scenarios against both the legacy implementation and the new one and compare key metrics (throughput, reconcile latency, API call volume, resource usage, and error rate) as an objective release gate.

A good stress/load test for this migration should answer two questions:

Does it still do the right thing under load? (no missed / duplicate actions)
Does it stay stable and “API-friendly”? (no hot loops, no API storms, predictable CPU/mem)

Below is a concrete design + criteria you can implement as a repeatable “test bed” and run A/B (old controller vs new controller).

What to measure (the “contract”)

Correctness criteria

No missed actions: every intended target gets restarted/updated exactly when expected.
No duplicate actions: a burst of source updates should not cause multiple restarts for the same workload unless intended.
Idempotency: repeating the same update (same data hash) results in no action.
Ordering invariants (if any): e.g., status/event written after action, finalizer runs once.

Stability criteria

No hot loops: reconcile rate returns to baseline after the load stops.
Bounded error rate: transient errors may happen, but must converge (no stuck requeues).

Performance criteria (A/B comparison)

Track these and compare old vs new:

p50/p95/p99 “source update → action applied” latency
Actions/sec throughput (e.g., restarts per second)
Reconcile/sec (per controller + per key)
Kubernetes API call volume (GET/LIST/WATCH/PATCH/UPDATE) — ideally from apiserver metrics or client metrics
Controller CPU/memory steady-state and peak
Workqueue depth / rate-limiter behavior (if exposed)

Acceptance style:

“New is not worse than old by more than X%” (typical: ≤10–20% on p95 latency and CPU; ≤10% on API call volume; 0 correctness regressions).

Test bed design (repeatable)

Environment

Kind/k3d with real controller pods + Prometheus scraping (better)
OpenShift staging cluster for final gate (best for real enterprise environments)

Make it deterministic:

fixed seed for randomization
fixed resource counts + step durations
record controller version + config

Objects generator (load driver)

Write a small Go tool (or k6 + kubernetes client wrapper) that can:

Create N namespaces
Create M workloads per namespace (Deployments/StatefulSets/DaemonSets)
Create K ConfigMaps/Secrets and link them to workloads via labels/annotations your controller uses
Apply update patterns (below) at controlled rates

Core stress scenarios (we want 6–8)

S1 — Burst updates on one hot source

Goal: detect duplicate restarts, queue storms, hot loops.

1 ConfigMap → 500 workloads
Update CM 50 times in 10 seconds
Expected: either (a) 1 restart per workload total if you dedupe by hash/time window, or (b) at least bounded restarts per your spec.
Measure: duplicates, p95 latency, reconcile/sec spikes, API calls/sec.

S2 — Many sources, moderate fan-out

Goal: typical enterprise pattern.

200 ConfigMaps + 200 Secrets, each affecting 5–20 workloads
Update rate: 5 updates/sec for 5 minutes
Expected: stable throughput, no backlog growth, no increasing latency trend.

S3 — High cardinality namespaces/tenants

Goal: cache/index scalability + multi-tenant behavior.

200 namespaces × 20 workloads = 4,000 workloads
Periodic updates across namespaces
Expected: no memory blow-up, list calls don’t explode, per-namespace isolation is respected.

S4 — Metadata-only updates and no-op updates

Goal: predicate correctness + API friendliness.

Apply label/annotation changes that should not trigger actions
Re-apply same CM data (same hash) repeatedly
Expected: 0 actions, minimal reconcile churn.

S5 — Delete/recreate churn

Goal: tombstone/finalizer correctness.

Delete and recreate workloads and sources while updates are happening
Expected: no stuck finalizers, no panics, no leaked state, no “ghost” restarts.

S6 — Controller restart during load

Goal: resilience, missed-event recovery.

Run S2; restart controller pod mid-stream (or scale down/up)
Expected: system converges; any missed updates are reconciled via resync/drift logic (if applicable).

S7 — API server pressure / transient failures

Goal: backoff + retry correctness.

Inject failures (network policies, apiserver QPS limits, or deliberate 429/5xx simulation if possible)
Expected: bounded error rate, no tight requeue loops, eventual convergence.

S8 — Large object size / big secrets/configmaps

Goal: serialization/patch overhead.

CM/Secret sizes near typical limits (within K8s constraints)
Update at moderate rate
Expected: latency increases but remains bounded; no memory spikes.

Instrumentation (so results are objective)

Minimum instrumentation

Controller logs with a correlation ID per event (optional)
Custom Prometheus metrics:
- reconcile_total{result=success|error}
- reconcile_duration_seconds histogram
- actions_total{type=restart, kind=deployment,...}
- action_latency_seconds histogram (“source change observed → patch applied”)
- dedupe_hits_total (if you dedupe)
Capture controller resource usage (CPU/mem) from Kubernetes metrics

Kubernetes API call volume

Best sources:

apiserver metrics (apiserver_request_total, latency histograms)
client-go metrics (if enabled)
audit logs (heavier; use only if needed)

Reporting format (A/B comparable)

For each scenario produce a small JSON + markdown summary:

Scenario params: namespaces/workloads/sources/update rate/duration
Correctness: missed=0, duplicates=0, unexpected=0
Latency: p50/p95/p99
Throughput: actions/sec
Controller CPU/mem: peak + steady
API calls/sec: GET/LIST/WATCH/PATCH/UPDATE totals

Then do a diff: old vs new.

Suggested pass/fail gates (practical defaults)

You can tune these, but start here:

Correctness: missed = 0, duplicates = 0 (or within explicitly documented dedupe semantics)
No hot loop: reconcile/sec returns within 2× baseline within 60s after load stops
Latency regression: p95 action latency ≤ +20% vs old
API calls regression: total API calls ≤ +15% vs old; LIST calls should not balloon
Resource regression: CPU ≤ +20%, memory ≤ +15% vs old under same scenario

Implementation approach

A hack/loadtest/ folder with:
- generator tool (creates objects + applies update patterns)
- runner script (deploy controller version A, run scenarios, collect metrics; repeat for version B)
- prometheus scrape config + report generator
CI runs a small subset (S1 + S4 + S6 with reduced scale).
Nightly / pre-release runs full suite at realistic scale.

Copilot

Pull request overview

This PR represents a major architectural refactoring of the Reloader project, migrating from custom informer controllers to controller-runtime reconcilers while fixing multiple critical bugs. The refactoring improves code maintainability, test coverage, and addresses issues with case sensitivity, annotation precedence, deployment pause handling, and resource detection.

Key Changes:

Migrated from custom informers to controller-runtime with dedicated reconcilers (ConfigMap, Secret, Deployment, Namespace)
Fixed case-insensitive resource comparison, annotation precedence bugs, and persistent pause handling
Achieved ~94.4% test coverage for core reload logic and ~56% for controllers with comprehensive unit and e2e tests

Reviewed changes

Copilot reviewed 85 out of 119 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
internal/pkg/testutil/fixtures.go	New test utility fixtures for creating test workload objects
internal/pkg/reload/strategy_test.go	Unit tests for reload strategies (EnvVar and Annotation)
internal/pkg/reload/strategy.go	Reload strategy implementations for triggering workload restarts
internal/pkg/reload/service_test.go	Comprehensive tests for reload service orchestration logic
internal/pkg/reload/service.go	Service orchestrating reload logic for ConfigMaps and Secrets
internal/pkg/reload/resource_type_test.go	Tests for resource type enumeration
internal/pkg/reload/resource_type.go	Resource type definitions (ConfigMap, Secret)
internal/pkg/reload/predicate_test.go	Tests for event filtering predicates
internal/pkg/reload/predicate.go	Event filtering predicates for controller-runtime
internal/pkg/reload/pause_test.go	Tests for deployment pause handling logic
internal/pkg/reload/pause.go	Deployment pause/resume handler implementation
internal/pkg/reload/matcher_test.go	Tests for annotation matching and reload decision logic
internal/pkg/reload/matcher.go	Matcher determining if workloads should be reloaded based on annotations
internal/pkg/reload/hasher_test.go	Tests for ConfigMap/Secret content hashing
internal/pkg/reload/hasher.go	SHA1 hasher for detecting ConfigMap/Secret changes
internal/pkg/reload/decision_test.go	Tests for reload decision filtering
internal/pkg/reload/decision.go	Reload decision data structures
internal/pkg/reload/change.go	Resource change event types and interfaces
internal/pkg/options/flags.go	Removed legacy flags file (replaced by config package)
internal/pkg/openshift/detect.go	DeploymentConfig API detection for OpenShift
internal/pkg/metrics/prometheus_test.go	Tests for Prometheus metrics recording
internal/pkg/metrics/prometheus.go	Prometheus metrics collection with namespace tracking
internal/pkg/metadata/publisher.go	Metadata ConfigMap publisher for reloader build/config info
internal/pkg/metadata/metadata_test.go	Tests for metadata ConfigMap creation
internal/pkg/metadata/metadata.go	Metadata structures for build info and configuration
internal/pkg/leadership/*	Removed legacy leadership election files
internal/pkg/handler/*	Removed legacy handler files (replaced by controller package)
internal/pkg/events/recorder_test.go	Tests for Kubernetes event recording
internal/pkg/events/recorder.go	Event recorder wrapper for workload reload events
internal/pkg/crypto/*	Removed crypto package (functionality moved to reload.Hasher)
internal/pkg/controller/test_helpers_test.go	Test helper functions for controller unit tests
internal/pkg/controller/secret_reconciler_test.go	Unit tests for Secret reconciler
internal/pkg/controller/secret_reconciler.go	Secret reconciler implementation using controller-runtime
internal/pkg/constants/enums.go	Removed legacy enum constants

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… testing

TheiLLeniumStudios · 2026-01-04T15:08:52Z

@rasheedamir I've added metrics to both the old and new implementations to help see how both implementations hold up against each other. Here is a list of metrics and the the scenarios that I've tested:

Metrics

Metric	Description	Better Value
reloader_reconcile_total	Number of reconciliation loops executed	Lower = better event batching/deduplication
reloader_reconcile_duration_seconds	Time spent in each reconcile loop (p50/p95/p99)	Lower = faster processing
reloader_action_total	Reload actions triggered on workloads	Should match expected = correct behavior
reloader_action_latency_seconds	End-to-end time from config change to workload restart (p50/p95/p99)	Lower = faster user-facing response
reloader_reload_executed_total	Successful vs failed reload operations	Higher success rate = more reliable
reloader_workloads_scanned_total	Workloads checked for ConfigMap/Secret references	Informational
reloader_workloads_matched_total	Workloads that matched and needed reload	Should match action_total
reloader_errors_total	Total errors encountered	Zero = no issues
reloader_skipped_total	Updates skipped (e.g., no data change)	Informational
rest_client_requests_total	Total API server calls	Lower = less API server load
rest_client_requests_get	GET requests to API server	Lower = better caching
rest_client_requests_patch	PATCH requests (workload restarts)	Should correlate with reloads
rest_client_requests_errors	Failed API requests	Zero = no API issues

Scenarios

ID	Scenario	What it Tests
S1	Burst Updates	100 rapid ConfigMap/Secret updates
S2	Fan-Out	1 ConfigMap referenced by 50 Deployments
S3	High Cardinality	Resources spread across 10 namespaces
S4	No-Op Updates	Annotation-only changes (no data change)
S5	Workload Churn	Deployments created/deleted during updates
S6	Controller Restart	Controller pod killed mid-test
S7	API Pressure	20 concurrent update bursts
S8	Large Objects	ConfigMaps > 100KB

I'll share results of each scenario in a separate comment

TheiLLeniumStudios · 2026-01-04T15:38:23Z

S1:

================================================================================
                     RELOADER A/B COMPARISON REPORT
================================================================================

Scenario:     S1
Generated:    2026-01-04 16:16:52
Status:       PASS
Test:         S1: 141 burst updates, each triggers 1 deployment reload

--------------------------------------------------------------------------------
                           EXPECTED VALUES
--------------------------------------------------------------------------------
Expected Action Total:          141
Expected Reload Executed Total: 141

--------------------------------------------------------------------------------
                           METRIC COMPARISONS
--------------------------------------------------------------------------------
(Old✓/New✓ = meets expected value within 15%)

Metric                                    Old          New   Expected  Old✓  New✓   Status
------                                    ---          ---   --------  ----  ----   ------
reconcile_total                           282          140          -     -     -     pass
reconcile_duration_p50 (s)             0.0095       0.0034          -     -     -     pass
reconcile_duration_p95 (s)             0.0240       0.0087          -     -     -     pass
reconcile_duration_p99 (s)             0.0388       0.0099          -     -     -     pass
action_total                              141          140        141     ✓     ✓     pass
action_latency_p50 (s)                 0.0051       0.0050          -     -     -     pass
action_latency_p95 (s)                 0.0096       0.0096          -     -     -     pass
action_latency_p99 (s)                 0.0232       0.0100          -     -     -     pass
errors_total                                0            0          -     -     -     pass
reload_executed_total                     141          140        141     ✓     ✓     pass
workloads_scanned_total                  1410         1400          -     -     -     pass
workloads_matched_total                   141          140        141     ✓     ✓     pass
rest_client_requests_total                992          304          -     -     -     pass
rest_client_requests_get                  710           24          -     -     -     pass
rest_client_requests_patch                272          270          -     -     -     pass
rest_client_requests_put                    0            0          -     -     -     pass
rest_client_requests_errors                 1            8          -     -     -     pass

TheiLLeniumStudios · 2026-01-04T15:39:44Z

S2:

================================================================================
                     RELOADER A/B COMPARISON REPORT
================================================================================

Scenario:     S2
Generated:    2026-01-04 16:22:49
Status:       WARN
Summary:      2 metrics with warning: rest_client_requests_total, rest_client_requests_patch
Test:         S2: 8 updates × 50 deployments = 400 expected reloads

--------------------------------------------------------------------------------
                           EXPECTED VALUES
--------------------------------------------------------------------------------
Expected Action Total:          400
Expected Reload Executed Total: 400

--------------------------------------------------------------------------------
                           METRIC COMPARISONS
--------------------------------------------------------------------------------
(Old✓/New✓ = meets expected value within 15%)

Metric                                    Old          New   Expected  Old✓  New✓   Status
------                                    ---          ---   --------  ----  ----   ------
reconcile_total                             8            8          -     -     -     pass
reconcile_duration_p50 (s)            10.0000       0.2070          -     -     -     pass
reconcile_duration_p95 (s)            10.0000       0.4722          -     -     -     pass
reconcile_duration_p99 (s)            10.0000       0.7340          -     -     -     pass
action_total                              241          400        400     ✗     ✓     pass
action_latency_p50 (s)                 0.1329       0.0051          -     -     -     pass
action_latency_p95 (s)                 0.4375       0.0098          -     -     -     pass
action_latency_p99 (s)                 3.4375       0.0344          -     -     -     pass
errors_total                                0            0          -     -     -     pass
reload_executed_total                     243          400        400     ✗     ✓     pass
workloads_scanned_total                   250          400        400     ✗     ✓     pass
workloads_matched_total                   200          400        400     ✗     ✓     pass
rest_client_requests_total                501          824          -     -     -     warn
rest_client_requests_get                   26           24          -     -     -     pass
rest_client_requests_patch                425          750          -     -     -     warn
rest_client_requests_put                    0            0          -     -     -     pass
rest_client_requests_errors                 1            8          -     -     -     pass

TheiLLeniumStudios · 2026-01-04T15:41:30Z

S3:

================================================================================
                     RELOADER A/B COMPARISON REPORT
================================================================================

Scenario:     S3
Generated:    2026-01-04 16:33:07
Status:       PASS
Test:         S3: 142 updates across 11 namespaces

--------------------------------------------------------------------------------
                           EXPECTED VALUES
--------------------------------------------------------------------------------
Expected Action Total:          142
Expected Reload Executed Total: 142

--------------------------------------------------------------------------------
                           METRIC COMPARISONS
--------------------------------------------------------------------------------
(Old✓/New✓ = meets expected value within 15%)

Metric                                    Old          New   Expected  Old✓  New✓   Status
------                                    ---          ---   --------  ----  ----   ------
reconcile_total                           284          142          -     -     -     pass
reconcile_duration_p50 (s)             0.0108       0.0035          -     -     -     pass
reconcile_duration_p95 (s)             0.0242       0.0090          -     -     -     pass
reconcile_duration_p99 (s)             0.0386       0.0196          -     -     -     pass
action_total                              142          142        142     ✓     ✓     pass
action_latency_p50 (s)                 0.0051       0.0051          -     -     -     pass
action_latency_p95 (s)                 0.0096       0.0096          -     -     -     pass
action_latency_p99 (s)                 0.0226       0.0228          -     -     -     pass
errors_total                                0            0          -     -     -     pass
reload_executed_total                     142          142        142     ✓     ✓     pass
workloads_scanned_total                  2840         2840          -     -     -     pass
workloads_matched_total                   142          142          -     -     -     pass
rest_client_requests_total                999          308          -     -     -     pass
rest_client_requests_get                  715           24          -     -     -     pass
rest_client_requests_patch                178          179          -     -     -     pass
rest_client_requests_put                    0            0          -     -     -     pass
rest_client_requests_errors                 1            8          -     -     -     pass

TheiLLeniumStudios · 2026-01-04T15:57:58Z

S4:

================================================================================
                     RELOADER A/B COMPARISON REPORT
================================================================================

Scenario:     S4
Generated:    2026-01-04 16:37:35
Status:       PASS
Test:         S4: 143 no-op updates, all should be skipped

--------------------------------------------------------------------------------
                           METRIC COMPARISONS
--------------------------------------------------------------------------------
(Old✓/New✓ = meets expected value within 15%)

Metric                                    Old          New   Expected  Old✓  New✓   Status
------                                    ---          ---   --------  ----  ----   ------
reconcile_total                           286            0          -     -     -     pass
reconcile_duration_p50 (s)             0.0005       0.0000          -     -     -     pass
reconcile_duration_p95 (s)             0.0009       0.0000          -     -     -     pass
reconcile_duration_p99 (s)             0.0010       0.0000          -     -     -     pass
action_total                                0            0          -     -     -     pass
action_latency_p50 (s)                 0.0000       0.0000          -     -     -     pass
action_latency_p95 (s)                 0.0000       0.0000          -     -     -     pass
action_latency_p99 (s)                 0.0000       0.0000          -     -     -     pass
errors_total                                0            0          -     -     -     pass
reload_executed_total                       0            0          -     -     -     pass
workloads_scanned_total                     0            0          -     -     -     pass
workloads_matched_total                     0            0          -     -     -     pass
rest_client_requests_total                  5           12          -     -     -     pass
rest_client_requests_get                    5           12          -     -     -     pass
rest_client_requests_patch                  0            0          -     -     -     pass
rest_client_requests_put                    0            0          -     -     -     pass
rest_client_requests_errors                 1            4          -     -     -     pass

TheiLLeniumStudios · 2026-01-04T15:59:10Z

S5:


================================================================================
                     RELOADER A/B COMPARISON REPORT
================================================================================

Scenario:     S5
Generated:    2026-01-04 16:42:10
Status:       PASS
Test:         S5: Churn test - 0 deploys created, 0 deleted, 26 CM updates

--------------------------------------------------------------------------------
                           METRIC COMPARISONS
--------------------------------------------------------------------------------
(Old✓/New✓ = meets expected value within 15%)

Metric                                    Old          New   Expected  Old✓  New✓   Status
------                                    ---          ---   --------  ----  ----   ------
reconcile_total                            62           26          -     -     -     pass
reconcile_duration_p50 (s)             0.6846       0.0033          -     -     -     pass
reconcile_duration_p95 (s)             0.9685       0.0120          -     -     -     pass
reconcile_duration_p99 (s)             0.9937       0.0224          -     -     -     pass
action_total                              275           26          -     -     -     pass
action_latency_p50 (s)                 0.0056       0.0052          -     -     -     pass
action_latency_p95 (s)                 0.1603       0.0099          -     -     -     pass
action_latency_p99 (s)                 0.2321       0.0404          -     -     -     pass
errors_total                                3            0          -     -     -     pass
reload_executed_total                     272           26          -     -     -     pass
workloads_scanned_total                   305           26          -     -     -     pass
workloads_matched_total                   269           26          -     -     -     pass
rest_client_requests_total                695           75          -     -     -     pass
rest_client_requests_get                  148           24          -     -     -     pass
rest_client_requests_patch                434           50          -     -     -     pass
rest_client_requests_put                    0            0          -     -     -     pass
rest_client_requests_errors                 4            8          -     -     -     pass

TheiLLeniumStudios · 2026-01-04T16:00:13Z

S6:

================================================================================
                     RELOADER A/B COMPARISON REPORT
================================================================================

Scenario:     S6
Generated:    2026-01-04 16:46:10
Status:       PASS
Test:         S6: Restart test - 141 updates during restart

--------------------------------------------------------------------------------
                           METRIC COMPARISONS
--------------------------------------------------------------------------------
(Old✓/New✓ = meets expected value within 15%)

Metric                                    Old          New   Expected  Old✓  New✓   Status
------                                    ---          ---   --------  ----  ----   ------
reconcile_total                           120           58          -     -     -     pass
reconcile_duration_p50 (s)             0.0133       0.0036          -     -     -     pass
reconcile_duration_p95 (s)             0.0242       0.0093          -     -     -     pass
reconcile_duration_p99 (s)             0.0725       0.3561          -     -     -     pass
action_total                               60           58          -     -     -     pass
action_latency_p50 (s)                 0.0051       0.0050          -     -     -     pass
action_latency_p95 (s)                 0.0097       0.0095          -     -     -     pass
action_latency_p99 (s)                 0.0280       0.0099          -     -     -     pass
errors_total                                0            0          -     -     -     pass
reload_executed_total                      62           56          -     -     -     pass
workloads_scanned_total                  1200         1151          -     -     -     pass
workloads_matched_total                    60           58          -     -     -     pass
rest_client_requests_total                  0           12          -     -     -     pass
rest_client_requests_get                    0           12          -     -     -     pass
rest_client_requests_patch                  0            0          -     -     -     pass
rest_client_requests_put                    0            0          -     -     -     pass
rest_client_requests_errors                 0            4          -     -     -     pass

TheiLLeniumStudios · 2026-01-04T16:01:00Z

S7:

================================================================================
                     RELOADER A/B COMPARISON REPORT
================================================================================

Scenario:     S7
Generated:    2026-01-04 16:51:53
Status:       PASS
Test:         S7: 131 concurrent updates from 10 goroutines

--------------------------------------------------------------------------------
                           EXPECTED VALUES
--------------------------------------------------------------------------------
Expected Action Total:          131
Expected Reload Executed Total: 131

--------------------------------------------------------------------------------
                           METRIC COMPARISONS
--------------------------------------------------------------------------------
(Old✓/New✓ = meets expected value within 15%)

Metric                                    Old          New   Expected  Old✓  New✓   Status
------                                    ---          ---   --------  ----  ----   ------
reconcile_total                           276          131          -     -     -     pass
reconcile_duration_p50 (s)             0.0163       0.0039          -     -     -     pass
reconcile_duration_p95 (s)             0.0247       0.0122          -     -     -     pass
reconcile_duration_p99 (s)             0.0476       0.4222          -     -     -     pass
action_total                              138          131        131     ✓     ✓     pass
action_latency_p50 (s)                 0.0050       0.0052          -     -     -     pass
action_latency_p95 (s)                 0.0096       0.0099          -     -     -     pass
action_latency_p99 (s)                 0.0100       0.2035          -     -     -     pass
errors_total                                0            0          -     -     -     pass
reload_executed_total                     138          131        131     ✓     ✓     pass
workloads_scanned_total                  6900         6550          -     -     -     pass
workloads_matched_total                   138          131          -     -     -     pass
rest_client_requests_total                971          286          -     -     -     pass
rest_client_requests_get                  695           24          -     -     -     pass
rest_client_requests_patch                228          218          -     -     -     pass
rest_client_requests_put                    0            0          -     -     -     pass
rest_client_requests_errors                 1            8          -     -     -     pass

TheiLLeniumStudios · 2026-01-04T16:01:45Z

S8:


================================================================================
                     RELOADER A/B COMPARISON REPORT
================================================================================

Scenario:     S8
Generated:    2026-01-04 16:56:06
Status:       PASS
Test:         S8: 28 large object (100KB) updates

--------------------------------------------------------------------------------
                           EXPECTED VALUES
--------------------------------------------------------------------------------
Expected Action Total:          28
Expected Reload Executed Total: 28

--------------------------------------------------------------------------------
                           METRIC COMPARISONS
--------------------------------------------------------------------------------
(Old✓/New✓ = meets expected value within 15%)

Metric                                    Old          New   Expected  Old✓  New✓   Status
------                                    ---          ---   --------  ----  ----   ------
reconcile_total                            56           28          -     -     -     pass
reconcile_duration_p50 (s)             0.0161       0.0039          -     -     -     pass
reconcile_duration_p95 (s)             0.0248       0.0098          -     -     -     pass
reconcile_duration_p99 (s)             0.0865       0.0870          -     -     -     pass
action_total                               28           28         28     ✓     ✓     pass
action_latency_p50 (s)                 0.0050       0.0052          -     -     -     pass
action_latency_p95 (s)                 0.0095       0.0099          -     -     -     pass
action_latency_p99 (s)                 0.0099       0.0870          -     -     -     pass
errors_total                                0            0          -     -     -     pass
reload_executed_total                      28           28         28     ✓     ✓     pass
workloads_scanned_total                   280          280          -     -     -     pass
workloads_matched_total                    28           28          -     -     -     pass
rest_client_requests_total                201           80          -     -     -     pass
rest_client_requests_get                  145           24          -     -     -     pass
rest_client_requests_patch                 46           47          -     -     -     pass
rest_client_requests_put                    0            0          -     -     -     pass
rest_client_requests_errors                 1            8          -     -     -     pass

TheiLLeniumStudios · 2026-01-04T16:27:01Z

To summarize all the test results above and their results:

S1 (Burst)

Fires 141 ConfigMap/Secret updates in quick succession to test event handling under load.

New implementation handles burst updates with half reconcile and with 70% fewer API calls to the API server, specially with the reduction in GET calls due to how controller-runtime caches objects

S2 (fan-out)

Creates 1 ConfigMap referenced by 50 deployments, then updates it 8 times (expecting 400 total reloads).

This is actually interesting, the old / current implementation actually missed a bunch of reloads (241 out of 400) while my implementation reconcile each one of them. The higher API calls here is because it was able to reconcile everything and thus requiring a few more calls. Also the reconcile time went down from 10s under pressure in the older implementation to 0.2s while keeping up with everything

S3 (High cardinality)

Spreads 142 updates across 11 different namespaces to test cross-namespace handling.

Similar to S1, the reconciles were reduced to half with less API calls while handling each reload request properly

S4 (No-op)

Updates ConfigMap annotations without changing actual data (143 times).

The new implementation is actually way smarter as it doesn't even trigger the reconcile loop for annotation-only changes (0 vs 186 in the old). Both correctly skip the actual reloads but the new one doesn't waste any cycles for checking. This is because of filtering cached objects using predicates

S5 (Churn)

Creates and deletes deployments while firing ConfigMap updates to simulate a busy cluster.

New implementation handles workload churn way better because it doesn't do any duplicate reloads (old did 275 reconciles vs 26 for the new ones which was expected). The old one seems to get confused when deployments are created / deleted during updates

S6 (Restart)

Kills the reloader pod mid-test while updates are happening to test recovery.

Both survive restarts without issues (ofcourse missing some events here and there but its the same behavior across both)

S7 (API Pressure)

Fires 131 updates from 10 concurrent goroutines to stress test the controller.

Under a lot of load, the new implementation still makes around 70% fewer API calls while reloading everything properly

S8 (Large Objects)

Uses 100KB+ ConfigMaps to test handling of large objects.

Works fine for both implementations. The new one uses less API calls as expected

@rasheedamir Let me know if this provides more than enough details to the team to take a deeper look into the changes and review them

…ad to de-duplicate a lot of code

rasheedamir · 2026-01-05T09:16:29Z

@waseem-h great work! can you please commit the load tests and add step in pipeline to execute load tests when "load-tests" label is added to the PR?

rasheedamir · 2026-01-05T12:19:59Z

@waseem-h plz make separate PR for load tests and commit them to the existing codebase so, we use that as the reference and we merge it

TheiLLeniumStudios added 26 commits December 28, 2025 08:47

feat: add v2 foundation packages for config and workload abstraction

b9a7b89

refactor: migrate reloader command to use new config package

dce45a4

feat: Add reload package with core matching and strategy logic

94e3fcd

feat: Add controller-runtime reconcilers and unit tests for reload pa…

0a2aa12

…ckag

feat: Add webhook notification and conflict retry for Reloader v2

ce1e7df

feat: reload execution and observability

b5df945

feat: Implement NamespaceReconciler for namespace label selector filt…

3cb45e8

…ering

feat: Argo rollouts workload + refactor of alerting

f70c4d2

chore: Fix formatting issues

8b3ad89

feat: Migration to new implementation

f48c5ac

feat: Add reconciler test cases and config test cases

9d588ca

refactor: Move all common reconcile logic to lister and reload handler

3defc8b

refactor: Re-use a lot of code and move to specific packages

c19058a

refactor(workload): centralize workload listing with registry-based l…

3cf0119

…isters and add Argo Rollouts support

refactor: unify label set implementations and rename variables for cl…

3a8c300

…arity and consistency across packages

refactor(reload): move resource types and reload decision structs to …

fa60c1d

…dedicated files and remove redundant code

feat: Improve test coverage of important packages

841f6f3

build: update build configuration to use internal metadata package an…

0c40064

…d improve ldflags injection, fix defer resp.Body.Close() usage, replace os.Setenv with t.Setenv in tests, correct error message casing, and adjust Dockerfile and Makefile for cmd/reloader structure

feat: e2e tests and a lot of refactoring for existing tests

1725175

refactor: Move test helpers to testutil

ced6ffa

fix: Linting issues

cca7383

feat: Upgrade all go packages

612006c

fix: Add missing cmd for reloader due to gitignore issues

2eeb44d

feat: Improve slack alerts

9a5fbf1

feat: Use viper for config handling and flags

2f9633c

fix: Linting issues

4db9e59

rasheedamir requested review from Felix-Stakater, MuneebAijaz and faizanahmad055 December 28, 2025 11:42

rasheedamir requested review from Juansasa and msafwankarim December 28, 2025 11:42

TheiLLeniumStudios added 3 commits December 28, 2025 14:29

chore: Cleanup code

8bce8e9

refactor(reloader): replace CreateOrUpdate with Runnable for metadata…

9f331ca

… publishing and update RBAC to include watch permission

feat: DeploymentConfig support

fa5f185

fix: Issues with paused rollouts and test cases

b55e597

feat: Test cases for envvar strategy and more

da6f33c

TheiLLeniumStudios mentioned this pull request Dec 28, 2025

[BUG] Reloader must crash on API issues #926

Open

rasheedamir mentioned this pull request Dec 28, 2025

pause-period annotation is not working in deployment #990

Open

MuneebAijaz requested a review from Copilot December 29, 2025 05:57

Copilot AI reviewed Dec 29, 2025

View reviewed changes

feat: Use strategic merge to patch workloads and add metrics for load…

c785067

… testing

feat: Introduce a Generic ResourceReconciler and a generic BaseWorklo…

5548ce5

…ad to de-duplicate a lot of code

Uh oh!

Major refactoring of Reloader to use controller-runtime and fix multiple bugs #1071

Are you sure you want to change the base?

Major refactoring of Reloader to use controller-runtime and fix multiple bugs #1071

Conversation

TheiLLeniumStudios commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues Fixed

Changes

Bug Fixes

Tests

Test coverage (pulled using go tool cover)

No breaking changes

Other improvements

Architecture

New components

High-level flow comparison

New detailed flow

Uh oh!

rasheedamir commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheiLLeniumStudios commented Dec 28, 2025

Uh oh!

TheiLLeniumStudios commented Dec 28, 2025

Uh oh!

rasheedamir commented Dec 28, 2025

Uh oh!

TheiLLeniumStudios commented Dec 28, 2025

Uh oh!

rasheedamir commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What to measure (the “contract”)

Correctness criteria

Stability criteria

Performance criteria (A/B comparison)

Test bed design (repeatable)

Environment

Objects generator (load driver)

Core stress scenarios (we want 6–8)

S1 — Burst updates on one hot source

S2 — Many sources, moderate fan-out

S3 — High cardinality namespaces/tenants

S4 — Metadata-only updates and no-op updates

S5 — Delete/recreate churn

S6 — Controller restart during load

S7 — API server pressure / transient failures

S8 — Large object size / big secrets/configmaps

Instrumentation (so results are objective)

Minimum instrumentation

Kubernetes API call volume

Reporting format (A/B comparable)

Suggested pass/fail gates (practical defaults)

Implementation approach

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

TheiLLeniumStudios commented Jan 4, 2026

Metrics

Scenarios

Uh oh!

TheiLLeniumStudios commented Jan 4, 2026

Uh oh!

TheiLLeniumStudios commented Jan 4, 2026

Uh oh!

TheiLLeniumStudios commented Jan 4, 2026

Uh oh!

TheiLLeniumStudios commented Jan 4, 2026

Uh oh!

TheiLLeniumStudios commented Jan 4, 2026

Uh oh!

TheiLLeniumStudios commented Jan 4, 2026

Uh oh!

TheiLLeniumStudios commented Jan 4, 2026

Uh oh!

TheiLLeniumStudios commented Jan 4, 2026

Uh oh!

TheiLLeniumStudios commented Dec 28, 2025 •

edited

Loading

Test coverage (pulled using `go tool cover`)

rasheedamir commented Dec 28, 2025 •

edited

Loading

rasheedamir commented Dec 28, 2025 •

edited

Loading