Skip to content

Conversation

@TheiLLeniumStudios
Copy link

@TheiLLeniumStudios TheiLLeniumStudios commented Dec 28, 2025

Issues Fixed

Changes

Bug Fixes

  1. Case-insensitive resource comparison - configMaps now matches configmaps
  2. Correct annotation precedence - Explicit reload annotations checked before auto
  3. Persistent pause handling - DeploymentReconciler watches paused-at annotation and resumes paused deployments even after restart
  4. Unified ConfigMap/Secret handling - Both use identical reconcilers and ResourceChange

Tests

  • Dedicated e2e tests to test every logical flow
  • Detailed unit tests for every package

Test coverage (pulled using go tool cover)

Component Coverage Test File Key Test Cases
Matcher 94.4% reload/matcher_test.go Auto/explicit/search annotation, regex patterns, ignore, exclude list, precedence
Hasher 94.4% reload/hasher_test.go ConfigMap/Secret hash, deterministic output, empty/nil handling
Strategy 94.4% reload/strategy_test.go EnvVar apply/naming, Annotation apply, deletion handling
Service 94.4% reload/service_test.go Process ConfigMap/Secret, container detection (volume/env/envFrom/init), multi-workload
Predicate 94.4% reload/predicate_test.go Namespace filter, label selector, ignore annotation, hash change detection
Pause 94.4% reload/pause_test.go Should pause, get period, apply/check/clear pause
Deployment 63.4% workload/deployment_test.go Getters, annotations, volumes, uses ConfigMap/Secret (all patterns)
DaemonSet 63.4% workload/daemonset_test.go Getters, annotations, volumes, uses ConfigMap/Secret
StatefulSet 63.4% workload/statefulset_test.go Getters, annotations, volumes, uses ConfigMap/Secret
Job 63.4% workload/job_test.go Getters, uses ConfigMap/Secret, deep copy
CronJob 63.4% workload/cronjob_test.go Getters, uses ConfigMap/Secret, deep copy
Rollout 63.4% workload/rollout_test.go Getters, strategy (restart/rollout), uses ConfigMap/Secret
Registry 63.4% workload/registry_test.go With/without Argo, ListerFor, FromObject all types
ConfigMapReconciler 56.0% controller/configmap_reconciler_test.go Not found, ignored namespace, auto/explicit, volume/projected/search
SecretReconciler 56.0% controller/secret_reconciler_test.go Same as ConfigMap + ServiceAccount token ignore
NamespaceReconciler 56.0% controller/namespace_reconciler_test.go Add/remove cache, label changes, multiple selectors
Retry 56.0% controller/retry_test.go All workload types, both strategies, conflict handling
Filter 56.0% controller/filter_test.go Create event predicate, combined filters
Config 97.2% config/config_test.go Defaults, validation, normalization
Flags 97.2% config/flags_test.go All flags, env vars, boolean strings, comma lists, selectors
Alerter 86.1% alerting/alerter_test.go All 4 sinks, structured JSON, webhook errors
Webhook 90.5% webhook/webhook_test.go Payload format, HTTP errors
Events 100% events/recorder_test.go Success/failure event recording
Metrics 68.4% metrics/prometheus_test.go Counter increments, namespace metrics
Metadata 85.4% metadata/metadata_test.go Meta-info ConfigMap creation
E2E - test/e2e/e2e_test.go ConfigMap/Secret update, auto-reload-all, DaemonSet/StatefulSet reload, label-only change, multiple ConfigMaps, auto+explicit annotation, auto disabled

No breaking changes

  • All existing annotations work the same way
  • CLI flags remain compatible

Other improvements

  • Migrated from logrus to zerolog + logr for standardized logging
  • Upgraded all go packages to latest versions
  • Added support for structured output for Raw alerts
  • Used viper for configuration and flag management
  • Graceful shutdown
  • Reloader binary size reduced to 31MB (from 68MB)
  • Container image size reduced to 67.8MB (from 103MB)

Architecture

  • Migrated from custom informer controller to controller-runtime reconcilers
  • Dedicated reconcilers: ConfigMapReconciler, SecretReconciler, DeploymentReconciler, NamespaceReconciler
  • Uses controller-runtime's built-in leader election
  • Cleaner separation: config → reload (matcher/hasher/strategy) → workload

New components

Component Package Responsibility Replaces
Matcher reload Evaluates annotations to decide if workload should reload common.ContainsRecursive() scattered logic
Hasher reload Computes SHA hash of ConfigMap/Secret content util.GetSHAFromConfigmap/Secret()
Strategy reload Applies reload (EnvVar or Annotation) to pod template handler.updateContainerEnvVars/Annotations()
Service reload Orchestrates reload: list workloads → match → decide handler.doRollingUpgrade()
Predicate reload Filters events by hash change, namespace, labels Inline checks in handler/*.go
WorkloadAccessor workload Unified interface for all workload types callbacks.ItemsFunc type-specific callbacks
Registry workload Factory for creating workload accessors handler.getWorkloadItems() switch statement
Lister workload Lists all workloads in a namespace Inline listing in handler.doRollingUpgrade()
ConfigMapReconciler controller Watches ConfigMaps, triggers reload on change controller.Start() + informer handlers
SecretReconciler controller Watches Secrets, triggers reload on change controller.Start() + informer handlers
DeploymentReconciler controller Watches paused deployments, resumes after period handler.checkMissingPauseTimers() in-memory
NamespaceReconciler controller Caches namespaces matching label selectors common.IsNamespaceAllowed() per-request
ReloadHandler controller Common reload workflow for both reconcilers handler.{Create,Update,Delete} handlers
Config config Single struct for all configuration options.Collectors + scattered globals
Alerter alerting Interface + implementations for alert sinks alerts.SendAlert() monolithic function
Recorder events Records K8s events on workloads handler.sendEvent()
Publisher metadata Publishes reloader-meta-info ConfigMap common.CreateMetaInfo()

High-level flow comparison

  OLD:                                NEW:
  ┌─────────────────────┐                  ┌─────────────────────┐
  │ controller.Start()  │                  │ controller-runtime  │
  │ (custom informers)  │                  │ Manager             │
  └──────────┬──────────┘                  └──────────┬──────────┘
             │                                        │
             ▼                                        ▼
  ┌─────────────────────┐                  ┌─────────────────────┐
  │ handler.Create/     │                  │ ConfigMapReconciler │
  │ Update/Delete()     │                  │ SecretReconciler    │
  │ (one big file)      │                  │ DeploymentReconciler│
  └──────────┬──────────┘                  └──────────┬──────────┘
             │                                        │
             ▼                                        ▼
  ┌─────────────────────┐                  ┌─────────────────────┐
  │ handler.            │                  │ ReloadHandler       │
  │ doRollingUpgrade()  │                  │ (common workflow)   │
  │ (800+ lines)        │                  └──────────┬──────────┘
  └──────────┬──────────┘                             │
             │                             ┌──────────┴──────────┐
             │                             ▼                     ▼
             │                  ┌─────────────────┐   ┌─────────────────┐
             │                  │ reload.Service  │   │ reload.Matcher  │
             │                  │ (orchestration) │   │ (decisions)     │
             │                  └────────┬────────┘   └─────────────────┘
             │                           │
             ▼                           ▼
  ┌─────────────────────┐     ┌─────────────────────┐
  │ callbacks.ItemsFunc │     │ workload.Accessor   │
  │ (per-type callbacks)│     │ (unified interface) │
  └─────────────────────┘     └─────────────────────┘

New detailed flow

  ┌─────────────────────────────────────────────────────────────────────────────┐
  │                         ConfigMap or Secret Updated                          │
  └──────────────────────────────────┬──────────────────────────────────────────┘
                                     │
                                     ▼
                      ┌──────────────────────────┐
                      │   Event Filter (Filter)   │
                      │  ┌────────────────────┐  │
                      │  │ Namespace ignored? │──┼──Yes──► Skip
                      │  └─────────┬──────────┘  │
                      │            │ No          │
                      │  ┌─────────▼──────────┐  │
                      │  │ Resource selectors │──┼──No match──► Skip
                      │  │     match?         │  │
                      │  └─────────┬──────────┘  │
                      │            │ Yes         │
                      │  ┌─────────▼──────────┐  │
                      │  │  Hash changed?     │──┼──No──► Skip (no-op)
                      │  └─────────┬──────────┘  │
                      └────────────┼─────────────┘
                                   │ Yes
                                   ▼
                      ┌──────────────────────────┐
                      │     Reconciler.Reconcile │
                      └────────────┬─────────────┘
                                   │
                                   ▼
                      ┌──────────────────────────┐
                      │  ReloadHandler.Process   │
                      └────────────┬─────────────┘
                                   │
                ┌──────────────────┴──────────────────┐
                ▼                                     ▼
     ┌────────────────────┐               ┌────────────────────┐
     │  Lister.List()     │               │ ReloadService      │
     │  Get all workloads │               │ .Process()         │
     │  in namespace      │               │                    │
     └─────────┬──────────┘               └─────────┬──────────┘
               │                                    │
               └──────────────┬─────────────────────┘
                              ▼
                ┌──────────────────────────┐
                │   For each workload:     │
                │   Matcher.ShouldReload() │
                └────────────┬─────────────┘
                             │
                             ▼
                ┌──────────────────────────┐
                │   Filter Decisions       │
                │   (ShouldReload = true)  │
                └────────────┬─────────────┘
                             │
             ┌───────────────┴───────────────┐
             │                               │
             ▼                               ▼
  ┌─────────────────────┐         ┌─────────────────────┐
  │ Webhook configured? │──Yes──► │  Send webhook       │
  └──────────┬──────────┘         │  notification       │
             │ No                 └─────────────────────┘
             ▼
  ┌─────────────────────┐
  │  Apply Reloads      │
  │  (update workloads) │
  └──────────┬──────────┘
             │
             ▼
  ┌─────────────────────┐
  │ For each decision:  │
  │ • Update workload   │
  │ • Record metrics    │
  │ • Send event        │
  │ • Send alert        │
  └─────────────────────┘

…d improve ldflags injection, fix defer resp.Body.Close() usage, replace os.Setenv with t.Setenv in tests, correct error message casing, and adjust Dockerfile and Makefile for cmd/reloader structure
@rasheedamir
Copy link
Member

rasheedamir commented Dec 28, 2025

@waseem-h Can you plz add "Functionality Parity Matrix" i.e. make a table of every behavior the old controller had, and mark how it’s implemented now + how it’s tested.

@TheiLLeniumStudios
Copy link
Author

@rasheedamir sure, I'll update the description to include that

@TheiLLeniumStudios
Copy link
Author

@rasheedamir added test coverage information and details about the new components and what they replaced

@rasheedamir
Copy link
Member

@waseem-h can't the coverage be 100% for all components? And is the core business logic separated out into small unit testable pieces of code?

@TheiLLeniumStudios
Copy link
Author

@rasheedamir the original coverage was less than 50% for all of the code and most components didn't even have tests. The new tests cover all the scenarios and their functions that are critical, but not the getters setters or constructor functions, which is why there is not 100% coverage.
As for the separation, the entire refactor is done in a way that each core piece of logic is split up in a smaller package with its own tests so we have much better testing coverage as before. I'll see if there are more test cases we need to add, and then will add them if needed

@rasheedamir
Copy link
Member

rasheedamir commented Dec 28, 2025

@TheiLLeniumStudios Can we add a minimal load/stress test suite as part of this migration? Since Reloader is widely adopted and used by enterprise customers, we should validate that moving to controller-runtime reconcilers doesn’t regress performance or stability. A small, repeatable test bed would let us run identical scenarios against both the legacy implementation and the new one and compare key metrics (throughput, reconcile latency, API call volume, resource usage, and error rate) as an objective release gate.

A good stress/load test for this migration should answer two questions:

  1. Does it still do the right thing under load? (no missed / duplicate actions)
  2. Does it stay stable and “API-friendly”? (no hot loops, no API storms, predictable CPU/mem)

Below is a concrete design + criteria you can implement as a repeatable “test bed” and run A/B (old controller vs new controller).

What to measure (the “contract”)

Correctness criteria

  • No missed actions: every intended target gets restarted/updated exactly when expected.
  • No duplicate actions: a burst of source updates should not cause multiple restarts for the same workload unless intended.
  • Idempotency: repeating the same update (same data hash) results in no action.
  • Ordering invariants (if any): e.g., status/event written after action, finalizer runs once.

Stability criteria

  • No hot loops: reconcile rate returns to baseline after the load stops.
  • Bounded error rate: transient errors may happen, but must converge (no stuck requeues).

Performance criteria (A/B comparison)

Track these and compare old vs new:

  • p50/p95/p99 “source update → action applied” latency
  • Actions/sec throughput (e.g., restarts per second)
  • Reconcile/sec (per controller + per key)
  • Kubernetes API call volume (GET/LIST/WATCH/PATCH/UPDATE) — ideally from apiserver metrics or client metrics
  • Controller CPU/memory steady-state and peak
  • Workqueue depth / rate-limiter behavior (if exposed)

Acceptance style:

  • “New is not worse than old by more than X%” (typical: ≤10–20% on p95 latency and CPU; ≤10% on API call volume; 0 correctness regressions).

Test bed design (repeatable)

Environment

  • Kind/k3d with real controller pods + Prometheus scraping (better)
  • OpenShift staging cluster for final gate (best for real enterprise environments)

Make it deterministic:

  • fixed seed for randomization
  • fixed resource counts + step durations
  • record controller version + config

Objects generator (load driver)

Write a small Go tool (or k6 + kubernetes client wrapper) that can:

  • Create N namespaces
  • Create M workloads per namespace (Deployments/StatefulSets/DaemonSets)
  • Create K ConfigMaps/Secrets and link them to workloads via labels/annotations your controller uses
  • Apply update patterns (below) at controlled rates

Core stress scenarios (we want 6–8)

S1 — Burst updates on one hot source

Goal: detect duplicate restarts, queue storms, hot loops.

  • 1 ConfigMap → 500 workloads
  • Update CM 50 times in 10 seconds
  • Expected: either (a) 1 restart per workload total if you dedupe by hash/time window, or (b) at least bounded restarts per your spec.
  • Measure: duplicates, p95 latency, reconcile/sec spikes, API calls/sec.

S2 — Many sources, moderate fan-out

Goal: typical enterprise pattern.

  • 200 ConfigMaps + 200 Secrets, each affecting 5–20 workloads
  • Update rate: 5 updates/sec for 5 minutes
  • Expected: stable throughput, no backlog growth, no increasing latency trend.

S3 — High cardinality namespaces/tenants

Goal: cache/index scalability + multi-tenant behavior.

  • 200 namespaces × 20 workloads = 4,000 workloads
  • Periodic updates across namespaces
  • Expected: no memory blow-up, list calls don’t explode, per-namespace isolation is respected.

S4 — Metadata-only updates and no-op updates

Goal: predicate correctness + API friendliness.

  • Apply label/annotation changes that should not trigger actions
  • Re-apply same CM data (same hash) repeatedly
  • Expected: 0 actions, minimal reconcile churn.

S5 — Delete/recreate churn

Goal: tombstone/finalizer correctness.

  • Delete and recreate workloads and sources while updates are happening
  • Expected: no stuck finalizers, no panics, no leaked state, no “ghost” restarts.

S6 — Controller restart during load

Goal: resilience, missed-event recovery.

  • Run S2; restart controller pod mid-stream (or scale down/up)
  • Expected: system converges; any missed updates are reconciled via resync/drift logic (if applicable).

S7 — API server pressure / transient failures

Goal: backoff + retry correctness.

  • Inject failures (network policies, apiserver QPS limits, or deliberate 429/5xx simulation if possible)
  • Expected: bounded error rate, no tight requeue loops, eventual convergence.

S8 — Large object size / big secrets/configmaps

Goal: serialization/patch overhead.

  • CM/Secret sizes near typical limits (within K8s constraints)
  • Update at moderate rate
  • Expected: latency increases but remains bounded; no memory spikes.

Instrumentation (so results are objective)

Minimum instrumentation

  • Controller logs with a correlation ID per event (optional)
  • Custom Prometheus metrics:
    • reconcile_total{result=success|error}
    • reconcile_duration_seconds histogram
    • actions_total{type=restart, kind=deployment,...}
    • action_latency_seconds histogram (“source change observed → patch applied”)
    • dedupe_hits_total (if you dedupe)
  • Capture controller resource usage (CPU/mem) from Kubernetes metrics

Kubernetes API call volume

Best sources:

  • apiserver metrics (apiserver_request_total, latency histograms)
  • client-go metrics (if enabled)
  • audit logs (heavier; use only if needed)

Reporting format (A/B comparable)

For each scenario produce a small JSON + markdown summary:

  • Scenario params: namespaces/workloads/sources/update rate/duration
  • Correctness: missed=0, duplicates=0, unexpected=0
  • Latency: p50/p95/p99
  • Throughput: actions/sec
  • Controller CPU/mem: peak + steady
  • API calls/sec: GET/LIST/WATCH/PATCH/UPDATE totals

Then do a diff: old vs new.

Suggested pass/fail gates (practical defaults)

You can tune these, but start here:

  • Correctness: missed = 0, duplicates = 0 (or within explicitly documented dedupe semantics)
  • No hot loop: reconcile/sec returns within 2× baseline within 60s after load stops
  • Latency regression: p95 action latency ≤ +20% vs old
  • API calls regression: total API calls ≤ +15% vs old; LIST calls should not balloon
  • Resource regression: CPU ≤ +20%, memory ≤ +15% vs old under same scenario

Implementation approach

  • A hack/loadtest/ folder with:
    • generator tool (creates objects + applies update patterns)
    • runner script (deploy controller version A, run scenarios, collect metrics; repeat for version B)
    • prometheus scrape config + report generator
  • CI runs a small subset (S1 + S4 + S6 with reduced scale).
  • Nightly / pre-release runs full suite at realistic scale.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR represents a major architectural refactoring of the Reloader project, migrating from custom informer controllers to controller-runtime reconcilers while fixing multiple critical bugs. The refactoring improves code maintainability, test coverage, and addresses issues with case sensitivity, annotation precedence, deployment pause handling, and resource detection.

Key Changes:

  • Migrated from custom informers to controller-runtime with dedicated reconcilers (ConfigMap, Secret, Deployment, Namespace)
  • Fixed case-insensitive resource comparison, annotation precedence bugs, and persistent pause handling
  • Achieved ~94.4% test coverage for core reload logic and ~56% for controllers with comprehensive unit and e2e tests

Reviewed changes

Copilot reviewed 85 out of 119 changed files in this pull request and generated no comments.

Show a summary per file
File Description
internal/pkg/testutil/fixtures.go New test utility fixtures for creating test workload objects
internal/pkg/reload/strategy_test.go Unit tests for reload strategies (EnvVar and Annotation)
internal/pkg/reload/strategy.go Reload strategy implementations for triggering workload restarts
internal/pkg/reload/service_test.go Comprehensive tests for reload service orchestration logic
internal/pkg/reload/service.go Service orchestrating reload logic for ConfigMaps and Secrets
internal/pkg/reload/resource_type_test.go Tests for resource type enumeration
internal/pkg/reload/resource_type.go Resource type definitions (ConfigMap, Secret)
internal/pkg/reload/predicate_test.go Tests for event filtering predicates
internal/pkg/reload/predicate.go Event filtering predicates for controller-runtime
internal/pkg/reload/pause_test.go Tests for deployment pause handling logic
internal/pkg/reload/pause.go Deployment pause/resume handler implementation
internal/pkg/reload/matcher_test.go Tests for annotation matching and reload decision logic
internal/pkg/reload/matcher.go Matcher determining if workloads should be reloaded based on annotations
internal/pkg/reload/hasher_test.go Tests for ConfigMap/Secret content hashing
internal/pkg/reload/hasher.go SHA1 hasher for detecting ConfigMap/Secret changes
internal/pkg/reload/decision_test.go Tests for reload decision filtering
internal/pkg/reload/decision.go Reload decision data structures
internal/pkg/reload/change.go Resource change event types and interfaces
internal/pkg/options/flags.go Removed legacy flags file (replaced by config package)
internal/pkg/openshift/detect.go DeploymentConfig API detection for OpenShift
internal/pkg/metrics/prometheus_test.go Tests for Prometheus metrics recording
internal/pkg/metrics/prometheus.go Prometheus metrics collection with namespace tracking
internal/pkg/metadata/publisher.go Metadata ConfigMap publisher for reloader build/config info
internal/pkg/metadata/metadata_test.go Tests for metadata ConfigMap creation
internal/pkg/metadata/metadata.go Metadata structures for build info and configuration
internal/pkg/leadership/* Removed legacy leadership election files
internal/pkg/handler/* Removed legacy handler files (replaced by controller package)
internal/pkg/events/recorder_test.go Tests for Kubernetes event recording
internal/pkg/events/recorder.go Event recorder wrapper for workload reload events
internal/pkg/crypto/* Removed crypto package (functionality moved to reload.Hasher)
internal/pkg/controller/test_helpers_test.go Test helper functions for controller unit tests
internal/pkg/controller/secret_reconciler_test.go Unit tests for Secret reconciler
internal/pkg/controller/secret_reconciler.go Secret reconciler implementation using controller-runtime
internal/pkg/constants/enums.go Removed legacy enum constants

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@TheiLLeniumStudios
Copy link
Author

@rasheedamir I've added metrics to both the old and new implementations to help see how both implementations hold up against each other. Here is a list of metrics and the the scenarios that I've tested:

Metrics

Metric Description Better Value
reloader_reconcile_total Number of reconciliation loops executed Lower = better event batching/deduplication
reloader_reconcile_duration_seconds Time spent in each reconcile loop (p50/p95/p99) Lower = faster processing
reloader_action_total Reload actions triggered on workloads Should match expected = correct behavior
reloader_action_latency_seconds End-to-end time from config change to workload restart (p50/p95/p99) Lower = faster user-facing response
reloader_reload_executed_total Successful vs failed reload operations Higher success rate = more reliable
reloader_workloads_scanned_total Workloads checked for ConfigMap/Secret references Informational
reloader_workloads_matched_total Workloads that matched and needed reload Should match action_total
reloader_errors_total Total errors encountered Zero = no issues
reloader_skipped_total Updates skipped (e.g., no data change) Informational
rest_client_requests_total Total API server calls Lower = less API server load
rest_client_requests_get GET requests to API server Lower = better caching
rest_client_requests_patch PATCH requests (workload restarts) Should correlate with reloads
rest_client_requests_errors Failed API requests Zero = no API issues

Scenarios

ID Scenario What it Tests
S1 Burst Updates 100 rapid ConfigMap/Secret updates
S2 Fan-Out 1 ConfigMap referenced by 50 Deployments
S3 High Cardinality Resources spread across 10 namespaces
S4 No-Op Updates Annotation-only changes (no data change)
S5 Workload Churn Deployments created/deleted during updates
S6 Controller Restart Controller pod killed mid-test
S7 API Pressure 20 concurrent update bursts
S8 Large Objects ConfigMaps > 100KB

I'll share results of each scenario in a separate comment

@TheiLLeniumStudios
Copy link
Author

S1:

================================================================================
                     RELOADER A/B COMPARISON REPORT
================================================================================

Scenario:     S1
Generated:    2026-01-04 16:16:52
Status:       PASS
Test:         S1: 141 burst updates, each triggers 1 deployment reload

--------------------------------------------------------------------------------
                           EXPECTED VALUES
--------------------------------------------------------------------------------
Expected Action Total:          141
Expected Reload Executed Total: 141

--------------------------------------------------------------------------------
                           METRIC COMPARISONS
--------------------------------------------------------------------------------
(Old✓/New✓ = meets expected value within 15%)

Metric                                    Old          New   Expected  Old✓  New✓   Status
------                                    ---          ---   --------  ----  ----   ------
reconcile_total                           282          140          -     -     -     pass
reconcile_duration_p50 (s)             0.0095       0.0034          -     -     -     pass
reconcile_duration_p95 (s)             0.0240       0.0087          -     -     -     pass
reconcile_duration_p99 (s)             0.0388       0.0099          -     -     -     pass
action_total                              141          140        141     ✓     ✓     pass
action_latency_p50 (s)                 0.0051       0.0050          -     -     -     pass
action_latency_p95 (s)                 0.0096       0.0096          -     -     -     pass
action_latency_p99 (s)                 0.0232       0.0100          -     -     -     pass
errors_total                                0            0          -     -     -     pass
reload_executed_total                     141          140        141     ✓     ✓     pass
workloads_scanned_total                  1410         1400          -     -     -     pass
workloads_matched_total                   141          140        141     ✓     ✓     pass
rest_client_requests_total                992          304          -     -     -     pass
rest_client_requests_get                  710           24          -     -     -     pass
rest_client_requests_patch                272          270          -     -     -     pass
rest_client_requests_put                    0            0          -     -     -     pass
rest_client_requests_errors                 1            8          -     -     -     pass

@TheiLLeniumStudios
Copy link
Author

S2:

================================================================================
                     RELOADER A/B COMPARISON REPORT
================================================================================

Scenario:     S2
Generated:    2026-01-04 16:22:49
Status:       WARN
Summary:      2 metrics with warning: rest_client_requests_total, rest_client_requests_patch
Test:         S2: 8 updates × 50 deployments = 400 expected reloads

--------------------------------------------------------------------------------
                           EXPECTED VALUES
--------------------------------------------------------------------------------
Expected Action Total:          400
Expected Reload Executed Total: 400

--------------------------------------------------------------------------------
                           METRIC COMPARISONS
--------------------------------------------------------------------------------
(Old✓/New✓ = meets expected value within 15%)

Metric                                    Old          New   Expected  Old✓  New✓   Status
------                                    ---          ---   --------  ----  ----   ------
reconcile_total                             8            8          -     -     -     pass
reconcile_duration_p50 (s)            10.0000       0.2070          -     -     -     pass
reconcile_duration_p95 (s)            10.0000       0.4722          -     -     -     pass
reconcile_duration_p99 (s)            10.0000       0.7340          -     -     -     pass
action_total                              241          400        400     ✗     ✓     pass
action_latency_p50 (s)                 0.1329       0.0051          -     -     -     pass
action_latency_p95 (s)                 0.4375       0.0098          -     -     -     pass
action_latency_p99 (s)                 3.4375       0.0344          -     -     -     pass
errors_total                                0            0          -     -     -     pass
reload_executed_total                     243          400        400     ✗     ✓     pass
workloads_scanned_total                   250          400        400     ✗     ✓     pass
workloads_matched_total                   200          400        400     ✗     ✓     pass
rest_client_requests_total                501          824          -     -     -     warn
rest_client_requests_get                   26           24          -     -     -     pass
rest_client_requests_patch                425          750          -     -     -     warn
rest_client_requests_put                    0            0          -     -     -     pass
rest_client_requests_errors                 1            8          -     -     -     pass

@TheiLLeniumStudios
Copy link
Author

S3:

================================================================================
                     RELOADER A/B COMPARISON REPORT
================================================================================

Scenario:     S3
Generated:    2026-01-04 16:33:07
Status:       PASS
Test:         S3: 142 updates across 11 namespaces

--------------------------------------------------------------------------------
                           EXPECTED VALUES
--------------------------------------------------------------------------------
Expected Action Total:          142
Expected Reload Executed Total: 142

--------------------------------------------------------------------------------
                           METRIC COMPARISONS
--------------------------------------------------------------------------------
(Old✓/New✓ = meets expected value within 15%)

Metric                                    Old          New   Expected  Old✓  New✓   Status
------                                    ---          ---   --------  ----  ----   ------
reconcile_total                           284          142          -     -     -     pass
reconcile_duration_p50 (s)             0.0108       0.0035          -     -     -     pass
reconcile_duration_p95 (s)             0.0242       0.0090          -     -     -     pass
reconcile_duration_p99 (s)             0.0386       0.0196          -     -     -     pass
action_total                              142          142        142     ✓     ✓     pass
action_latency_p50 (s)                 0.0051       0.0051          -     -     -     pass
action_latency_p95 (s)                 0.0096       0.0096          -     -     -     pass
action_latency_p99 (s)                 0.0226       0.0228          -     -     -     pass
errors_total                                0            0          -     -     -     pass
reload_executed_total                     142          142        142     ✓     ✓     pass
workloads_scanned_total                  2840         2840          -     -     -     pass
workloads_matched_total                   142          142          -     -     -     pass
rest_client_requests_total                999          308          -     -     -     pass
rest_client_requests_get                  715           24          -     -     -     pass
rest_client_requests_patch                178          179          -     -     -     pass
rest_client_requests_put                    0            0          -     -     -     pass
rest_client_requests_errors                 1            8          -     -     -     pass

@TheiLLeniumStudios
Copy link
Author

S4:

================================================================================
                     RELOADER A/B COMPARISON REPORT
================================================================================

Scenario:     S4
Generated:    2026-01-04 16:37:35
Status:       PASS
Test:         S4: 143 no-op updates, all should be skipped

--------------------------------------------------------------------------------
                           METRIC COMPARISONS
--------------------------------------------------------------------------------
(Old✓/New✓ = meets expected value within 15%)

Metric                                    Old          New   Expected  Old✓  New✓   Status
------                                    ---          ---   --------  ----  ----   ------
reconcile_total                           286            0          -     -     -     pass
reconcile_duration_p50 (s)             0.0005       0.0000          -     -     -     pass
reconcile_duration_p95 (s)             0.0009       0.0000          -     -     -     pass
reconcile_duration_p99 (s)             0.0010       0.0000          -     -     -     pass
action_total                                0            0          -     -     -     pass
action_latency_p50 (s)                 0.0000       0.0000          -     -     -     pass
action_latency_p95 (s)                 0.0000       0.0000          -     -     -     pass
action_latency_p99 (s)                 0.0000       0.0000          -     -     -     pass
errors_total                                0            0          -     -     -     pass
reload_executed_total                       0            0          -     -     -     pass
workloads_scanned_total                     0            0          -     -     -     pass
workloads_matched_total                     0            0          -     -     -     pass
rest_client_requests_total                  5           12          -     -     -     pass
rest_client_requests_get                    5           12          -     -     -     pass
rest_client_requests_patch                  0            0          -     -     -     pass
rest_client_requests_put                    0            0          -     -     -     pass
rest_client_requests_errors                 1            4          -     -     -     pass

@TheiLLeniumStudios
Copy link
Author

S5:


================================================================================
                     RELOADER A/B COMPARISON REPORT
================================================================================

Scenario:     S5
Generated:    2026-01-04 16:42:10
Status:       PASS
Test:         S5: Churn test - 0 deploys created, 0 deleted, 26 CM updates

--------------------------------------------------------------------------------
                           METRIC COMPARISONS
--------------------------------------------------------------------------------
(Old✓/New✓ = meets expected value within 15%)

Metric                                    Old          New   Expected  Old✓  New✓   Status
------                                    ---          ---   --------  ----  ----   ------
reconcile_total                            62           26          -     -     -     pass
reconcile_duration_p50 (s)             0.6846       0.0033          -     -     -     pass
reconcile_duration_p95 (s)             0.9685       0.0120          -     -     -     pass
reconcile_duration_p99 (s)             0.9937       0.0224          -     -     -     pass
action_total                              275           26          -     -     -     pass
action_latency_p50 (s)                 0.0056       0.0052          -     -     -     pass
action_latency_p95 (s)                 0.1603       0.0099          -     -     -     pass
action_latency_p99 (s)                 0.2321       0.0404          -     -     -     pass
errors_total                                3            0          -     -     -     pass
reload_executed_total                     272           26          -     -     -     pass
workloads_scanned_total                   305           26          -     -     -     pass
workloads_matched_total                   269           26          -     -     -     pass
rest_client_requests_total                695           75          -     -     -     pass
rest_client_requests_get                  148           24          -     -     -     pass
rest_client_requests_patch                434           50          -     -     -     pass
rest_client_requests_put                    0            0          -     -     -     pass
rest_client_requests_errors                 4            8          -     -     -     pass

@TheiLLeniumStudios
Copy link
Author

S6:

================================================================================
                     RELOADER A/B COMPARISON REPORT
================================================================================

Scenario:     S6
Generated:    2026-01-04 16:46:10
Status:       PASS
Test:         S6: Restart test - 141 updates during restart

--------------------------------------------------------------------------------
                           METRIC COMPARISONS
--------------------------------------------------------------------------------
(Old✓/New✓ = meets expected value within 15%)

Metric                                    Old          New   Expected  Old✓  New✓   Status
------                                    ---          ---   --------  ----  ----   ------
reconcile_total                           120           58          -     -     -     pass
reconcile_duration_p50 (s)             0.0133       0.0036          -     -     -     pass
reconcile_duration_p95 (s)             0.0242       0.0093          -     -     -     pass
reconcile_duration_p99 (s)             0.0725       0.3561          -     -     -     pass
action_total                               60           58          -     -     -     pass
action_latency_p50 (s)                 0.0051       0.0050          -     -     -     pass
action_latency_p95 (s)                 0.0097       0.0095          -     -     -     pass
action_latency_p99 (s)                 0.0280       0.0099          -     -     -     pass
errors_total                                0            0          -     -     -     pass
reload_executed_total                      62           56          -     -     -     pass
workloads_scanned_total                  1200         1151          -     -     -     pass
workloads_matched_total                    60           58          -     -     -     pass
rest_client_requests_total                  0           12          -     -     -     pass
rest_client_requests_get                    0           12          -     -     -     pass
rest_client_requests_patch                  0            0          -     -     -     pass
rest_client_requests_put                    0            0          -     -     -     pass
rest_client_requests_errors                 0            4          -     -     -     pass

@TheiLLeniumStudios
Copy link
Author

S7:

================================================================================
                     RELOADER A/B COMPARISON REPORT
================================================================================

Scenario:     S7
Generated:    2026-01-04 16:51:53
Status:       PASS
Test:         S7: 131 concurrent updates from 10 goroutines

--------------------------------------------------------------------------------
                           EXPECTED VALUES
--------------------------------------------------------------------------------
Expected Action Total:          131
Expected Reload Executed Total: 131

--------------------------------------------------------------------------------
                           METRIC COMPARISONS
--------------------------------------------------------------------------------
(Old✓/New✓ = meets expected value within 15%)

Metric                                    Old          New   Expected  Old✓  New✓   Status
------                                    ---          ---   --------  ----  ----   ------
reconcile_total                           276          131          -     -     -     pass
reconcile_duration_p50 (s)             0.0163       0.0039          -     -     -     pass
reconcile_duration_p95 (s)             0.0247       0.0122          -     -     -     pass
reconcile_duration_p99 (s)             0.0476       0.4222          -     -     -     pass
action_total                              138          131        131     ✓     ✓     pass
action_latency_p50 (s)                 0.0050       0.0052          -     -     -     pass
action_latency_p95 (s)                 0.0096       0.0099          -     -     -     pass
action_latency_p99 (s)                 0.0100       0.2035          -     -     -     pass
errors_total                                0            0          -     -     -     pass
reload_executed_total                     138          131        131     ✓     ✓     pass
workloads_scanned_total                  6900         6550          -     -     -     pass
workloads_matched_total                   138          131          -     -     -     pass
rest_client_requests_total                971          286          -     -     -     pass
rest_client_requests_get                  695           24          -     -     -     pass
rest_client_requests_patch                228          218          -     -     -     pass
rest_client_requests_put                    0            0          -     -     -     pass
rest_client_requests_errors                 1            8          -     -     -     pass

@TheiLLeniumStudios
Copy link
Author

S8:


================================================================================
                     RELOADER A/B COMPARISON REPORT
================================================================================

Scenario:     S8
Generated:    2026-01-04 16:56:06
Status:       PASS
Test:         S8: 28 large object (100KB) updates

--------------------------------------------------------------------------------
                           EXPECTED VALUES
--------------------------------------------------------------------------------
Expected Action Total:          28
Expected Reload Executed Total: 28

--------------------------------------------------------------------------------
                           METRIC COMPARISONS
--------------------------------------------------------------------------------
(Old✓/New✓ = meets expected value within 15%)

Metric                                    Old          New   Expected  Old✓  New✓   Status
------                                    ---          ---   --------  ----  ----   ------
reconcile_total                            56           28          -     -     -     pass
reconcile_duration_p50 (s)             0.0161       0.0039          -     -     -     pass
reconcile_duration_p95 (s)             0.0248       0.0098          -     -     -     pass
reconcile_duration_p99 (s)             0.0865       0.0870          -     -     -     pass
action_total                               28           28         28     ✓     ✓     pass
action_latency_p50 (s)                 0.0050       0.0052          -     -     -     pass
action_latency_p95 (s)                 0.0095       0.0099          -     -     -     pass
action_latency_p99 (s)                 0.0099       0.0870          -     -     -     pass
errors_total                                0            0          -     -     -     pass
reload_executed_total                      28           28         28     ✓     ✓     pass
workloads_scanned_total                   280          280          -     -     -     pass
workloads_matched_total                    28           28          -     -     -     pass
rest_client_requests_total                201           80          -     -     -     pass
rest_client_requests_get                  145           24          -     -     -     pass
rest_client_requests_patch                 46           47          -     -     -     pass
rest_client_requests_put                    0            0          -     -     -     pass
rest_client_requests_errors                 1            8          -     -     -     pass

@TheiLLeniumStudios
Copy link
Author

To summarize all the test results above and their results:

S1 (Burst)

Fires 141 ConfigMap/Secret updates in quick succession to test event handling under load.

New implementation handles burst updates with half reconcile and with 70% fewer API calls to the API server, specially with the reduction in GET calls due to how controller-runtime caches objects

S2 (fan-out)

Creates 1 ConfigMap referenced by 50 deployments, then updates it 8 times (expecting 400 total reloads).

This is actually interesting, the old / current implementation actually missed a bunch of reloads (241 out of 400) while my implementation reconcile each one of them. The higher API calls here is because it was able to reconcile everything and thus requiring a few more calls. Also the reconcile time went down from 10s under pressure in the older implementation to 0.2s while keeping up with everything

S3 (High cardinality)

Spreads 142 updates across 11 different namespaces to test cross-namespace handling.

Similar to S1, the reconciles were reduced to half with less API calls while handling each reload request properly

S4 (No-op)

Updates ConfigMap annotations without changing actual data (143 times).

The new implementation is actually way smarter as it doesn't even trigger the reconcile loop for annotation-only changes (0 vs 186 in the old). Both correctly skip the actual reloads but the new one doesn't waste any cycles for checking. This is because of filtering cached objects using predicates

S5 (Churn)

Creates and deletes deployments while firing ConfigMap updates to simulate a busy cluster.

New implementation handles workload churn way better because it doesn't do any duplicate reloads (old did 275 reconciles vs 26 for the new ones which was expected). The old one seems to get confused when deployments are created / deleted during updates

S6 (Restart)

Kills the reloader pod mid-test while updates are happening to test recovery.

Both survive restarts without issues (ofcourse missing some events here and there but its the same behavior across both)

S7 (API Pressure)

Fires 131 updates from 10 concurrent goroutines to stress test the controller.

Under a lot of load, the new implementation still makes around 70% fewer API calls while reloading everything properly

S8 (Large Objects)

Uses 100KB+ ConfigMaps to test handling of large objects.

Works fine for both implementations. The new one uses less API calls as expected

@rasheedamir Let me know if this provides more than enough details to the team to take a deeper look into the changes and review them

@rasheedamir
Copy link
Member

@waseem-h great work! can you please commit the load tests and add step in pipeline to execute load tests when "load-tests" label is added to the PR?

@rasheedamir
Copy link
Member

@waseem-h plz make separate PR for load tests and commit them to the existing codebase so, we use that as the reference and we merge it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants