-
-
Notifications
You must be signed in to change notification settings - Fork 619
Major refactoring of Reloader to use controller-runtime and fix multiple bugs #1071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Major refactoring of Reloader to use controller-runtime and fix multiple bugs #1071
Conversation
…isters and add Argo Rollouts support
…arity and consistency across packages
…dedicated files and remove redundant code
…d improve ldflags injection, fix defer resp.Body.Close() usage, replace os.Setenv with t.Setenv in tests, correct error message casing, and adjust Dockerfile and Makefile for cmd/reloader structure
|
@waseem-h Can you plz add "Functionality Parity Matrix" i.e. make a table of every behavior the old controller had, and mark how it’s implemented now + how it’s tested. |
|
@rasheedamir sure, I'll update the description to include that |
|
@rasheedamir added test coverage information and details about the new components and what they replaced |
… publishing and update RBAC to include watch permission
|
@waseem-h can't the coverage be 100% for all components? And is the core business logic separated out into small unit testable pieces of code? |
|
@rasheedamir the original coverage was less than 50% for all of the code and most components didn't even have tests. The new tests cover all the scenarios and their functions that are critical, but not the getters setters or constructor functions, which is why there is not 100% coverage. |
|
@TheiLLeniumStudios Can we add a minimal load/stress test suite as part of this migration? Since Reloader is widely adopted and used by enterprise customers, we should validate that moving to controller-runtime reconcilers doesn’t regress performance or stability. A small, repeatable test bed would let us run identical scenarios against both the legacy implementation and the new one and compare key metrics (throughput, reconcile latency, API call volume, resource usage, and error rate) as an objective release gate. A good stress/load test for this migration should answer two questions:
Below is a concrete design + criteria you can implement as a repeatable “test bed” and run A/B (old controller vs new controller). What to measure (the “contract”)Correctness criteria
Stability criteria
Performance criteria (A/B comparison)Track these and compare old vs new:
Acceptance style:
Test bed design (repeatable)Environment
Make it deterministic:
Objects generator (load driver)Write a small Go tool (or k6 + kubernetes client wrapper) that can:
Core stress scenarios (we want 6–8)S1 — Burst updates on one hot sourceGoal: detect duplicate restarts, queue storms, hot loops.
S2 — Many sources, moderate fan-outGoal: typical enterprise pattern.
S3 — High cardinality namespaces/tenantsGoal: cache/index scalability + multi-tenant behavior.
S4 — Metadata-only updates and no-op updatesGoal: predicate correctness + API friendliness.
S5 — Delete/recreate churnGoal: tombstone/finalizer correctness.
S6 — Controller restart during loadGoal: resilience, missed-event recovery.
S7 — API server pressure / transient failuresGoal: backoff + retry correctness.
S8 — Large object size / big secrets/configmapsGoal: serialization/patch overhead.
Instrumentation (so results are objective)Minimum instrumentation
Kubernetes API call volumeBest sources:
Reporting format (A/B comparable)For each scenario produce a small JSON + markdown summary:
Then do a diff: old vs new. Suggested pass/fail gates (practical defaults)You can tune these, but start here:
Implementation approach
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR represents a major architectural refactoring of the Reloader project, migrating from custom informer controllers to controller-runtime reconcilers while fixing multiple critical bugs. The refactoring improves code maintainability, test coverage, and addresses issues with case sensitivity, annotation precedence, deployment pause handling, and resource detection.
Key Changes:
- Migrated from custom informers to controller-runtime with dedicated reconcilers (ConfigMap, Secret, Deployment, Namespace)
- Fixed case-insensitive resource comparison, annotation precedence bugs, and persistent pause handling
- Achieved ~94.4% test coverage for core reload logic and ~56% for controllers with comprehensive unit and e2e tests
Reviewed changes
Copilot reviewed 85 out of 119 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| internal/pkg/testutil/fixtures.go | New test utility fixtures for creating test workload objects |
| internal/pkg/reload/strategy_test.go | Unit tests for reload strategies (EnvVar and Annotation) |
| internal/pkg/reload/strategy.go | Reload strategy implementations for triggering workload restarts |
| internal/pkg/reload/service_test.go | Comprehensive tests for reload service orchestration logic |
| internal/pkg/reload/service.go | Service orchestrating reload logic for ConfigMaps and Secrets |
| internal/pkg/reload/resource_type_test.go | Tests for resource type enumeration |
| internal/pkg/reload/resource_type.go | Resource type definitions (ConfigMap, Secret) |
| internal/pkg/reload/predicate_test.go | Tests for event filtering predicates |
| internal/pkg/reload/predicate.go | Event filtering predicates for controller-runtime |
| internal/pkg/reload/pause_test.go | Tests for deployment pause handling logic |
| internal/pkg/reload/pause.go | Deployment pause/resume handler implementation |
| internal/pkg/reload/matcher_test.go | Tests for annotation matching and reload decision logic |
| internal/pkg/reload/matcher.go | Matcher determining if workloads should be reloaded based on annotations |
| internal/pkg/reload/hasher_test.go | Tests for ConfigMap/Secret content hashing |
| internal/pkg/reload/hasher.go | SHA1 hasher for detecting ConfigMap/Secret changes |
| internal/pkg/reload/decision_test.go | Tests for reload decision filtering |
| internal/pkg/reload/decision.go | Reload decision data structures |
| internal/pkg/reload/change.go | Resource change event types and interfaces |
| internal/pkg/options/flags.go | Removed legacy flags file (replaced by config package) |
| internal/pkg/openshift/detect.go | DeploymentConfig API detection for OpenShift |
| internal/pkg/metrics/prometheus_test.go | Tests for Prometheus metrics recording |
| internal/pkg/metrics/prometheus.go | Prometheus metrics collection with namespace tracking |
| internal/pkg/metadata/publisher.go | Metadata ConfigMap publisher for reloader build/config info |
| internal/pkg/metadata/metadata_test.go | Tests for metadata ConfigMap creation |
| internal/pkg/metadata/metadata.go | Metadata structures for build info and configuration |
| internal/pkg/leadership/* | Removed legacy leadership election files |
| internal/pkg/handler/* | Removed legacy handler files (replaced by controller package) |
| internal/pkg/events/recorder_test.go | Tests for Kubernetes event recording |
| internal/pkg/events/recorder.go | Event recorder wrapper for workload reload events |
| internal/pkg/crypto/* | Removed crypto package (functionality moved to reload.Hasher) |
| internal/pkg/controller/test_helpers_test.go | Test helper functions for controller unit tests |
| internal/pkg/controller/secret_reconciler_test.go | Unit tests for Secret reconciler |
| internal/pkg/controller/secret_reconciler.go | Secret reconciler implementation using controller-runtime |
| internal/pkg/constants/enums.go | Removed legacy enum constants |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@rasheedamir I've added metrics to both the old and new implementations to help see how both implementations hold up against each other. Here is a list of metrics and the the scenarios that I've tested: Metrics
Scenarios
I'll share results of each scenario in a separate comment |
|
S1: |
|
S2: |
|
S3: |
|
S4: |
|
S5: |
|
S6: |
|
S7: |
|
S8: |
|
To summarize all the test results above and their results: S1 (Burst)Fires 141 ConfigMap/Secret updates in quick succession to test event handling under load. New implementation handles burst updates with half reconcile and with 70% fewer API calls to the API server, specially with the reduction in GET calls due to how controller-runtime caches objects S2 (fan-out)Creates 1 ConfigMap referenced by 50 deployments, then updates it 8 times (expecting 400 total reloads). This is actually interesting, the old / current implementation actually missed a bunch of reloads (241 out of 400) while my implementation reconcile each one of them. The higher API calls here is because it was able to reconcile everything and thus requiring a few more calls. Also the reconcile time went down from 10s under pressure in the older implementation to 0.2s while keeping up with everything S3 (High cardinality)Spreads 142 updates across 11 different namespaces to test cross-namespace handling. Similar to S1, the reconciles were reduced to half with less API calls while handling each reload request properly S4 (No-op)Updates ConfigMap annotations without changing actual data (143 times). The new implementation is actually way smarter as it doesn't even trigger the reconcile loop for annotation-only changes (0 vs 186 in the old). Both correctly skip the actual reloads but the new one doesn't waste any cycles for checking. This is because of filtering cached objects using predicates S5 (Churn)Creates and deletes deployments while firing ConfigMap updates to simulate a busy cluster. New implementation handles workload churn way better because it doesn't do any duplicate reloads (old did 275 reconciles vs 26 for the new ones which was expected). The old one seems to get confused when deployments are created / deleted during updates S6 (Restart)Kills the reloader pod mid-test while updates are happening to test recovery. Both survive restarts without issues (ofcourse missing some events here and there but its the same behavior across both) S7 (API Pressure)Fires 131 updates from 10 concurrent goroutines to stress test the controller. Under a lot of load, the new implementation still makes around 70% fewer API calls while reloading everything properly S8 (Large Objects)Uses 100KB+ ConfigMaps to test handling of large objects. Works fine for both implementations. The new one uses less API calls as expected @rasheedamir Let me know if this provides more than enough details to the team to take a deeper look into the changes and review them |
…ad to de-duplicate a lot of code
|
@waseem-h great work! can you please commit the load tests and add step in pipeline to execute load tests when "load-tests" label is added to the PR? |
|
@waseem-h plz make separate PR for load tests and commit them to the existing codebase so, we use that as the reference and we merge it |
Issues Fixed
--resources-to-ignore=configMaps(case sensitivity)auto: trueis also setChanges
Bug Fixes
configMapsnow matchesconfigmapspaused-atannotation and resumes paused deployments even after restartResourceChangeTests
Test coverage (pulled using
go tool cover)No breaking changes
Other improvements
viperfor configuration and flag management31MB(from68MB)67.8MB(from103MB)Architecture
New components
High-level flow comparison
New detailed flow