fix(k8s): validate shardTaskPatches early and unblock finalizer on deletion#1049
fix(k8s): validate shardTaskPatches early and unblock finalizer on deletion#1049ashishpatel26 wants to merge 2 commits into
Conversation
…letion Fixes opensandbox-group#1019. Problem 1 - schema bypass: shardTaskPatches uses []runtime.RawExtension, so the API server accepts payloads like `args: 3600` (integer) even though TaskSpec.Args is []string. The mismatch is only discovered during reconcile when strategicpatch.StrategicMergePatch fails, generating a confusing error. Fix: add ValidateShardTaskPatches() to the TaskSchedulingStrategy interface and DefaultTaskSchedulingStrategy implementation. In the reconcile loop, call it before adding the finalizer. If invalid, write an InvalidShardPatch condition to status and return without error (no requeue) so the controller does not spin. Problem 2 - finalizer deadlock: because the merge failure prevents the controller from reaching the finalizer-cleanup code, a BatchSandbox with an invalid patch and a DeletionTimestamp is stuck in Terminating forever. Fix: on the deletion path, run the same validation first. If patches are invalid, clear the FinalizerTaskCleanup finalizer immediately and record the error in status so the resource can be garbage-collected without manual intervention. New tests: - TestDefaultTaskSchedulingStrategy_ValidateShardTaskPatches (5 sub-cases) - TestReconcile_InvalidShardTaskPatches_SetsConditionAndDoesNotProceed - TestReconcile_InvalidShardTaskPatches_OnDeletion_ClearsFinalizer
|
Changed directories: kubernetes. 📋 Recommended labels (based on changed files):
Other available labels:
💡 Tip: Use |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds early validation for shardTaskPatches to prevent type-mismatch/malformed patch data from causing late reconcile failures, and surfaces invalid patches via a dedicated status condition while ensuring deletion is not blocked by finalizers.
Changes:
- Added
ValidateShardTaskPatches()to the task scheduling strategy and implemented it in the default strategy using strategic merge + unmarshal checks. - Updated the controller reconcile flow to set an
InvalidShardPatchstatus condition early and to clear the finalizer on deletion when patches are invalid. - Added unit/integration-style tests covering validation and reconciler behavior for invalid shard patches.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| kubernetes/internal/controller/strategy/task_scheduling_strategy_default.go | Implements shard patch validation via strategic merge and schema/type checks. |
| kubernetes/internal/controller/strategy/task_scheduling_strategy.go | Extends strategy interface with ValidateShardTaskPatches(). |
| kubernetes/apis/sandbox/v1alpha1/batchsandbox_types.go | Adds InvalidShardPatch to condition enum and constants. |
| kubernetes/internal/controller/batchsandbox_controller.go | Validates patches early, sets status condition, and unblocks deletion by removing finalizer when patches are invalid. |
| kubernetes/internal/controller/strategy/task_scheduling_strategy_default_test.go | Adds unit tests for shard patch validation behavior. |
| kubernetes/internal/controller/batchsandbox_pause_resume_test.go | Adds reconciler tests for invalid patches (status condition + finalizer behavior). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2e60c5a4d5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Thanks for the reviews. Addressing all feedback: Codex P1 – CRD regen: Valid. The Codex P2 / Copilot – Clear Codex P2 / Copilot – Raw patch bytes in status/logs: Valid concern. Will truncate Copilot – Ignored Copilot / Codex P2 – Swallow Codex P2 – Validate only patches within Codex P2 – Cleanup existing schedulers on invalid-patch deletion path: Valid edge case. Will call |
- Propagate updateStatus errors instead of swallowing them - Clear InvalidShardPatch condition when patches become valid - Add hasCondition helper to batchsandbox_status.go - Move deleteTaskScheduler before finalizer check on deletion path - Truncate raw patch bytes in error messages (max 200 chars) - Handle json.Marshal error in ValidateShardTaskPatches - Add InvalidShardPatch to CRD condition enum
- Remove ValidateShardTaskPatches contamination (belongs in opensandbox-group#1049) - Change rate limiter base delay 5ms -> 100ms to reduce retry burst - Narrow status requeue guard to Replicas>0 && Ready==0 to avoid unnecessary reconciles before pods are scheduled
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b12f29f0d6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| patchSummary := patch.Raw | ||
| if len(patchSummary) > 200 { | ||
| patchSummary = append(patchSummary[:200], []byte("...(truncated)")...) |
There was a problem hiding this comment.
Avoid mutating shardTaskPatches while truncating diagnostics
For any patch whose raw JSON is longer than about 214 bytes, patchSummary still shares the backing array with patch.Raw, so this append(patchSummary[:200], ...) overwrites bytes 200 onward in the actual patch before StrategicMergePatch validates it. A valid long shardTaskPatches entry can therefore be rejected as malformed or, worse, used later in the same reconcile with altered command/env contents; make the summary from a copied slice or string instead of slicing the live patch bytes.
Useful? React with 👍 / 👎.
Summary
Two bugs fixed:
1. shardTaskPatches bypasses schema validation
shardTaskPatches []runtime.RawExtensionaccepts arbitrary JSON — invalid payloads (e.g.args: 3600instead ofargs: ["3600"]) pass API admission but fail during reconcile when merged into TaskSpec.Fix: add
ValidateShardTaskPatches()toTaskSchedulingStrategy. Called before adding the finalizer — on failure, writes anInvalidShardPatchcondition to status and stops requeuing until the user corrects the resource.2. Finalizer deadlock on deletion
If
shardTaskPatchesmerge fails during the deletion path, the finalizer is never cleared, leaving the resource stuck inTerminatingforever (must manually patch finalizer out).Fix: in the deletion path, clear
FinalizerTaskCleanupimmediately when patch validation fails, record the condition in status, and return cleanly.Test plan
go test ./internal/controller/... ./internal/controller/strategy/...ValidateShardTaskPatches: valid, invalid type, malformed JSON, second-patch-invalidInvalidShardPatchcondition and does not proceedFixes #1019