Skip to content

fix(k8s): validate shardTaskPatches early and unblock finalizer on deletion#1049

Open
ashishpatel26 wants to merge 2 commits into
opensandbox-group:mainfrom
ashishpatel26:fix/issue-1019-v2
Open

fix(k8s): validate shardTaskPatches early and unblock finalizer on deletion#1049
ashishpatel26 wants to merge 2 commits into
opensandbox-group:mainfrom
ashishpatel26:fix/issue-1019-v2

Conversation

@ashishpatel26

Copy link
Copy Markdown

Summary

Two bugs fixed:

1. shardTaskPatches bypasses schema validation
shardTaskPatches []runtime.RawExtension accepts arbitrary JSON — invalid payloads (e.g. args: 3600 instead of args: ["3600"]) pass API admission but fail during reconcile when merged into TaskSpec.

Fix: add ValidateShardTaskPatches() to TaskSchedulingStrategy. Called before adding the finalizer — on failure, writes an InvalidShardPatch condition to status and stops requeuing until the user corrects the resource.

2. Finalizer deadlock on deletion
If shardTaskPatches merge fails during the deletion path, the finalizer is never cleared, leaving the resource stuck in Terminating forever (must manually patch finalizer out).

Fix: in the deletion path, clear FinalizerTaskCleanup immediately when patch validation fails, record the condition in status, and return cleanly.

Test plan

  • go test ./internal/controller/... ./internal/controller/strategy/...
    • 5 sub-cases for ValidateShardTaskPatches: valid, invalid type, malformed JSON, second-patch-invalid
    • Reconcile with invalid patches sets InvalidShardPatch condition and does not proceed
    • Deletion with invalid patches clears finalizer and unblocks termination

Fixes #1019

…letion

Fixes opensandbox-group#1019.

Problem 1 - schema bypass: shardTaskPatches uses []runtime.RawExtension, so
the API server accepts payloads like `args: 3600` (integer) even though
TaskSpec.Args is []string. The mismatch is only discovered during reconcile
when strategicpatch.StrategicMergePatch fails, generating a confusing error.

Fix: add ValidateShardTaskPatches() to the TaskSchedulingStrategy interface and
DefaultTaskSchedulingStrategy implementation. In the reconcile loop, call it
before adding the finalizer. If invalid, write an InvalidShardPatch condition to
status and return without error (no requeue) so the controller does not spin.

Problem 2 - finalizer deadlock: because the merge failure prevents the
controller from reaching the finalizer-cleanup code, a BatchSandbox with an
invalid patch and a DeletionTimestamp is stuck in Terminating forever.

Fix: on the deletion path, run the same validation first. If patches are
invalid, clear the FinalizerTaskCleanup finalizer immediately and record the
error in status so the resource can be garbage-collected without manual
intervention.

New tests:
- TestDefaultTaskSchedulingStrategy_ValidateShardTaskPatches (5 sub-cases)
- TestReconcile_InvalidShardTaskPatches_SetsConditionAndDoesNotProceed
- TestReconcile_InvalidShardTaskPatches_OnDeletion_ClearsFinalizer
@github-actions

Copy link
Copy Markdown
Contributor

⚠️ This PR has no labels. Please add one based on the changes.

Changed directories: kubernetes.

📋 Recommended labels (based on changed files):

  • component/k8s ⬅️

Other available labels:

  • bug - Something isn't working
  • dependencies - Pull requests that update a dependency file
  • documentation - Improvements or additions to documentation
  • feature - New feature or request
  • packages - Changes for package, image and configuration

💡 Tip: Use feature for new functionality or improvements, bug for fixes.

cc @ashishpatel26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds early validation for shardTaskPatches to prevent type-mismatch/malformed patch data from causing late reconcile failures, and surfaces invalid patches via a dedicated status condition while ensuring deletion is not blocked by finalizers.

Changes:

  • Added ValidateShardTaskPatches() to the task scheduling strategy and implemented it in the default strategy using strategic merge + unmarshal checks.
  • Updated the controller reconcile flow to set an InvalidShardPatch status condition early and to clear the finalizer on deletion when patches are invalid.
  • Added unit/integration-style tests covering validation and reconciler behavior for invalid shard patches.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
kubernetes/internal/controller/strategy/task_scheduling_strategy_default.go Implements shard patch validation via strategic merge and schema/type checks.
kubernetes/internal/controller/strategy/task_scheduling_strategy.go Extends strategy interface with ValidateShardTaskPatches().
kubernetes/apis/sandbox/v1alpha1/batchsandbox_types.go Adds InvalidShardPatch to condition enum and constants.
kubernetes/internal/controller/batchsandbox_controller.go Validates patches early, sets status condition, and unblocks deletion by removing finalizer when patches are invalid.
kubernetes/internal/controller/strategy/task_scheduling_strategy_default_test.go Adds unit tests for shard patch validation behavior.
kubernetes/internal/controller/batchsandbox_pause_resume_test.go Adds reconciler tests for invalid patches (status condition + finalizer behavior).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kubernetes/internal/controller/strategy/task_scheduling_strategy_default.go Outdated
Comment thread kubernetes/internal/controller/strategy/task_scheduling_strategy_default.go Outdated
Comment thread kubernetes/internal/controller/batchsandbox_controller.go

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2e60c5a4d5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kubernetes/apis/sandbox/v1alpha1/batchsandbox_types.go
Comment thread kubernetes/internal/controller/batchsandbox_controller.go
Comment thread kubernetes/internal/controller/batchsandbox_controller.go
Comment thread kubernetes/internal/controller/strategy/task_scheduling_strategy_default.go Outdated
Comment thread kubernetes/internal/controller/batchsandbox_controller.go Outdated
@ashishpatel26

Copy link
Copy Markdown
Author

Thanks for the reviews. Addressing all feedback:

Codex P1 – CRD regen: Valid. The BatchSandboxConditionType kubebuilder enum annotation was updated but config/crd/bases/ manifests were not regenerated. Will add a make generate manifests step and commit the updated CRD.

Codex P2 / Copilot – Clear InvalidShardPatch after fix: Valid. Once patches are corrected and reconcile succeeds, the condition is never removed. Will add a clearConditionInStatus call on the happy path after validation passes.

Codex P2 / Copilot – Raw patch bytes in status/logs: Valid concern. Will truncate patch.Raw to a safe length (e.g. first 200 bytes) in the error format string before writing to status.

Copilot – Ignored json.Marshal error on zero-value: Will handle the error and return early rather than silently passing nil/empty bytes to the merge operation.

Copilot / Codex P2 – Swallow updateStatus error: Valid. The _ = suppression means a transient API error won't requeue. Will propagate the error and return it so the controller retries.

Codex P2 – Validate only patches within replicas count: Intentional: validating all patches eagerly catches user errors even before scale-up, and the cost is minimal (JSON marshal of a zero struct). Happy to scope it to [:replicas] if the project prefers lazy validation.

Codex P2 – Cleanup existing schedulers on invalid-patch deletion path: Valid edge case. Will call r.deleteTaskScheduler unconditionally before returning on the deletion path regardless of whether a scheduler is already registered.

- Propagate updateStatus errors instead of swallowing them
- Clear InvalidShardPatch condition when patches become valid
- Add hasCondition helper to batchsandbox_status.go
- Move deleteTaskScheduler before finalizer check on deletion path
- Truncate raw patch bytes in error messages (max 200 chars)
- Handle json.Marshal error in ValidateShardTaskPatches
- Add InvalidShardPatch to CRD condition enum
ashishpatel26 added a commit to ashishpatel26/OpenSandbox that referenced this pull request Jun 12, 2026
- Remove ValidateShardTaskPatches contamination (belongs in opensandbox-group#1049)
- Change rate limiter base delay 5ms -> 100ms to reduce retry burst
- Narrow status requeue guard to Replicas>0 && Ready==0 to avoid
  unnecessary reconciles before pods are scheduled

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b12f29f0d6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +59 to +61
patchSummary := patch.Raw
if len(patchSummary) > 200 {
patchSummary = append(patchSummary[:200], []byte("...(truncated)")...)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid mutating shardTaskPatches while truncating diagnostics

For any patch whose raw JSON is longer than about 214 bytes, patchSummary still shares the backing array with patch.Raw, so this append(patchSummary[:200], ...) overwrites bytes 200 onward in the actual patch before StrategicMergePatch validates it. A valid long shardTaskPatches entry can therefore be rejected as malformed or, worse, used later in the same reconcile with altered command/env contents; make the summary from a copied slice or string instead of slicing the live patch bytes.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants