Skip to content

fix(k8s): validate shardTaskPatches early and unblock finalizer on deletion#1040

Closed
ashishpatel26 wants to merge 5 commits into
opensandbox-group:mainfrom
ashishpatel26:fix/issue-1019-shardtaskpatches-validation
Closed

fix(k8s): validate shardTaskPatches early and unblock finalizer on deletion#1040
ashishpatel26 wants to merge 5 commits into
opensandbox-group:mainfrom
ashishpatel26:fix/issue-1019-shardtaskpatches-validation

Conversation

@ashishpatel26

Copy link
Copy Markdown

Summary

Fixes #1019.

  • Fix 1 (schema validation): Add ValidateShardTaskPatches() to the TaskSchedulingStrategy interface and DefaultTaskSchedulingStrategy. Call it in the reconcile loop before adding the FinalizerTaskCleanup finalizer. If any patch cannot be merged into a zero-value TaskTemplateSpec (e.g. args: 3600 instead of args: [3600]), write an InvalidShardPatch condition to status and return without error, so the controller stops requeueing until the user corrects the resource.
  • Fix 2 (finalizer deadlock): On the deletion path (DeletionTimestamp != nil), run the same validation first. If patches are invalid, remove FinalizerTaskCleanup immediately and record the error in status so the resource is no longer stuck in Terminating.
  • New condition type: BatchSandboxConditionInvalidShardPatch = InvalidShardPatch added to BatchSandboxConditionType.

Test plan

  • TestDefaultTaskSchedulingStrategy_ValidateShardTaskPatches — 5 sub-cases covering no patches, valid string args, integer args (invalid), malformed JSON, and second-patch-invalid detection.
  • TestReconcile_InvalidShardTaskPatches_SetsConditionAndDoesNotProceed — reconcile with invalid patches sets InvalidShardPatch condition, returns no error, and does not add finalizer.
  • TestReconcile_InvalidShardTaskPatches_OnDeletion_ClearsFinalizer — reconcile of a terminating sandbox with invalid patches clears the finalizer so the resource is garbage-collected.

When a Volume with pvc backend and readOnly=True is mounted, only the
VolumeMount readOnly field was set. Some CSI drivers (e.g.
mountpoint-s3-csi-driver) require readOnly on the PersistentVolumeClaim
volume source as well, so the mount remained writable despite the flag.

Fix: also set persistentVolumeClaim.readOnly=true in the pod volumes list
when the volume is declared read-only.

Additionally change the dedup key from claim_name alone to
(claim_name, read_only) so the same PVC can be mounted both read-only
and read-write within the same pod without one mount silently overriding
the other.

Fixes opensandbox-group#545
When the proxy forwards POST/PUT/PATCH/DELETE requests it streams the
body via an async generator, causing httpx to use Transfer-Encoding:
chunked.  The original Content-Length header was still forwarded, so
Go's net/http on execd read exactly Content-Length bytes and discarded
the remainder of the chunked stream.  Files larger than ~18 KB appeared
to succeed (HTTP 200) but were silently truncated on the sandbox
filesystem.

Fix: pop content-length from forwarded headers whenever stream_body is
True so httpx is free to drive chunk framing without a conflicting
length constraint.

Fixes opensandbox-group#1016
When createDirectories is called on a path whose parent components
already exist (e.g. /tmp), MakeDir was unconditionally calling
ChmodFile on the target after os.MkdirAll, even if the directory
was not newly created. For system directories the process does not
own this causes 'chmod: operation not permitted'.

Fix: stat the target before MkdirAll; if it already exists, skip
ChmodFile entirely. Only newly-created directories receive the
requested permission. This matches POSIX mkdir -p --mode semantics.

Fixes opensandbox-group#1024
…ix 504 timeout

ConnectionConfig.get_base_url() was unconditionally appending /_API_VERSION ("v1")
to the domain, causing all SDK requests to hit /v1/sandboxes instead of /sandboxes.
In deployments where the gateway routes /v1/* to a synchronous blocking endpoint,
this caused 504 Gateway Timeout when pod startup exceeded the proxy idle timeout (~60s).

The generated API client already uses correct relative paths (e.g. /sandboxes), so
the base URL should be the plain domain without any version prefix. The server
registers routes at both / and /v1 with the same async handler, so removing /v1
from the base URL routes traffic to the non-prefixed async endpoint that returns
202 Accepted immediately and lets the SDK poll for readiness via check_ready().

Fixes opensandbox-group#591
…letion

Fixes opensandbox-group#1019.

Problem 1 - schema bypass: shardTaskPatches uses []runtime.RawExtension, so
the API server accepts payloads like `args: 3600` (integer) even though
TaskSpec.Args is []string. The mismatch is only discovered during reconcile
when strategicpatch.StrategicMergePatch fails, generating a confusing error.

Fix: add ValidateShardTaskPatches() to the TaskSchedulingStrategy interface and
DefaultTaskSchedulingStrategy implementation. In the reconcile loop, call it
before adding the finalizer. If invalid, write an InvalidShardPatch condition to
status and return without error (no requeue) so the controller does not spin.

Problem 2 - finalizer deadlock: because the merge failure prevents the
controller from reaching the finalizer-cleanup code, a BatchSandbox with an
invalid patch and a DeletionTimestamp is stuck in Terminating forever.

Fix: on the deletion path, run the same validation first. If patches are
invalid, clear the FinalizerTaskCleanup finalizer immediately and record the
error in status so the resource can be garbage-collected without manual
intervention.

New tests:
- TestDefaultTaskSchedulingStrategy_ValidateShardTaskPatches (5 sub-cases)
- TestReconcile_InvalidShardTaskPatches_SetsConditionAndDoesNotProceed
- TestReconcile_InvalidShardTaskPatches_OnDeletion_ClearsFinalizer
Copilot AI review requested due to automatic review settings June 12, 2026 14:06
@github-actions

Copy link
Copy Markdown
Contributor

⚠️ This PR has no labels. Please add one based on the changes.

Changed directories: components、kubernetes、sdks、server.

📋 Recommended labels (based on changed files):

  • component/execd ⬅️
  • component/k8s ⬅️
  • component/server ⬅️
  • sdks ⬅️

Other available labels:

  • bug - Something isn't working
  • dependencies - Pull requests that update a dependency file
  • documentation - Improvements or additions to documentation
  • feature - New feature or request
  • packages - Changes for package, image and configuration

💡 Tip: Use feature for new functionality or improvements, bug for fixes.

cc @ashishpatel26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR addresses several correctness and reliability issues across the server proxy, Kubernetes controller, SDK base URL construction, and filesystem utilities.

Changes:

  • Strip Content-Length when proxying streamed request bodies to prevent backend truncation.
  • Improve Kubernetes PVC volume mounting behavior (RO/RW support) and add shard-task patch validation with controller status reporting.
  • Update Python SDK get_base_url() to avoid appending /v1, and make MakeDir idempotent by skipping chmod/chown on pre-existing directories.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
server/tests/test_routes_proxy.py Adds regression tests ensuring Content-Length is stripped for streamed proxy requests.
server/tests/k8s/test_volume_helper.py Adds new tests for PVC/host volume application behavior in pod specs.
server/tests/k8s/test_batchsandbox_provider.py Extends existing test to assert PVC readOnly propagates to the volume spec.
server/opensandbox_server/services/k8s/volume_helper.py Updates PVC volume handling to support RO/RW mounts and improves logging.
server/opensandbox_server/api/proxy.py Strips Content-Length when streaming request bodies.
sdks/sandbox/python/tests/test_connection_config.py Updates/extends tests for base URL generation without /v1 and with slash stripping.
sdks/sandbox/python/src/opensandbox/config/connection_sync.py Changes get_base_url() to stop appending /v1 and adds documentation.
sdks/sandbox/python/src/opensandbox/config/connection.py Same as above for async config: stop appending /v1 and add documentation.
kubernetes/internal/controller/strategy/task_scheduling_strategy_default_test.go Adds unit tests for shard-task-patch validation.
kubernetes/internal/controller/strategy/task_scheduling_strategy_default.go Implements shard-task-patch validation using strategic merge patch.
kubernetes/internal/controller/strategy/task_scheduling_strategy.go Extends interface with ValidateShardTaskPatches().
kubernetes/internal/controller/batchsandbox_pause_resume_test.go Adds reconciler tests for invalid shard-task-patches and deletion finalizer unblock.
kubernetes/internal/controller/batchsandbox_controller.go Validates shard-task-patches early; sets status condition and unblocks deletion finalizer.
kubernetes/apis/sandbox/v1alpha1/batchsandbox_types.go Adds InvalidShardPatch condition type to CRD.
components/execd/pkg/web/controller/utils_windows.go Makes MakeDir skip chmod/chown for pre-existing directories (Windows build).
components/execd/pkg/web/controller/utils_test.go Adds tests for MakeDir creation/idempotency/nested behavior.
components/execd/pkg/web/controller/utils.go Same MakeDir idempotency change for non-Windows build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +58 to +60
pvc_key = (pvc_claim_name, vol.read_only)

if pvc_key not in pvc_to_volume_name:
Comment on lines 66 to +70
pod_volumes.append({
"name": vol_name,
"persistentVolumeClaim": {
"claimName": pvc_claim_name,
},
"persistentVolumeClaim": pvc_volume,
})
pvc_to_volume_name[pvc_claim_name] = vol_name
pvc_to_volume_name[pvc_key] = vol_name
Comment on lines +74 to 77
"name": pvc_to_volume_name[pvc_key],
"mountPath": vol.mount_path,
"readOnly": vol.read_only,
}
v.get("name") for v in pod_volumes if isinstance(v, dict)
}
# Key: (claim_name, read_only) so the same PVC can be mounted both RO and RW.
pvc_to_volume_name: Dict[tuple, str] = {}
return f"{self.protocol}://{domain}/{self._API_VERSION}"
if domain.startswith("http://") or domain.startswith("https://"):
return domain.rstrip("/")
return f"{self.protocol}://{domain}"
Comment on lines +153 to +160
log.Error(patchErr, "invalid shardTaskPatches detected, updating status condition")
statusCopy := batchSbx.Status.DeepCopy()
setConditionInStatus(statusCopy, sandboxv1alpha1.BatchSandboxConditionInvalidShardPatch,
sandboxv1alpha1.ConditionTrue, "InvalidShardPatch", patchErr.Error())
_ = r.updateStatus(ctx, batchSbx, statusCopy)
// Return without error so the controller does not keep requeueing;
// the resource must be corrected by the user.
return ctrl.Result{}, nil
Comment on lines +162 to +163
_, statErr := os.Stat(abs)
alreadyExisted := statErr == nil
Comment on lines +169 to +171
if alreadyExisted {
return nil
}

after, err := os.Stat(existing)
require.NoError(t, err)
require.Equal(t, before.Mode(), after.Mode(), "MakeDir must not chmod a pre-existing directory")

afterParent, err := os.Stat(parent)
require.NoError(t, err)
require.Equal(t, beforeParent.Mode(), afterParent.Mode(), "MakeDir must not chmod the pre-existing parent")

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f369376e98

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


// BatchSandboxConditionType represents the type of BatchSandbox condition.
// +kubebuilder:validation:Enum=Ready;Progressing;Paused;PauseFailed;ResumeFailed;PodFailed
// +kubebuilder:validation:Enum=Ready;Progressing;Paused;PauseFailed;ResumeFailed;PodFailed;InvalidShardPatch

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Regenerate CRDs for the new condition

Adding InvalidShardPatch here is not enough for deployed clusters: the generated CRD schemas still restrict condition types to the old enum (checked kubernetes/config/crd/bases/sandbox.opensandbox.io_batchsandboxes.yaml lines 173-180 and the Helm CRD template lines 182-189, neither includes InvalidShardPatch). When the controller reaches the new invalid-patch path and writes this status condition, the API server will reject the status update under the installed CRD, so the condition will not be surfaced in real Kubernetes deployments. Please regenerate/update the CRD YAMLs alongside this API type change.

Useful? React with 👍 / 👎.

// Validate shardTaskPatches early so that a type-mismatch is surfaced
// immediately as a status condition rather than causing a silent merge
// failure deep in the reconcile loop.
if patchErr := taskStrategy.ValidateShardTaskPatches(); patchErr != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear the invalid-patch condition after validation succeeds

When a user fixes shardTaskPatches, this validation branch is skipped but nothing removes the previously written InvalidShardPatch=True condition. The normal status path starts from batchSbx.Status.DeepCopy() in buildRuntimeView, so that stale condition is carried into later status patches even after scheduling succeeds, leaving corrected BatchSandboxes permanently reported as invalid.

Useful? React with 👍 / 👎.

statusCopy := batchSbx.Status.DeepCopy()
setConditionInStatus(statusCopy, sandboxv1alpha1.BatchSandboxConditionInvalidShardPatch,
sandboxv1alpha1.ConditionTrue, "InvalidShardPatch", patchErr.Error())
_ = r.updateStatus(ctx, batchSbx, statusCopy)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return status update errors from invalid-patch path

If this status patch fails for a transient API error or conflict while handling an invalid shardTaskPatches update, the reconciler still returns nil and disables requeueing, so the user never sees the InvalidShardPatch condition that is meant to explain why reconciliation stopped. Please return or otherwise requeue on updateStatus errors in this branch instead of discarding them.

Useful? React with 👍 / 👎.

@ashishpatel26

Copy link
Copy Markdown
Author

Superseded by #1049 — clean single-issue branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants