fix(k8s): validate shardTaskPatches early and unblock finalizer on deletion#1040
Conversation
When a Volume with pvc backend and readOnly=True is mounted, only the VolumeMount readOnly field was set. Some CSI drivers (e.g. mountpoint-s3-csi-driver) require readOnly on the PersistentVolumeClaim volume source as well, so the mount remained writable despite the flag. Fix: also set persistentVolumeClaim.readOnly=true in the pod volumes list when the volume is declared read-only. Additionally change the dedup key from claim_name alone to (claim_name, read_only) so the same PVC can be mounted both read-only and read-write within the same pod without one mount silently overriding the other. Fixes opensandbox-group#545
When the proxy forwards POST/PUT/PATCH/DELETE requests it streams the body via an async generator, causing httpx to use Transfer-Encoding: chunked. The original Content-Length header was still forwarded, so Go's net/http on execd read exactly Content-Length bytes and discarded the remainder of the chunked stream. Files larger than ~18 KB appeared to succeed (HTTP 200) but were silently truncated on the sandbox filesystem. Fix: pop content-length from forwarded headers whenever stream_body is True so httpx is free to drive chunk framing without a conflicting length constraint. Fixes opensandbox-group#1016
When createDirectories is called on a path whose parent components already exist (e.g. /tmp), MakeDir was unconditionally calling ChmodFile on the target after os.MkdirAll, even if the directory was not newly created. For system directories the process does not own this causes 'chmod: operation not permitted'. Fix: stat the target before MkdirAll; if it already exists, skip ChmodFile entirely. Only newly-created directories receive the requested permission. This matches POSIX mkdir -p --mode semantics. Fixes opensandbox-group#1024
…ix 504 timeout
ConnectionConfig.get_base_url() was unconditionally appending /_API_VERSION ("v1")
to the domain, causing all SDK requests to hit /v1/sandboxes instead of /sandboxes.
In deployments where the gateway routes /v1/* to a synchronous blocking endpoint,
this caused 504 Gateway Timeout when pod startup exceeded the proxy idle timeout (~60s).
The generated API client already uses correct relative paths (e.g. /sandboxes), so
the base URL should be the plain domain without any version prefix. The server
registers routes at both / and /v1 with the same async handler, so removing /v1
from the base URL routes traffic to the non-prefixed async endpoint that returns
202 Accepted immediately and lets the SDK poll for readiness via check_ready().
Fixes opensandbox-group#591
…letion Fixes opensandbox-group#1019. Problem 1 - schema bypass: shardTaskPatches uses []runtime.RawExtension, so the API server accepts payloads like `args: 3600` (integer) even though TaskSpec.Args is []string. The mismatch is only discovered during reconcile when strategicpatch.StrategicMergePatch fails, generating a confusing error. Fix: add ValidateShardTaskPatches() to the TaskSchedulingStrategy interface and DefaultTaskSchedulingStrategy implementation. In the reconcile loop, call it before adding the finalizer. If invalid, write an InvalidShardPatch condition to status and return without error (no requeue) so the controller does not spin. Problem 2 - finalizer deadlock: because the merge failure prevents the controller from reaching the finalizer-cleanup code, a BatchSandbox with an invalid patch and a DeletionTimestamp is stuck in Terminating forever. Fix: on the deletion path, run the same validation first. If patches are invalid, clear the FinalizerTaskCleanup finalizer immediately and record the error in status so the resource can be garbage-collected without manual intervention. New tests: - TestDefaultTaskSchedulingStrategy_ValidateShardTaskPatches (5 sub-cases) - TestReconcile_InvalidShardTaskPatches_SetsConditionAndDoesNotProceed - TestReconcile_InvalidShardTaskPatches_OnDeletion_ClearsFinalizer
|
Changed directories: components、kubernetes、sdks、server. 📋 Recommended labels (based on changed files):
Other available labels:
💡 Tip: Use |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR addresses several correctness and reliability issues across the server proxy, Kubernetes controller, SDK base URL construction, and filesystem utilities.
Changes:
- Strip
Content-Lengthwhen proxying streamed request bodies to prevent backend truncation. - Improve Kubernetes PVC volume mounting behavior (RO/RW support) and add shard-task patch validation with controller status reporting.
- Update Python SDK
get_base_url()to avoid appending/v1, and makeMakeDiridempotent by skipping chmod/chown on pre-existing directories.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| server/tests/test_routes_proxy.py | Adds regression tests ensuring Content-Length is stripped for streamed proxy requests. |
| server/tests/k8s/test_volume_helper.py | Adds new tests for PVC/host volume application behavior in pod specs. |
| server/tests/k8s/test_batchsandbox_provider.py | Extends existing test to assert PVC readOnly propagates to the volume spec. |
| server/opensandbox_server/services/k8s/volume_helper.py | Updates PVC volume handling to support RO/RW mounts and improves logging. |
| server/opensandbox_server/api/proxy.py | Strips Content-Length when streaming request bodies. |
| sdks/sandbox/python/tests/test_connection_config.py | Updates/extends tests for base URL generation without /v1 and with slash stripping. |
| sdks/sandbox/python/src/opensandbox/config/connection_sync.py | Changes get_base_url() to stop appending /v1 and adds documentation. |
| sdks/sandbox/python/src/opensandbox/config/connection.py | Same as above for async config: stop appending /v1 and add documentation. |
| kubernetes/internal/controller/strategy/task_scheduling_strategy_default_test.go | Adds unit tests for shard-task-patch validation. |
| kubernetes/internal/controller/strategy/task_scheduling_strategy_default.go | Implements shard-task-patch validation using strategic merge patch. |
| kubernetes/internal/controller/strategy/task_scheduling_strategy.go | Extends interface with ValidateShardTaskPatches(). |
| kubernetes/internal/controller/batchsandbox_pause_resume_test.go | Adds reconciler tests for invalid shard-task-patches and deletion finalizer unblock. |
| kubernetes/internal/controller/batchsandbox_controller.go | Validates shard-task-patches early; sets status condition and unblocks deletion finalizer. |
| kubernetes/apis/sandbox/v1alpha1/batchsandbox_types.go | Adds InvalidShardPatch condition type to CRD. |
| components/execd/pkg/web/controller/utils_windows.go | Makes MakeDir skip chmod/chown for pre-existing directories (Windows build). |
| components/execd/pkg/web/controller/utils_test.go | Adds tests for MakeDir creation/idempotency/nested behavior. |
| components/execd/pkg/web/controller/utils.go | Same MakeDir idempotency change for non-Windows build. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pvc_key = (pvc_claim_name, vol.read_only) | ||
|
|
||
| if pvc_key not in pvc_to_volume_name: |
| pod_volumes.append({ | ||
| "name": vol_name, | ||
| "persistentVolumeClaim": { | ||
| "claimName": pvc_claim_name, | ||
| }, | ||
| "persistentVolumeClaim": pvc_volume, | ||
| }) | ||
| pvc_to_volume_name[pvc_claim_name] = vol_name | ||
| pvc_to_volume_name[pvc_key] = vol_name |
| "name": pvc_to_volume_name[pvc_key], | ||
| "mountPath": vol.mount_path, | ||
| "readOnly": vol.read_only, | ||
| } |
| v.get("name") for v in pod_volumes if isinstance(v, dict) | ||
| } | ||
| # Key: (claim_name, read_only) so the same PVC can be mounted both RO and RW. | ||
| pvc_to_volume_name: Dict[tuple, str] = {} |
| return f"{self.protocol}://{domain}/{self._API_VERSION}" | ||
| if domain.startswith("http://") or domain.startswith("https://"): | ||
| return domain.rstrip("/") | ||
| return f"{self.protocol}://{domain}" |
| log.Error(patchErr, "invalid shardTaskPatches detected, updating status condition") | ||
| statusCopy := batchSbx.Status.DeepCopy() | ||
| setConditionInStatus(statusCopy, sandboxv1alpha1.BatchSandboxConditionInvalidShardPatch, | ||
| sandboxv1alpha1.ConditionTrue, "InvalidShardPatch", patchErr.Error()) | ||
| _ = r.updateStatus(ctx, batchSbx, statusCopy) | ||
| // Return without error so the controller does not keep requeueing; | ||
| // the resource must be corrected by the user. | ||
| return ctrl.Result{}, nil |
| _, statErr := os.Stat(abs) | ||
| alreadyExisted := statErr == nil |
| if alreadyExisted { | ||
| return nil | ||
| } |
|
|
||
| after, err := os.Stat(existing) | ||
| require.NoError(t, err) | ||
| require.Equal(t, before.Mode(), after.Mode(), "MakeDir must not chmod a pre-existing directory") |
|
|
||
| afterParent, err := os.Stat(parent) | ||
| require.NoError(t, err) | ||
| require.Equal(t, beforeParent.Mode(), afterParent.Mode(), "MakeDir must not chmod the pre-existing parent") |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f369376e98
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| // BatchSandboxConditionType represents the type of BatchSandbox condition. | ||
| // +kubebuilder:validation:Enum=Ready;Progressing;Paused;PauseFailed;ResumeFailed;PodFailed | ||
| // +kubebuilder:validation:Enum=Ready;Progressing;Paused;PauseFailed;ResumeFailed;PodFailed;InvalidShardPatch |
There was a problem hiding this comment.
Regenerate CRDs for the new condition
Adding InvalidShardPatch here is not enough for deployed clusters: the generated CRD schemas still restrict condition types to the old enum (checked kubernetes/config/crd/bases/sandbox.opensandbox.io_batchsandboxes.yaml lines 173-180 and the Helm CRD template lines 182-189, neither includes InvalidShardPatch). When the controller reaches the new invalid-patch path and writes this status condition, the API server will reject the status update under the installed CRD, so the condition will not be surfaced in real Kubernetes deployments. Please regenerate/update the CRD YAMLs alongside this API type change.
Useful? React with 👍 / 👎.
| // Validate shardTaskPatches early so that a type-mismatch is surfaced | ||
| // immediately as a status condition rather than causing a silent merge | ||
| // failure deep in the reconcile loop. | ||
| if patchErr := taskStrategy.ValidateShardTaskPatches(); patchErr != nil { |
There was a problem hiding this comment.
Clear the invalid-patch condition after validation succeeds
When a user fixes shardTaskPatches, this validation branch is skipped but nothing removes the previously written InvalidShardPatch=True condition. The normal status path starts from batchSbx.Status.DeepCopy() in buildRuntimeView, so that stale condition is carried into later status patches even after scheduling succeeds, leaving corrected BatchSandboxes permanently reported as invalid.
Useful? React with 👍 / 👎.
| statusCopy := batchSbx.Status.DeepCopy() | ||
| setConditionInStatus(statusCopy, sandboxv1alpha1.BatchSandboxConditionInvalidShardPatch, | ||
| sandboxv1alpha1.ConditionTrue, "InvalidShardPatch", patchErr.Error()) | ||
| _ = r.updateStatus(ctx, batchSbx, statusCopy) |
There was a problem hiding this comment.
Return status update errors from invalid-patch path
If this status patch fails for a transient API error or conflict while handling an invalid shardTaskPatches update, the reconciler still returns nil and disables requeueing, so the user never sees the InvalidShardPatch condition that is meant to explain why reconciliation stopped. Please return or otherwise requeue on updateStatus errors in this branch instead of discarding them.
Useful? React with 👍 / 👎.
|
Superseded by #1049 — clean single-issue branch |
Summary
Fixes #1019.
ValidateShardTaskPatches()to theTaskSchedulingStrategyinterface andDefaultTaskSchedulingStrategy. Call it in the reconcile loop before adding theFinalizerTaskCleanupfinalizer. If any patch cannot be merged into a zero-valueTaskTemplateSpec(e.g.args: 3600instead ofargs: [3600]), write anInvalidShardPatchcondition to status and return without error, so the controller stops requeueing until the user corrects the resource.DeletionTimestamp != nil), run the same validation first. If patches are invalid, removeFinalizerTaskCleanupimmediately and record the error in status so the resource is no longer stuck inTerminating.BatchSandboxConditionInvalidShardPatch = InvalidShardPatchadded toBatchSandboxConditionType.Test plan
TestDefaultTaskSchedulingStrategy_ValidateShardTaskPatches— 5 sub-cases covering no patches, valid string args, integer args (invalid), malformed JSON, and second-patch-invalid detection.TestReconcile_InvalidShardTaskPatches_SetsConditionAndDoesNotProceed— reconcile with invalid patches setsInvalidShardPatchcondition, returns no error, and does not add finalizer.TestReconcile_InvalidShardTaskPatches_OnDeletion_ClearsFinalizer— reconcile of a terminating sandbox with invalid patches clears the finalizer so the resource is garbage-collected.