Skip to content

DOC-1581 inference provider production path#2348

Open
djwfyi wants to merge 5 commits into
mainfrom
DOC-1581/inference-production-path
Open

DOC-1581 inference provider production path#2348
djwfyi wants to merge 5 commits into
mainfrom
DOC-1581/inference-production-path

Conversation

@djwfyi

@djwfyi djwfyi commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Content Description

Adds the Inference Provider production path for teams building managed model-serving endpoints on GPU infrastructure. Includes a new production guide, a model-serving runtimes integration page, GPU and inference autoscaling patterns (KEDA and Prometheus), a GPU Operator install guide inside tenant clusters, and a readability pass on the new content.

Preview Link

Internal Reference

Partially addresses DOC-1581

AI review: mention @claude in a comment to request a review or changes. See CONTRIBUTING.md for available commands.

FORK LIMITATION: @claude does not work on pull requests opened from forks. GitHub Actions cannot access the required secrets for fork-originated PRs. To use AI review, push your branch directly to this repository.

@netlify /docs

djwfyi and others added 4 commits June 30, 2026 13:16
Fix awkward phrasing, informal terms, and unclear word choices across
inference-provider.mdx, model-serving.mdx, and gpu-hpa-dcgm.mdx.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@djwfyi djwfyi requested a review from a team as a code owner June 30, 2026 19:15
@netlify

netlify Bot commented Jun 30, 2026

Copy link
Copy Markdown

Deploy Preview for vcluster-docs-site ready!

Name Link
🔨 Latest commit 3b515b8
🔍 Latest deploy log https://app.netlify.com/projects/vcluster-docs-site/deploys/6a44314d3d3d2c00078cf68e
😎 Deploy Preview https://deploy-preview-2348--vcluster-docs-site.netlify.app/docs
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@github-actions

Copy link
Copy Markdown
Contributor

@djwfyi

djwfyi commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

Remaining work and known gaps from DOC-1581

This PR covers Phases 1–3 of the issue. The items below are either out of scope for this PR or deferred.

Done in this PR

  • inference-provider.mdx — new production guide (Day 0/1/2)
  • model-serving.mdx — tenant-runtime pattern, smoke-test deployment, model storage, Gateway API endpoint exposure
  • gpu-hpa-dcgm.mdx — serving-metrics section: KEDA, queue depth, request concurrency, TTFT, latency
  • gpu-and-accelerator-support.mdx — GPU Operator install guide inside tenant clusters
  • Cross-links from overview.mdx, ai-cloud.mdx, gpu-cloud-platform.mdx
  • Architecture diagram (inference-provider.svg)

Not done — cross-repo (separate PRs needed)

  • vmetal-docs Phase 4: inference-specific GPU capacity design in docs/operate/gpu-fleet.mdx (hardware SKUs for serving tiers, per-tenant node types, warm pool guidance, OS image/driver rollout for model-serving fleets) and docs/overview.mdx (add inference provider path)
  • vnode-docs Phase 5: inference isolation section in docs/concepts/tenancy-models.mdx and docs/concepts/vcluster-integration.mdx (when vNode matters for inference: untrusted adapters, custom containers, privileged workloads)
  • Cross-links from vmetal-docs and vnode-docs back to the inference provider path in vcluster-docs

Not done — vcluster-docs (future issues)

  • Security/trust model page (Phase 6): dedicated page or section with buyer-facing validation checklist — what an enterprise buyer can verify about isolation boundaries, RBAC, quota enforcement, and audit trail
  • Detailed Day 2 operations content: the Day 2 table links out but there is no prose for capacity exhaustion handling, endpoint drain and rollout, stuck model pods, GPU burn-in/validation, node reclaim on customer offboarding, or regional failover
  • SEO pass (Phase 7): natural placement of inference-provider search terms and final cross-link sweep across the three repos
  • End-to-end "serve a model" walkthrough: model-serving.mdx covers the runtime pattern but not the full step-by-step from vMetal node prep through validated external endpoint; the issue scoped this as a potential child issue

@djwfyi djwfyi self-assigned this Jun 30, 2026
@djwfyi

djwfyi commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

Reciprocal dependency: vmetal-docs#29 (DOC-1581 Phase 4)

The vMetal half of this inference provider journey is now up in loft-sh/vmetal-docs#29 (DOC-1581 Phase 4). It adds an "Inference provider capacity" section to the GPU fleet ops page and cross-links the vMetal docs back to the inference provider production path created here.

Merge order: vmetal-docs#29 links to https://www.vcluster.com/docs/vcluster/production-guide/inference-provider, which doesn't exist until this PR merges and deploys. The links are external URLs so no build gate fails either way, but please merge #2348 first (or together) so the vMetal links don't point at a not-yet-published page.

This PR already links into the vMetal pages (operate/gpu-fleet, deploy/gpu-quickstart), which are live, so there's no dependency in this direction.

Retarget the Day 0 and Day 2 vMetal cross-links to the new
#inference-provider-capacity anchor added in vmetal-docs#29, and tie
the endpoint readiness warm-pool guidance to vMetal's hardware-layer
warm pool against on-demand provisioning.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant