From 6567f8668bb8028197611fd92c25279214dc3f8a Mon Sep 17 00:00:00 2001 From: Matt Linville Date: Fri, 30 Jan 2026 14:16:22 -0800 Subject: [PATCH 1/7] Operator revamp phase 1 - Create and edit snippets for shared object storage content - Updates to BYOB page to clarify that it applies to all deployments - Add a new Requirements page for Self-Managed - Uses existing snippet imports extensively - Clear cross-references to reference architecture - Strong BYOB integration - Sections: - Software version requirements - Hardware requirements - Kubernetes (with OpenShift mention) - MySQL database (with config parameters and creation instructions) - Redis - Object storage (with BYOB provisioning link and configuration transition) - Networking (with DNS and load balancer info) - SSL/TLS - License - Next steps (links to cloud, on-prem, airgapped guides) - Create new directories for details specific to Cloud or on-premise deployments, will be populated in phases 3 and 4s - Update navigation --- docs.json | 3 +- .../secure-storage-connector.mdx | 4 + .../hosting/self-managed/requirements.mdx | 127 ++++++++++++++++++ snippets/en/_includes/byob-context-note.mdx | 8 ++ .../en/_includes/byob-provisioning-link.mdx | 10 ++ .../object-storage-configuration-intro.mdx | 7 + ...lf-managed-object-storage-requirements.mdx | 9 +- ...pi-key-create-additional-single-tenant.mdx | 10 ++ 8 files changed, 173 insertions(+), 5 deletions(-) create mode 100644 platform/hosting/self-managed/requirements.mdx create mode 100644 snippets/en/_includes/byob-context-note.mdx create mode 100644 snippets/en/_includes/byob-provisioning-link.mdx create mode 100644 snippets/en/_includes/object-storage-configuration-intro.mdx create mode 100644 snippets/en/_includes/service-account-api-key-create-additional-single-tenant.mdx diff --git a/docs.json b/docs.json index e976b3d5ab..399f6619fc 100644 --- a/docs.json +++ b/docs.json @@ -79,6 +79,7 @@ "pages": [ "platform/hosting/hosting-options/self-managed", "platform/hosting/self-managed/ref-arch", + "platform/hosting/self-managed/requirements", { "group": "Run W&B Server on Kubernetes", "pages": [ @@ -2634,4 +2635,4 @@ } ], "baseUrl": "https://docs.wandb.ai" -} \ No newline at end of file +} diff --git a/platform/hosting/data-security/secure-storage-connector.mdx b/platform/hosting/data-security/secure-storage-connector.mdx index a0e5b04369..a9cb5b3558 100644 --- a/platform/hosting/data-security/secure-storage-connector.mdx +++ b/platform/hosting/data-security/secure-storage-connector.mdx @@ -2,6 +2,10 @@ title: Bring your own bucket (BYOB) --- +import ByobContextNote from "/snippets/en/_includes/byob-context-note.mdx"; + + + ## Overview Bring your own bucket (BYOB) allows you to store W&B artifacts and other related sensitive data in your own cloud or on-prem infrastructure. In case of [Dedicated Cloud](/platform/hosting/hosting-options/dedicated_cloud) or [Multi-tenant Cloud](/platform/hosting/hosting-options/multi_tenant_cloud), data that you store in your bucket is not copied to the W&B managed infrastructure. diff --git a/platform/hosting/self-managed/requirements.mdx b/platform/hosting/self-managed/requirements.mdx new file mode 100644 index 0000000000..2984f2f781 --- /dev/null +++ b/platform/hosting/self-managed/requirements.mdx @@ -0,0 +1,127 @@ +--- +title: Self-Managed infrastructure requirements +description: Infrastructure and software requirements for W&B Self-Managed deployments +--- + +import SelfManagedVersionRequirements from "/snippets/en/_includes/self-managed-version-requirements.mdx"; +import SelfManagedHardwareRequirements from "/snippets/en/_includes/self-managed-hardware-requirements.mdx"; +import SelfManagedMysqlRequirements from "/snippets/en/_includes/self-managed-mysql-requirements.mdx"; +import SelfManagedMysqlDatabaseCreation from "/snippets/en/_includes/self-managed-mysql-database-creation.mdx"; +import SelfManagedRedisRequirements from "/snippets/en/_includes/self-managed-redis-requirements.mdx"; +import SelfManagedObjectStorageRequirements from "/snippets/en/_includes/self-managed-object-storage-requirements.mdx"; +import ByobProvisioningLink from "/snippets/en/_includes/byob-provisioning-link.mdx"; +import SelfManagedNetworkingRequirements from "/snippets/en/_includes/self-managed-networking-requirements.mdx"; +import SelfManagedSslTlsRequirements from "/snippets/en/_includes/self-managed-ssl-tls-requirements.mdx"; + +This page provides a comprehensive overview of the infrastructure and software requirements for deploying W&B Self-Managed. Review these requirements before beginning your deployment. + + +W&B recommends fully managed deployment options such as [W&B Multi-tenant Cloud](/platform/hosting/hosting-options/multi_tenant_cloud) or [W&B Dedicated Cloud](/platform/hosting/hosting-options/dedicated_cloud/) deployment types. W&B fully managed services are simple and secure to use, with minimum to no configuration required. + + +For complete architectural guidance, see the [reference architecture](/platform/hosting/self-managed/ref-arch/). + +## Software version requirements + + + +## Hardware requirements + + + +For detailed sizing recommendations based on your use case (Models only, Weave only, or both), see the [reference architecture sizing section](/platform/hosting/self-managed/ref-arch/#sizing). + +## Kubernetes + +W&B Server is deployed as a [Kubernetes Operator](/platform/hosting/self-managed/operator/) that manages multiple pods. Your Kubernetes cluster must meet these requirements: + +- **Version**: See [Software version requirements](#software-version-requirements) above +- **Ingress controller**: A fully configured and functioning ingress controller (Nginx, Istio, Traefik, or cloud provider ingress) +- **Persistent volumes**: Capability to provision persistent volumes +- **CPU architecture**: Intel or AMD 64-bit (ARM is not supported) + +W&B supports deployment on [OpenShift Kubernetes clusters](https://www.redhat.com/en/technologies/cloud-computing/openshift) in cloud, on-premises, and air-gapped environments. For specific configuration instructions, see the [OpenShift section](/platform/hosting/self-managed/operator/#openshift-kubernetes-clusters) in the Operator guide. + +For complete Kubernetes requirements including load balancer and ingress configuration, see the [reference architecture Kubernetes section](/platform/hosting/self-managed/ref-arch/#kubernetes). + +## MySQL database + + + +**W&B strongly recommends using managed database services** such as AWS RDS Aurora MySQL, Google Cloud SQL for MySQL, or Azure Database for MySQL for production deployments. Managed services provide automated backups, monitoring, high availability, patching, and significantly reduce operational complexity. + +### MySQL configuration parameters + +If you are running your own MySQL instance, configure MySQL with these settings: + +``` +binlog_format = 'ROW' +binlog_row_image = 'MINIMAL' +innodb_flush_log_at_trx_commit = 1 +innodb_online_alter_log_max_size = 268435456 +max_prepared_stmt_count = 1048576 +sort_buffer_size = '67108864' +sync_binlog = 1 +``` + +These settings have been validated by W&B for optimal performance and reliability. + +### Database creation + +For instructions to manually create the MySQL database and user: + + + +For additional considerations including backups, performance, monitoring, and availability, see the [reference architecture MySQL section](/platform/hosting/self-managed/ref-arch/#mysql). + +## Redis + + + +W&B can connect to a Redis instance in the following environments: + +- [AWS Elasticache](https://aws.amazon.com/pm/elasticache/) +- [Google Cloud Memory Store](https://cloud.google.com/memorystore?hl=en) +- [Azure Cache for Redis](https://azure.microsoft.com/en-us/products/cache) +- Redis deployment hosted in your cloud or on-premises infrastructure + +## Object storage + + + + + +### Configure W&B to use your bucket + +After provisioning your bucket, you will configure W&B to use it through the Operator's Helm values. See the [Operator object storage configuration section](/platform/hosting/self-managed/operator/#object-storage-bucket) for details. + +## Networking + + + +### DNS + +The fully qualified domain name (FQDN) of the W&B deployment must resolve to the IP address of the ingress/load balancer using an A record. + +### Load balancer and ingress + +The W&B Kubernetes Operator exposes services using a Kubernetes ingress controller, which routes to service endpoints based on URL paths. The ingress controller must be accessible by all machines that execute machine learning payloads or access the service through web browsers. + +For detailed load balancer options, ingress controller requirements, and configuration examples, see the [reference architecture load balancer section](/platform/hosting/self-managed/ref-arch/#load-balancer-and-ingress). + +## SSL/TLS + + + +## License + +A valid W&B Server license is required for all Self-Managed deployments. See [Obtain your W&B Server license](/platform/hosting/hosting-options/self-managed#obtain-your-wb-server-license) for instructions. + +## Next steps + +After ensuring your infrastructure meets these requirements: + +- **Cloud deployments**: See [Deploy with Terraform on public cloud](/platform/hosting/self-managed/cloud-deployments/terraform) for AWS, Google Cloud, or Azure deployments using Terraform modules +- **On-premises deployments**: See [Deploy on Kubernetes](/platform/hosting/self-managed/on-premises-deployments/kubernetes) for standard on-premises deployments +- **Air-gapped deployments**: See [Deploy in air-gapped environment](/platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped) for disconnected environments +- **All deployment methods**: See [Deploy with Kubernetes Operator](/platform/hosting/self-managed/operator) for the core operator deployment guide diff --git a/snippets/en/_includes/byob-context-note.mdx b/snippets/en/_includes/byob-context-note.mdx new file mode 100644 index 0000000000..0cae9fedb1 --- /dev/null +++ b/snippets/en/_includes/byob-context-note.mdx @@ -0,0 +1,8 @@ + +**This guide applies to all W&B deployment types:** +- **Multi-tenant Cloud**: Team-level BYOB +- **Dedicated Cloud**: Instance and team-level BYOB +- **Self-Managed**: Instance and team-level BYOB + +The bucket provisioning instructions in this guide are the same regardless of your deployment type. + diff --git a/snippets/en/_includes/byob-provisioning-link.mdx b/snippets/en/_includes/byob-provisioning-link.mdx new file mode 100644 index 0000000000..b85b92bfdf --- /dev/null +++ b/snippets/en/_includes/byob-provisioning-link.mdx @@ -0,0 +1,10 @@ +### Provision your storage bucket + +Before configuring W&B, provision your object storage bucket with proper IAM policies, CORS configuration, and access credentials. + +**See the [Bring Your Own Bucket (BYOB) guide](/platform/hosting/data-security/secure-storage-connector) for detailed step-by-step provisioning instructions for:** +- Amazon S3 (including IAM policies and bucket policies) +- Google Cloud Storage (including PubSub notifications) +- Azure Blob Storage (including managed identities) +- CoreWeave AI Object Storage +- S3-compatible storage (MinIO Enterprise, NetApp StorageGRID, and other enterprise solutions) diff --git a/snippets/en/_includes/object-storage-configuration-intro.mdx b/snippets/en/_includes/object-storage-configuration-intro.mdx new file mode 100644 index 0000000000..b8c2e8534e --- /dev/null +++ b/snippets/en/_includes/object-storage-configuration-intro.mdx @@ -0,0 +1,7 @@ +### Configure W&B to use your bucket + +Once your bucket is provisioned, configure W&B to use it: + +- **Self-Managed deployments**: See the [Operator configuration guide](/platform/hosting/self-managed/operator#object-storage-bucket) for Helm values configuration +- **Dedicated Cloud**: Configure via the [System Console](/platform/hosting/iam/sso#system-console) +- **Multi-tenant Cloud**: Configure when creating or editing a team diff --git a/snippets/en/_includes/self-managed-object-storage-requirements.mdx b/snippets/en/_includes/self-managed-object-storage-requirements.mdx index bf81224897..90df7fddc9 100644 --- a/snippets/en/_includes/self-managed-object-storage-requirements.mdx +++ b/snippets/en/_includes/self-managed-object-storage-requirements.mdx @@ -1,15 +1,16 @@ W&B requires object storage with pre-signed URL and CORS support. -For production deployments, W&B recommends using managed object storage services: +**Recommended storage providers:** - [Amazon S3](https://aws.amazon.com/s3/): Object storage service offering industry-leading scalability, data availability, security, and performance. - [Google Cloud Storage](https://cloud.google.com/storage): Managed service for storing unstructured data at scale. - [Azure Blob Storage](https://azure.microsoft.com/products/storage/blobs): Cloud-based object storage solution for storing massive amounts of unstructured data. - [CoreWeave AI Object Storage](https://docs.coreweave.com/docs/products/storage/object-storage): High-performance, S3-compatible object storage service optimized for AI workloads. - -For self-hosted object storage options, see the [bare-metal guide object storage section](/platform/hosting/self-managed/bare-metal/#object-storage) for detailed setup instructions including CORS configuration and enterprise alternatives. +- Enterprise S3-compatible storage: [MinIO Enterprise (AIStor)](https://min.io/product/aistor), [NetApp StorageGRID](https://www.netapp.com/data-storage/storagegrid/), or other enterprise-grade solutions -MinIO Open Source is in [maintenance mode](https://github.com/minio/minio) with no active development or pre-compiled binaries. For production deployments, W&B recommends using managed object storage services or enterprise-grade S3-compatible solutions. +MinIO Open Source is in [maintenance mode](https://github.com/minio/minio) with no active development or pre-compiled binaries. For production deployments, W&B recommends using managed object storage services or enterprise S3-compatible solutions such as MinIO Enterprise (AIStor). +For detailed bucket provisioning instructions including IAM policies, CORS configuration, and access setup, see the [Bring Your Own Bucket (BYOB) guide](/platform/hosting/data-security/secure-storage-connector). + See the [reference architecture object storage section](/platform/hosting/self-managed/ref-arch/#object-storage) for complete requirements. diff --git a/snippets/en/_includes/service-account-api-key-create-additional-single-tenant.mdx b/snippets/en/_includes/service-account-api-key-create-additional-single-tenant.mdx new file mode 100644 index 0000000000..e6b1115fab --- /dev/null +++ b/snippets/en/_includes/service-account-api-key-create-additional-single-tenant.mdx @@ -0,0 +1,10 @@ +To create an API key owned by a service account: + +1. Navigate to the **Service Accounts** tab in your team or organization settings. +2. Find the service account in the list. +3. Click the action menu (`...`), then click **Create API key**. +4. Provide a name for the API key, then click **Create**. +5. Copy the displayed API key immediately and store it securely. +6. Click **Done**. + +You can create multiple API keys for a single service account to support different environments or workflows. From 19785806b8a73d3ad60c0d3d5e940c17e54373d8 Mon Sep 17 00:00:00 2001 From: Matt Linville Date: Fri, 30 Jan 2026 14:47:01 -0800 Subject: [PATCH 2/7] Operator revamp phase 2 - Move and edit Operator landing page - Update title - Use shared BYOB content - Adjust Before you begin - Update links for file moves - Adjust support storage provider list - Improve cross-references - Where possible, tabify content for Helm and Terraform to simplify the giew - Update navigation --- docs.json | 9 +- platform/hosting/operator.mdx | 4 +- platform/hosting/self-managed/operator.mdx | 1091 ++++++++++++++++++++ 3 files changed, 1095 insertions(+), 9 deletions(-) create mode 100644 platform/hosting/self-managed/operator.mdx diff --git a/docs.json b/docs.json index 399f6619fc..29a8bd06cd 100644 --- a/docs.json +++ b/docs.json @@ -80,13 +80,7 @@ "platform/hosting/hosting-options/self-managed", "platform/hosting/self-managed/ref-arch", "platform/hosting/self-managed/requirements", - { - "group": "Run W&B Server on Kubernetes", - "pages": [ - "platform/hosting/operator", - "platform/hosting/self-managed/operator-airgapped" - ] - }, + "platform/hosting/self-managed/operator", { "group": "Install on public cloud", "pages": [ @@ -96,6 +90,7 @@ ] }, "platform/hosting/self-managed/bare-metal", + "platform/hosting/self-managed/operator-airgapped", "platform/hosting/server-upgrade-process", "platform/hosting/self-managed/disable-automatic-app-version-updates" ] diff --git a/platform/hosting/operator.mdx b/platform/hosting/operator.mdx index f779d56b3c..26390de7c5 100644 --- a/platform/hosting/operator.mdx +++ b/platform/hosting/operator.mdx @@ -100,7 +100,7 @@ W&B supports deployment on [OpenShift Kubernetes clusters](https://www.redhat.co W&B recommends you install with the official W&B Helm chart. -#### Run the container as an un-privileged user +#### Run the container as an unprivileged user By default, containers use a `$UID` of 999. Specify `$UID` >= 100000 and a `$GID` of 0 if your orchestrator requires the container run with a non-root user. @@ -137,7 +137,7 @@ api: If needed, configure a custom security context for other components like `app` or `console`. For details, see [Custom security context](#custom-security-context). -## Deploy W&B Server application +## Deploy W&B Server **The W&B Kubernetes Operator with Helm is the recommended installation method** for all W&B Self-Managed deployments, including cloud, on-premises, and air-gapped environments. diff --git a/platform/hosting/self-managed/operator.mdx b/platform/hosting/self-managed/operator.mdx new file mode 100644 index 0000000000..15a3b8335e --- /dev/null +++ b/platform/hosting/self-managed/operator.mdx @@ -0,0 +1,1091 @@ +--- +description: Deploy W&B Platform with Kubernetes Operator +title: Deploy W&B with Kubernetes Operator +--- + +import SelfManagedMysqlRequirements from "/snippets/en/_includes/self-managed-mysql-requirements.mdx"; +import SelfManagedRedisRequirements from "/snippets/en/_includes/self-managed-redis-requirements.mdx"; +import SelfManagedObjectStorageRequirements from "/snippets/en/_includes/self-managed-object-storage-requirements.mdx"; +import ByobProvisioningLink from "/snippets/en/_includes/byob-provisioning-link.mdx"; +import SelfManagedVerifyInstallation from "/snippets/en/_includes/self-managed-verify-installation.mdx"; + +## W&B Kubernetes Operator + +Use the W&B Kubernetes Operator to simplify deploying, administering, troubleshooting, and scaling your W&B Server deployments on Kubernetes. You can think of the operator as a smart assistant for your W&B instance. + +The W&B Server architecture and design continuously evolves to expand AI developer tooling capabilities, and to provide appropriate primitives for high performance, better scalability, and easier administration. That evolution applies to the compute services, relevant storage and the connectivity between them. To help facilitate continuous updates and improvements across deployment types, W&B users a Kubernetes operator. + + +W&B uses the operator to deploy and manage Dedicated Cloud instances on AWS, Google Cloud and Azure public clouds. + + +For more information about Kubernetes operators, see [Operator pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) in the Kubernetes documentation. + +### Reasons for the architecture shift +Historically, the W&B application was deployed as a single deployment and pod within a Kubernetes Cluster or a single Docker container. W&B has, and continues to recommend, to externalize the Database and Object Store. Externalizing the Database and Object store decouples the application's state. + +As the application grew, the need to evolve from a monolithic container to a distributed system (microservices) was apparent. This change facilitates backend logic handling and seamlessly introduces built-in Kubernetes infrastructure capabilities. Distributed systems also supports deploying new services essential for additional features that W&B relies on. + +Before 2024, any Kubernetes-related change required manually updating the [terraform-kubernetes-wandb](https://github.com/wandb/terraform-kubernetes-wandb) Terraform module. Updating the Terraform module ensures compatibility across cloud providers, configuring necessary Terraform variables, and executing a Terraform apply for each backend or Kubernetes-level change. + +This process was not scalable since W&B Support had to assist each customer with upgrading their Terraform module. + +The solution was to implement an operator that connects to a central [deploy.wandb.ai](https://deploy.wandb.ai) server to request the latest specification changes for a given release channel and apply them. Updates are received as long as the license is valid. [Helm](https://helm.sh/) is used as both the deployment mechanism for the W&B operator and the means for the operator to handle all configuration templating of the W&B Kubernetes stack, Helm-ception. + +### How it works +You can install the operator with helm or from the source. See [charts/operator](https://github.com/wandb/helm-charts/tree/main/charts/operator) for detailed instructions. + +The installation process creates a deployment called `controller-manager` and uses a [custom resource](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) definition named `weightsandbiases.apps.wandb.com` (shortName: `wandb`), that takes a single `spec` and applies it to the cluster: + +```yaml +apiVersion: apiextensions.k8s.io/v1 +kind: CustomResourceDefinition +metadata: + name: weightsandbiases.apps.wandb.com +``` + +The `controller-manager` installs [charts/operator-wandb](https://github.com/wandb/helm-charts/tree/main/charts/operator-wandb) based on the spec of the custom resource, release channel, and a user defined config. The configuration specification hierarchy enables maximum configuration flexibility at the user end and enables W&B to release new images, configurations, features, and Helm updates automatically. + +Refer to the [configuration specification hierarchy](#configuration-specification-hierarchy) and [configuration reference](#configuration-reference-for-wb-operator) for configuration options. + +The deployment consists of multiple pods, one per service. Each pod's name is prefixed with `wandb-`. + +### Configuration specification hierarchy +Configuration specifications follow a hierarchical model where higher-level specifications override lower-level ones. Here's how it works: + +- **Release Channel Values**: This base level configuration sets default values and configurations based on the release channel set by W&B for the deployment. +- **User Input Values**: Users can override the default settings provided by the Release Channel Spec through the System Console. +- **Custom Resource Values**: The highest level of specification, which comes from the user. Any values specified here override both the User Input and Release Channel specifications. For a detailed description of the configuration options, see [Configuration Reference](#configuration-reference-for-wb-operator). + +This hierarchical model ensures that configurations are flexible and customizable to meet varying needs while maintaining a manageable and systematic approach to upgrades and changes. + +## Before you begin + +Before deploying W&B with the Kubernetes Operator, ensure your infrastructure meets all requirements: + +1. **Review infrastructure requirements**: See the [Self-Managed infrastructure requirements](/platform/hosting/self-managed/requirements/) page for comprehensive details on: + - Software version requirements (Kubernetes, MySQL, Redis, Helm) + - Hardware requirements (CPU architecture, sizing recommendations) + - Kubernetes cluster configuration + - Networking, SSL/TLS, and DNS requirements +2. **Obtain a W&B Server license**: See [Obtain your W&B Server license](/platform/hosting/hosting-options/self-managed#obtain-your-wb-server-license). +3. **Provision external services**: Set up MySQL, Redis, and object storage before deployment. + +For additional context, see the [reference architecture](/platform/hosting/self-managed/ref-arch/) page. + +### MySQL Database + + +For complete MySQL setup instructions including configuration parameters and database creation, see the [MySQL section in the requirements page](/platform/hosting/self-managed/requirements/#mysql-database). + +### Redis + + +See the [External Redis configuration section](#external-redis) for details on how to configure an external Redis instance in Helm values. + +### Object storage + + + + +See the [Object storage configuration section](#object-storage-bucket) for details on how to configure object storage in Helm values. + +### OpenShift Kubernetes clusters + +W&B supports deployment on [OpenShift Kubernetes clusters](https://www.redhat.com/en/technologies/cloud-computing/openshift) in cloud, on-premises, and air-gapped environments. + + +W&B recommends you install with the official W&B Helm chart. + + +#### Run the container as an un-privileged user + +By default, containers use a `$UID` of 999. Specify `$UID` >= 100000 and a `$GID` of 0 if your orchestrator requires the container run with a non-root user. + + +W&B must start as the root group (`$GID=0`) for file system permissions to function properly. + + +Configure security contexts for each W&B component. For example, to configure the API component: + +```yaml +api: + install: true + image: + repository: wandb/megabinary + tag: 0.74.1 + pod: + securityContext: + fsGroup: 10001 + fsGroupChangePolicy: Always + runAsGroup: 0 + runAsNonRoot: true + runAsUser: 10001 + seccompProfile: + type: RuntimeDefault + container: + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + privileged: false + readOnlyRootFilesystem: false +``` + +If needed, configure a custom security context for other components like `app` or `console`. For details, see [Custom security context](#custom-security-context). + +## Deploy W&B Server application + + +**The W&B Kubernetes Operator with Helm is the recommended installation method** for all W&B Self-Managed deployments, including cloud, on-premises, and air-gapped environments. + + +Choose your deployment method: + + + + +W&B provides a Helm Chart to deploy the W&B Kubernetes operator to a Kubernetes cluster. This approach allows you to deploy W&B Server with Helm CLI or a continuous delivery tool like ArgoCD. + +For deployment-specific considerations, also see: +- [Deploy on Kubernetes](/platform/hosting/self-managed/on-premises-deployments/kubernetes/) for standard on-premises environments +- [Deploy in air-gapped environment](/platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped/) for disconnected environments +- [Deploy with Terraform on public cloud](/platform/hosting/self-managed/cloud-deployments/terraform/) for AWS, Google Cloud, or Azure + +Follow these steps to install the W&B Kubernetes Operator with Helm CLI: + +1. Add the W&B Helm repository. The W&B Helm chart is available in the W&B Helm repository: + ```shell + helm repo add wandb https://charts.wandb.ai + helm repo update + ``` +2. Install the Operator on a Kubernetes cluster: + ```shell + helm upgrade --install operator wandb/operator -n wandb-cr --create-namespace + ``` +3. Configure the W&B operator custom resource to trigger the W&B Server installation. Create a file named `operator.yaml` with your W&B deployment configuration. Refer to [Configuration Reference](#configuration-reference-for-wb-server) for all available options. + + Here's a minimal example configuration: + + ```yaml + apiVersion: apps.wandb.com/v1 + kind: WeightsAndBiases + metadata: + labels: + app.kubernetes.io/name: weightsandbiases + app.kubernetes.io/instance: wandb + name: wandb + namespace: default + spec: + values: + global: + host: https:// + license: eyJhbGnUzaH...j9ZieKQ2x5GGfw + bucket: +
+ mysql: + + ingress: + annotations: + + ``` + +4. Start the Operator with your custom configuration so that it can install, configure, and manage the W&B Server application: + + ```shell + kubectl apply -f operator.yaml + ``` + + Wait until the deployment completes. This takes a few minutes. + +5. To verify the installation using the web UI, create the first admin user account, then follow the verification steps outlined in [Verify the installation](#verify-the-installation). + + + + + +Deploy W&B using Terraform for infrastructure-as-code deployments. Choose between: +- **Helm Terraform Module**: Deploys the operator to existing Kubernetes infrastructure +- **Cloud Terraform Modules**: Complete infrastructure + application deployment for AWS, Google Cloud, and Azure + +For deployment-specific considerations, also see: +- [Deploy on Kubernetes](/platform/hosting/self-managed/on-premises-deployments/kubernetes/) for standard on-premises environments +- [Deploy in air-gapped environment](/platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped/) for disconnected environments +- [Deploy with Terraform on public cloud](/platform/hosting/self-managed/cloud-deployments/terraform/) for AWS, Google Cloud, or Azure + +#### Helm Terraform Module + +This method allows for customized deployments tailored to specific requirements, leveraging Terraform's infrastructure-as-code approach for consistency and repeatability. The official W&B Helm-based Terraform Module is located [here](https://registry.terraform.io/modules/wandb/wandb/helm/latest). + +The following code can be used as a starting point and includes all necessary configuration options for a production grade deployment: + +```hcl +module "wandb" { + source = "wandb/wandb/helm" + + spec = { + values = { + global = { + host = "https://" + license = "eyJhbGnUzaH...j9ZieKQ2x5GGfw" + + bucket = { +
+ } + + mysql = { + + } + } + + ingress = { + annotations = { + "a" = "b" + "x" = "y" + } + } + } + } +} +``` + +Note that the configuration options are the same as described in [Configuration Reference](#configuration-reference-for-wb-server), but that the syntax has to follow the HashiCorp Configuration Language (HCL). The Terraform module creates the W&B custom resource definition (CRD). + +To see how W&B themselves use the Helm Terraform module to deploy Dedicated Cloud installations for customers, follow these links: +- [AWS](https://github.com/wandb/terraform-aws-wandb/blob/45e1d746f53e78e73e68f911a1f8cad5408e74b6/main.tf#L225) +- [Azure](https://github.com/wandb/terraform-azurerm-wandb/blob/170e03136b6b6fc758102d59dacda99768854045/main.tf#L155) +- [Google Cloud](https://github.com/wandb/terraform-google-wandb/blob/49ddc3383df4cefc04337a2ae784f57ce2a2c699/main.tf#L189) + +#### Cloud Terraform Modules + +W&B provides a set of Terraform Modules for AWS, Google Cloud and Azure. These modules deploy entire infrastructures including Kubernetes clusters, load balancers, MySQL databases and so on as well as the W&B Server application. The W&B Kubernetes Operator is already pre-baked with these official W&B cloud-specific Terraform Modules with the following versions: + +| Terraform Registry | Source Code | Version | +| ------------------------------------------------------------------- | ------------------------------------------------ | ------- | +| [AWS](https://registry.terraform.io/modules/wandb/wandb/aws/latest) | https://github.com/wandb/terraform-aws-wandb | v4.0.0+ | +| [Azure](https://github.com/wandb/terraform-azurerm-wandb) | https://github.com/wandb/terraform-azurerm-wandb | v2.0.0+ | +| [Google Cloud](https://github.com/wandb/terraform-google-wandb) | https://github.com/wandb/terraform-google-wandb | v2.0.0+ | + +This integration ensures that W&B Kubernetes Operator is ready to use for your instance with minimal setup, providing a streamlined path to deploying and managing W&B Server in your cloud environment. + +For detailed instructions on using these cloud-specific modules, see [Deploy with Terraform on public cloud](/platform/hosting/self-managed/cloud-deployments/terraform/). + + + + +### Verify the installation + + + +## Access the W&B Management Console +The W&B Kubernetes operator comes with a management console. It is located at `${HOST_URI}/console`, for example `https://wandb.company-name.com/console`. + +There are two ways to log in to the management console: + + + +1. Open the W&B application in the browser and login. Log in to the W&B application with `${HOST_URI}/`, for example `https://wandb.company-name.com/` +2. Access the console. Click on the icon in the top right corner and then click **System console**. Only users with admin privileges can see the **System console** entry. + + + System console access + + + + +W&B recommends you access the console using the following steps only if Option 1 does not work. + + +1. Open console application in browser. Open the above described URL, which redirects you to the login screen: + + Direct system console access + +2. Retrieve the password from the Kubernetes secret that the installation generates: + ```shell + kubectl get secret wandb-password -o jsonpath='{.data.password}' | base64 -d + ``` + Copy the password. +3. Login to the console. Paste the copied password, then click **Login**. + + + +## Update the W&B Kubernetes operator +This section describes how to update the W&B Kubernetes operator. + + +* Updating the W&B Kubernetes operator does not update the W&B server application. +* See the instructions [here](#migrate-Self-Managed-instances-to-wb-operator) if you use a Helm chart that does not user the W&B Kubernetes operator before you follow the proceeding instructions to update the W&B operator. + + +Copy and paste the code snippets below into your terminal. + +1. First, update the repo with [`helm repo update`](https://helm.sh/docs/helm/helm_repo_update/): + ```shell + helm repo update + ``` + +2. Next, update the Helm chart with [`helm upgrade`](https://helm.sh/docs/helm/helm_upgrade/): + ```shell + helm upgrade operator wandb/operator -n wandb-cr --reuse-values + ``` + +## Update the W&B Server application +You no longer need to update W&B Server application if you use the W&B Kubernetes operator. + +The operator automatically updates your W&B Server application when a new version of the software of W&B is released. + + +## Migrate Self-Managed instances to W&B Operator +The proceeding section describe how to migrate from self-managing your own W&B Server installation to using the W&B Operator to do this for you. The migration process depends on how you installed W&B Server: + + +The W&B Operator is the default and recommended installation method for W&B Server. Reach out to [Customer Support](mailto:support@wandb.com) or your W&B team if you have any questions. + + +- If you used the official W&B Cloud Terraform Modules, navigate to the appropriate documentation and follow the steps there: + - [AWS](#migrate-to-operator-based-aws-terraform-modules) + - [Google Cloud](#migrate-to-operator-based-google-cloud-terraform-modules) + - [Azure](#migrate-to-operator-based-azure-terraform-modules) +- If you used the [W&B Non-Operator Helm chart](https://github.com/wandb/helm-charts/tree/main/charts/wandb), continue [here](#migrate-to-operator-based-helm-chart). +- If you used the [W&B Non-Operator Helm chart with Terraform](https://registry.terraform.io/modules/wandb/wandb/kubernetes/latest), continue [here](#migrate-to-operator-based-terraform-helm-chart). +- If you created the Kubernetes resources with manifests, continue [here](#migrate-to-operator-based-helm-chart). + + +### Migrate to Operator-based AWS Terraform Modules + +For a detailed description of the migration process, continue [here](https://github.com/wandb/helm-charts/tree/main/charts/operator-wandb). + +### Migrate to Operator-based Google Cloud Terraform Modules + +Reach out to [Customer Support](mailto:support@wandb.com) or your W&B team if you have any questions or need assistance. + + +### Migrate to Operator-based Azure Terraform Modules + +Reach out to [Customer Support](mailto:support@wandb.com) or your W&B team if you have any questions or need assistance. + +### Migrate to Operator-based Helm chart + +Follow these steps to migrate to the Operator-based Helm chart: + +1. Get the current W&B configuration. If W&B was deployed with an non-operator-based version of the Helm chart, export the values like this: + ```shell + helm get values wandb + ``` + If W&B was deployed with Kubernetes manifests, export the values like this: + ```shell + kubectl get deployment wandb -o yaml + ``` + You now have all the configuration values you need for the next step. + +2. Create a file called `operator.yaml`. Follow the format described in the [Configuration Reference](#configuration-reference-for-wb-operator). Use the values from step 1. + +3. Scale the current deployment to 0 pods. This step is stops the current deployment. + ```shell + kubectl scale --replicas=0 deployment wandb + ``` +4. Update the Helm chart repo: + ```shell + helm repo update + ``` +5. Install the new Helm chart: + ```shell + helm upgrade --install operator wandb/operator -n wandb-cr --create-namespace + ``` +6. Configure the new helm chart and trigger W&B application deployment. Apply the new configuration. + ```shell + kubectl apply -f operator.yaml + ``` + The deployment takes a few minutes to complete. + +7. Verify the installation. Make sure that everything works by following the steps in [Verify the installation](#verify-the-installation). + +8. Remove to old installation. Uninstall the old helm chart or delete the resources that were created with manifests. + +### Migrate to Operator-based Terraform Helm chart + +Follow these steps to migrate to the Operator-based Helm chart: + + +1. Prepare Terraform config. Replace the Terraform code from the old deployment in your Terraform config with the one that is described [here](#deploy-wb-with-helm-terraform-module). Set the same variables as before. Do not change .tfvars file if you have one. +2. Execute Terraform run. Execute terraform init, plan and apply +3. Verify the installation. Make sure that everything works by following the steps in [Verify the installation](#verify-the-installation). +4. Remove to old installation. Uninstall the old helm chart or delete the resources that were created with manifests. + + + +## Configuration Reference for W&B Server + +This section describes the configuration options for W&B Server application. The application receives its configuration as custom resource definition named [WeightsAndBiases](#how-it-works). Some configuration options are exposed with the below configuration, some need to be set as environment variables. + +The documentation has two lists of environment variables: [basic](/platform/hosting/env-vars/) and [advanced](/platform/hosting/iam/advanced_env_vars/). Only use environment variables if the configuration option that you need is not exposed using the Helm Chart. + +### Basic example +This example defines the minimum set of values required for W&B. For a more realistic production example, see [Complete example](#complete-example). + +This YAML file defines the desired state of your W&B deployment, including the version, environment variables, external resources like databases, and other necessary settings. + +```yaml +apiVersion: apps.wandb.com/v1 +kind: WeightsAndBiases +metadata: + labels: + app.kubernetes.io/name: weightsandbiases + app.kubernetes.io/instance: wandb + name: wandb + namespace: default +spec: + values: + global: + host: https:// + license: eyJhbGnUzaH...j9ZieKQ2x5GGfw + bucket: +
+ mysql: + + ingress: + annotations: + +``` + +Find the full set of values in the [W&B Helm repository](https://github.com/wandb/helm-charts/blob/main/charts/operator-wandb/values.yaml). **Change only those values you need to override**. + +### Complete example +This example configuration deploys W&B to Google Cloud Anthos using Google Cloud Storage: + +```yaml +apiVersion: apps.wandb.com/v1 +kind: WeightsAndBiases +metadata: + labels: + app.kubernetes.io/name: weightsandbiases + app.kubernetes.io/instance: wandb + name: wandb + namespace: default +spec: + values: + global: + host: https://abc-wandb.sandbox-gcp.wandb.ml + bucket: + name: abc-wandb-moving-pipefish + provider: gcs + mysql: + database: wandb_local + host: 10.218.0.2 + name: wandb_local + password: 8wtX6cJHizAZvYScjDzZcUarK4zZGjpV + port: 3306 + user: wandb + redis: + host: redis.example.com + port: 6379 + password: password + api: + enabled: true + glue: + enabled: true + executor: + enabled: true + license: eyJhbGnUzaHgyQjQyQWhEU3...ZieKQ2x5GGfw + ingress: + annotations: + ingress.gcp.kubernetes.io/pre-shared-cert: abc-wandb-cert-creative-puma + kubernetes.io/ingress.class: gce + kubernetes.io/ingress.global-static-ip-name: abc-wandb-operator-address +``` + +### Host +```yaml + # Provide the FQDN with protocol +global: + # example host name, replace with your own + host: https://wandb.example.com +``` + +### Object storage (bucket) + +**AWS** +```yaml +global: + bucket: + provider: "s3" + name: "" + kmsKey: "" + region: "" +``` + +**Google Cloud** +```yaml +global: + bucket: + provider: "gcs" + name: "" +``` + +**Azure** +```yaml +global: + bucket: + provider: "az" + name: "" + secretKey: "" +``` + +**Other providers (Minio, Ceph, and other S3-compatible storage)** + +For other S3 compatible providers, set the bucket configuration as follows: +```yaml +global: + bucket: + # Example values, replace with your own + provider: s3 + name: storage.example.com + kmsKey: null + path: wandb + region: default + accessKey: 5WOA500...P5DK7I + secretKey: HDKYe4Q...JAp1YyjysnX +``` + +For S3-compatible storage hosted outside of AWS, `kmsKey` must be `null`. + +To reference `accessKey` and `secretKey` from a secret: +```yaml +global: + bucket: + # Example values, replace with your own + provider: s3 + name: storage.example.com + kmsKey: null + path: wandb + region: default + secret: + secretName: bucket-secret + accessKeyName: ACCESS_KEY + secretKeyName: SECRET_KEY +``` + +### MySQL + +```yaml +global: + mysql: + # Example values, replace with your own + host: db.example.com + port: 3306 + database: wandb_local + user: wandb + password: 8wtX6cJH...ZcUarK4zZGjpV +``` + +To reference the `password` from a secret: +```yaml +global: + mysql: + # Example values, replace with your own + host: db.example.com + port: 3306 + database: wandb_local + user: wandb + passwordSecret: + name: database-secret + passwordKey: MYSQL_WANDB_PASSWORD +``` + +### License + +```yaml +global: + # Example license, replace with your own + license: eyJhbGnUzaHgyQjQy...VFnPS_KETXg1hi +``` + +To reference the `license` from a secret: +```yaml +global: + licenseSecret: + name: license-secret + key: CUSTOMER_WANDB_LICENSE +``` + +### Ingress + +To identify the ingress class, see this FAQ [entry](#how-to-identify-the-kubernetes-ingress-class). + +**Without TLS** + +```yaml +global: +# IMPORTANT: Ingress is on the same level in the YAML as 'global' (not a child) +ingress: + class: "" +``` + +**With TLS** + +Create a secret that contains the certificate + +```console +kubectl create secret tls wandb-ingress-tls --key wandb-ingress-tls.key --cert wandb-ingress-tls.crt +``` + +Reference the secret in the ingress configuration +```yaml +global: +# IMPORTANT: Ingress is on the same level in the YAML as 'global' (not a child) +ingress: + class: "" + annotations: + {} + # kubernetes.io/ingress.class: nginx + # kubernetes.io/tls-acme: "true" + tls: + - secretName: wandb-ingress-tls + hosts: + - +``` + +In case of Nginx you might have to add the following annotation: + +``` +ingress: + annotations: + nginx.ingress.kubernetes.io/proxy-body-size: 0 +``` + +### Custom Kubernetes ServiceAccounts + +Specify custom Kubernetes service accounts to run the W&B pods. + +The following snippet creates a service account as part of the deployment with the specified name: + +```yaml +app: + serviceAccount: + name: custom-service-account + create: true + +parquet: + serviceAccount: + name: custom-service-account + create: true + +global: + ... +``` +The subsystems "app" and "parquet" run under the specified service account. The other subsystems run under the default service account. + +If the service account already exists on the cluster, set `create: false`: + +```yaml +app: + serviceAccount: + name: custom-service-account + create: false + +parquet: + serviceAccount: + name: custom-service-account + create: false + +global: + ... +``` + +You can specify service accounts on different subsystems such as app, parquet, console, and others: + +```yaml +app: + serviceAccount: + name: custom-service-account + create: true + +console: + serviceAccount: + name: custom-service-account + create: true + +global: + ... +``` + +The service accounts can be different between the subsystems: + +```yaml +app: + serviceAccount: + name: custom-service-account + create: false + +console: + serviceAccount: + name: another-custom-service-account + create: true + +global: + ... +``` + +### External Redis + +```yaml +redis: + install: false + +global: + redis: + host: "" + port: 6379 + password: "" + parameters: {} + caCert: "" +``` + +To reference the `password` from a secret: + +```console +kubectl create secret generic redis-secret --from-literal=redis-password=supersecret +``` + +Reference it in below configuration: +```yaml +redis: + install: false + +global: + redis: + host: redis.example + port: 9001 + auth: + enabled: true + secret: redis-secret + key: redis-password +``` + +### LDAP + + +LDAP configuration support in the current Helm chart is limited. Contact W&B Support or your AISE for assistance configuring LDAP. + + +Configure LDAP by setting environment variables in `global.extraEnv`: + +```yaml +global: + extraEnv: + LDAP_ADDRESS: ldaps://ldap.company.example.com + LDAP_BASE_DN: cn=accounts,dc=company,dc=example,dc=com + LDAP_USER_BASE_DN: cn=users,cn=accounts,dc=company,dc=example,dc=com + LDAP_GROUP_BASE_DN: cn=groups,cn=accounts,dc=company,dc=example,dc=com + LDAP_BIND_DN: uid=ldapbind,cn=sysaccounts,cn=etc,dc=company,dc=example,dc=com + LDAP_BIND_PW: ******************** + LDAP_ATTRIBUTES: email=mail,name=cn + LDAP_TLS_ENABLE: "true" + LDAP_LOGIN: "true" + LDAP_USER_OBJECT_CLASS: user + LDAP_GROUP_OBJECT_CLASS: group +``` + + +This legacy approach is no longer recommended. This section is provided for reference. + +**Without TLS** +```yaml +global: + ldap: + enabled: true + # LDAP server address including "ldap://" or "ldaps://" + host: + # LDAP search base to use for finding users + baseDN: + # LDAP user to bind with (if not using anonymous bind) + bindDN: + # Secret name and key with LDAP password to bind with (if not using anonymous bind) + bindPW: + # LDAP attribute for email and group ID attribute names as comma separated string values. + attributes: + # LDAP group allow list + groupAllowList: + # Enable LDAP TLS + tls: false +``` + +**With TLS** + +The LDAP TLS cert configuration requires a config map pre-created with the certificate content. + +To create the config map you can use the following command: + +```console +kubectl create configmap ldap-tls-cert --from-file=certificate.crt +``` + +And use the config map in the YAML like the example below + +```yaml +global: + ldap: + enabled: true + # LDAP server address including "ldap://" or "ldaps://" + host: + # LDAP search base to use for finding users + baseDN: + # LDAP user to bind with (if not using anonymous bind) + bindDN: + # Secret name and key with LDAP password to bind with (if not using anonymous bind) + bindPW: + # LDAP attribute for email and group ID attribute names as comma separated string values. + attributes: + # LDAP group allow list + groupAllowList: + # Enable LDAP TLS + tls: true + # ConfigMap name and key with CA certificate for LDAP server + tlsCert: + configMap: + name: "ldap-tls-cert" + key: "certificate.crt" +``` + + +### OIDC SSO + +```yaml +global: + auth: + sessionLengthHours: 720 + oidc: + clientId: "" + secret: "" + # Only include if your IdP requires it. + authMethod: "" + issuer: "" +``` + +`authMethod` is optional. + +### SMTP + +```yaml +global: + email: + smtp: + host: "" + port: 587 + user: "" + password: "" +``` + +### Environment Variables +```yaml +global: + extraEnv: + GLOBAL_ENV: "example" +``` + +### Custom certificate authority +`customCACerts` is a list and can take many certificates. Certificate authorities specified in `customCACerts` only apply to the W&B Server application. + +```yaml +global: + customCACerts: + - | + -----BEGIN CERTIFICATE----- + MIIBnDCCAUKgAwIBAg.....................fucMwCgYIKoZIzj0EAwIwLDEQ + MA4GA1UEChMHSG9tZU.....................tZUxhYiBSb290IENBMB4XDTI0 + MDQwMTA4MjgzMFoXDT.....................oNWYggsMo8O+0mWLYMAoGCCqG + SM49BAMCA0gAMEUCIQ.....................hwuJgyQRaqMI149div72V2QIg + P5GD+5I+02yEp58Cwxd5Bj2CvyQwTjTO4hiVl1Xd0M0= + -----END CERTIFICATE----- + - | + -----BEGIN CERTIFICATE----- + MIIBxTCCAWugAwIB.......................qaJcwCgYIKoZIzj0EAwIwLDEQ + MA4GA1UEChMHSG9t.......................tZUxhYiBSb290IENBMB4XDTI0 + MDQwMTA4MjgzMVoX.......................UK+moK4nZYvpNpqfvz/7m5wKU + SAAwRQIhAIzXZMW4.......................E8UFqsCcILdXjAiA7iTluM0IU + aIgJYVqKxXt25blH/VyBRzvNhViesfkNUQ== + -----END CERTIFICATE----- +``` + +CA certificates can also be stored in a ConfigMap: +```yaml +global: + caCertsConfigMap: custom-ca-certs +``` + +The ConfigMap must look like this: +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: custom-ca-certs +data: + ca-cert1.crt: | + -----BEGIN CERTIFICATE----- + ... + -----END CERTIFICATE----- + ca-cert2.crt: | + -----BEGIN CERTIFICATE----- + ... + -----END CERTIFICATE----- +``` + + +If using a ConfigMap, each key in the ConfigMap must end with `.crt` (for example, `my-cert.crt` or `ca-cert1.crt`). This naming convention is required for `update-ca-certificates` to parse and add each certificate to the system CA store. + + +### Custom security context + +Each W&B component supports custom security context configurations of the following form: + +```yaml +pod: + securityContext: + runAsNonRoot: true + runAsUser: 1001 + runAsGroup: 0 + fsGroup: 1001 + fsGroupChangePolicy: Always + seccompProfile: + type: RuntimeDefault +container: + securityContext: + capabilities: + drop: + - ALL + readOnlyRootFilesystem: false + allowPrivilegeEscalation: false +``` + + +The only valid value for `runAsGroup:` is `0`. Any other value is an error. + + + +For example, to configure the application pod, add a section `app` to your configuration: + +```yaml +global: + ... +app: + pod: + securityContext: + runAsNonRoot: true + runAsUser: 1001 + runAsGroup: 0 + fsGroup: 1001 + fsGroupChangePolicy: Always + seccompProfile: + type: RuntimeDefault + container: + securityContext: + capabilities: + drop: + - ALL + readOnlyRootFilesystem: false + allowPrivilegeEscalation: false +``` + +The same concept applies to `console`, `weave`, `weave-trace` and `parquet`. + +## Configuration Reference for W&B Operator + +This section describes configuration options for W&B Kubernetes operator (`wandb-controller-manager`). The operator receives its configuration in the form of a YAML file. + +By default, the W&B Kubernetes operator does not need a configuration file. Create a configuration file if required. For example, you might need a configuration file to specify custom certificate authorities, deploy in an air gap environment and so forth. + +Find the full list of spec customization [in the Helm repository](https://github.com/wandb/helm-charts/blob/main/charts/operator/values.yaml). + +### Custom CA +A custom certificate authority (`customCACerts`), is a list and can take many certificates. Those certificate authorities when added only apply to the W&B Kubernetes operator (`wandb-controller-manager`). + +```yaml +customCACerts: +- | + -----BEGIN CERTIFICATE----- + MIIBnDCCAUKgAwIBAg.....................fucMwCgYIKoZIzj0EAwIwLDEQ + MA4GA1UEChMHSG9tZU.....................tZUxhYiBSb290IENBMB4XDTI0 + MDQwMTA4MjgzMFoXDT.....................oNWYggsMo8O+0mWLYMAoGCCqG + SM49BAMCA0gAMEUCIQ.....................hwuJgyQRaqMI149div72V2QIg + P5GD+5I+02yEp58Cwxd5Bj2CvyQwTjTO4hiVl1Xd0M0= + -----END CERTIFICATE----- +- | + -----BEGIN CERTIFICATE----- + MIIBxTCCAWugAwIB.......................qaJcwCgYIKoZIzj0EAwIwLDEQ + MA4GA1UEChMHSG9t.......................tZUxhYiBSb290IENBMB4XDTI0 + MDQwMTA4MjgzMVoX.......................UK+moK4nZYvpNpqfvz/7m5wKU + SAAwRQIhAIzXZMW4.......................E8UFqsCcILdXjAiA7iTluM0IU + aIgJYVqKxXt25blH/VyBRzvNhViesfkNUQ== + -----END CERTIFICATE----- +``` + +CA certificates can also be stored in a ConfigMap: +```yaml +caCertsConfigMap: custom-ca-certs +``` + +The ConfigMap must look like this: +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: custom-ca-certs +data: + ca-cert1.crt: | + -----BEGIN CERTIFICATE----- + ... + -----END CERTIFICATE----- + ca-cert2.crt: | + -----BEGIN CERTIFICATE----- + ... + -----END CERTIFICATE----- +``` + + +Each key in the ConfigMap must end with `.crt` (for example, `my-cert.crt` or `ca-cert1.crt`). This naming convention is required for `update-ca-certificates` to parse and add each certificate to the system CA store. + + +## FAQ + +### What is the purpose/role of each individual pod? +* **`wandb-app`**: the core of W&B, including the GraphQL API and frontend application. It powers most of our platform's functionality. +* **`wandb-console`**: the administration console, accessed via `/console`. +* **`wandb-otel`**: the OpenTelemetry agent, which collects metrics and logs from resources at the Kubernetes layer for display in the administration console. +* **`wandb-prometheus`**: the Prometheus server, which captures metrics from various components for display in the administration console. +* **`wandb-parquet`**: a backend microservice separate from the `wandb-app` pod that exports database data to object storage in Parquet format. +* **`wandb-weave`**: another backend microservice that loads query tables in the UI and supports various core app features. +* **`wandb-weave-trace`**: a framework for tracking, experimenting with, evaluating, deploying, and improving LLM-based applications. The framework is accessed via the `wandb-app` pod. + +### How to get the W&B Operator Console password +See [Accessing the W&B Kubernetes Operator Management Console](#access-the-wb-management-console). + + +### How to access the W&B Operator Console if Ingress doesn't work + +Execute the following command on a host that can reach the Kubernetes cluster: + +```console +kubectl port-forward svc/wandb-console 8082 +``` + +Access the console in the browser with `https://localhost:8082/` console. + +See [Accessing the W&B Kubernetes Operator Management Console](#access-the-wb-management-console) on how to get the password (Option 2). + +### How to view W&B Server logs + +The application pod is named **wandb-app-xxx**. + +```console +kubectl get pods +kubectl logs wandb-XXXXX-XXXXX +``` + +### How to identify the Kubernetes ingress class + +You can get the ingress class installed in your cluster by running + +```console +kubectl get ingressclass +``` From 90948936bf0ceebc823bd7e9842a9a06bc7a734f Mon Sep 17 00:00:00 2001 From: Matt Linville Date: Fri, 30 Jan 2026 15:49:29 -0800 Subject: [PATCH 3/7] Operator revamp phase 3 - Create consolidated Terraform guide - Shared intro, explanation of modules - Adds a cloud provider selector - Platform-specific content in tabs - Prerequisites - General steps - Recommendations - Specific details about Redis, message broker - Links to more resources - Update navigation --- docs.json | 9 +- .../cloud-deployments/terraform.mdx | 698 ++++++++++++++++++ 2 files changed, 699 insertions(+), 8 deletions(-) create mode 100644 platform/hosting/self-managed/cloud-deployments/terraform.mdx diff --git a/docs.json b/docs.json index 29a8bd06cd..d2790fb16d 100644 --- a/docs.json +++ b/docs.json @@ -81,14 +81,7 @@ "platform/hosting/self-managed/ref-arch", "platform/hosting/self-managed/requirements", "platform/hosting/self-managed/operator", - { - "group": "Install on public cloud", - "pages": [ - "platform/hosting/self-managed/aws-tf", - "platform/hosting/self-managed/gcp-tf", - "platform/hosting/self-managed/azure-tf" - ] - }, + "platform/hosting/self-managed/cloud-deployments/terraform", "platform/hosting/self-managed/bare-metal", "platform/hosting/self-managed/operator-airgapped", "platform/hosting/server-upgrade-process", diff --git a/platform/hosting/self-managed/cloud-deployments/terraform.mdx b/platform/hosting/self-managed/cloud-deployments/terraform.mdx new file mode 100644 index 0000000000..766f33678d --- /dev/null +++ b/platform/hosting/self-managed/cloud-deployments/terraform.mdx @@ -0,0 +1,698 @@ +--- +description: Deploy W&B Platform on public cloud with Terraform +title: Deploy with Terraform on public cloud +--- + + +W&B recommends fully managed deployment options such as [W&B Multi-tenant Cloud](/platform/hosting/hosting-options/multi_tenant_cloud) or [W&B Dedicated Cloud](/platform/hosting/hosting-options/dedicated_cloud/) deployment types. W&B fully managed services are simple and secure to use, with minimum to no configuration required. + + +W&B provides Terraform modules for deploying the platform on public cloud providers. These modules automate the provisioning of infrastructure and installation of W&B Server. + +Before you start, W&B recommends that you choose one of the [remote backends](https://developer.hashicorp.com/terraform/language/backend) available for Terraform to store the [State File](https://developer.hashicorp.com/terraform/language/state). The State File is the necessary resource to roll out upgrades or make changes in your deployment without recreating all components. + +Select your cloud provider: + + + + +W&B recommends using the [W&B Server AWS Terraform Module](https://registry.terraform.io/modules/wandb/wandb/aws/latest) to deploy the platform on AWS. + +The Terraform Module deploys the following mandatory components: + +- Load Balancer +- AWS Identity & Access Management (IAM) +- AWS Key Management System (KMS) +- Amazon Aurora MySQL +- Amazon VPC +- Amazon S3 +- Amazon Route53 +- Amazon Certificate Manager (ACM) +- Amazon Elastic Load Balancing (ALB) +- Amazon Secrets Manager + +Optional components include: + +- Elastic Cache for Redis +- SQS + +## Pre-requisite permissions + +The account that runs Terraform needs to be able to create all components described above and permission to create **IAM Policies** and **IAM Roles** and assign roles to resources. + +## General steps + +The steps in this section are common for any deployment option. + +1. Prepare the development environment. + - Install [Terraform](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli) + - W&B recommend creating a Git repository for version control. +2. Create the `terraform.tfvars` file. + + The `tvfars` file content can be customized according to the installation type, but the minimum recommended will look like the example below. + + ```bash + namespace = "wandb" + license = "xxxxxxxxxxyyyyyyyyyyyzzzzzzz" + subdomain = "wandb-aws" + domain_name = "wandb.ml" + zone_id = "xxxxxxxxxxxxxxxx" + allowed_inbound_cidr = ["0.0.0.0/0"] + allowed_inbound_ipv6_cidr = ["::/0"] + eks_cluster_version = "1.29" + ``` + + Ensure to define variables in your `tvfars` file before you deploy because the `namespace` variable is a string that prefixes all resources created by Terraform. + + The combination of `subdomain` and `domain` will form the FQDN that W&B will be configured. In the example above, the W&B FQDN will be `wandb-aws.wandb.ml` and the DNS `zone_id` where the FQDN record will be created. + + Both `allowed_inbound_cidr` and `allowed_inbound_ipv6_cidr` also require setting. In the module, this is a mandatory input. The proceeding example permits access from any source to the W&B installation. + +3. Create the file `versions.tf` + + This file will contain the Terraform and Terraform provider versions required to deploy W&B in AWS: + + ```bash + provider "aws" { + region = "eu-central-1" + + default_tags { + tags = { + GithubRepo = "terraform-aws-wandb" + GithubOrg = "wandb" + Enviroment = "Example" + Example = "PublicDnsExternal" + } + } + } + ``` + + Refer to the [Terraform Official Documentation](https://registry.terraform.io/providers/hashicorp/aws/latest/docs#provider-configuration) to configure the AWS provider. + + Optionally, but highly recommended, add the [remote backend configuration](https://developer.hashicorp.com/terraform/language/backend) mentioned at the beginning of this documentation. + +4. Create the file `variables.tf` + + For every option configured in the `terraform.tfvars` Terraform requires a correspondent variable declaration. + + ``` + variable "namespace" { + type = string + description = "Name prefix used for resources" + } + + variable "domain_name" { + type = string + description = "Domain name used to access instance." + } + + variable "subdomain" { + type = string + default = null + description = "Subdomain for accessing the Weights & Biases UI." + } + + variable "license" { + type = string + } + + variable "zone_id" { + type = string + description = "Domain for creating the Weights & Biases subdomain on." + } + + variable "allowed_inbound_cidr" { + description = "CIDRs allowed to access wandb-server." + nullable = false + type = list(string) + } + + variable "allowed_inbound_ipv6_cidr" { + description = "CIDRs allowed to access wandb-server." + nullable = false + type = list(string) + } + + variable "eks_cluster_version" { + description = "EKS cluster kubernetes version" + nullable = false + type = string + } + ``` + +## Recommended deployment + +This is the most straightforward deployment option configuration that creates all mandatory components and installs in the Kubernetes Cluster the latest version of W&B. + +1. Create the `main.tf` + + In the same directory where you created the files in the General Steps, create a file `main.tf` with the following content: + + ``` + module "wandb_infra" { + source = "wandb/wandb/aws" + version = "~>7.0" + + namespace = var.namespace + domain_name = var.domain_name + subdomain = var.subdomain + zone_id = var.zone_id + + allowed_inbound_cidr = var.allowed_inbound_cidr + allowed_inbound_ipv6_cidr = var.allowed_inbound_ipv6_cidr + + public_access = true + external_dns = true + kubernetes_public_access = true + kubernetes_public_access_cidrs = ["0.0.0.0/0"] + eks_cluster_version = var.eks_cluster_version + } + + data "aws_eks_cluster" "eks_cluster_id" { + name = module.wandb_infra.cluster_name + } + + data "aws_eks_cluster_auth" "eks_cluster_auth" { + name = module.wandb_infra.cluster_name + } + + provider "kubernetes" { + host = data.aws_eks_cluster.eks_cluster_id.endpoint + cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks_cluster_id.certificate_authority.0.data) + token = data.aws_eks_cluster_auth.eks_cluster_auth.token + } + + + provider "helm" { + kubernetes { + host = data.aws_eks_cluster.eks_cluster_id.endpoint + cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks_cluster_id.certificate_authority.0.data) + token = data.aws_eks_cluster_auth.eks_cluster_auth.token + } + } + + output "url" { + value = module.wandb_infra.url + } + + output "bucket" { + value = module.wandb_infra.bucket_name + } + ``` + +2. Deploy W&B + + To deploy W&B, execute the following commands: + + ``` + terraform init + terraform apply -var-file=terraform.tfvars + ``` + +## Enable Redis + +To use Redis to cache SQL queries and speed up the application response when loading metrics, add the option `create_elasticache_subnet = true` to the `main.tf` file: + +``` +module "wandb_infra" { + source = "wandb/wandb/aws" + version = "~>7.0" + + namespace = var.namespace + domain_name = var.domain_name + subdomain = var.subdomain + zone_id = var.zone_id + create_elasticache_subnet = true +} +[...] +``` + +## Enable message broker (queue) + +To enable an external message broker using SQS, add the option `use_internal_queue = false` to the `main.tf` file: + + +This is optional because W&B includes an embedded broker. This option does not bring a performance improvement. + + +``` +module "wandb_infra" { + source = "wandb/wandb/aws" + version = "~>7.0" + + namespace = var.namespace + domain_name = var.domain_name + subdomain = var.subdomain + zone_id = var.zone_id + use_internal_queue = false + +[...] +} +``` + +## Additional resources + +- [AWS Terraform Module documentation](https://registry.terraform.io/modules/wandb/wandb/aws/latest) +- [AWS Terraform Module source code](https://github.com/wandb/terraform-aws-wandb) +- [Migrate to operator-based AWS Terraform modules](/platform/hosting/self-managed/aws-tf/#migrate-to-operator-based-aws-terraform-modules) + + + + + +W&B recommends using the [W&B Server Google Cloud Terraform Module](https://registry.terraform.io/modules/wandb/wandb/google/latest) to deploy the platform on Google Cloud. + +The module documentation is extensive and contains all available options that can be used. + +The Terraform Module deploys the following mandatory components: + +- VPC +- Cloud SQL for MySQL +- Cloud Storage Bucket +- Google Kubernetes Engine +- KMS Crypto Key +- Load Balancer + +Optional components include: + +- Memory store for Redis +- Pub/Sub messages system + +## Pre-requisite permissions + +The account that will run Terraform needs to have the role `roles/owner` in the Google Cloud project used. + +## General steps + +The steps in this section are common for any deployment option. + +1. Prepare the development environment. + - Install [Terraform](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli) + - W&B recommends creating a Git repository with the code that will be used, but you can keep your files locally. + - Create a project in [Google Cloud Console](https://console.cloud.google.com/) + - Authenticate with Google Cloud (make sure to [install gcloud](https://cloud.google.com/sdk/docs/install) before): + `gcloud auth application-default login` +2. Create the `terraform.tfvars` file. + + The `tvfars` file content can be customized according to the installation type, but the minimum recommended will look like the example below. + + ```bash + project_id = "wandb-project" + region = "europe-west2" + zone = "europe-west2-a" + namespace = "wandb" + license = "xxxxxxxxxxyyyyyyyyyyyzzzzzzz" + subdomain = "wandb-gcp" + domain_name = "wandb.ml" + ``` + + The variables defined here need to be decided before the deployment. The `namespace` variable will be a string that will prefix all resources created by Terraform. + + The combination of `subdomain` and `domain` will form the FQDN that W&B will be configured. In the example above, the W&B FQDN will be `wandb-gcp.wandb.ml` + +3. Create the file `variables.tf` + + For every option configured in the `terraform.tfvars` Terraform requires a correspondent variable declaration. + + ``` + variable "project_id" { + type = string + description = "Project ID" + } + + variable "region" { + type = string + description = "Google region" + } + + variable "zone" { + type = string + description = "Google zone" + } + + variable "namespace" { + type = string + description = "Namespace prefix used for resources" + } + + variable "domain_name" { + type = string + description = "Domain name for accessing the Weights & Biases UI." + } + + variable "subdomain" { + type = string + description = "Subdomain for access the Weights & Biases UI." + } + + variable "license" { + type = string + description = "W&B License" + } + ``` + +## Recommended deployment + +This is the most straightforward deployment option configuration that creates all mandatory components and installs in the Kubernetes Cluster the latest version of W&B. + +1. Create the `main.tf` + + In the same directory where you created the files in the General Steps, create a file `main.tf` with the following content: + + ``` + provider "google" { + project = var.project_id + region = var.region + zone = var.zone + } + + provider "google-beta" { + project = var.project_id + region = var.region + zone = var.zone + } + + data "google_client_config" "current" {} + + provider "kubernetes" { + host = "https://${module.wandb.cluster_endpoint}" + cluster_ca_certificate = base64decode(module.wandb.cluster_ca_certificate) + token = data.google_client_config.current.access_token + } + + # Spin up all required services + module "wandb" { + source = "wandb/wandb/google" + version = "~> 5.0" + + namespace = var.namespace + license = var.license + domain_name = var.domain_name + subdomain = var.subdomain + } + + # You'll want to update your DNS with the provisioned IP address + output "url" { + value = module.wandb.url + } + + output "address" { + value = module.wandb.address + } + + output "bucket_name" { + value = module.wandb.bucket_name + } + ``` + +2. Deploy W&B + + To deploy W&B, execute the following commands: + + ``` + terraform init + terraform apply -var-file=terraform.tfvars + ``` + +## Enable Redis + +To use Redis to cache SQL queries and speed up the application response when loading metrics, add the option `create_redis = true` to the `main.tf` file: + +``` +[...] + +module "wandb" { + source = "wandb/wandb/google" + version = "~> 5.0" + + namespace = var.namespace + license = var.license + domain_name = var.domain_name + subdomain = var.subdomain + create_redis = true +} +[...] +``` + +## Enable message broker (queue) + +To enable an external message broker using Pub/Sub, add the option `use_internal_queue = false` to the `main.tf` file: + + +This is optional because W&B includes an embedded broker. This option does not bring a performance improvement. + + +``` +[...] + +module "wandb" { + source = "wandb/wandb/google" + version = "~> 5.0" + + namespace = var.namespace + license = var.license + domain_name = var.domain_name + subdomain = var.subdomain + use_internal_queue = false +} + +[...] +``` + +## Additional resources + +- [Google Cloud Terraform Module documentation](https://registry.terraform.io/modules/wandb/wandb/google/latest) +- [Google Cloud Terraform Module source code](https://github.com/wandb/terraform-google-wandb) + + + + + +W&B recommends using the [W&B Server Azure Terraform Module](https://registry.terraform.io/modules/wandb/wandb/azurerm/latest) to deploy the platform on Azure. + +The module documentation is extensive and contains all available options that can be used. + +The Terraform Module deploys the following mandatory components: + +- Azure Resource Group +- Azure Virtual Network (VPC) +- Azure MySQL Flexible Server +- Azure Storage Account & Blob Storage +- Azure Kubernetes Service +- Azure Application Gateway + +Optional components include: + +- Azure Cache for Redis +- Azure Event Grid + +## Pre-requisite permissions + +The simplest way to get the AzureRM provider configured is via [Azure CLI](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/guides/azure_cli) but in case of automation using [Azure Service Principal](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/guides/service_principal_client_secret) can also be useful. + +Regardless of the authentication method used, the account that will run Terraform needs to be able to create all components described above. + +## General steps + +The steps in this section are common for any deployment option. + +1. Prepare the development environment. + - Install [Terraform](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli) + - W&B recommends creating a Git repository with the code that will be used, but you can keep your files locally. + +2. Create the `terraform.tfvars` file. + + The `tvfars` file content can be customized according to the installation type, but the minimum recommended will look like the example below. + + ```bash + namespace = "wandb" + wandb_license = "xxxxxxxxxxyyyyyyyyyyyzzzzzzz" + subdomain = "wandb-azure" + domain_name = "wandb.ml" + location = "westeurope" + ``` + + The variables defined here need to be decided before the deployment. The `namespace` variable will be a string that will prefix all resources created by Terraform. + + The combination of `subdomain` and `domain` will form the FQDN that W&B will be configured. In the example above, the W&B FQDN will be `wandb-azure.wandb.ml`. + +3. Create the file `versions.tf` + + This file will contain the Terraform and Terraform provider versions required to deploy W&B in Azure: + + ```bash + terraform { + required_version = "~> 1.3" + + required_providers { + azurerm = { + source = "hashicorp/azurerm" + version = "~> 3.17" + } + } + } + ``` + + Refer to the [Terraform Official Documentation](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs) to configure the Azure provider. + + Optionally, but highly recommended, add the [remote backend configuration](https://developer.hashicorp.com/terraform/language/backend) mentioned at the beginning of this documentation. + +4. Create the file `variables.tf` + + For every option configured in the `terraform.tfvars` Terraform requires a correspondent variable declaration. + + ```bash + variable "namespace" { + type = string + description = "String used for prefix resources." + } + + variable "location" { + type = string + description = "Azure Resource Group location" + } + + variable "domain_name" { + type = string + description = "Domain for accessing the Weights & Biases UI." + } + + variable "subdomain" { + type = string + default = null + description = "Subdomain for accessing the Weights & Biases UI. Default creates record at Route53 Route." + } + + variable "license" { + type = string + description = "Your wandb/local license" + } + ``` + +## Recommended deployment + +This is the most straightforward deployment option configuration that creates all mandatory components and installs in the Kubernetes Cluster the latest version of W&B. + +1. Create the `main.tf` + + In the same directory where you created the files in the General Steps, create a file `main.tf` with the following content: + + ```bash + provider "azurerm" { + features {} + } + + provider "kubernetes" { + host = module.wandb.cluster_host + cluster_ca_certificate = base64decode(module.wandb.cluster_ca_certificate) + client_key = base64decode(module.wandb.cluster_client_key) + client_certificate = base64decode(module.wandb.cluster_client_certificate) + } + + provider "helm" { + kubernetes { + host = module.wandb.cluster_host + cluster_ca_certificate = base64decode(module.wandb.cluster_ca_certificate) + client_key = base64decode(module.wandb.cluster_client_key) + client_certificate = base64decode(module.wandb.cluster_client_certificate) + } + } + + # Spin up all required services + module "wandb" { + source = "wandb/wandb/azurerm" + version = "~> 1.2" + + namespace = var.namespace + location = var.location + license = var.license + domain_name = var.domain_name + subdomain = var.subdomain + + deletion_protection = false + + tags = { + "Example" : "PublicDns" + } + } + + output "address" { + value = module.wandb.address + } + + output "url" { + value = module.wandb.url + } + ``` + +2. Deploy W&B + + To deploy W&B, execute the following commands: + + ``` + terraform init + terraform apply -var-file=terraform.tfvars + ``` + +## Enable Redis + +To use Redis to cache SQL queries and speed up the application response when loading metrics, add the option `create_redis = true` to the `main.tf` file: + +```bash +# Spin up all required services +module "wandb" { + source = "wandb/wandb/azurerm" + version = "~> 1.2" + + namespace = var.namespace + location = var.location + license = var.license + domain_name = var.domain_name + subdomain = var.subdomain + + create_redis = true + [...] +} +``` + +## Enable message broker (queue) + +To enable an external message broker using Azure Event Grid, add the option `use_internal_queue = false` to the `main.tf` file: + + +This is optional because W&B includes an embedded broker. This option does not bring a performance improvement. + + +```bash +# Spin up all required services +module "wandb" { + source = "wandb/wandb/azurerm" + version = "~> 1.2" + + namespace = var.namespace + location = var.location + license = var.license + domain_name = var.domain_name + subdomain = var.subdomain + + use_internal_queue = false + [...] +} +``` + +## Additional resources + +- [Azure Terraform Module documentation](https://registry.terraform.io/modules/wandb/wandb/azurerm/latest) +- [Azure Terraform Module source code](https://github.com/wandb/terraform-azurerm-wandb) + + + + +## Other deployment options + +You can combine multiple deployment options by adding all configurations to the same file. Each Terraform module provides several options that can be combined with the standard options and the minimal configuration found in the recommended deployment section. + +Refer to the module documentation for your cloud provider for the full list of available options: +- [AWS Module documentation](https://registry.terraform.io/modules/wandb/wandb/aws/latest) +- [Google Cloud Module documentation](https://registry.terraform.io/modules/wandb/wandb/google/latest) +- [Azure Module documentation](https://registry.terraform.io/modules/wandb/wandb/azurerm/latest) From ae58ee25e23e7c92c98480c98ca148b131290d34 Mon Sep 17 00:00:00 2001 From: Matt Linville Date: Mon, 2 Feb 2026 16:19:59 -0800 Subject: [PATCH 4/7] Operator revamp phase 4 - Create connected and airgapped on-prem guides - Update requirements - Update navigation This resolves some links that were intentionally left broken after the last phase --- docs.json | 4 +- .../kubernetes-airgapped.mdx | 855 ++++++++++++++++++ .../on-premises-deployments/kubernetes.mdx | 193 ++++ platform/hosting/self-managed/operator.mdx | 2 +- 4 files changed, 1051 insertions(+), 3 deletions(-) create mode 100644 platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped.mdx create mode 100644 platform/hosting/self-managed/on-premises-deployments/kubernetes.mdx diff --git a/docs.json b/docs.json index 6aeea38659..dcaff0fc75 100644 --- a/docs.json +++ b/docs.json @@ -82,8 +82,8 @@ "platform/hosting/self-managed/requirements", "platform/hosting/self-managed/operator", "platform/hosting/self-managed/cloud-deployments/terraform", - "platform/hosting/self-managed/bare-metal", - "platform/hosting/self-managed/operator-airgapped", + "platform/hosting/self-managed/on-premises-deployments/kubernetes", + "platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped", "platform/hosting/server-upgrade-process", "platform/hosting/self-managed/disable-automatic-app-version-updates" ] diff --git a/platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped.mdx b/platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped.mdx new file mode 100644 index 0000000000..652f525827 --- /dev/null +++ b/platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped.mdx @@ -0,0 +1,855 @@ +--- +description: Deploy W&B Platform in air-gapped and disconnected Kubernetes environments +title: Deploy on Air-Gapped Kubernetes +--- + +import SelfManagedVersionRequirements from "/snippets/en/_includes/self-managed-version-requirements.mdx"; +import SelfManagedSslTlsRequirements from "/snippets/en/_includes/self-managed-ssl-tls-requirements.mdx"; +import SelfManagedMysqlRequirements from "/snippets/en/_includes/self-managed-mysql-requirements.mdx"; +import SelfManagedMysqlDatabaseCreation from "/snippets/en/_includes/self-managed-mysql-database-creation.mdx"; +import SelfManagedRedisRequirements from "/snippets/en/_includes/self-managed-redis-requirements.mdx"; +import SelfManagedObjectStorageRequirements from "/snippets/en/_includes/self-managed-object-storage-requirements.mdx"; +import SelfManagedHardwareRequirements from "/snippets/en/_includes/self-managed-hardware-requirements.mdx"; +import SelfManagedVerifyInstallation from "/snippets/en/_includes/self-managed-verify-installation.mdx"; + +## Introduction + +This guide provides step-by-step instructions to deploy the W&B Platform in air-gapped, fully disconnected, or restricted network customer-managed environments. Air-gapped deployments are common in: + +- Secure government facilities +- Financial institutions with strict network isolation +- Healthcare organizations with compliance requirements +- Industrial control systems (ICS) environments +- Research facilities with classified networks + +Use an internal container registry and Helm repository to host the required W&B images and charts. Run these commands in a shell console with proper access to the Kubernetes cluster. + +You can adapt these commands to work with any CI/CD tooling you use to deploy Kubernetes applications. + +For standard on-premises Kubernetes deployments with internet connectivity, see [Deploy on On-Premises Kubernetes](/platform/hosting/self-managed/on-premises-deployments/kubernetes). + +## Prerequisites + +Before starting, ensure your air-gapped environment meets the following requirements. + +### Version requirements + + + +### SSL/TLS requirements + + + +### Hardware requirements + + + +### MySQL database + + + +For MySQL configuration parameters for self-managed instances, see the [reference architecture MySQL configuration section](/platform/hosting/self-managed/ref-arch#mysql-configuration-parameters). + +### Redis + + + +### Object storage + + + +For detailed object storage provisioning guidance, see the [Bring Your Own Bucket (BYOB)](/platform/hosting/data-security/secure-storage-connector) guide. In air-gapped environments, you'll typically use on-premises S3-compatible storage such as MinIO Enterprise, NetApp StorageGRID, or Dell ECS. + +### Air-gapped specific requirements + +In addition to the standard requirements above, air-gapped deployments require: + +- **Internal container registry**: Access to a private container registry (Harbor, JFrog Artifactory, Nexus, etc.) with all required W&B images +- **Internal Helm repository**: Access to a private Helm chart repository with W&B Helm charts +- **Image transfer capability**: A method to transfer container images from an internet-connected system to your air-gapped registry +- **License file**: A valid W&B Enterprise license (contact your W&B account team) + +For complete infrastructure requirements, including networking and load balancer configuration, see the [reference architecture](/platform/hosting/self-managed/ref-arch#infrastructure-requirements). + +## Prepare your air-gapped environment + +### Step 1: Set up internal container registry + +For a successful air-gapped deployment, all required container images must be available in your air-gapped container registry. + + +You are responsible for tracking the W&B Operator's requirements and maintaining your container registry with updated images regularly. For the most current list of required container images and versions, refer to the Helm chart, or contact [W&B Support](mailto:support@wandb.com) or your assigned W&B support engineer. + + +#### Core W&B component containers + +The following core images are required: + +- [`docker.io/wandb/controller`](https://hub.docker.com/r/wandb/controller) - W&B Kubernetes Operator +- [`docker.io/wandb/local`](https://hub.docker.com/r/wandb/local) - W&B application server +- [`docker.io/wandb/console`](https://hub.docker.com/r/wandb/console) - W&B management console +- [`docker.io/wandb/megabinary`](https://hub.docker.com/r/wandb/megabinary) - W&B microservices (API, executor, glue, parquet) + +#### Dependency containers + +The following third-party dependency images are required: + +- [`docker.io/bitnamilegacy/redis`](https://hub.docker.com/r/bitnamilegacy/redis) - Required for local Redis deployment during testing and development. For production Redis requirements, see the [Redis section](#redis) in Prerequisites. +- [`docker.io/otel/opentelemetry-collector-contrib`](https://hub.docker.com/r/otel/opentelemetry-collector-contrib) - OpenTelemetry agent for collecting metrics and logs +- [`quay.io/prometheus/prometheus`](https://quay.io/repository/prometheus/prometheus) - Prometheus for metrics collection +- [`quay.io/prometheus-operator/prometheus-config-reloader`](https://quay.io/repository/prometheus-operator/prometheus-config-reloader) - Prometheus dependency + +#### Get the complete image list + +To extract the complete list of required images and versions from the Helm chart: + +1. On an internet-connected system, download the W&B Helm charts from the [W&B Helm charts repository](https://github.com/wandb/helm-charts): + + ```bash + # Clone the helm-charts repository + git clone https://github.com/wandb/helm-charts.git + cd helm-charts + ``` + +2. Inspect the `values.yaml` files to identify all container images and their versions: + + ```bash + # Extract image references from the operator chart + helm show values charts/operator | grep -E "repository:|tag:" | grep -v "^#" + + # Extract image references from the platform chart + helm show values charts/operator-wandb | grep -E "repository:|tag:" | grep -v "^#" + ``` + + Alternatively, use this command to extract just the repository names (without version tags): + + ```bash + helm show values charts/operator-wandb \ + | awk -F': *' '/^[[:space:]]*repository:/{print $2}' \ + | grep -v "^#" \ + | sort -u + ``` + + The list of repositories will look similar to the following: + + ```text + wandb/controller + wandb/local + wandb/console + wandb/megabinary + wandb/weave-python + wandb/weave-trace + otel/opentelemetry-collector-contrib + prometheus/prometheus + prometheus-operator/prometheus-config-reloader + bitnamilegacy/redis + ``` + + To get the specific version tags for each image, use the first command above (`grep -E "repository:|tag:"`), which will show both repository names and their corresponding version tags. + +#### Transfer images to air-gapped registry + +1. On an internet-connected system, pull and save all required images. + + + Replace version numbers in the examples below with the actual versions from your Helm chart inspection in step 2 above. The versions shown here are examples and will become outdated. + + + Use shell variables to manage versions consistently: + + ```bash + # Set version variables (update these based on your Helm chart versions) + CONTROLLER_VERSION="1.13.3" + APP_VERSION="0.59.2" + CONSOLE_VERSION="2.12.2" + + # Pull images + docker pull wandb/controller:${CONTROLLER_VERSION} + docker pull wandb/local:${APP_VERSION} + docker pull wandb/console:${CONSOLE_VERSION} + docker pull wandb/megabinary:${APP_VERSION} + # ... pull all other required images with their versions + + # Save images to .tar files + docker save wandb/controller:${CONTROLLER_VERSION} -o wandb-controller-${CONTROLLER_VERSION}.tar + docker save wandb/local:${APP_VERSION} -o wandb-local-${APP_VERSION}.tar + docker save wandb/console:${CONSOLE_VERSION} -o wandb-console-${CONSOLE_VERSION}.tar + docker save wandb/megabinary:${APP_VERSION} -o wandb-megabinary-${APP_VERSION}.tar + # ... save all other images + ``` + +2. Transfer the `.tar` files to your air-gapped environment using your approved method (USB drive, secure file transfer, etc.). + +3. In your air-gapped environment, load and push images to your internal registry: + + ```bash + # Set the same version variables used above + CONTROLLER_VERSION="1.13.3" + APP_VERSION="0.59.2" + CONSOLE_VERSION="2.12.2" + INTERNAL_REGISTRY="registry.yourdomain.com" + + # Load images + docker load -i wandb-controller-${CONTROLLER_VERSION}.tar + docker load -i wandb-local-${APP_VERSION}.tar + docker load -i wandb-console-${CONSOLE_VERSION}.tar + docker load -i wandb-megabinary-${APP_VERSION}.tar + # ... load all other images + + # Tag for internal registry + docker tag wandb/controller:${CONTROLLER_VERSION} ${INTERNAL_REGISTRY}/wandb/controller:${CONTROLLER_VERSION} + docker tag wandb/local:${APP_VERSION} ${INTERNAL_REGISTRY}/wandb/local:${APP_VERSION} + docker tag wandb/console:${CONSOLE_VERSION} ${INTERNAL_REGISTRY}/wandb/console:${CONSOLE_VERSION} + docker tag wandb/megabinary:${APP_VERSION} ${INTERNAL_REGISTRY}/wandb/megabinary:${APP_VERSION} + # ... tag all other images + + # Push to internal registry + docker push ${INTERNAL_REGISTRY}/wandb/controller:${CONTROLLER_VERSION} + docker push ${INTERNAL_REGISTRY}/wandb/local:${APP_VERSION} + docker push ${INTERNAL_REGISTRY}/wandb/console:${CONSOLE_VERSION} + docker push ${INTERNAL_REGISTRY}/wandb/megabinary:${APP_VERSION} + # ... push all other images + ``` + +### Step 2: Set up internal Helm chart repository + +Along with the container images, ensure the following Helm charts are available in your internal Helm repository: + +- [W&B Operator chart](https://github.com/wandb/helm-charts/tree/main/charts/operator) +- [W&B Platform chart](https://github.com/wandb/helm-charts/tree/main/charts/operator-wandb) + +1. On an internet-connected system, download the charts: + + ```bash + # Add W&B Helm repository + helm repo add wandb https://wandb.github.io/helm-charts + helm repo update + + # Download the charts + helm pull wandb/operator --version 1.13.3 + helm pull wandb/operator-wandb --version 0.18.0 + ``` + +2. Transfer the `.tgz` chart files to your air-gapped environment and upload them to your internal Helm repository according to your repository's procedures. + + The `operator` chart deploys the W&B Kubernetes Operator (Controller Manager). The `operator-wandb` chart deploys the W&B Platform using the values configured in the Custom Resource (CR). + +### Step 3: Configure Helm repository access + +1. In your air-gapped environment, configure Helm to use your internal repository: + + ```bash + helm repo add local-repo https://charts.yourdomain.com + helm repo update + ``` + +2. Verify the charts are available: + + ```bash + helm search repo local-repo/operator + helm search repo local-repo/operator-wandb + ``` + +## Deploy W&B in air-gapped environment + +### Step 4: Install the Kubernetes Operator + +The W&B Kubernetes Operator (controller manager) manages the W&B platform components. To install it in an air-gapped environment, configure it to use your internal container registry. + +1. Create a `values.yaml` file with the following content: + + ```yaml + image: + repository: registry.yourdomain.com/wandb/controller + tag: 1.13.3 + + airgapped: true + ``` + + + Replace the repository and tag with the actual versions you transferred to your internal registry in Step 1. The version shown here (`1.13.3`) is an example and will become outdated. + + +2. Install the operator and Custom Resource Definition (CRD): + + ```bash + helm upgrade --install operator local-repo/operator \ + --namespace wandb \ + --create-namespace \ + --values values.yaml + ``` + +3. Verify the operator is running: + + ```bash + kubectl get pods -n wandb + ``` + + You should see the operator pod in a `Running` state. + +For full details about supported values, refer to the [Kubernetes operator GitHub repository values file](https://github.com/wandb/helm-charts/blob/main/charts/operator/values.yaml). + +### Step 5: Set up MySQL database + +Before configuring the W&B Custom Resource, set up an external MySQL database. For production deployments, W&B strongly recommends using managed database services where available. However, if you are running your own MySQL instance, create the database and user: + + + +For MySQL configuration parameters, see the [reference architecture MySQL configuration section](/platform/hosting/self-managed/ref-arch#mysql-configuration-parameters). + +### Step 6: Configure W&B Custom Resource + +After installing the W&B Kubernetes Operator, configure the Custom Resource (CR) to point to your internal Helm repository and container registry. + +This configuration ensures the Kubernetes operator uses your internal registry and repository when deploying the required components of the W&B platform. + + +The example configuration below includes image version tags that will become outdated. Replace all `tag:` values with the actual versions you transferred to your internal registry in Step 1. + + +Create a file named `wandb.yaml` with the following content: + +```yaml +apiVersion: apps.wandb.com/v1 +kind: WeightsAndBiases +metadata: + labels: + app.kubernetes.io/instance: wandb + app.kubernetes.io/name: weightsandbiases + name: wandb + namespace: wandb + +spec: + chart: + url: https://charts.yourdomain.com + name: operator-wandb + version: 0.18.0 + + values: + global: + host: https://wandb.yourdomain.com + license: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + + bucket: + accessKey: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + secretKey: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + name: s3.yourdomain.com:9000 + path: wandb + provider: s3 + region: us-east-1 + + mysql: + database: wandb + host: mysql.yourdomain.com + password: + port: 3306 + user: wandb + + redis: + host: redis.yourdomain.com + port: 6379 + password: + + api: + enabled: true + + glue: + enabled: true + + executor: + enabled: true + + extraEnv: + ENABLE_REGISTRY_UI: 'true' + + # Configure all component images to use internal registry + app: + image: + repository: registry.yourdomain.com/wandb/local + tag: 0.59.2 + + console: + image: + repository: registry.yourdomain.com/wandb/console + tag: 2.12.2 + + api: + image: + repository: registry.yourdomain.com/wandb/megabinary + tag: 0.59.2 + + executor: + image: + repository: registry.yourdomain.com/wandb/megabinary + tag: 0.59.2 + + glue: + image: + repository: registry.yourdomain.com/wandb/megabinary + tag: 0.59.2 + + parquet: + image: + repository: registry.yourdomain.com/wandb/megabinary + tag: 0.59.2 + + weave: + image: + repository: registry.yourdomain.com/wandb/weave-python + tag: 0.59.2 + + otel: + image: + repository: registry.yourdomain.com/otel/opentelemetry-collector-contrib + tag: 0.97.0 + + prometheus: + server: + image: + repository: registry.yourdomain.com/prometheus/prometheus + tag: v2.47.0 + configmapReload: + prometheus: + image: + repository: registry.yourdomain.com/prometheus-operator/prometheus-config-reloader + tag: v0.67.0 + + ingress: + annotations: + nginx.ingress.kubernetes.io/proxy-body-size: 0 + class: nginx +``` + + +Replace all placeholder values (hostnames, passwords, tags, etc.) with your actual configuration values. The example above shows the most commonly used components. + + +Depending on your deployment needs, you may also need to configure image repositories for additional components such as: +- `settingsMigrationJob` +- `weave-trace` +- `filestream` +- `flat-runs-table` + +Refer to the [W&B Helm repository values file](https://github.com/wandb/helm-charts/blob/main/charts/operator-wandb/values.yaml) for the complete list of configurable components. + +### Step 7: Deploy the W&B platform + +1. Apply the W&B Custom Resource to deploy the platform: + + ```bash + kubectl apply -f wandb.yaml + ``` + +2. Monitor the deployment progress: + + ```bash + # Watch pods being created + kubectl get pods -n wandb --watch + + # Check deployment status + kubectl get weightsandbiases -n wandb + + # View operator logs + kubectl logs -n wandb deployment/wandb-operator-controller-manager + ``` + + The deployment may take several minutes as the operator creates all necessary components. + +## OpenShift configuration + +W&B fully supports deployment on air-gapped OpenShift Kubernetes clusters. OpenShift deployments require additional security context configurations due to OpenShift's stricter security policies. + +### OpenShift security context constraints + +OpenShift uses Security Context Constraints (SCCs) to control pod permissions. By default, OpenShift assigns the `restricted` SCC to pods, which prevents running as root and requires specific user IDs. + +#### Option 1: Use restricted SCC (recommended) + +Configure W&B components to run with the restricted SCC by setting appropriate security contexts in your Custom Resource: + +```yaml +spec: + values: + # Configure security contexts for all pods + app: + podSecurityContext: + fsGroup: 1000 + runAsUser: 1000 + runAsNonRoot: true + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + runAsNonRoot: true + seccompProfile: + type: RuntimeDefault + + console: + podSecurityContext: + fsGroup: 1000 + runAsUser: 1000 + runAsNonRoot: true + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + runAsNonRoot: true + seccompProfile: + type: RuntimeDefault + + # Repeat for other components: api, executor, glue, parquet, weave +``` + +#### Option 2: Create custom SCC (if required) + +If your deployment requires capabilities not available in the `restricted` SCC, create a custom SCC: + +```yaml +apiVersion: security.openshift.io/v1 +kind: SecurityContextConstraints +metadata: + name: wandb-scc +allowHostDirVolumePlugin: false +allowHostIPC: false +allowHostNetwork: false +allowHostPID: false +allowHostPorts: false +allowPrivilegeEscalation: false +allowPrivilegedContainer: false +allowedCapabilities: [] +defaultAddCapabilities: [] +fsGroup: + type: MustRunAs + ranges: + - min: 1000 + max: 65535 +readOnlyRootFilesystem: false +requiredDropCapabilities: + - ALL +runAsUser: + type: MustRunAsRange + uidRangeMin: 1000 + uidRangeMax: 65535 +seLinuxContext: + type: MustRunAs +supplementalGroups: + type: RunAsAny +volumes: + - configMap + - downwardAPI + - emptyDir + - persistentVolumeClaim + - projected + - secret +``` + +1. Apply the SCC: + + ```bash + oc apply -f wandb-scc.yaml + ``` + +2. Bind the SCC to the W&B service accounts: + + ```bash + oc adm policy add-scc-to-user wandb-scc -z wandb-app -n wandb + oc adm policy add-scc-to-user wandb-scc -z wandb-console -n wandb + ``` + +### OpenShift routes + +OpenShift uses Routes instead of standard Kubernetes Ingress. Configure W&B to use OpenShift Routes: + +```yaml +spec: + values: + ingress: + enabled: false + + route: + enabled: true + host: wandb.apps.openshift.yourdomain.com + tls: + enabled: true + termination: edge + insecureEdgeTerminationPolicy: Redirect +``` + +### OpenShift image pull configuration + +If your OpenShift cluster uses an internal image registry with authentication: + +1. Create an image pull secret: + + ```bash + kubectl create secret docker-registry wandb-registry-secret \ + --docker-server=registry.yourdomain.com \ + --docker-username= \ + --docker-password= \ + --namespace=wandb + ``` + +2. Reference the secret in your Custom Resource: + + ```yaml + spec: + values: + imagePullSecrets: + - name: wandb-registry-secret + ``` + +### OpenShift complete example + +Here's a complete example CR for OpenShift air-gapped deployment: + + +Replace all `tag:` values in this example with the actual versions you transferred to your internal registry in Step 1. The versions shown are examples and will become outdated. + + +```yaml +apiVersion: apps.wandb.com/v1 +kind: WeightsAndBiases +metadata: + name: wandb + namespace: wandb + +spec: + chart: + url: https://charts.yourdomain.com + name: operator-wandb + version: 0.18.0 + + values: + global: + host: https://wandb.apps.openshift.yourdomain.com + license: + + bucket: + accessKey: + secretKey: + name: s3.yourdomain.com:9000 + path: wandb + provider: s3 + region: us-east-1 + + mysql: + database: wandb + host: mysql.yourdomain.com + password: + port: 3306 + user: wandb + + redis: + host: redis.yourdomain.com + port: 6379 + password: + + # OpenShift-specific: Use Routes instead of Ingress + ingress: + enabled: false + + route: + enabled: true + host: wandb.apps.openshift.yourdomain.com + tls: + enabled: true + termination: edge + + # Image pull secret for internal registry + imagePullSecrets: + - name: wandb-registry-secret + + # Security contexts for OpenShift restricted SCC + app: + image: + repository: registry.yourdomain.com/wandb/local + tag: 0.59.2 + podSecurityContext: + fsGroup: 1000 + runAsUser: 1000 + runAsNonRoot: true + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + runAsNonRoot: true + seccompProfile: + type: RuntimeDefault + + console: + image: + repository: registry.yourdomain.com/wandb/console + tag: 2.12.2 + podSecurityContext: + fsGroup: 1000 + runAsUser: 1000 + runAsNonRoot: true + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + runAsNonRoot: true + seccompProfile: + type: RuntimeDefault + + # Repeat security contexts for: api, executor, glue, parquet, weave + # (abbreviated for clarity) +``` + + + +Contact [W&B Support](mailto:support@wandb.com) or your assigned W&B support engineer for comprehensive OpenShift configuration examples tailored to your security requirements. + + +## Verify your installation + +After deploying W&B, verify the installation is working correctly: + + + +### Additional air-gapped verification + +For air-gapped deployments, also verify: + +1. **Image pull**: Confirm all pods successfully pulled images from your internal registry: + + ```bash + kubectl get pods -n wandb -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\t"}{.status.containerStatuses[*].image}{"\n"}{end}' + ``` + + All images should point to your internal registry and all pods should be in `Running` state. + +2. **External connectivity**: Verify W&B is not attempting external connections (it shouldn't in air-gapped mode): + + ```bash + kubectl logs -n wandb deployment/wandb-app --tail=100 | grep -i "connection" + ``` + +3. **License validation**: Access the W&B console and verify your license is active. + +## Troubleshooting + +### Image pull errors + +If pods fail to pull images: + +1. Verify images exist in your internal registry +2. Check image pull secret is correctly configured +3. Verify network connectivity from Kubernetes nodes to registry +4. Check registry authentication credentials + + ```bash + # Test image pull manually + kubectl run test-pull --image=registry.yourdomain.com/wandb/local:0.59.2 --namespace=wandb + kubectl logs test-pull -n wandb + kubectl delete pod test-pull -n wandb + ``` + +### OpenShift SCC errors + +If pods fail with permission errors on OpenShift: + +```bash +# Check which SCC is being used +oc get pod -n wandb -o yaml | grep scc + +# Check service account permissions +oc describe scc wandb-scc +oc get rolebinding -n wandb +``` + +### Helm chart not found + +If the operator cannot find the platform chart: + +1. Verify the chart repository URL in the Custom Resource +2. Check that the operator pod can reach your internal Helm repository +3. Verify the chart exists in your repository: + + ```bash + helm search repo local-repo/operator-wandb + ``` + +## Frequently asked questions + +### Can I use a different ingress class? + +Yes, configure your ingress class by modifying the ingress settings in your Custom Resource: + +```yaml +spec: + values: + ingress: + class: your-ingress-class +``` + +### How do I handle certificate bundles with multiple certificates? + +Split the certificates into multiple entries in the `customCACerts` section: + +```yaml +spec: + values: + customCACerts: + cert1.crt: | + -----BEGIN CERTIFICATE----- + ... + -----END CERTIFICATE----- + cert2.crt: | + -----BEGIN CERTIFICATE----- + ... + -----END CERTIFICATE----- +``` + +### How do I prevent automatic updates? + +Configure the operator to not automatically update W&B: + +1. Set `airgapped: true` in the operator installation (this disables automatic update checks) +2. Control version updates by manually updating the `spec.chart.version` in your Custom Resource +3. Optionally, disable automatic updates from the W&B System Console + +See [Disable automatic app version updates](/platform/hosting/self-managed/disable-automatic-app-version-updates) for more details. + + +W&B strongly recommends customers with Self-Managed instances update their deployments with the latest release at minimum once per quarter to maintain support and receive the latest features, performance improvements, and fixes. W&B supports a major release for 12 months from its initial release date. Refer to [Release policies and processes](/release-notes/release-policies). + + +### Does the deployment work with no connection to public repositories? + +Yes. When `airgapped: true` is set in the operator configuration, the Kubernetes operator uses only your internal resources and does not attempt to connect to public repositories. + +### How do I update W&B in an air-gapped environment? + +To update W&B: + +1. Pull new container images on an internet-connected system +2. Transfer images to your air-gapped registry +3. Upload new Helm charts to your internal repository +4. Update the `spec.chart.version` and image tags in your Custom Resource +5. Apply the updated Custom Resource + + The operator will perform a rolling update of the W&B components. + +## Next steps + +After successful deployment: + +1. **Configure user authentication**: Set up [SSO](/platform/hosting/iam/sso) or other authentication methods +2. **Set up monitoring**: Configure monitoring for your W&B instance and infrastructure +3. **Plan for updates**: Review the [Server upgrade process](/platform/hosting/server-upgrade-process) and establish an update cadence +4. **Configure backups**: Establish backup procedures for your MySQL database +5. **Document your process**: Create runbooks for your specific air-gapped update procedures + +## Getting help + +If you encounter issues during deployment: + +- Review the [Reference Architecture](/platform/hosting/self-managed/ref-arch) for infrastructure guidance +- Check the [Operator guide](/platform/hosting/self-managed/operator) for configuration details +- Contact [W&B Support](mailto:support@wandb.com) or your assigned W&B support engineer +- For OpenShift-specific issues, reference Red Hat OpenShift documentation diff --git a/platform/hosting/self-managed/on-premises-deployments/kubernetes.mdx b/platform/hosting/self-managed/on-premises-deployments/kubernetes.mdx new file mode 100644 index 0000000000..cc47ce6c6e --- /dev/null +++ b/platform/hosting/self-managed/on-premises-deployments/kubernetes.mdx @@ -0,0 +1,193 @@ +--- +description: Deploy W&B Platform on on-premises Kubernetes infrastructure +title: Deploy on On-Premises Kubernetes +--- + +import ByobProvisioningLink from "/snippets/en/_includes/byob-provisioning-link.mdx"; +import SelfManagedVerifyInstallation from "/snippets/en/_includes/self-managed-verify-installation.mdx"; + + +W&B recommends fully managed deployment options such as [W&B Multi-tenant Cloud](/platform/hosting/hosting-options/multi_tenant_cloud) or [W&B Dedicated Cloud](/platform/hosting/hosting-options/dedicated_cloud/) deployment types. W&B fully managed services are simple and secure to use, with minimum to no configuration required. + + +This guide provides instructions for deploying W&B Platform on on-premises Kubernetes infrastructure, including datacenter and private cloud environments. + +For air-gapped or fully disconnected environments, see the [Deploy on Air-Gapped Kubernetes](/platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped) guide. + +Reach out to the W&B [Sales](mailto:contact@wandb.com) to learn more. + +## Prerequisites + +Before deploying W&B on your on-premises Kubernetes infrastructure, ensure your environment meets all requirements. For complete details, see the [Requirements](/platform/hosting/self-managed/requirements) page, which covers: + +- Kubernetes cluster requirements (versions, ingress, persistent volumes) +- MySQL database configuration +- Redis requirements +- Object storage setup +- SSL/TLS certificates +- Networking and load balancer configuration +- Hardware sizing recommendations + +Additionally, refer to the [Reference Architecture](/platform/hosting/self-managed/ref-arch) for infrastructure guidelines and best practices. + +### Object storage provisioning + +W&B requires S3-compatible object storage. For detailed provisioning instructions for various storage providers, see: + + + +After provisioning your object storage, you'll configure it in the W&B Custom Resource as described in the deployment steps below. + +## Deploy W&B with the Kubernetes Operator + +The recommended method for deploying W&B on Kubernetes is using the **W&B Kubernetes Operator**. The operator manages the W&B platform components and simplifies deployment, updates, and maintenance. + +### Choose your deployment method + +The W&B Operator can be deployed using two methods: + +1. **Helm CLI** - Direct deployment using Helm commands +2. **Terraform** - Infrastructure-as-code deployment using Terraform + +For complete deployment instructions, including step-by-step guides for both methods, see [Deploy W&B with Kubernetes Operator](/platform/hosting/self-managed/operator). + +The operator guide covers: +- Installing the operator +- Configuring the W&B Custom Resource (CR) +- Object storage configuration +- MySQL and Redis connection settings +- SSL/TLS certificate configuration +- Ingress and networking setup +- Updates and maintenance + +### On-premises specific considerations + +When deploying on on-premises infrastructure, pay special attention to the following: + +#### Load balancer configuration + +On-premises Kubernetes clusters typically require manual load balancer configuration. Options include: + +- **External load balancer**: Configure an existing hardware or software load balancer (F5, HAProxy, etc.) +- **Nginx Ingress Controller**: Deploy nginx-ingress-controller with NodePort or host networking +- **MetalLB**: For bare-metal Kubernetes clusters, MetalLB provides load balancer services + +For detailed load balancer configuration examples, see the [Reference Architecture networking section](/platform/hosting/self-managed/ref-arch#networking). + +#### Persistent storage + +Ensure your Kubernetes cluster has a StorageClass configured for persistent volumes. W&B components may require persistent storage for caching and temporary data. + +Common on-premises storage options: +- NFS-based storage classes +- Ceph/Rook storage +- Local persistent volumes +- Enterprise storage solutions (NetApp, Pure Storage, etc.) + +#### DNS and certificate management + +For on-premises deployments: +- Configure internal DNS records to point to your W&B hostname +- Provision SSL/TLS certificates from your internal Certificate Authority (CA) +- If using self-signed certificates, configure the operator to trust your CA certificate + +See the [SSL/TLS requirements](/platform/hosting/self-managed/requirements#ssl-tls) for certificate configuration details. + +### OpenShift deployments + +W&B fully supports deployment on OpenShift Kubernetes clusters. OpenShift deployments require additional security context configurations due to OpenShift's stricter security policies. + +For OpenShift-specific configuration details, see: +- [Operator guide OpenShift section](/platform/hosting/self-managed/operator#openshift-kubernetes-clusters) +- [Deploy on Air-Gapped Kubernetes](/platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped#openshift-configuration) for comprehensive OpenShift examples + +## Object storage configuration + +After provisioning your object storage bucket (see [Prerequisites](#object-storage-provisioning) above), configure it in your W&B Custom Resource. + +### AWS S3 (on-premises) + +For on-premises AWS S3 (via Outposts or compatible storage): + +```yaml +bucket: + kmsKey: # Optional KMS key for encryption + name: # Example: wandb + path: "" # Keep as empty string + provider: s3 + region: # Example: us-east-1 +``` + +### S3-compatible storage (MinIO, Ceph, NetApp, etc.) + +For S3-compatible storage systems: + +```yaml +bucket: + kmsKey: null + name: # Example: s3.example.com:9000 + path: # Example: wandb + provider: s3 + region: # Example: us-east-1 +``` + +To enable TLS for S3-compatible storage, append `?tls=true` to the bucket path: + +```yaml +bucket: + path: "wandb?tls=true" +``` + + +The certificate must be trusted. Self-signed certificates require additional configuration. See the [SSL/TLS requirements](/platform/hosting/self-managed/requirements#ssl-tls) for details. + + +### Important considerations for on-premises object storage + +When running your own object storage, consider: + +1. **Storage capacity and performance**: Monitor disk capacity carefully. Average W&B usage results in tens to hundreds of gigabytes. Heavy usage could result in petabytes of storage consumption. +2. **Fault tolerance**: At minimum, use RAID arrays for physical disks. For S3-compatible storage, use distributed or highly available configurations. +3. **Availability**: Configure monitoring to ensure the storage remains available. + +### MinIO considerations + + +MinIO Open Source is in [maintenance mode](https://github.com/minio/minio) with no active development. Pre-compiled binaries are no longer provided, and only critical security fixes are considered case-by-case. For production deployments, W&B recommends using managed object storage services or [MinIO Enterprise (AIStor)](https://min.io/product/aistor). + + +Enterprise alternatives for on-premises object storage include: +- [Amazon S3 on Outposts](https://aws.amazon.com/s3/outposts/) +- [NetApp StorageGRID](https://www.netapp.com/data-storage/storagegrid/) +- MinIO Enterprise (AIStor) +- [Dell ECS](https://www.dell.com/en-us/dt/storage/ecs/index.htm) + +If you are using an existing MinIO deployment or MinIO Enterprise, you can create a bucket using the MinIO client: + +```bash +mc config host add local http://$MINIO_HOST:$MINIO_PORT "$MINIO_ACCESS_KEY" "$MINIO_SECRET_KEY" --api s3v4 +mc mb --region=us-east-1 local/wandb-files +``` + +## Verify your installation + +After deploying W&B, verify the installation is working correctly: + + + +## Next steps + +After successful deployment: + +1. **Configure user authentication**: Set up [SSO](/platform/hosting/iam/sso) or other authentication methods +2. **Set up monitoring**: Configure monitoring for your W&B instance and infrastructure +3. **Plan for updates**: Review the [Server upgrade process](/platform/hosting/server-upgrade-process) and establish an update cadence +4. **Configure backups**: Establish backup procedures for your MySQL database + +## Getting help + +If you encounter issues during deployment: + +- Check the [Reference Architecture](/platform/hosting/self-managed/ref-arch) for infrastructure guidance +- Review the [Operator guide](/platform/hosting/self-managed/operator) for configuration details +- Contact [W&B Support](mailto:support@wandb.com) or your assigned W&B support engineer diff --git a/platform/hosting/self-managed/operator.mdx b/platform/hosting/self-managed/operator.mdx index 15a3b8335e..165669c76f 100644 --- a/platform/hosting/self-managed/operator.mdx +++ b/platform/hosting/self-managed/operator.mdx @@ -113,7 +113,7 @@ api: install: true image: repository: wandb/megabinary - tag: 0.74.1 + tag: 0.74.1 # Replace with your actual version pod: securityContext: fsGroup: 10001 From 559d904cfda61f22d0442cf2ff491907120853e1 Mon Sep 17 00:00:00 2001 From: Matt Linville Date: Mon, 2 Feb 2026 16:26:33 -0800 Subject: [PATCH 5/7] Operator revamp phase 5 Add redirects for files we moved, renamed, or consolidated Temporarily, redirect old Japanese / Korean destinations to English destinations --- docs.json | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) diff --git a/docs.json b/docs.json index dcaff0fc75..d94f39658b 100644 --- a/docs.json +++ b/docs.json @@ -2619,6 +2619,78 @@ { "source": "/weave/guides/tools/limits", "destination": "/weave/details/limits" + }, + { + "source": "/platform/hosting/operator", + "destination": "/platform/hosting/self-managed/operator" + }, + { + "source": "/platform/hosting/self-managed/bare-metal", + "destination": "/platform/hosting/self-managed/on-premises-deployments/kubernetes" + }, + { + "source": "/platform/hosting/self-managed/operator-airgapped", + "destination": "/platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped" + }, + { + "source": "/platform/hosting/self-managed/aws-tf", + "destination": "/platform/hosting/self-managed/cloud-deployments/terraform" + }, + { + "source": "/platform/hosting/self-managed/gcp-tf", + "destination": "/platform/hosting/self-managed/cloud-deployments/terraform" + }, + { + "source": "/platform/hosting/self-managed/azure-tf", + "destination": "/platform/hosting/self-managed/cloud-deployments/terraform" + }, + { + "source": "/ja/platform/hosting/operator", + "destination": "/platform/hosting/self-managed/operator" + }, + { + "source": "/ja/platform/hosting/self-managed/bare-metal", + "destination": "/platform/hosting/self-managed/on-premises-deployments/kubernetes" + }, + { + "source": "/ja/platform/hosting/self-managed/operator-airgapped", + "destination": "/platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped" + }, + { + "source": "/ja/platform/hosting/self-managed/aws-tf", + "destination": "/platform/hosting/self-managed/cloud-deployments/terraform" + }, + { + "source": "/ja/platform/hosting/self-managed/gcp-tf", + "destination": "/platform/hosting/self-managed/cloud-deployments/terraform" + }, + { + "source": "/ja/platform/hosting/self-managed/azure-tf", + "destination": "/platform/hosting/self-managed/cloud-deployments/terraform" + }, + { + "source": "/ko/platform/hosting/operator", + "destination": "/platform/hosting/self-managed/operator" + }, + { + "source": "/ko/platform/hosting/self-managed/bare-metal", + "destination": "/platform/hosting/self-managed/on-premises-deployments/kubernetes" + }, + { + "source": "/ko/platform/hosting/self-managed/operator-airgapped", + "destination": "/platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped" + }, + { + "source": "/ko/platform/hosting/self-managed/aws-tf", + "destination": "/platform/hosting/self-managed/cloud-deployments/terraform" + }, + { + "source": "/ko/platform/hosting/self-managed/gcp-tf", + "destination": "/platform/hosting/self-managed/cloud-deployments/terraform" + }, + { + "source": "/ko/platform/hosting/self-managed/azure-tf", + "destination": "/platform/hosting/self-managed/cloud-deployments/terraform" } ], "baseUrl": "https://docs.wandb.ai" From d2312a94f3cf2cb40b0d001c5ac8cda254aa7a2b Mon Sep 17 00:00:00 2001 From: Matt Linville Date: Mon, 2 Feb 2026 16:32:29 -0800 Subject: [PATCH 6/7] Operator revamp phase 6 Delete old files we no longer need --- platform/hosting/operator.mdx | 1078 ----------------- platform/hosting/self-managed/aws-tf.mdx | 629 ---------- platform/hosting/self-managed/azure-tf.mdx | 223 ---- platform/hosting/self-managed/bare-metal.mdx | 169 --- platform/hosting/self-managed/gcp-tf.mdx | 312 ----- .../kubernetes-airgapped.mdx | 1 - .../self-managed/operator-airgapped.mdx | 306 ----- 7 files changed, 2718 deletions(-) delete mode 100644 platform/hosting/operator.mdx delete mode 100644 platform/hosting/self-managed/aws-tf.mdx delete mode 100644 platform/hosting/self-managed/azure-tf.mdx delete mode 100644 platform/hosting/self-managed/bare-metal.mdx delete mode 100644 platform/hosting/self-managed/gcp-tf.mdx delete mode 100644 platform/hosting/self-managed/operator-airgapped.mdx diff --git a/platform/hosting/operator.mdx b/platform/hosting/operator.mdx deleted file mode 100644 index 26390de7c5..0000000000 --- a/platform/hosting/operator.mdx +++ /dev/null @@ -1,1078 +0,0 @@ ---- -description: Deploy W&B Platform with Kubernetes Operator -title: Run W&B Server on Kubernetes ---- - -import SelfManagedNetworkingRequirements from "/snippets/en/_includes/self-managed-networking-requirements.mdx"; -import SelfManagedSslTlsRequirements from "/snippets/en/_includes/self-managed-ssl-tls-requirements.mdx"; -import SelfManagedMysqlRequirements from "/snippets/en/_includes/self-managed-mysql-requirements.mdx"; -import SelfManagedRedisRequirements from "/snippets/en/_includes/self-managed-redis-requirements.mdx"; -import SelfManagedObjectStorageRequirements from "/snippets/en/_includes/self-managed-object-storage-requirements.mdx"; -import SelfManagedVerifyInstallation from "/snippets/en/_includes/self-managed-verify-installation.mdx"; - -## W&B Kubernetes Operator - -Use the W&B Kubernetes Operator to simplify deploying, administering, troubleshooting, and scaling your W&B Server deployments on Kubernetes. You can think of the operator as a smart assistant for your W&B instance. - -The W&B Server architecture and design continuously evolves to expand AI developer tooling capabilities, and to provide appropriate primitives for high performance, better scalability, and easier administration. That evolution applies to the compute services, relevant storage and the connectivity between them. To help facilitate continuous updates and improvements across deployment types, W&B users a Kubernetes operator. - - -W&B uses the operator to deploy and manage Dedicated Cloud instances on AWS, Google Cloud and Azure public clouds. - - -For more information about Kubernetes operators, see [Operator pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) in the Kubernetes documentation. - -### Reasons for the architecture shift -Historically, the W&B application was deployed as a single deployment and pod within a Kubernetes Cluster or a single Docker container. W&B has, and continues to recommend, to externalize the Database and Object Store. Externalizing the Database and Object store decouples the application's state. - -As the application grew, the need to evolve from a monolithic container to a distributed system (microservices) was apparent. This change facilitates backend logic handling and seamlessly introduces built-in Kubernetes infrastructure capabilities. Distributed systems also supports deploying new services essential for additional features that W&B relies on. - -Before 2024, any Kubernetes-related change required manually updating the [terraform-kubernetes-wandb](https://github.com/wandb/terraform-kubernetes-wandb) Terraform module. Updating the Terraform module ensures compatibility across cloud providers, configuring necessary Terraform variables, and executing a Terraform apply for each backend or Kubernetes-level change. - -This process was not scalable since W&B Support had to assist each customer with upgrading their Terraform module. - -The solution was to implement an operator that connects to a central [deploy.wandb.ai](https://deploy.wandb.ai) server to request the latest specification changes for a given release channel and apply them. Updates are received as long as the license is valid. [Helm](https://helm.sh/) is used as both the deployment mechanism for the W&B operator and the means for the operator to handle all configuration templating of the W&B Kubernetes stack, Helm-ception. - -### How it works -You can install the operator with helm or from the source. See [charts/operator](https://github.com/wandb/helm-charts/tree/main/charts/operator) for detailed instructions. - -The installation process creates a deployment called `controller-manager` and uses a [custom resource](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) definition named `weightsandbiases.apps.wandb.com` (shortName: `wandb`), that takes a single `spec` and applies it to the cluster: - -```yaml -apiVersion: apiextensions.k8s.io/v1 -kind: CustomResourceDefinition -metadata: - name: weightsandbiases.apps.wandb.com -``` - -The `controller-manager` installs [charts/operator-wandb](https://github.com/wandb/helm-charts/tree/main/charts/operator-wandb) based on the spec of the custom resource, release channel, and a user defined config. The configuration specification hierarchy enables maximum configuration flexibility at the user end and enables W&B to release new images, configurations, features, and Helm updates automatically. - -Refer to the [configuration specification hierarchy](#configuration-specification-hierarchy) and [configuration reference](#configuration-reference-for-wb-operator) for configuration options. - -The deployment consists of multiple pods, one per service. Each pod's name is prefixed with `wandb-`. - -### Configuration specification hierarchy -Configuration specifications follow a hierarchical model where higher-level specifications override lower-level ones. Here’s how it works: - -- **Release Channel Values**: This base level configuration sets default values and configurations based on the release channel set by W&B for the deployment. -- **User Input Values**: Users can override the default settings provided by the Release Channel Spec through the System Console. -- **Custom Resource Values**: The highest level of specification, which comes from the user. Any values specified here override both the User Input and Release Channel specifications. For a detailed description of the configuration options, see [Configuration Reference](#configuration-reference-for-wb-operator). - -This hierarchical model ensures that configurations are flexible and customizable to meet varying needs while maintaining a manageable and systematic approach to upgrades and changes. - -## Before you begin -1. Refer to the [reference architecture](/platform/hosting/self-managed/ref-arch/#infrastructure-requirements) for complete infrastructure requirements, including: - - Software version requirements (Kubernetes, MySQL, Redis, Helm) - - Hardware requirements (CPU architecture, sizing recommendations) - - Networking, SSL/TLS, and DNS requirements -1. [Obtain a valid W&B Server license](/platform/hosting/hosting-options/self-managed#obtain-your-wb-server-license). -1. See the following sections and the [bare-metal installation guide](/platform/hosting/self-managed/bare-metal/) for detailed instructions to set up and configure W&B Self-Managed. Depending on the installation method, you might need to install additional software or meet additional requirements. - -### MySQL Database - - -### Redis - - -See the [External Redis configuration section](#external-redis) for details on how to configure an external Redis instance. - -### Object storage - - -See the [Object storage configuration section](#object-storage-bucket) for details on how to configure object storage in Helm values. - -### Networking requirements - - -For load balancer and ingress controller options and configuration examples, see the [reference architecture load balancer section](/platform/hosting/self-managed/ref-arch/#load-balancer-and-ingress). - -### SSL/TLS requirements - - -### Air-gapped installations -See the [Deploy W&B in airgapped environment with Kubernetes](/platform/hosting/self-managed/operator-airgapped/) tutorial on how to install the W&B Kubernetes Operator in an airgapped environment. - -### OpenShift Kubernetes clusters - -W&B supports deployment on [OpenShift Kubernetes clusters](https://www.redhat.com/en/technologies/cloud-computing/openshift) in cloud, on-premises, and air-gapped environments. - - -W&B recommends you install with the official W&B Helm chart. - - -#### Run the container as an unprivileged user - -By default, containers use a `$UID` of 999. Specify `$UID` >= 100000 and a `$GID` of 0 if your orchestrator requires the container run with a non-root user. - - -W&B must start as the root group (`$GID=0`) for file system permissions to function properly. - - -Configure security contexts for each W&B component. For example, to configure the API component: - -```yaml -api: - install: true - image: - repository: wandb/megabinary - tag: 0.74.1 - pod: - securityContext: - fsGroup: 10001 - fsGroupChangePolicy: Always - runAsGroup: 0 - runAsNonRoot: true - runAsUser: 10001 - seccompProfile: - type: RuntimeDefault - container: - securityContext: - allowPrivilegeEscalation: false - capabilities: - drop: - - ALL - privileged: false - readOnlyRootFilesystem: false -``` - -If needed, configure a custom security context for other components like `app` or `console`. For details, see [Custom security context](#custom-security-context). - -## Deploy W&B Server - - -**The W&B Kubernetes Operator with Helm is the recommended installation method** for all W&B Self-Managed deployments, including cloud, on-premises, and air-gapped environments. - - -This section describes different ways to deploy the W&B Kubernetes operator: -- **Helm CLI**: Direct deployment using Helm commands -- **Helm Terraform Module**: Infrastructure-as-code deployment -- **W&B Cloud Terraform Modules**: Complete infrastructure + application deployment for AWS, Google Cloud, and Azure - -For deployment-specific considerations, also see: -- [Deploy W&B Platform On-premises](/platform/hosting/self-managed/bare-metal/) for datacenter/bare-metal environments -- [Kubernetes operator for air-gapped instances](/platform/hosting/self-managed/operator-airgapped/) for disconnected environments - -### Deploy W&B with Helm CLI -W&B provides a Helm Chart to deploy the W&B Kubernetes operator to a Kubernetes cluster. This approach allows you to deploy W&B Server with Helm CLI or a continuous delivery tool like ArgoCD. Make sure that the above mentioned requirements are in place. - -Follow those steps to install the W&B Kubernetes Operator with Helm CLI: - -1. Add the W&B Helm repository. The W&B Helm chart is available in the W&B Helm repository: - ```shell - helm repo add wandb https://charts.wandb.ai - helm repo update - ``` -2. Install the Operator on a Kubernetes cluster: - ```shell - helm upgrade --install operator wandb/operator -n wandb-cr --create-namespace - ``` -3. Configure the W&B operator custom resource to trigger the W&B Server installation. Create a file named `operator.yaml` with your W&B deployment configuration. Refer to [Configuration Reference](#configuration-reference-for-wb-server) for all available options. - - Here's a minimal example configuration: - - ```yaml - apiVersion: apps.wandb.com/v1 - kind: WeightsAndBiases - metadata: - labels: - app.kubernetes.io/name: weightsandbiases - app.kubernetes.io/instance: wandb - name: wandb - namespace: default - spec: - values: - global: - host: https:// - license: eyJhbGnUzaH...j9ZieKQ2x5GGfw - bucket: -
- mysql: - - ingress: - annotations: - - ``` - -4. Start the Operator with your custom configuration so that it can install, configure, and manage the W&B Server application: - - ```shell - kubectl apply -f operator.yaml - ``` - - Wait until the deployment completes. This takes a few minutes. - -5. To verify the installation using the web UI, create the first admin user account, then follow the verification steps outlined in [Verify the installation](#verify-the-installation). - - -### Deploy W&B with Helm Terraform Module - -This method allows for customized deployments tailored to specific requirements, leveraging Terraform's infrastructure-as-code approach for consistency and repeatability. The official W&B Helm-based Terraform Module is located [here](https://registry.terraform.iohttps://github.com/wandb/helm-charts/tree/main/charts/operator-wandb). - -The following code can be used as a starting point and includes all necessary configuration options for a production grade deployment. - -```hcl -module "wandb" { - source = "wandb/wandb/helm" - - spec = { - values = { - global = { - host = "https://" - license = "eyJhbGnUzaH...j9ZieKQ2x5GGfw" - - bucket = { -
- } - - mysql = { - - } - } - - ingress = { - annotations = { - "a" = "b" - "x" = "y" - } - } - } - } -} -``` - -Note that the configuration options are the same as described in [Configuration Reference](#configuration-reference-for-wb-operator), but that the syntax has to follow the HashiCorp Configuration Language (HCL). The Terraform module creates the W&B custom resource definition (CRD). - -To see how W&B&Biases themselves use the Helm Terraform module to deploy “Dedicated Cloud” installations for customers, follow those links: -- [AWS](https://github.com/wandb/terraform-aws-wandb/blob/45e1d746f53e78e73e68f911a1f8cad5408e74b6/main.tf#L225) -- [Azure](https://github.com/wandb/terraform-azurerm-wandb/blob/170e03136b6b6fc758102d59dacda99768854045/main.tf#L155) -- [Google Cloud](https://github.com/wandb/terraform-google-wandb/blob/49ddc3383df4cefc04337a2ae784f57ce2a2c699/main.tf#L189) - -### Deploy W&B with W&B Cloud Terraform modules - -W&B provides a set of Terraform Modules for AWS, Google Cloud and Azure. Those modules deploy entire infrastructures including Kubernetes clusters, load balancers, MySQL databases and so on as well as the W&B Server application. The W&B Kubernetes Operator is already pre-baked with those official W&B cloud-specific Terraform Modules with the following versions: - -| Terraform Registry | Source Code | Version | -| ------------------------------------------------------------------- | ------------------------------------------------ | ------- | -| [AWS](https://registry.terraform.io/modules/wandb/wandb/aws/latest) | https://github.com/wandb/terraform-aws-wandb | v4.0.0+ | -| [Azure](https://github.com/wandb/terraform-azurerm-wandb) | https://github.com/wandb/terraform-azurerm-wandb | v2.0.0+ | -| [Google Cloud](https://github.com/wandb/terraform-google-wandb) | https://github.com/wandb/terraform-google-wandb | v2.0.0+ | - -This integration ensures that W&B Kubernetes Operator is ready to use for your instance with minimal setup, providing a streamlined path to deploying and managing W&B Server in your cloud environment. - -For a detailed description on how to use these modules, refer to the [Self-Managed installations section](/platform/hosting/hosting-options/self-managed#deploy-wb-server-within-self-managed-cloud-accounts) in the docs. - -### Verify the installation - - - -## Access the W&B Management Console -The W&B Kubernetes operator comes with a management console. It is located at `${HOST_URI}/console`, for example `https://wandb.company-name.com/console`. - -There are two ways to log in to the management console: - - - -1. Open the W&B application in the browser and login. Log in to the W&B application with `${HOST_URI}/`, for example `https://wandb.company-name.com/` -2. Access the console. Click on the icon in the top right corner and then click **System console**. Only users with admin privileges can see the **System console** entry. - - - System console access - - - - -W&B recommends you access the console using the following steps only if Option 1 does not work. - - -1. Open console application in browser. Open the above described URL, which redirects you to the login screen: - - Direct system console access - -2. Retrieve the password from the Kubernetes secret that the installation generates: - ```shell - kubectl get secret wandb-password -o jsonpath='{.data.password}' | base64 -d - ``` - Copy the password. -3. Login to the console. Paste the copied password, then click **Login**. - - - -## Update the W&B Kubernetes operator -This section describes how to update the W&B Kubernetes operator. - - -* Updating the W&B Kubernetes operator does not update the W&B server application. -* See the instructions [here](#migrate-Self-Managed-instances-to-wb-operator) if you use a Helm chart that does not user the W&B Kubernetes operator before you follow the proceeding instructions to update the W&B operator. - - -Copy and paste the code snippets below into your terminal. - -1. First, update the repo with [`helm repo update`](https://helm.sh/docs/helm/helm_repo_update/): - ```shell - helm repo update - ``` - -2. Next, update the Helm chart with [`helm upgrade`](https://helm.sh/docs/helm/helm_upgrade/): - ```shell - helm upgrade operator wandb/operator -n wandb-cr --reuse-values - ``` - -## Update the W&B Server application -You no longer need to update W&B Server application if you use the W&B Kubernetes operator. - -The operator automatically updates your W&B Server application when a new version of the software of W&B is released. - - -## Migrate Self-Managed instances to W&B Operator -The proceeding section describe how to migrate from self-managing your own W&B Server installation to using the W&B Operator to do this for you. The migration process depends on how you installed W&B Server: - - -The W&B Operator is the default and recommended installation method for W&B Server. Reach out to [Customer Support](mailto:support@wandb.com) or your W&B team if you have any questions. - - -- If you used the official W&B Cloud Terraform Modules, navigate to the appropriate documentation and follow the steps there: - - [AWS](#migrate-to-operator-based-aws-terraform-modules) - - [Google Cloud](#migrate-to-operator-based-google-cloud-terraform-modules) - - [Azure](#migrate-to-operator-based-azure-terraform-modules) -- If you used the [W&B Non-Operator Helm chart](https://github.com/wandb/helm-charts/tree/main/charts/wandb), continue [here](#migrate-to-operator-based-helm-chart). -- If you used the [W&B Non-Operator Helm chart with Terraform](https://registry.terraform.io/modules/wandb/wandb/kubernetes/latest), continue [here](#migrate-to-operator-based-terraform-helm-chart). -- If you created the Kubernetes resources with manifests, continue [here](#migrate-to-operator-based-helm-chart). - - -### Migrate to Operator-based AWS Terraform Modules - -For a detailed description of the migration process, continue [here](https://github.com/wandb/helm-charts/tree/main/charts/operator-wandb). - -### Migrate to Operator-based Google Cloud Terraform Modules - -Reach out to [Customer Support](mailto:support@wandb.com) or your W&B team if you have any questions or need assistance. - - -### Migrate to Operator-based Azure Terraform Modules - -Reach out to [Customer Support](mailto:support@wandb.com) or your W&B team if you have any questions or need assistance. - -### Migrate to Operator-based Helm chart - -Follow these steps to migrate to the Operator-based Helm chart: - -1. Get the current W&B configuration. If W&B was deployed with an non-operator-based version of the Helm chart, export the values like this: - ```shell - helm get values wandb - ``` - If W&B was deployed with Kubernetes manifests, export the values like this: - ```shell - kubectl get deployment wandb -o yaml - ``` - You now have all the configuration values you need for the next step. - -2. Create a file called `operator.yaml`. Follow the format described in the [Configuration Reference](#configuration-reference-for-wb-operator). Use the values from step 1. - -3. Scale the current deployment to 0 pods. This step is stops the current deployment. - ```shell - kubectl scale --replicas=0 deployment wandb - ``` -4. Update the Helm chart repo: - ```shell - helm repo update - ``` -5. Install the new Helm chart: - ```shell - helm upgrade --install operator wandb/operator -n wandb-cr --create-namespace - ``` -6. Configure the new helm chart and trigger W&B application deployment. Apply the new configuration. - ```shell - kubectl apply -f operator.yaml - ``` - The deployment takes a few minutes to complete. - -7. Verify the installation. Make sure that everything works by following the steps in [Verify the installation](#verify-the-installation). - -8. Remove to old installation. Uninstall the old helm chart or delete the resources that were created with manifests. - -### Migrate to Operator-based Terraform Helm chart - -Follow these steps to migrate to the Operator-based Helm chart: - - -1. Prepare Terraform config. Replace the Terraform code from the old deployment in your Terraform config with the one that is described [here](#deploy-wb-with-helm-terraform-module). Set the same variables as before. Do not change .tfvars file if you have one. -2. Execute Terraform run. Execute terraform init, plan and apply -3. Verify the installation. Make sure that everything works by following the steps in [Verify the installation](#verify-the-installation). -4. Remove to old installation. Uninstall the old helm chart or delete the resources that were created with manifests. - - - -## Configuration Reference for W&B Server - -This section describes the configuration options for W&B Server application. The application receives its configuration as custom resource definition named [WeightsAndBiases](#how-it-works). Some configuration options are exposed with the below configuration, some need to be set as environment variables. - -The documentation has two lists of environment variables: [basic](/platform/hosting/env-vars/) and [advanced](/platform/hosting/iam/advanced_env_vars/). Only use environment variables if the configuration option that you need is not exposed using the Helm Chart. - -### Basic example -This example defines the minimum set of values required for W&B. For a more realistic production example, see [Complete example](#complete-example). - -This YAML file defines the desired state of your W&B deployment, including the version, environment variables, external resources like databases, and other necessary settings. - -```yaml -apiVersion: apps.wandb.com/v1 -kind: WeightsAndBiases -metadata: - labels: - app.kubernetes.io/name: weightsandbiases - app.kubernetes.io/instance: wandb - name: wandb - namespace: default -spec: - values: - global: - host: https:// - license: eyJhbGnUzaH...j9ZieKQ2x5GGfw - bucket: -
- mysql: - - ingress: - annotations: - -``` - -Find the full set of values in the [W&B Helm repository](https://github.com/wandb/helm-charts/blob/main/charts/operator-wandb/values.yaml). **Change only those values you need to override**. - -### Complete example -This example configuration deploys W&B to Google Cloud Anthos using Google Cloud Storage: - -```yaml -apiVersion: apps.wandb.com/v1 -kind: WeightsAndBiases -metadata: - labels: - app.kubernetes.io/name: weightsandbiases - app.kubernetes.io/instance: wandb - name: wandb - namespace: default -spec: - values: - global: - host: https://abc-wandb.sandbox-gcp.wandb.ml - bucket: - name: abc-wandb-moving-pipefish - provider: gcs - mysql: - database: wandb_local - host: 10.218.0.2 - name: wandb_local - password: 8wtX6cJHizAZvYScjDzZcUarK4zZGjpV - port: 3306 - user: wandb - redis: - host: redis.example.com - port: 6379 - password: password - api: - enabled: true - glue: - enabled: true - executor: - enabled: true - license: eyJhbGnUzaHgyQjQyQWhEU3...ZieKQ2x5GGfw - ingress: - annotations: - ingress.gcp.kubernetes.io/pre-shared-cert: abc-wandb-cert-creative-puma - kubernetes.io/ingress.class: gce - kubernetes.io/ingress.global-static-ip-name: abc-wandb-operator-address -``` - -### Host -```yaml - # Provide the FQDN with protocol -global: - # example host name, replace with your own - host: https://wandb.example.com -``` - -### Object storage (bucket) - -**AWS** -```yaml -global: - bucket: - provider: "s3" - name: "" - kmsKey: "" - region: "" -``` - -**Google Cloud** -```yaml -global: - bucket: - provider: "gcs" - name: "" -``` - -**Azure** -```yaml -global: - bucket: - provider: "az" - name: "" - secretKey: "" -``` - -**Other providers (Minio, Ceph, etc.)** - -For other S3 compatible providers, set the bucket configuration as follows: -```yaml -global: - bucket: - # Example values, replace with your own - provider: s3 - name: storage.example.com - kmsKey: null - path: wandb - region: default - accessKey: 5WOA500...P5DK7I - secretKey: HDKYe4Q...JAp1YyjysnX -``` - -For S3-compatible storage hosted outside of AWS, `kmsKey` must be `null`. - -To reference `accessKey` and `secretKey` from a secret: -```yaml -global: - bucket: - # Example values, replace with your own - provider: s3 - name: storage.example.com - kmsKey: null - path: wandb - region: default - secret: - secretName: bucket-secret - accessKeyName: ACCESS_KEY - secretKeyName: SECRET_KEY -``` - -### MySQL - -```yaml -global: - mysql: - # Example values, replace with your own - host: db.example.com - port: 3306 - database: wandb_local - user: wandb - password: 8wtX6cJH...ZcUarK4zZGjpV -``` - -To reference the `password` from a secret: -```yaml -global: - mysql: - # Example values, replace with your own - host: db.example.com - port: 3306 - database: wandb_local - user: wandb - passwordSecret: - name: database-secret - passwordKey: MYSQL_WANDB_PASSWORD -``` - -### License - -```yaml -global: - # Example license, replace with your own - license: eyJhbGnUzaHgyQjQy...VFnPS_KETXg1hi -``` - -To reference the `license` from a secret: -```yaml -global: - licenseSecret: - name: license-secret - key: CUSTOMER_WANDB_LICENSE -``` - -### Ingress - -To identify the ingress class, see this FAQ [entry](#how-to-identify-the-kubernetes-ingress-class). - -**Without TLS** - -```yaml -global: -# IMPORTANT: Ingress is on the same level in the YAML as ‘global’ (not a child) -ingress: - class: "" -``` - -**With TLS** - -Create a secret that contains the certificate - -```console -kubectl create secret tls wandb-ingress-tls --key wandb-ingress-tls.key --cert wandb-ingress-tls.crt -``` - -Reference the secret in the ingress configuration -```yaml -global: -# IMPORTANT: Ingress is on the same level in the YAML as ‘global’ (not a child) -ingress: - class: "" - annotations: - {} - # kubernetes.io/ingress.class: nginx - # kubernetes.io/tls-acme: "true" - tls: - - secretName: wandb-ingress-tls - hosts: - - -``` - -In case of Nginx you might have to add the following annotation: - -``` -ingress: - annotations: - nginx.ingress.kubernetes.io/proxy-body-size: 0 -``` - -### Custom Kubernetes ServiceAccounts - -Specify custom Kubernetes service accounts to run the W&B pods. - -The following snippet creates a service account as part of the deployment with the specified name: - -```yaml -app: - serviceAccount: - name: custom-service-account - create: true - -parquet: - serviceAccount: - name: custom-service-account - create: true - -global: - ... -``` -The subsystems "app" and "parquet" run under the specified service account. The other subsystems run under the default service account. - -If the service account already exists on the cluster, set `create: false`: - -```yaml -app: - serviceAccount: - name: custom-service-account - create: false - -parquet: - serviceAccount: - name: custom-service-account - create: false - -global: - ... -``` - -You can specify service accounts on different subsystems such as app, parquet, console, and others: - -```yaml -app: - serviceAccount: - name: custom-service-account - create: true - -console: - serviceAccount: - name: custom-service-account - create: true - -global: - ... -``` - -The service accounts can be different between the subsystems: - -```yaml -app: - serviceAccount: - name: custom-service-account - create: false - -console: - serviceAccount: - name: another-custom-service-account - create: true - -global: - ... -``` - -### External Redis - -```yaml -redis: - install: false - -global: - redis: - host: "" - port: 6379 - password: "" - parameters: {} - caCert: "" -``` - -To reference the `password` from a secret: - -```console -kubectl create secret generic redis-secret --from-literal=redis-password=supersecret -``` - -Reference it in below configuration: -```yaml -redis: - install: false - -global: - redis: - host: redis.example - port: 9001 - auth: - enabled: true - secret: redis-secret - key: redis-password -``` - -### LDAP - - -LDAP configuration support in the current Helm chart is limited. Contact W&B Support or your AISE for assistance configuring LDAP. - - -Configure LDAP by setting environment variables in `global.extraEnv`: - -```yaml -global: - extraEnv: - LDAP_ADDRESS: ldaps://ldap.company.example.com - LDAP_BASE_DN: cn=accounts,dc=company,dc=example,dc=com - LDAP_USER_BASE_DN: cn=users,cn=accounts,dc=company,dc=example,dc=com - LDAP_GROUP_BASE_DN: cn=groups,cn=accounts,dc=company,dc=example,dc=com - LDAP_BIND_DN: uid=ldapbind,cn=sysaccounts,cn=etc,dc=company,dc=example,dc=com - LDAP_BIND_PW: ******************** - LDAP_ATTRIBUTES: email=mail,name=cn - LDAP_TLS_ENABLE: "true" - LDAP_LOGIN: "true" - LDAP_USER_OBJECT_CLASS: user - LDAP_GROUP_OBJECT_CLASS: group -``` - - -This legacy approach is no longer recommended. This section is provided for reference. - -**Without TLS** -```yaml -global: - ldap: - enabled: true - # LDAP server address including "ldap://" or "ldaps://" - host: - # LDAP search base to use for finding users - baseDN: - # LDAP user to bind with (if not using anonymous bind) - bindDN: - # Secret name and key with LDAP password to bind with (if not using anonymous bind) - bindPW: - # LDAP attribute for email and group ID attribute names as comma separated string values. - attributes: - # LDAP group allow list - groupAllowList: - # Enable LDAP TLS - tls: false -``` - -**With TLS** - -The LDAP TLS cert configuration requires a config map pre-created with the certificate content. - -To create the config map you can use the following command: - -```console -kubectl create configmap ldap-tls-cert --from-file=certificate.crt -``` - -And use the config map in the YAML like the example below - -```yaml -global: - ldap: - enabled: true - # LDAP server address including "ldap://" or "ldaps://" - host: - # LDAP search base to use for finding users - baseDN: - # LDAP user to bind with (if not using anonymous bind) - bindDN: - # Secret name and key with LDAP password to bind with (if not using anonymous bind) - bindPW: - # LDAP attribute for email and group ID attribute names as comma separated string values. - attributes: - # LDAP group allow list - groupAllowList: - # Enable LDAP TLS - tls: true - # ConfigMap name and key with CA certificate for LDAP server - tlsCert: - configMap: - name: "ldap-tls-cert" - key: "certificate.crt" -``` - - -### OIDC SSO - -```yaml -global: - auth: - sessionLengthHours: 720 - oidc: - clientId: "" - secret: "" - # Only include if your IdP requires it. - authMethod: "" - issuer: "" -``` - -`authMethod` is optional. - -### SMTP - -```yaml -global: - email: - smtp: - host: "" - port: 587 - user: "" - password: "" -``` - -### Environment Variables -```yaml -global: - extraEnv: - GLOBAL_ENV: "example" -``` - -### Custom certificate authority -`customCACerts` is a list and can take many certificates. Certificate authorities specified in `customCACerts` only apply to the W&B Server application. - -```yaml -global: - customCACerts: - - | - -----BEGIN CERTIFICATE----- - MIIBnDCCAUKgAwIBAg.....................fucMwCgYIKoZIzj0EAwIwLDEQ - MA4GA1UEChMHSG9tZU.....................tZUxhYiBSb290IENBMB4XDTI0 - MDQwMTA4MjgzMFoXDT.....................oNWYggsMo8O+0mWLYMAoGCCqG - SM49BAMCA0gAMEUCIQ.....................hwuJgyQRaqMI149div72V2QIg - P5GD+5I+02yEp58Cwxd5Bj2CvyQwTjTO4hiVl1Xd0M0= - -----END CERTIFICATE----- - - | - -----BEGIN CERTIFICATE----- - MIIBxTCCAWugAwIB.......................qaJcwCgYIKoZIzj0EAwIwLDEQ - MA4GA1UEChMHSG9t.......................tZUxhYiBSb290IENBMB4XDTI0 - MDQwMTA4MjgzMVoX.......................UK+moK4nZYvpNpqfvz/7m5wKU - SAAwRQIhAIzXZMW4.......................E8UFqsCcILdXjAiA7iTluM0IU - aIgJYVqKxXt25blH/VyBRzvNhViesfkNUQ== - -----END CERTIFICATE----- -``` - -CA certificates can also be stored in a ConfigMap: -```yaml -global: - caCertsConfigMap: custom-ca-certs -``` - -The ConfigMap must look like this: -```yaml -apiVersion: v1 -kind: ConfigMap -metadata: - name: custom-ca-certs -data: - ca-cert1.crt: | - -----BEGIN CERTIFICATE----- - ... - -----END CERTIFICATE----- - ca-cert2.crt: | - -----BEGIN CERTIFICATE----- - ... - -----END CERTIFICATE----- -``` - - -If using a ConfigMap, each key in the ConfigMap must end with `.crt` (for example, `my-cert.crt` or `ca-cert1.crt`). This naming convention is required for `update-ca-certificates` to parse and add each certificate to the system CA store. - - -### Custom security context - -Each W&B component supports custom security context configurations of the following form: - -```yaml -pod: - securityContext: - runAsNonRoot: true - runAsUser: 1001 - runAsGroup: 0 - fsGroup: 1001 - fsGroupChangePolicy: Always - seccompProfile: - type: RuntimeDefault -container: - securityContext: - capabilities: - drop: - - ALL - readOnlyRootFilesystem: false - allowPrivilegeEscalation: false -``` - - -The only valid value for `runAsGroup:` is `0`. Any other value is an error. - - - -For example, to configure the application pod, add a section `app` to your configuration: - -```yaml -global: - ... -app: - pod: - securityContext: - runAsNonRoot: true - runAsUser: 1001 - runAsGroup: 0 - fsGroup: 1001 - fsGroupChangePolicy: Always - seccompProfile: - type: RuntimeDefault - container: - securityContext: - capabilities: - drop: - - ALL - readOnlyRootFilesystem: false - allowPrivilegeEscalation: false -``` - -The same concept applies to `console`, `weave`, `weave-trace` and `parquet`. - -## Configuration Reference for W&B Operator - -This section describes configuration options for W&B Kubernetes operator (`wandb-controller-manager`). The operator receives its configuration in the form of a YAML file. - -By default, the W&B Kubernetes operator does not need a configuration file. Create a configuration file if required. For example, you might need a configuration file to specify custom certificate authorities, deploy in an air gap environment and so forth. - -Find the full list of spec customization [in the Helm repository](https://github.com/wandb/helm-charts/blob/main/charts/operator/values.yaml). - -### Custom CA -A custom certificate authority (`customCACerts`), is a list and can take many certificates. Those certificate authorities when added only apply to the W&B Kubernetes operator (`wandb-controller-manager`). - -```yaml -customCACerts: -- | - -----BEGIN CERTIFICATE----- - MIIBnDCCAUKgAwIBAg.....................fucMwCgYIKoZIzj0EAwIwLDEQ - MA4GA1UEChMHSG9tZU.....................tZUxhYiBSb290IENBMB4XDTI0 - MDQwMTA4MjgzMFoXDT.....................oNWYggsMo8O+0mWLYMAoGCCqG - SM49BAMCA0gAMEUCIQ.....................hwuJgyQRaqMI149div72V2QIg - P5GD+5I+02yEp58Cwxd5Bj2CvyQwTjTO4hiVl1Xd0M0= - -----END CERTIFICATE----- -- | - -----BEGIN CERTIFICATE----- - MIIBxTCCAWugAwIB.......................qaJcwCgYIKoZIzj0EAwIwLDEQ - MA4GA1UEChMHSG9t.......................tZUxhYiBSb290IENBMB4XDTI0 - MDQwMTA4MjgzMVoX.......................UK+moK4nZYvpNpqfvz/7m5wKU - SAAwRQIhAIzXZMW4.......................E8UFqsCcILdXjAiA7iTluM0IU - aIgJYVqKxXt25blH/VyBRzvNhViesfkNUQ== - -----END CERTIFICATE----- -``` - -CA certificates can also be stored in a ConfigMap: -```yaml -caCertsConfigMap: custom-ca-certs -``` - -The ConfigMap must look like this: -```yaml -apiVersion: v1 -kind: ConfigMap -metadata: - name: custom-ca-certs -data: - ca-cert1.crt: | - -----BEGIN CERTIFICATE----- - ... - -----END CERTIFICATE----- - ca-cert2.crt: | - -----BEGIN CERTIFICATE----- - ... - -----END CERTIFICATE----- -``` - - -Each key in the ConfigMap must end with `.crt` (e.g., `my-cert.crt` or `ca-cert1.crt`). This naming convention is required for `update-ca-certificates` to parse and add each certificate to the system CA store. - - -## FAQ - -### What is the purpose/role of each individual pod? -* **`wandb-app`**: the core of W&B, including the GraphQL API and frontend application. It powers most of our platform’s functionality. -* **`wandb-console`**: the administration console, accessed via `/console`. -* **`wandb-otel`**: the OpenTelemetry agent, which collects metrics and logs from resources at the Kubernetes layer for display in the administration console. -* **`wandb-prometheus`**: the Prometheus server, which captures metrics from various components for display in the administration console. -* **`wandb-parquet`**: a backend microservice separate from the `wandb-app` pod that exports database data to object storage in Parquet format. -* **`wandb-weave`**: another backend microservice that loads query tables in the UI and supports various core app features. -* **`wandb-weave-trace`**: a framework for tracking, experimenting with, evaluating, deploying, and improving LLM-based applications. The framework is accessed via the `wandb-app` pod. - -### How to get the W&B Operator Console password -See [Accessing the W&B Kubernetes Operator Management Console](#access-the-wb-management-console). - - -### How to access the W&B Operator Console if Ingress doesn’t work - -Execute the following command on a host that can reach the Kubernetes cluster: - -```console -kubectl port-forward svc/wandb-console 8082 -``` - -Access the console in the browser with `https://localhost:8082/` console. - -See [Accessing the W&B Kubernetes Operator Management Console](#access-the-wb-management-console) on how to get the password (Option 2). - -### How to view W&B Server logs - -The application pod is named **wandb-app-xxx**. - -```console -kubectl get pods -kubectl logs wandb-XXXXX-XXXXX -``` - -### How to identify the Kubernetes ingress class - -You can get the ingress class installed in your cluster by running - -```console -kubectl get ingressclass -``` diff --git a/platform/hosting/self-managed/aws-tf.mdx b/platform/hosting/self-managed/aws-tf.mdx deleted file mode 100644 index 4af0429f9c..0000000000 --- a/platform/hosting/self-managed/aws-tf.mdx +++ /dev/null @@ -1,629 +0,0 @@ ---- -description: Hosting W&B Server on AWS. -title: Deploy W&B Platform on AWS ---- - - -W&B recommends fully managed deployment options such as [W&B Multi-tenant Cloud](/platform/hosting/hosting-options/multi_tenant_cloud) or [W&B Dedicated Cloud](/platform/hosting/hosting-options/dedicated_cloud/) deployment types. W&B fully managed services are simple and secure to use, with minimum to no configuration required. - - -W&B recommends using the [W&B Server AWS Terraform Module](https://registry.terraform.io/modules/wandb/wandb/aws/latest) to deploy the platform on AWS. - -Before you start, W&B recommends that you choose one of the [remote backends](https://developer.hashicorp.com/terraform/language/backend) available for Terraform to store the [State File](https://developer.hashicorp.com/terraform/language/state). - -The State File is the necessary resource to roll out upgrades or make changes in your deployment without recreating all components. - -The Terraform Module deploys the following `mandatory` components: - -- Load Balancer -- AWS Identity & Access Management (IAM) -- AWS Key Management System (KMS) -- Amazon Aurora MySQL -- Amazon VPC -- Amazon S3 -- Amazon Route53 -- Amazon Certificate Manager (ACM) -- Amazon Elastic Load Balancing (ALB) -- Amazon Secrets Manager - -Other deployment options can also include the following optional components: - -- Elastic Cache for Redis -- SQS - -## Pre-requisite permissions - -The account that runs Terraform needs to be able to create all components described in the Introduction and permission to create **IAM Policies** and **IAM Roles** and assign roles to resources. - -## General steps - -The steps on this topic are common for any deployment option covered by this documentation. - -1. Prepare the development environment. - - Install [Terraform](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli) - - W&B recommend creating a Git repository for version control. -2. Create the `terraform.tfvars` file. - - The `tvfars` file content can be customized according to the installation type, but the minimum recommended will look like the example below. - - ```bash - namespace = "wandb" - license = "xxxxxxxxxxyyyyyyyyyyyzzzzzzz" - subdomain = "wandb-aws" - domain_name = "wandb.ml" - zone_id = "xxxxxxxxxxxxxxxx" - allowed_inbound_cidr = ["0.0.0.0/0"] - allowed_inbound_ipv6_cidr = ["::/0"] - eks_cluster_version = "1.29" - ``` - - Ensure to define variables in your `tvfars` file before you deploy because the `namespace` variable is a string that prefixes all resources created by Terraform. - - - The combination of `subdomain` and `domain` will form the FQDN that W&B will be configured. In the example above, the W&B FQDN will be `wandb-aws.wandb.ml` and the DNS `zone_id` where the FQDN record will be created. - - Both `allowed_inbound_cidr` and `allowed_inbound_ipv6_cidr` also require setting. In the module, this is a mandatory input. The proceeding example permits access from any source to the W&B installation. - -3. Create the file `versions.tf` - - This file will contain the Terraform and Terraform provider versions required to deploy W&B in AWS - - ```bash - provider "aws" { - region = "eu-central-1" - - default_tags { - tags = { - GithubRepo = "terraform-aws-wandb" - GithubOrg = "wandb" - Enviroment = "Example" - Example = "PublicDnsExternal" - } - } - } - ``` - - Refer to the [Terraform Official Documentation](https://registry.terraform.io/providers/hashicorp/aws/latest/docs#provider-configuration) to configure the AWS provider. - - Optionally, but highly recommended, add the [remote backend configuration](https://developer.hashicorp.com/terraform/language/backend) mentioned at the beginning of this documentation. - -4. Create the file `variables.tf` - - For every option configured in the `terraform.tfvars` Terraform requires a correspondent variable declaration. - - ``` - variable "namespace" { - type = string - description = "Name prefix used for resources" - } - - variable "domain_name" { - type = string - description = "Domain name used to access instance." - } - - variable "subdomain" { - type = string - default = null - description = "Subdomain for accessing the Weights & Biases UI." - } - - variable "license" { - type = string - } - - variable "zone_id" { - type = string - description = "Domain for creating the Weights & Biases subdomain on." - } - - variable "allowed_inbound_cidr" { - description = "CIDRs allowed to access wandb-server." - nullable = false - type = list(string) - } - - variable "allowed_inbound_ipv6_cidr" { - description = "CIDRs allowed to access wandb-server." - nullable = false - type = list(string) - } - - variable "eks_cluster_version" { - description = "EKS cluster kubernetes version" - nullable = false - type = string - } - ``` - -## Recommended deployment option - -This is the most straightforward deployment option configuration that creates all `Mandatory` components and installs in the `Kubernetes Cluster` the latest version of `W&B`. - -1. Create the `main.tf` - - In the same directory where you created the files in the `General Steps`, create a file `main.tf` with the following content: - - ``` - module "wandb_infra" { - source = "wandb/wandb/aws" - version = "~>7.0" - - namespace = var.namespace - domain_name = var.domain_name - subdomain = var.subdomain - zone_id = var.zone_id - - allowed_inbound_cidr = var.allowed_inbound_cidr - allowed_inbound_ipv6_cidr = var.allowed_inbound_ipv6_cidr - - public_access = true - external_dns = true - kubernetes_public_access = true - kubernetes_public_access_cidrs = ["0.0.0.0/0"] - eks_cluster_version = var.eks_cluster_version - } - - data "aws_eks_cluster" "eks_cluster_id" { - name = module.wandb_infra.cluster_name - } - - data "aws_eks_cluster_auth" "eks_cluster_auth" { - name = module.wandb_infra.cluster_name - } - - provider "kubernetes" { - host = data.aws_eks_cluster.eks_cluster_id.endpoint - cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks_cluster_id.certificate_authority.0.data) - token = data.aws_eks_cluster_auth.eks_cluster_auth.token - } - - - provider "helm" { - kubernetes { - host = data.aws_eks_cluster.eks_cluster_id.endpoint - cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks_cluster_id.certificate_authority.0.data) - token = data.aws_eks_cluster_auth.eks_cluster_auth.token - } - } - - output "url" { - value = module.wandb_infra.url - } - - output "bucket" { - value = module.wandb_infra.bucket_name - } - ``` - -2. Deploy W&B - - To deploy W&B, execute the following commands: - - ``` - terraform init - terraform apply -var-file=terraform.tfvars - ``` - -## Enable REDIS - -Another deployment option uses `Redis` to cache the SQL queries and speed up the application response when loading the metrics for the experiments. - -You need to add the option `create_elasticache_subnet = true` to the same `main.tf` file described in the [Recommended deployment](#recommended-deployment-option) section to enable the cache. - -``` -module "wandb_infra" { - source = "wandb/wandb/aws" - version = "~>7.0" - - namespace = var.namespace - domain_name = var.domain_name - subdomain = var.subdomain - zone_id = var.zone_id - **create_elasticache_subnet = true** -} -[...] -``` - -## Enable message broker (queue) - -Deployment option 3 consists of enabling the external `message broker`. This is optional because the W&B brings embedded a broker. This option doesn't bring a performance improvement. - -The AWS resource that provides the message broker is the `SQS`, and to enable it, you will need to add the option `use_internal_queue = false` to the same `main.tf` described in the [Recommended deployment](#recommended-deployment-option) section. - -``` -module "wandb_infra" { - source = "wandb/wandb/aws" - version = "~>7.0" - - namespace = var.namespace - domain_name = var.domain_name - subdomain = var.subdomain - zone_id = var.zone_id - **use_internal_queue = false** - -[...] -} -``` - -## Other deployment options - -You can combine all three deployment options adding all configurations to the same file. -The [Terraform Module](https://github.com/wandb/terraform-aws-wandb) provides several options that can be combined along with the standard options and the minimal configuration found in `Deployment - Recommended` - -## Manual configuration - -To use an Amazon S3 bucket as a file storage backend for W&B, you will need to: - -* [Create an Amazon S3 Bucket and Bucket Notifications](#create-an-s3-bucket-and-bucket-notifications) -* [Create SQS Queue](#create-an-sqs-queue) -* [Grant Permissions to Node Running W&B](#grant-permissions-to-node-that-runs-wb) - - - you'll need to create a bucket, along with an SQS queue configured to receive object creation notifications from that bucket. Your instance will need permissions to read from this queue. - -### Create an S3 Bucket and Bucket Notifications - -Follow the procedure bellow to create an Amazon S3 bucket and enable bucket notifications. - -1. Navigate to Amazon S3 in the AWS Console. -2. Select **Create bucket**. -3. Within the **Advanced settings**, select **Add notification** within the **Events** section. -4. Configure all object creation events to be sent to the SQS Queue you configured earlier. - - - Enterprise file storage settings - - -Enable CORS access. Your CORS configuration should look like the following: - -```markup - - - - http://YOUR-W&B-SERVER-IP - GET - PUT - * - - -``` - -### Create an SQS Queue - -Follow the procedure below to create an SQS Queue: - -1. Navigate to Amazon SQS in the AWS Console. -2. Select **Create queue**. -3. From the **Details** section, select a **Standard** queue type. -4. Within the Access policy section, add permission to the following principals: -* `SendMessage` -* `ReceiveMessage` -* `ChangeMessageVisibility` -* `DeleteMessage` -* `GetQueueUrl` - -Optionally add an advanced access policy in the **Access Policy** section. For example, the policy for accessing Amazon SQS with a statement is as follows: - -```json -{ - "Version" : "2012-10-17", - "Statement" : [ - { - "Effect" : "Allow", - "Principal" : "*", - "Action" : ["sqs:SendMessage"], - "Resource" : "", - "Condition" : { - "ArnEquals" : { "aws:SourceArn" : "" } - } - } - ] -} -``` - -### Grant permissions to node that runs W&B - -The node where W&B server is running must be configured to permit access to Amazon S3 and Amazon SQS. Depending on the type of server deployment you have opted for, you may need to add the following policy statements to your node role: - -```json -{ - "Statement":[ - { - "Sid":"", - "Effect":"Allow", - "Action":"s3:*", - "Resource":"arn:aws:s3:::" - }, - { - "Sid":"", - "Effect":"Allow", - "Action":[ - "sqs:*" - ], - "Resource":"arn:aws:sqs:::" - } - ] -} -``` - -### Configure W&B server -Finally, configure your W&B Server. - -1. Navigate to the W&B settings page at `http(s)://YOUR-W&B-SERVER-HOST/system-admin`. -2. Enable the ***Use an external file storage backend* option -3. Provide information about your Amazon S3 bucket, region, and Amazon SQS queue in the following format: -* **File Storage Bucket**: `s3://` -* **File Storage Region (AWS only)**: `` -* **Notification Subscription**: `sqs://` - - - AWS file storage configuration - - -4. Select **Update settings** to apply the new settings. - -## Upgrade your W&B version - -Follow the steps outlined here to update W&B: - -1. Add `wandb_version` to your configuration in your `wandb_app` module. Provide the version of W&B you want to upgrade to. For example, the following line specifies W&B version `0.48.1`: - - ``` - module "wandb_app" { - source = "wandb/wandb/kubernetes" - version = "~>1.0" - - license = var.license - wandb_version = "0.48.1" - ``` - - -Alternatively, you can add the `wandb_version` to the `terraform.tfvars` and create a variable with the same name and instead of using the literal value, use the `var.wandb_version` - - -2. After you update your configuration, complete the steps described in the [Recommended deployment section](#recommended-deployment-option). - -## Migrate to operator-based AWS Terraform modules - -This section details the steps required to upgrade from _pre-operator_ to _post-operator_ environments using the [terraform-aws-wandb](https://registry.terraform.io/modules/wandb/wandb/aws/latest) module. - - -The transition to a Kubernetes [operator](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) pattern is necessary for the W&B architecture. See the [architecture shift explanation](/platform/hosting/operator/#reasons-for-the-architecture-shift) for a detailed explanation. - - - -### Before and after architecture - -Previously, the W&B architecture used: - -```hcl -module "wandb_infra" { - source = "wandb/wandb/aws" - version = "1.16.10" - ... -} -``` - -to control the infrastructure: - - - pre-operator-infra - - -and this module to deploy the W&B Server: - -```hcl -module "wandb_app" { - source = "wandb/wandb/kubernetes" - version = "1.12.0" -} -``` - - - pre-operator-k8s - - -Post-transition, the architecture uses: - -```hcl -module "wandb_infra" { - source = "wandb/wandb/aws" - version = "4.7.2" - ... -} -``` - -to manage both the installation of infrastructure and the W&B Server to the Kubernetes cluster, thus eliminating the need for the `module "wandb_app"` in `post-operator.tf`. - - - post-operator-k8s - - -This architectural shift enables additional features (like OpenTelemetry, Prometheus, HPAs, Kafka, and image updates) without requiring manual Terraform operations by SRE/Infrastructure teams. - -To commence with a base installation of the W&B Pre-Operator, ensure that `post-operator.tf` has a `.disabled` file extension and `pre-operator.tf` is active (that does not have a `.disabled` extension). Those files can be found [here](https://github.com/wandb/terraform-aws-wandb/tree/main/docs/operator-migration). - -### Prerequisites - -Before initiating the migration process, ensure the following prerequisites are met: - -- **Egress**: The deployment can't be airgapped. It needs access to [deploy.wandb.ai](https://deploy.wandb.ai) to get the latest spec for the **_Release Channel_**. -- **AWS Credentials**: Proper AWS credentials configured to interact with your AWS resources. -- **Terraform Installed**: The latest version of Terraform should be installed on your system. -- **Route53 Hosted Zone**: An existing Route53 hosted zone corresponding to the domain under which the application will be served. -- **Pre-Operator Terraform Files**: Ensure `pre-operator.tf` and associated variable files like `pre-operator.tfvars` are correctly set up. - -### Pre-Operator set up - -Execute the following Terraform commands to initialize and apply the configuration for the Pre-Operator setup: - -```bash -terraform init -upgrade -terraform apply -var-file=./pre-operator.tfvars -``` - -`pre-operator.tf` should look something like this: - -```ini -namespace = "operator-upgrade" -domain_name = "sandbox-aws.wandb.ml" -zone_id = "Z032246913CW32RVRY0WU" -subdomain = "operator-upgrade" -wandb_license = "ey..." -wandb_version = "0.51.2" -``` - -The `pre-operator.tf` configuration calls two modules: - -```hcl -module "wandb_infra" { - source = "wandb/wandb/aws" - version = "1.16.10" - ... -} -``` - -This module spins up the infrastructure. - -```hcl -module "wandb_app" { - source = "wandb/wandb/kubernetes" - version = "1.12.0" -} -``` - -This module deploys the application. - -### Post-Operator Setup - -Make sure that `pre-operator.tf` has a `.disabled` extension, and `post-operator.tf` is active. - -The `post-operator.tfvars` includes additional variables: - -```ini -... -# wandb_version = "0.51.2" is now managed via the Release Channel or set in the User Spec. - -# Required Operator Variables for Upgrade: -size = "small" -enable_dummy_dns = true -enable_operator_alb = true -custom_domain_filter = "sandbox-aws.wandb.ml" -``` - -Run the following commands to initialize and apply the Post-Operator configuration: - -```bash -terraform init -upgrade -terraform apply -var-file=./post-operator.tfvars -``` - -The plan and apply steps will update the following resources: - -```yaml -actions: - create: - - aws_efs_backup_policy.storage_class - - aws_efs_file_system.storage_class - - aws_efs_mount_target.storage_class["0"] - - aws_efs_mount_target.storage_class["1"] - - aws_eks_addon.efs - - aws_iam_openid_connect_provider.eks - - aws_iam_policy.secrets_manager - - aws_iam_role_policy_attachment.ebs_csi - - aws_iam_role_policy_attachment.eks_efs - - aws_iam_role_policy_attachment.node_secrets_manager - - aws_security_group.storage_class_nfs - - aws_security_group_rule.nfs_ingress - - random_pet.efs - - aws_s3_bucket_acl.file_storage - - aws_s3_bucket_cors_configuration.file_storage - - aws_s3_bucket_ownership_controls.file_storage - - aws_s3_bucket_server_side_encryption_configuration.file_storage - - helm_release.operator - - helm_release.wandb - - aws_cloudwatch_log_group.this[0] - - aws_iam_policy.default - - aws_iam_role.default - - aws_iam_role_policy_attachment.default - - helm_release.external_dns - - aws_default_network_acl.this[0] - - aws_default_route_table.default[0] - - aws_iam_policy.default - - aws_iam_role.default - - aws_iam_role_policy_attachment.default - - helm_release.aws_load_balancer_controller - - update_in_place: - - aws_iam_policy.node_IMDSv2 - - aws_iam_policy.node_cloudwatch - - aws_iam_policy.node_kms - - aws_iam_policy.node_s3 - - aws_iam_policy.node_sqs - - aws_eks_cluster.this[0] - - aws_elasticache_replication_group.default - - aws_rds_cluster.this[0] - - aws_rds_cluster_instance.this["1"] - - aws_default_security_group.this[0] - - aws_subnet.private[0] - - aws_subnet.private[1] - - aws_subnet.public[0] - - aws_subnet.public[1] - - aws_launch_template.workers["primary"] - - destroy: - - kubernetes_config_map.config_map - - kubernetes_deployment.wandb - - kubernetes_priority_class.priority - - kubernetes_secret.secret - - kubernetes_service.prometheus - - kubernetes_service.service - - random_id.snapshot_identifier[0] - - replace: - - aws_autoscaling_attachment.autoscaling_attachment["primary"] - - aws_route53_record.alb - - aws_eks_node_group.workers["primary"] -``` - -You should see something like this: - - - post-operator-apply - - -Note that in `post-operator.tf`, there is a single: - -```hcl -module "wandb_infra" { - source = "wandb/wandb/aws" - version = "4.7.2" - ... -} -``` - -#### Changes in the post-operator configuration: - -1. **Update Required Providers**: Change `required_providers.aws.version` from `3.6` to `4.0` for provider compatibility. -2. **DNS and Load Balancer Configuration**: Integrate `enable_dummy_dns` and `enable_operator_alb` to manage DNS records and AWS Load Balancer setup through an Ingress. -3. **License and Size Configuration**: Transfer the `license` and `size` parameters directly to the `wandb_infra` module to match new operational requirements. -4. **Custom Domain Handling**: If necessary, use `custom_domain_filter` to troubleshoot DNS issues by checking the External DNS pod logs within the `kube-system` namespace. -5. **Helm Provider Configuration**: Enable and configure the Helm provider to manage Kubernetes resources effectively: - -```hcl -provider "helm" { - kubernetes { - host = data.aws_eks_cluster.app_cluster.endpoint - cluster_ca_certificate = base64decode(data.aws_eks_cluster.app_cluster.certificate_authority[0].data) - token = data.aws_eks_cluster_auth.app_cluster.token - exec { - api_version = "client.authentication.k8s.io/v1beta1" - args = ["eks", "get-token", "--cluster-name", data.aws_eks_cluster.app_cluster.name] - command = "aws" - } - } -} -``` - -This comprehensive setup ensures a smooth transition from the Pre-Operator to the Post-Operator configuration, leveraging new efficiencies and capabilities enabled by the operator model. \ No newline at end of file diff --git a/platform/hosting/self-managed/azure-tf.mdx b/platform/hosting/self-managed/azure-tf.mdx deleted file mode 100644 index 8c82574169..0000000000 --- a/platform/hosting/self-managed/azure-tf.mdx +++ /dev/null @@ -1,223 +0,0 @@ ---- -description: Hosting W&B Server on Azure. -title: Deploy W&B Platform on Azure ---- - - -W&B recommends fully managed deployment options such as [W&B Multi-tenant Cloud](/platform/hosting/hosting-options/multi_tenant_cloud) or [W&B Dedicated Cloud](/platform/hosting/hosting-options/dedicated_cloud/) deployment types. W&B fully managed services are simple and secure to use, with minimum to no configuration required. - - - -If you've determined to Self-Managed W&B Server, W&B recommends using the [W&B Server Azure Terraform Module](https://registry.terraform.io/modules/wandb/wandb/azurerm/latest) to deploy the platform on Azure. - -The module documentation is extensive and contains all available options that can be used. We will cover some deployment options in this document. - -Before you start, we recommend you choose one of the [remote backends](https://developer.hashicorp.com/terraform/language/backend) available for Terraform to store the [State File](https://developer.hashicorp.com/terraform/language/state). - -The State File is the necessary resource to roll out upgrades or make changes in your deployment without recreating all components. - -The Terraform Module will deploy the following `mandatory` components: - -- Azure Resource Group -- Azure Virtual Network (VPC) -- Azure MySQL Fliexible Server -- Azure Storage Account & Blob Storage -- Azure Kubernetes Service -- Azure Application Gateway - -Other deployment options can also include the following optional components: - -- Azure Cache for Redis -- Azure Event Grid - -## **Pre-requisite permissions** - -The simplest way to get the AzureRM provider configured is via [Azure CLI](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/guides/azure_cli) but the incase of automation using [Azure Service Principal](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/guides/service_principal_client_secret) can also be useful. -Regardless the authentication method used, the account that will run the Terraform needs to be able to create all components described in the Introduction. - -## General steps -The steps on this topic are common for any deployment option covered by this documentation. - -1. Prepare the development environment. - * Install [Terraform](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli) - * We recommend creating a Git repository with the code that will be used, but you can keep your files locally. - -2. **Create the `terraform.tfvars` file** The `tvfars` file content can be customized according to the installation type, but the minimum recommended will look like the example below. - - ```bash - namespace = "wandb" - wandb_license = "xxxxxxxxxxyyyyyyyyyyyzzzzzzz" - subdomain = "wandb-aws" - domain_name = "wandb.ml" - location = "westeurope" - ``` - - The variables defined here need to be decided before the deployment because. The `namespace` variable will be a string that will prefix all resources created by Terraform. - - The combination of `subdomain` and `domain` will form the FQDN that W&B will be configured. In the example above, the W&B FQDN will be `wandb-aws.wandb.ml` and the DNS `zone_id` where the FQDN record will be created. - -3. **Create the file `versions.tf`** - - This file will contain the Terraform and Terraform provider versions required to deploy W&B in AWS - ```bash - terraform { - required_version = "~> 1.3" - - required_providers { - azurerm = { - source = "hashicorp/azurerm" - version = "~> 3.17" - } - } - } - ``` - - Refer to the [Terraform Official Documentation](https://registry.terraform.io/providers/hashicorp/aws/latest/docs#provider-configuration) to configure the AWS provider. - - Optionally, **but highly recommended**, you can add the [remote backend configuration](https://developer.hashicorp.com/terraform/language/backend) mentioned at the beginning of this documentation. - -4. **Create the file** `variables.tf`. For every option configured in the `terraform.tfvars` Terraform requires a correspondent variable declaration. - - ```bash - variable "namespace" { - type = string - description = "String used for prefix resources." - } - - variable "location" { - type = string - description = "Azure Resource Group location" - } - - variable "domain_name" { - type = string - description = "Domain for accessing the Weights & Biases UI." - } - - variable "subdomain" { - type = string - default = null - description = "Subdomain for accessing the Weights & Biases UI. Default creates record at Route53 Route." - } - - variable "license" { - type = string - description = "Your wandb/local license" - } - ``` - -## Recommended deployment - -This is the most straightforward deployment option configuration that will create all `Mandatory` components and install in the `Kubernetes Cluster` the latest version of `W&B`. - -1. **Create the `main.tf`** - - In the same directory where you created the files in the `General Steps`, create a file `main.tf` with the following content: - - ```bash - provider "azurerm" { - features {} - } - - provider "kubernetes" { - host = module.wandb.cluster_host - cluster_ca_certificate = base64decode(module.wandb.cluster_ca_certificate) - client_key = base64decode(module.wandb.cluster_client_key) - client_certificate = base64decode(module.wandb.cluster_client_certificate) - } - - provider "helm" { - kubernetes { - host = module.wandb.cluster_host - cluster_ca_certificate = base64decode(module.wandb.cluster_ca_certificate) - client_key = base64decode(module.wandb.cluster_client_key) - client_certificate = base64decode(module.wandb.cluster_client_certificate) - } - } - - # Spin up all required services - module "wandb" { - source = "wandb/wandb/azurerm" - version = "~> 1.2" - - namespace = var.namespace - location = var.location - license = var.license - domain_name = var.domain_name - subdomain = var.subdomain - - deletion_protection = false - - tags = { - "Example" : "PublicDns" - } - } - - output "address" { - value = module.wandb.address - } - - output "url" { - value = module.wandb.url - } - ``` - -2. **Deploy to W&B** - - To deploy W&B, execute the following commands: - - ``` - terraform init - terraform apply -var-file=terraform.tfvars - ``` - -## Deployment with REDIS Cache - -Another deployment option uses `Redis` to cache the SQL queries and speed up the application response when loading the metrics for the experiments. - -You must add the option `create_redis = true` to the same `main.tf` file that you used in [recommended deployment](#recommended-deployment) to enable the cache. - -```bash -# Spin up all required services -module "wandb" { - source = "wandb/wandb/azurerm" - version = "~> 1.2" - - - namespace = var.namespace - location = var.location - license = var.license - domain_name = var.domain_name - subdomain = var.subdomain - - create_redis = true # Create Redis - [...] -``` - -## Deployment with External Queue - -Deployment option 3 consists of enabling the external `message broker`. This is optional because the W&B brings embedded a broker. This option doesn't bring a performance improvement. - -The Azure resource that provides the message broker is the `Azure Event Grid`, and to enable it, you must add the option `use_internal_queue = false` to the same `main.tf` that you used in the [recommended deployment](#recommended-deployment) -```bash -# Spin up all required services -module "wandb" { - source = "wandb/wandb/azurerm" - version = "~> 1.2" - - - namespace = var.namespace - location = var.location - license = var.license - domain_name = var.domain_name - subdomain = var.subdomain - - use_internal_queue = false # Enable Azure Event Grid - [...] -} -``` - -## Other deployment options - -You can combine all three deployment options adding all configurations to the same file. -The [Terraform Module](https://github.com/wandb/terraform-azure-wandb) provides several options that you can combine along with the standard options and the minimal configuration found in [recommended deployment](#recommended-deployment) \ No newline at end of file diff --git a/platform/hosting/self-managed/bare-metal.mdx b/platform/hosting/self-managed/bare-metal.mdx deleted file mode 100644 index 72c46370a6..0000000000 --- a/platform/hosting/self-managed/bare-metal.mdx +++ /dev/null @@ -1,169 +0,0 @@ ---- -description: Hosting W&B Server on on-premises infrastructure -title: Deploy W&B Platform On-premises ---- - -import SelfManagedVersionRequirements from "/snippets/en/_includes/self-managed-version-requirements.mdx"; -import SelfManagedNetworkingRequirements from "/snippets/en/_includes/self-managed-networking-requirements.mdx"; -import SelfManagedSslTlsRequirements from "/snippets/en/_includes/self-managed-ssl-tls-requirements.mdx"; -import SelfManagedMysqlRequirements from "/snippets/en/_includes/self-managed-mysql-requirements.mdx"; -import SelfManagedMysqlDatabaseCreation from "/snippets/en/_includes/self-managed-mysql-database-creation.mdx"; -import SelfManagedRedisRequirements from "/snippets/en/_includes/self-managed-redis-requirements.mdx"; -import SelfManagedObjectStorageRequirements from "/snippets/en/_includes/self-managed-object-storage-requirements.mdx"; -import SelfManagedHardwareRequirements from "/snippets/en/_includes/self-managed-hardware-requirements.mdx"; -import SelfManagedVerifyInstallation from "/snippets/en/_includes/self-managed-verify-installation.mdx"; - - -W&B recommends fully managed deployment options such as [W&B Multi-tenant Cloud](/platform/hosting/hosting-options/multi_tenant_cloud) or [W&B Dedicated Cloud](/platform/hosting/hosting-options/dedicated_cloud/) deployment types. W&B fully managed services are simple and secure to use, with minimum to no configuration required. - - - -Reach out to the W&B Sales Team for related question: [contact@wandb.com](mailto:contact@wandb.com). - -## Infrastructure guidelines - -Before you start deploying W&B, refer to the [reference architecture](/platform/hosting/self-managed/ref-arch/#infrastructure-requirements) for complete infrastructure requirements, including hardware sizing recommendations. - -### Version Requirements - - -### Hardware Requirements - - - -## MySQL database - - - -For MySQL version requirements, see the [Version Requirements](#version-requirements) section above. - -For MySQL configuration parameters for self-managed instances, see the [reference architecture MySQL configuration section](/platform/hosting/self-managed/ref-arch/#mysql-configuration-parameters). - -### Database creation - - - - -For SSL/TLS certificate requirements, see the [SSL/TLS section](#ssl-tls) below. - - -## Redis - - - -## Object storage - - - -### Self-hosted object storage setup - -The object store can be externally hosted on any Amazon S3 compatible object store that has support for signed URLs. Run the [following script](https://gist.github.com/vanpelt/2e018f7313dabf7cca15ad66c2dd9c5b) to check if your object store supports signed URLs. See the [MinIO setup section](#minio-setup) below for important information about MinIO Open Source status. - -Additionally, the following CORS policy needs to be applied to the object store. - -``` xml - - - - http://YOUR-W&B-SERVER-IP - GET - PUT - HEAD - * - - -``` - -Configure object storage through the [System Console](/platform/hosting/iam/sso#system-console) or directly in the W&B Custom Resource (CR) specification. - -#### AWS S3 configuration - -For AWS S3 buckets, configure the following in your W&B CR: - -```yaml -bucket: - kmsKey: # Optional KMS key for encryption - name: # Example: wandb - path: "" # Keep as empty string - provider: s3 - region: # Example: us-east-1 -``` - -TLS is enabled by default for AWS S3 connections. - -#### S3-compatible storage configuration - -For S3-compatible storage (such as MinIO), use the following configuration: - -```yaml -bucket: - kmsKey: null - name: # Example: s3.example.com:9000 - path: # Example: wandb - provider: s3 - region: # Example: us-east-1 -``` - -To use TLS for S3-compatible storage, append `?tls=true` to the bucket path: - -```yaml -bucket: - kmsKey: null - name: "s3.example.com:9000" - path: "wandb?tls=true" - provider: "s3" - region: "us-east-1" -``` - - -For certificate requirements, see the [SSL/TLS section](#ssl-tls) section below. The certificate must be trusted. Self-signed certificates require additional configuration. - - -The most important things to consider when running your own object store are: - -1. **Storage capacity and performance**. It's fine to use magnetic disks, but you should be monitoring the capacity of these disks. Average W&B usage results in 10's to 100's of Gigabytes. Heavy usage could result in Petabytes of storage consumption. -2. **Fault tolerance.** At a minimum, the physical disk storing the objects should be on a RAID array. If you use S3-compatible storage, consider using a distributed or highly available configuration. -3. **Availability.** Monitoring should be configured to ensure the storage is available. - -There are many enterprise alternatives to running your own object storage service such as: - -1. [Amazon S3 on Outposts](https://aws.amazon.com/s3/outposts/) -2. [NetApp StorageGRID](https://www.netapp.com/data-storage/storagegrid/) - -### MinIO setup - - -MinIO Open Source is in [maintenance mode](https://github.com/minio/minio) with no active development. Pre-compiled binaries are no longer provided, and only critical security fixes are considered case-by-case. For production deployments, W&B recommends using managed object storage services or [MinIO Enterprise (AIStor)](https://min.io/product/aistor). - - -If you are using an existing MinIO deployment or MinIO Enterprise, you can create a bucket using the MinIO client: - -```bash -mc config host add local http://$MINIO_HOST:$MINIO_PORT "$MINIO_ACCESS_KEY" "$MINIO_SECRET_KEY" --api s3v4 -mc mb --region=us-east1 local/local-files -``` - -For new deployments, consider enterprise alternatives listed above or managed cloud object storage services. - -## Deploy W&B Server application to Kubernetes - -The recommended installation method is using the **W&B Kubernetes Operator**, deployed via Helm. - -For complete installation instructions, see [Run W&B Server on Kubernetes (Operator)](/platform/hosting/operator/), which covers: -- Helm CLI deployment -- Helm Terraform Module deployment -- W&B Cloud Terraform modules - -The sections below highlight considerations specific to on-premises/datacenter deployments. - -### OpenShift - -W&B supports deployment on OpenShift Kubernetes clusters in on-premises environments. See the [reference architecture](/platform/hosting/self-managed/ref-arch/#kubernetes) for more details and the [Operator guide OpenShift section](/platform/hosting/operator/#openshift-kubernetes-clusters) for specific configuration instructions that you can adapt for your on-premises OpenShift deployment. - -## Networking - -For networking requirements, load balancer options, and configuration examples (including nginx), see the [reference architecture networking sections](/platform/hosting/self-managed/ref-arch/#networking). - -## Verify your installation - - diff --git a/platform/hosting/self-managed/gcp-tf.mdx b/platform/hosting/self-managed/gcp-tf.mdx deleted file mode 100644 index c9262f72c1..0000000000 --- a/platform/hosting/self-managed/gcp-tf.mdx +++ /dev/null @@ -1,312 +0,0 @@ ---- -description: Hosting W&B Server on Google Cloud. -title: Deploy W&B Platform on Google Cloud ---- - - -W&B recommends fully managed deployment options such as [W&B Multi-tenant Cloud](/platform/hosting/hosting-options/multi_tenant_cloud) or [W&B Dedicated Cloud](/platform/hosting/hosting-options/dedicated_cloud/) deployment types. W&B fully managed services are simple and secure to use, with minimum to no configuration required. - - -If you've determined to Self-Managed W&B Server, W&B recommends using the [W&B Server Google Cloud Terraform Module](https://registry.terraform.io/modules/wandb/wandb/google/latest) to deploy the platform on Google Cloud. - -The module documentation is extensive and contains all available options that can be used. - -Before you start, W&B recommends that you choose one of the [remote backends](https://developer.hashicorp.com/terraform/language/backend/remote) available for Terraform to store the [State File](https://developer.hashicorp.com/terraform/language/state). - -The State File is the necessary resource to roll out upgrades or make changes in your deployment without recreating all components. - -The Terraform Module will deploy the following `mandatory` components: - -- VPC -- Cloud SQL for MySQL -- Cloud Storage Bucket -- Google Kubernetes Engine -- Memorystore for Redis -- KMS Crypto Key -- Load Balancer - -Other deployment options can also include the following optional components: - -- Memory store for Redis -- Pub/Sub messages system - -## Prerequisite permissions - -The account that will run the terraform need to have the role `roles/owner` in the Google Cloud project used. - -## General steps - -The steps on this topic are common for any deployment option covered by this documentation. - -1. Prepare the development environment. - - Install [Terraform](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli). - - We recommend creating a Git repository with the code that will be used, but you can keep your files locally. - - Create a project in [Google Cloud Console](https://console.cloud.google.com/). - - Authenticate with Google Cloud (make sure to [install gcloud](https://cloud.google.com/sdk/docs/install) before) using `gcloud auth application-default login`. - -2. Create the `terraform.tfvars` file. - - The `tvfars` file content can be customized according to the installation type, but the minimum recommended will look like the example below. - - ```bash - project_id = "wandb-project" - region = "europe-west2" - zone = "europe-west2-a" - namespace = "wandb" - license = "xxxxxxxxxxyyyyyyyyyyyzzzzzzz" - subdomain = "wandb-gcp" - domain_name = "wandb.ml" - ``` - - The variables defined here need to be decided before the deployment because. The `namespace` variable will be a string that will prefix all resources created by Terraform. - - The combination of `subdomain` and `domain` will form the FQDN that W&B will be configured. In the example above, the W&B FQDN will be `wandb-gcp.wandb.ml`. - -3. Create the file `variables.tf`. - - For every option configured in the `terraform.tfvars` Terraform requires a correspondent variable declaration. - - ```hcl - variable "project_id" { - type = string - description = "Project ID" - } - - variable "region" { - type = string - description = "Google region" - } - - variable "zone" { - type = string - description = "Google zone" - } - - variable "namespace" { - type = string - description = "Namespace prefix used for resources" - } - - variable "domain_name" { - type = string - description = "Domain name for accessing the Weights & Biases UI." - } - - variable "subdomain" { - type = string - description = "Subdomain for access the Weights & Biases UI." - } - - variable "license" { - type = string - description = "W&B License" - } - ``` - -## Deployment - Recommended (~20 mins) - -This is the most straightforward deployment option configuration that will create all `Mandatory` components and install in the `Kubernetes Cluster` the latest version of `W&B`. - -1. Create the `main.tf` file. - - In the same directory where you created the files in the [General Steps](#general-steps), create a file `main.tf` with the following content: - - ```hcl - provider "google" { - project = var.project_id - region = var.region - zone = var.zone - } - - provider "google-beta" { - project = var.project_id - region = var.region - zone = var.zone - } - - data "google_client_config" "current" {} - - provider "kubernetes" { - host = "https://${module.wandb.cluster_endpoint}" - cluster_ca_certificate = base64decode(module.wandb.cluster_ca_certificate) - token = data.google_client_config.current.access_token - } - - # Spin up all required services - module "wandb" { - source = "wandb/wandb/google" - version = "~> 10.0" - - namespace = var.namespace - license = var.license - domain_name = var.domain_name - subdomain = var.subdomain - } - - # You'll want to update your DNS with the provisioned IP address - output "url" { - value = module.wandb.url - } - - output "address" { - value = module.wandb.address - } - - output "bucket_name" { - value = module.wandb.bucket_name - } - ``` - -2. Deploy W&B. - - To deploy W&B, execute the following commands: - - ```bash - terraform init - terraform apply -var-file=terraform.tfvars - ``` - -## Other deployment options - -You can combine all three deployment options adding all configurations to the same file. -The [Terraform Module](https://github.com/wandb/terraform-google-wandb) provides several options that can be combined along with the standard options and the minimal configuration found in `Deployment - Recommended`. - -{/* ## Upgrades (coming soon) */} - -## Manual configuration - -To use a Google Cloud Storage bucket as a file storage backend for W&B, you will need to create a: - -- [PubSub Topic and Subscription](#create-pubsub-topic-and-subscription) -- [Storage Bucket](#create-storage-bucket) -- [PubSub Notification](#create-pubsub-notification) - -### Create PubSub Topic and Subscription - -Follow the procedure below to create a PubSub topic and subscription: - -1. Navigate to the Pub/Sub service within the Google Cloud Console. -2. Select **Create Topic** and provide a name for your topic. -3. At the bottom of the page, select **Create subscription**. Ensure **Delivery Type** is set to **Pull**. -4. Click **Create**. - -Make sure the service account or account that your instance is running has the `pubsub.admin` role on this subscription. For details, see the [Google Cloud Pub/Sub access control documentation](https://cloud.google.com/pubsub/docs/access-control#console). - -### Create Storage Bucket - -1. Navigate to the **Cloud Storage Buckets** page. - -2. Select **Create bucket** and provide a name for your bucket. Ensure you choose a **Standard** [storage class](https://cloud.google.com/storage/docs/storage-classes). - - Ensure that the service account or account that your instance is running has both: - - - access to the bucket you created in the previous step. - - `storage.objectAdmin` role on this bucket. For details, see the [Google Cloud Storage IAM permissions documentation](https://cloud.google.com/storage/docs/access-control/using-iam-permissions#bucket-add). - - - Your instance also needs the `iam.serviceAccounts.signBlob` permission in Google Cloud to create signed file URLs. Add `Service Account Token Creator` role to the service account or IAM member that your instance is running as to enable permission. - - -3. Enable CORS access. This can only be done using the command line. First, create a JSON file with the following CORS configuration. - - ```yaml - cors: - - maxAgeSeconds: 3600 - method: - - GET - - PUT - origin: - - '' - responseHeader: - - Content-Type - ``` - - Note that the scheme, host, and port of the values for the origin must match exactly. - -4. Make sure you have `gcloud` installed, and logged into the correct Google Cloud project. - -5. Next, run the following: - - ```bash - gcloud storage buckets update gs:// --cors-file= - ``` - -### Create PubSub Notification - -Follow the procedure below in your command line to create a notification stream from the Storage Bucket to the Pub/Sub topic. - - -You must use the CLI to create a notification stream. Ensure you have `gcloud` installed. - - -1. Log into your Google Cloud project. - -2. Run the following in your terminal: - - ```bash - gcloud pubsub topics list # list names of topics for reference - gcloud storage ls # list names of buckets for reference - - # create bucket notification - gcloud storage buckets notifications create gs:// --topic= - ``` - -[Further reference is available on the Cloud Storage website.](https://cloud.google.com/storage/docs/reporting-changes) - -### Configure W&B server - -1. Finally, navigate to the W&B `System Connections` page at `http(s)://YOUR-W&B-SERVER-HOST/console/settings/system`. - -2. Select the provider `Google Cloud Storage (gcs)`. - -3. Provide the name of the GCS bucket. - - Google Cloud file storage configuration - -4. Press **Update settings** to apply the new settings. - -## Upgrade W&B Server - -The Operator upgrades W&B automatically by the W&B Operator. To turn this off, you can update the user spec to override the images from the system console. - -To pin the images to a specific version: - -1. Access the ActiveSpec in the system console `https:///console/settings/advanced/spec/active`. - -2. Copy the components image configuration, which looks similar to: - - ```yaml - api: - image: - tag: 0.75.2 - initContainers: - init-db: - image: - tag: 0.75.2 - ``` - - W&B System Console - ActiveSpec - -3. Paste the component configuration replacing the image tag in the `UserSpec`. - - ```yaml - chart: {} - values: - api: - image: - tag: 0.76.3 - initContainers: - init-db: - image: - tag: 0.76.3 - ``` - - W&B System Console - UserSpec - -4. Click Save. - -5. Access the tab `Operator` and click `Trigger reapply`. - - W&B System Console - Operator - -The W&B Operator reconciles the configuration and pins the image to the specified version. diff --git a/platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped.mdx b/platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped.mdx index 652f525827..2c4e504dfc 100644 --- a/platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped.mdx +++ b/platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped.mdx @@ -701,7 +701,6 @@ spec: ``` - Contact [W&B Support](mailto:support@wandb.com) or your assigned W&B support engineer for comprehensive OpenShift configuration examples tailored to your security requirements. diff --git a/platform/hosting/self-managed/operator-airgapped.mdx b/platform/hosting/self-managed/operator-airgapped.mdx deleted file mode 100644 index b3f310a26b..0000000000 --- a/platform/hosting/self-managed/operator-airgapped.mdx +++ /dev/null @@ -1,306 +0,0 @@ ---- -description: Deploy W&B Platform with Kubernetes Operator (Airgapped) -title: Kubernetes operator for air-gapped instances ---- - -import SelfManagedVersionRequirements from "/snippets/en/_includes/self-managed-version-requirements.mdx"; -import SelfManagedSslTlsRequirements from "/snippets/en/_includes/self-managed-ssl-tls-requirements.mdx"; -import SelfManagedMysqlRequirements from "/snippets/en/_includes/self-managed-mysql-requirements.mdx"; -import SelfManagedMysqlDatabaseCreation from "/snippets/en/_includes/self-managed-mysql-database-creation.mdx"; -import SelfManagedRedisRequirements from "/snippets/en/_includes/self-managed-redis-requirements.mdx"; -import SelfManagedObjectStorageRequirements from "/snippets/en/_includes/self-managed-object-storage-requirements.mdx"; -import SelfManagedHardwareRequirements from "/snippets/en/_includes/self-managed-hardware-requirements.mdx"; -import SelfManagedVerifyInstallation from "/snippets/en/_includes/self-managed-verify-installation.mdx"; - -## Introduction - -This guide provides step-by-step instructions to deploy the W&B Platform in air-gapped customer-managed environments. - -Use an internal repository or registry to host the Helm charts and container images. Run all commands in a shell console with proper access to the Kubernetes cluster. - -You could utilize similar commands in any continuous delivery tooling that you use to deploy Kubernetes applications. - -## Step 1: Prerequisites - -Before starting, make sure your environment meets the following requirements: - -### Version Requirements - - -### SSL/TLS Requirements - - -### Hardware Requirements - - - -### MySQL Database - - - -### Redis - - - -### Object storage - - - -### Additional Requirements -- Access to an internal container registry with the required W&B images -- Access to an internal Helm repository for W&B Helm charts - -For complete infrastructure requirements, including networking and load balancer configuration, see the [reference architecture](/platform/hosting/self-managed/ref-arch/#infrastructure-requirements). - - -W&B can be deployed on air-gapped OpenShift Kubernetes clusters. See the [reference architecture](/platform/hosting/self-managed/ref-arch/#kubernetes) for details, and review the [OpenShift section](/platform/hosting/operator/#openshift-kubernetes-clusters) of the Operator guide for specific configuration instructions that you can adapt for your air-gapped OpenShift deployment. - - -## Step 2: Prepare internal container registry - -For a successful air-gapped deployment, the following container images must be available in your air-gapped container registry. - -You are responsible for tracking the W&B Operator's requirements and maintaining your container registry with updated images regularly. For the most current list of required container images and versions, refer to the Helm chart, or contact [support](mailto:support@wandb.com) or your AISE. - -### Core W&B component containers -* [`docker.io/wandb/controller`](https://hub.docker.com/r/wandb/controller) -* [`docker.io/wandb/local`](https://hub.docker.com/r/wandb/local) -* [`docker.io/wandb/console`](https://hub.docker.com/r/wandb/console) -* [`docker.io/wandb/megabinary`](https://hub.docker.com/r/wandb/megabinary) - -### Dependencies - -* [`docker.io/bitnamilegacy/redis`](https://hub.docker.com/r/bitnamilegacy/redis): Required for local Redis deployment during testing and development. To use the local Redis deployment, ensure that this image is available in your container registry. For production Redis requirements, see the [Redis section](#redis) in Prerequisites. -* [`docker.io/otel/opentelemetry-collector-contrib`](https://hub.docker.com/r/otel/opentelemetry-collector-contrib): W&B depends on the OpenTelemetry agent to collect metrics and logs from resources at the Kubernetes layer for display in W&B. -* [`quay.io/prometheus/prometheus`](https://quay.io/repository/prometheus/prometheus): W&B depends on Prometheus to capture metrics from various components for display in W&B. -* [`quay.io/prometheus-operator/prometheus-config-reloader`](https://quay.io/repository/prometheus-operator/prometheus-config-reloader): A required dependency of Prometheus. - - -### Get required images - -To extract the complete list of required images and versions from the Helm chart values: - -1. Download the W&B Operator and Platform Helm charts from the [W&B Helm charts repository](https://github.com/wandb/helm-charts). - -2. Inspect the `values.yaml` files to identify all container images and their versions: - - ```bash - # Extract image references from the Helm chart - helm show values wandb/operator \ - | awk -F': *' '/^[[:space:]]*repository:/{print $2}' \ - | grep -E '^wandb/' \ - | sort -u - ``` - - The list might look similar to the following. Image versions may vary. - - ```text - wandb/anaconda2 - wandb/console - wandb/frontend-nginx - wandb/local - wandb/megabinary - wandb/weave-python - wandb/weave-trace - ``` - - -## Step 3: Prepare internal Helm chart repository - -Along with the container images, you also must ensure that the following Helm charts are available in your internal Helm Chart repository. Download them from: - -- [W&B Operator](https://github.com/wandb/helm-charts/tree/main/charts/operator) -- [W&B Platform](https://github.com/wandb/helm-charts/tree/main/charts/operator-wandb) - - -The `operator` chart is used to deploy the W&B Operator, which is also referred to as the Controller Manager. The `platform` chart is used to deploy the W&B Platform using the values configured in the custom resource definition (CRD). - -## Step 4: Set up Helm repository - -Now, configure the Helm repository to pull the W&B Helm charts from your internal repository. Run the following commands to add and update the Helm repository: - -```bash -helm repo add local-repo https://charts.yourdomain.com -helm repo update -``` - -## Step 5: Install the Kubernetes operator - -The W&B Kubernetes operator, also known as the controller manager, is responsible for managing the W&B platform components. To install it in an air-gapped environment, -you must configure it to use your internal container registry. - -To do so, you must override the default image settings to use your internal container registry and set the key `airgapped: true` to indicate the expected deployment type. Update the `values.yaml` file as shown below: - -```yaml -image: - repository: registry.yourdomain.com/library/controller - tag: 1.13.3 -airgapped: true -``` - -Replace the tag with the version that is available in your internal registry. - -Install the operator and the CRD: -```bash -helm upgrade --install operator local-repo/operator -n wandb --create-namespace -f values.yaml -``` - -For full details about the supported values, refer to the [Kubernetes operator GitHub repository](https://github.com/wandb/helm-charts/blob/main/charts/operator/values.yaml). - -## Step 6: Set up MySQL database - -Before configuring the W&B Custom Resource, you must set up an external MySQL database. For production deployments, W&B strongly recommends using managed database services. However, if you are running your own MySQL instance, create the database and user: - - - -For MySQL configuration parameters, see the [reference architecture MySQL configuration section](/platform/hosting/self-managed/ref-arch/#mysql-configuration-parameters). - -## Step 7: Configure W&B Custom Resource - -After installing the W&B Kubernetes operator, you must configure the Custom Resource (CR) to point to your internal Helm repository and container registry. - -This configuration ensures that the Kubernetes operators uses your internal registry and repository are when it deploys the required components of the W&B platform. - -Copy this example CR to a new file named `wandb.yaml`. - -```yaml -apiVersion: apps.wandb.com/v1 -kind: WeightsAndBiases -metadata: - labels: - app.kubernetes.io/instance: wandb - app.kubernetes.io/name: weightsandbiases - name: wandb - namespace: default - -spec: - chart: - url: http://charts.yourdomain.com - name: operator-wandb - version: 0.18.0 - - values: - global: - host: https://wandb.yourdomain.com - license: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx - bucket: - accessKey: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx - secretKey: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx - name: s3.yourdomain.com:port #Ex.: s3.yourdomain.com:9000 - path: bucket_name - provider: s3 - region: us-east-1 - mysql: - database: wandb - host: mysql.home.lab - password: password - port: 3306 - user: wandb - redis: - host: redis.yourdomain.com - port: 6379 - password: password - api: - enabled: true - glue: - enabled: true - executor: - enabled: true - extraEnv: - ENABLE_REGISTRY_UI: 'true' - - app: - image: - repository: registry.yourdomain.com/local - tag: 0.59.2 - - console: - image: - repository: registry.yourdomain.com/console - tag: 2.12.2 - - api: - image: - repository: registry.yourdomain.com/megabinary - tag: 0.59.2 - - executor: - image: - repository: registry.yourdomain.com/megabinary - tag: 0.59.2 - - glue: - image: - repository: registry.yourdomain.com/megabinary - tag: 0.59.2 - - parquet: - image: - repository: registry.yourdomain.com/megabinary - tag: 0.59.2 - - weave: - image: - repository: registry.yourdomain.com/weave-python - tag: 0.59.2 - - otel: - image: - repository: registry.yourdomain.com/otel/opentelemetry-collector-contrib - tag: 0.97.0 - - prometheus: - server: - image: - repository: registry.yourdomain.com/prometheus/prometheus - tag: v2.47.0 - configmapReload: - prometheus: - image: - repository: registry.yourdomain.com/prometheus-operator/prometheus-config-reloader - tag: v0.67.0 - - ingress: - annotations: - nginx.ingress.kubernetes.io/proxy-body-size: 0 - class: nginx - - -``` - -To deploy the W&B platform, the Kubernetes Operator uses the values from your CR to configure the `operator-wandb` Helm chart from your internal repository. - -Replace all tags and versions with the versions available in your internal registry. The example above shows the most commonly used components. Depending on your deployment needs, you may also need to configure image repositories for additional components such as `settingsMigrationJob`, `weave-trace`, `filestream`, and others. Refer to the [W&B Helm repository values file](https://github.com/wandb/helm-charts/blob/main/charts/operator-wandb/values.yaml) for the complete list of configurable components. - - -## Step 8: Deploy the W&B platform - -Now that the Kubernetes operator and the CR are configured, apply the `wandb.yaml` configuration to deploy the W&B platform: - -```bash -kubectl apply -f wandb.yaml -``` - -## Step 9: Verify your installation - - - -## FAQ - -Refer to the below frequently asked questions (FAQs) and troubleshooting tips during the deployment process: - -### There is another ingress class. Can that class be used? -Yes, you can configure your ingress class by modifying the ingress settings in `values.yaml`. - -### The certificate bundle has more than one certificate. Would that work? -You must split the certificates into multiple entries in the `customCACerts` section of `values.yaml`. - -### How do you prevent the Kubernetes operator from applying unattended updates. Is that possible? -You can turn off auto-updates from the W&B console. Reach out to your W&B team for any questions on the supported versions. W&B supports a major W&B Server release for 12 months from its initial release date. Customers with **Self-Managed** instances are responsible for upgrading in time to maintain support. Avoid staying on an unsupported version. Refer to [Release policies and processes](/release-notes/release-policies). - - -W&B strongly recommends customers with **Self-Managed** instances to update their deployments with the latest release at minimum once per quarter to maintain support and receive the latest features, performance improvements, and fixes. - - -### Does the deployment work if the environment has no connection to public repositories? -If your configuration sets `airgapped` to `true`, the Kubernetes operator uses only your internal resources and does not attempt to connect to public repositories. From b9840935524dcdaa0ae2005d70d013071c31eea0 Mon Sep 17 00:00:00 2001 From: Matt Linville Date: Tue, 3 Feb 2026 15:35:53 -0800 Subject: [PATCH 7/7] Fix spurious redirects --- lychee.toml | 3 ++- .../secure-storage-connector.mdx | 20 +++++++++---------- .../cloud-deployments/terraform.mdx | 4 ++-- .../on-premises-deployments/kubernetes.mdx | 2 +- 4 files changed, 15 insertions(+), 14 deletions(-) diff --git a/lychee.toml b/lychee.toml index 348e2285da..e5299e5a0c 100644 --- a/lychee.toml +++ b/lychee.toml @@ -20,8 +20,9 @@ retry_wait_time = 2 # Accept these HTTP status codes as valid # 200 = OK +# 403 = Forbidden (can be a false positive - treat as success) # 429 = Too Many Requests (rate limit - treat as success) -accept = [200, 429] +accept = [200, 403, 429] # Only check HTTP/HTTPS URLs scheme = [ diff --git a/platform/hosting/data-security/secure-storage-connector.mdx b/platform/hosting/data-security/secure-storage-connector.mdx index a9cb5b3558..fec8709966 100644 --- a/platform/hosting/data-security/secure-storage-connector.mdx +++ b/platform/hosting/data-security/secure-storage-connector.mdx @@ -52,7 +52,7 @@ W&B can connect to the following storage providers: - [CoreWeave AI Object Storage](https://docs.coreweave.com/docs/products/storage/object-storage): High-performance, S3-compatible object storage service optimized for AI workloads. - [Amazon S3](https://aws.amazon.com/s3/): Object storage service offering industry-leading scalability, data availability, security, and performance. - [Google Cloud Storage](https://cloud.google.com/storage): Managed service for storing unstructured data at scale. -- [Azure Blob Storage](https://azure.microsoft.com/products/storage/blobs): Cloud-based object storage solution for storing massive amounts of unstructured data like text, binary data, images, videos, and logs. +- [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs): Cloud-based object storage solution for storing massive amounts of unstructured data like text, binary data, images, videos, and logs. - S3-compatible storage such as [MinIO Enterprise (AIStor)](https://min.io/product/aistor) or other enterprise-grade solutions hosted in your cloud or on-premises infrastructure. The following table shows the availability of BYOB at each scope for each W&B deployment type. @@ -80,10 +80,10 @@ After [verifying availability](#availability-matrix), you are ready to provision - A CoreWeave account with AI Object Storage enabled and with permission to create buckets, API access keys, and secret keys. - Your W&B instance must be able to connect to CoreWeave network endpoints. -For details, see [Create a CoreWeave AI Object Storage bucket](https://docs.coreweave.com/docs/products/storage/object-storage/how-to/create-bucket) in the CoreWeave documentation. +For details, see [Create a CoreWeave AI Object Storage bucket](https://docs.coreweave.com/docs/products/storage/object-storage/buckets/create-bucket) in the CoreWeave documentation. 1. **Multi-tenant Cloud**: Obtain your organization ID, which is required for your bucket policy. - 1. Log in to the [W&B App](https://wandb.ai/). + 1. Log in to the [W&B App](https://wandb.ai/site). 1. In the left navigation, click **Create a new team**. 1. In the drawer that opens, copy the W&B organization ID, which is located above **Invite team members**. 1. Leave this page open. You will use it to [configure W&B](#configure-byob). @@ -117,7 +117,7 @@ For details, see [Create a CoreWeave AI Object Storage bucket](https://docs.core ``` CoreWeave storage is S3-compatible. For details about CORS, refer to [Configuring cross-origin resource sharing (CORS)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/enabling-cors-examples.html) in the AWS documentation. -1. Configure a bucket policy that grants the required permissions for your W&B deployment to access the bucket and generate [pre-signed URLs](./presigned-urls) that AI workloads in your cloud infrastructure or user browsers utilize to access the bucket. Refer to [Bucket Policy Reference](https://docs.coreweave.com/docs/products/storage/object-storage/reference/bucket-policy) in the CoreWeave documentation. +1. Configure a bucket policy that grants the required permissions for your W&B deployment to access the bucket and generate [pre-signed URLs](./presigned-urls) that AI workloads in your cloud infrastructure or user browsers utilize to access the bucket. Refer to [Bucket Policy Reference](https://docs.coreweave.com/docs/products/storage/object-storage/auth-access/bucket-access/bucket-policies) in the CoreWeave documentation. ```json { @@ -219,10 +219,10 @@ For details, see [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/late Replace `` and `` accordingly. - If you are using [Multi-tenant Cloud](/platform/hosting/hosting-options/multi_tenant_cloud) or [Dedicated Cloud](/platform/hosting/hosting-options/dedicated_cloud), replace `` with the corresponding value: + If you are using [Multi-tenant Cloud](/platform/hosting/hosting-options#w%26b-multi-tenant-cloud) or [Dedicated Cloud](/platform/hosting/hosting-options#w%26b-dedicated-cloud), replace `` with the corresponding value: - * For [Multi-tenant Cloud](/platform/hosting/hosting-options/multi_tenant_cloud): `arn:aws:iam::725579432336:role/WandbIntegration` - * For [Dedicated Cloud](/platform/hosting/hosting-options/dedicated_cloud): `arn:aws:iam::830241207209:root` + * **Multi-tenant Cloud**: `arn:aws:iam::725579432336:role/WandbIntegration` + * **Dedicated Cloud**: `arn:aws:iam::830241207209:root` This policy grants your AWS account full access to the key and also assigns the required permissions to the AWS account hosting the W&B Platform. Keep a record of the KMS Key ARN. @@ -301,7 +301,7 @@ For details, see [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/late For more details, see the [AWS Self-Managed hosting guide](/platform/hosting/hosting-options/). -For details, see [Create a bucket](https://cloud.google.com/storage/docs/creating-buckets) in the Google Cloud documentation. +For details, see [Create a bucket](https://docs.cloud.google.com/storage/docs/creating-buckets) in the Google Cloud documentation. 1. Provision the GCS bucket. Follow these steps to provision the GCS bucket in your Google Cloud project: @@ -442,7 +442,7 @@ s3://:@/?region=&t In the address, the `region` parameter is mandatory. -This section is for S3-compatible storage buckets that are not hosted in S3, such as [MinIO Enterprise (AIStor)](https://min.io/product/aistor) or other enterprise-grade S3-compatible solutions hosted on your premises. For storage buckets hosted in AWS S3, see the **AWS** tab instead. +This section is for S3-compatible storage buckets that are not hosted in S3, such as [MinIO Enterprise (AIStor)](https://www.min.io/product/aistor) or other enterprise-grade S3-compatible solutions hosted on your premises. For storage buckets hosted in AWS S3, see the **AWS** tab instead. MinIO Open Source is in [maintenance mode](https://github.com/minio/minio) with no active development or pre-compiled binaries. For production deployments, use enterprise-grade S3-compatible solutions. @@ -569,4 +569,4 @@ This section helps troubleshoot problems connecting to CoreWeave AI Object Stora - Connecting to LOTA endpoints from W&B is not yet supported. To express interest, [contact support](mailto:support@wandb.com). - **Access key and permission errors** - Verify that your CoreWeave API Access Key is not expired. - - Verify that your CoreWeave API Access Key and Secret Key have sufficient permissions `GetObject`, `PutObject`, `DeleteObject`, `ListBucket`. The examples in this page meet this requirement. Refer to [Create and Manage Access Keys](https://docs.coreweave.com/docs/products/storage/object-storage/how-to/manage-access-keys) in the CoreWeave documentation. + - Verify that your CoreWeave API Access Key and Secret Key have sufficient permissions `GetObject`, `PutObject`, `DeleteObject`, `ListBucket`. The examples in this page meet this requirement. Refer to [Create and Manage Access Keys](https://docs.coreweave.com/docs/products/storage/object-storage/auth-access/manage-access-keys/about) in the CoreWeave documentation. diff --git a/platform/hosting/self-managed/cloud-deployments/terraform.mdx b/platform/hosting/self-managed/cloud-deployments/terraform.mdx index 766f33678d..deb1ccb88f 100644 --- a/platform/hosting/self-managed/cloud-deployments/terraform.mdx +++ b/platform/hosting/self-managed/cloud-deployments/terraform.mdx @@ -289,8 +289,8 @@ The steps in this section are common for any deployment option. 1. Prepare the development environment. - Install [Terraform](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli) - W&B recommends creating a Git repository with the code that will be used, but you can keep your files locally. - - Create a project in [Google Cloud Console](https://console.cloud.google.com/) - - Authenticate with Google Cloud (make sure to [install gcloud](https://cloud.google.com/sdk/docs/install) before): + - Create a project in [Google Cloud Console](https://console.cloud.google.com/welcome/new) + - Authenticate with Google Cloud (make sure to [install gcloud](https://docs.cloud.google.com/sdk/docs/install-sdk) before): `gcloud auth application-default login` 2. Create the `terraform.tfvars` file. diff --git a/platform/hosting/self-managed/on-premises-deployments/kubernetes.mdx b/platform/hosting/self-managed/on-premises-deployments/kubernetes.mdx index cc47ce6c6e..3869bc4cee 100644 --- a/platform/hosting/self-managed/on-premises-deployments/kubernetes.mdx +++ b/platform/hosting/self-managed/on-premises-deployments/kubernetes.mdx @@ -160,7 +160,7 @@ Enterprise alternatives for on-premises object storage include: - [Amazon S3 on Outposts](https://aws.amazon.com/s3/outposts/) - [NetApp StorageGRID](https://www.netapp.com/data-storage/storagegrid/) - MinIO Enterprise (AIStor) -- [Dell ECS](https://www.dell.com/en-us/dt/storage/ecs/index.htm) +- [Dell ObjectScale](https://www.dell.com/en-us/shop/cty/sf/objectscale) If you are using an existing MinIO deployment or MinIO Enterprise, you can create a bucket using the MinIO client: