-
Notifications
You must be signed in to change notification settings - Fork 4
added gpu-process-exporter chart, changed logos #111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
de0cc56
added gpu-process-exporter chart, changed logos
tsipo 91ad96b
Merge branch 'master' into feat/gpu-process-exporter
tsipo a7992d7
PR review comments
tsipo a1175fe
PR review comments
tsipo 4cdf330
PR review comments
tsipo d181241
PR comments
tsipo f7852ec
PR comments
tsipo 49b7dd7
Merge branch 'master' into feat/gpu-process-exporter
tsipo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| apiVersion: v2 | ||
| name: gpu-process-exporter | ||
| description: A Helm chart for gpu-process-exporter | ||
| type: application | ||
| icon: https://kubex.ai/wp-content/uploads/kubex-logo-landscape.svg | ||
| version: 1.0.0 | ||
|
tsipo marked this conversation as resolved.
|
||
| keywords: | ||
| - kubex | ||
| - gpu | ||
| - process | ||
| - exporter | ||
| maintainers: | ||
| - email: support@kubex.ai | ||
| name: support | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,118 @@ | ||
| # Kubex GPU Process Exporter Helm Chart | ||
|
|
||
| <picture> | ||
| <source media="(prefers-color-scheme: dark)" srcset="https://kubex.ai/wp-content/uploads/kubex-logo-reverse-landscape.svg"> | ||
| <source media="(prefers-color-scheme: light)" srcset="https://kubex.ai/wp-content/uploads/kubex-logo-landscape.svg"> | ||
| <img src="https://kubex.ai/wp-content/uploads/kubex-logo-landscape.svg" width="300"> | ||
| </picture> | ||
|
|
||
| ## Purpose | ||
|
|
||
| This chart deploys the Kubex GPU process exporter. This exporter addresses the limitations of Nvidia's [DCGM Exporter](https://github.com/NVIDIA/dcgm-exporter) in providing container-level metrics. | ||
|
|
||
| ## Motivation | ||
|
|
||
| The DCGM exporter collects metrics per GPU device, but comes short associating the utilization metrics with the specific container which actually uses the GPU. To do this, the DCGM exporter relies on the [Nvidia device plugin](https://github.com/NVIDIA/k8s-device-plugin). This association has the following issues: | ||
|
|
||
| * A basic assumption of the DCGM exporter is that ALL metrics of the device can (and should) be mapped to a **single** container using it. This assumption breaks in the case that the GPU is shared by multiple containers; it is also not the right approach for some metrics (non-utilization), which should not be mapped to containers. | ||
| * The DCGM exporter cannot deal with "soft" (software-based) GPU sharing techniques, such as time-slicing or MPS. With each datapoint the exporter randomly reports one of the containers using the GPU simultaneously, and attributes all the utilization to this container. | ||
| * The DCGM exporter also cannot deal with [KAI scheduler](https://github.com/kai-scheduler/KAI-Scheduler), which sets "reservation containers" to reserve the GPU, and schedules the actual workloads to utilize it. | ||
|
|
||
| The Kubex GPU process exporter addresses these limitations. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| * A k8s cluster with at least one Nvidia GPU | ||
| * All nodes with Nvidia GPUs have to be labeled `nvidia.com/gpu.present=true` (typically done by the [Nvidia GPU OPerator](https://github.com/nvidia/gpu-operator)) | ||
|
|
||
| ## Details | ||
|
|
||
| Deploys a DaemonSet with the following requirements: | ||
|
|
||
| * RBAC: `Pods - get, list, watch` | ||
| * access to `hostPID` | ||
| * security context: `privileged` container (runs as root) | ||
| * read-only access to the node's `/` filesystem | ||
| * read-only access to the node's `/proc` filesystem | ||
|
|
||
| ## Configuration | ||
|
|
||
| The following table lists configuration parameters in values.yaml and their default values. | ||
|
|
||
| | Parameter | Mandatory | Description | Default | | ||
| | --- | --- | --- | --- | | ||
| | `image.repository` | | Exporter image repository. | `densify/gpu-process-exporter` | | ||
| | `image.tag` | :white_check_mark: | Exporter image tag. | | | ||
| | `image.pullPolicy` | | Exporter image pull policy. | `Always` | | ||
| | `serviceAccount.create` | | Create a service account for the exporter. | | | ||
| | `serviceAccount.name` | Required when `serviceAccount.create` is `false`. | Service account name to use. | `gpu-process-exporter` | | ||
| | `rbac.create` | | Create RBAC resources for Pod read access. | | | ||
| | `rbac.clusterRoleName` | | Name of the ClusterRole to create or bind. | `gpu-exporter-role` | | ||
| | `rbac.clusterRoleBindingName` | | Name of the ClusterRoleBinding to create. | `gpu-exporter-binding` | | ||
| | `prometheusScrape.annotate` | | Add Prometheus scrape annotations to the Service (typically used by [prometheus-community/prometheus helm chart](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus)). | | | ||
| | `prometheusScrape.interval` | | Scrape interval - should match the actual scrape interval of Prometheus (global or explicit) for this exporter. Passed to the exporter as `SCRAPE_INTERVAL` environment variable. | `20s` | | ||
| | `port` | | Container and Service metrics port. | `9494` | | ||
| | `service.type` | | Kubernetes Service type. | `ClusterIP` | | ||
| | `service.annotations` | | Additional annotations to add to the Service. | | | ||
| | `hostProcMount` | | Host path mounted into the exporter as `/host/proc`. See [here](#kind-clusters-and-proprietary-driver). | `/proc` | | ||
| | `nvmlSearchPath` | | Override path used by the exporter to find NVML shared libraries. See [here](#non-standard-nvml-so-files-location). | | | ||
|
|
||
| ### Kind clusters and proprietary driver | ||
|
|
||
| The `hostProcMount` parameter is **only** required in case of a [kind](https://kind.sigs.k8s.io/) k8s cluster running on a host with a **proprietary** Nvidia driver (e.g. the series of `linux-modules-nvidia-<version>-server-generic` on Ubuntu). The reason for that is that `kind` nodes are Docker containers, and the proprietary Nvidia driver was blocked from understanding the Linux PID namespaces (by calling GPL-only functions), so it only has access to the host PIDs (which do not match the node's PIDs). | ||
|
|
||
| This parameter is **NOT** required if the cluster is NOT a `kind` cluster, or if the Nvidia driver uses the newer **Open GPU Kernel Modules** architecture (e.g. the series of `linux-modules-nvidia-<version>-server-open-generic` on Ubuntu), which is permitted access to these GPL-only functions and Linux PID namespaces. | ||
|
|
||
| If you have a `kind` cluster and a **proprietary** Nvidia driver, you need to deploy your cluster as follows: | ||
|
|
||
| ```yaml | ||
| kind: Cluster | ||
| apiVersion: kind.x-k8s.io/v1alpha4 | ||
| nodes: | ||
| - role: control-plane | ||
| - role: worker | ||
| extraMounts: | ||
| - hostPath: /dev/null | ||
| containerPath: /var/run/nvidia-container-devices/all | ||
| - hostPath: /proc | ||
| containerPath: /physical-host-proc | ||
| readOnly: true | ||
| ``` | ||
|
|
||
| And then specify the parameter `hostProcMount: /physical-host-proc` in the values. This makes sure that the exporter has access to the **host's** `/proc` filesystem. | ||
|
|
||
| ### Non-standard NVML .so files location | ||
|
|
||
| The exporter is required to load the NVML .so files from the **node's filesystem**. This makes sure that the right NVML version which matches the driver is loaded. | ||
|
|
||
| The exporter is configured to look by default for well-known standard locations of the NVML .so files as follows: | ||
|
|
||
| (`${DEBIAN_LIB_ARCH}` is one of `x86_64-linux-gnu` or `aarch64-linux-gnu`). | ||
|
|
||
| | Location | CSP / OS / Installation | | ||
| | --- | --- | | ||
| | `/home/kubernetes/bin/nvidia/lib64` | GKE COS / GKE GPU Operator with Google driver installer | | ||
| | `/opt/nvidia/lib64` | GKE Ubuntu Google driver installer | | ||
| | `/usr/local/nvidia/lib64` | NVIDIA container runtime / GKE exposed driver path / kind and nvkind | | ||
| | `/run/nvidia/driver/usr/lib64` | NVIDIA GPU Operator driver container, RPM-style | | ||
| | `/run/nvidia/driver/usr/lib/${DEBIAN_LIB_ARCH}` | NVIDIA GPU Operator driver container, Debian-style | | ||
| | `/usr/lib/${DEBIAN_LIB_ARCH}` | Ubuntu/Debian (GKE Ubuntu, AKS Ubuntu, OKE Ubuntu, kind) | | ||
| | `/usr/lib64` | EKS Amazon Linux, AKS Azure Linux, OKE Oracle Linux, Bottlerocket | | ||
| | `/lib/${DEBIAN_LIB_ARCH}` | Debian/Ubuntu merged-/usr compatibility | | ||
| | `/lib64` | RPM-style compatibility | | ||
|
|
||
| If your k8s cluster nodes have a non-standard location for the NVML .so files, the parameter `nvmlSearchPath` is required and should be set this location, as it is mounted under `/host/root/...` . In this case the standard locations are not searched. | ||
|
|
||
| --- | ||
|
|
||
| ## Limitations | ||
|
|
||
| * Supported architectures: amd64 (x64), arm64 | ||
|
|
||
| ## Documentation | ||
|
|
||
| * [Kubex](https://docs.kubex.ai) | ||
|
|
||
| ## License | ||
|
|
||
| Apache 2 Licensed. See [LICENSE](LICENSE) for full details. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| {{/* | ||
| Create the image repository to use | ||
| */}} | ||
| {{- define "gpu-process-exporter.imageRepository" -}} | ||
| {{- default "densify/gpu-process-exporter" .Values.image.repository }} | ||
| {{- end }} | ||
| {{/* | ||
| Create the image tag to use | ||
| */}} | ||
| {{- define "gpu-process-exporter.imageTag" -}} | ||
| {{- required "image.tag is required" .Values.image.tag }} | ||
| {{- end }} | ||
| {{/* | ||
| Create the image pull policy to use | ||
| */}} | ||
| {{- define "gpu-process-exporter.imagePullPolicy" -}} | ||
| {{- default "Always" .Values.image.pullPolicy }} | ||
| {{- end }} | ||
| {{/* | ||
| Create the name of the service account to use | ||
| */}} | ||
| {{- define "gpu-process-exporter.serviceAccountName" -}} | ||
| {{- $serviceAccount := default dict .Values.serviceAccount -}} | ||
| {{- if and (hasKey $serviceAccount "create") (eq $serviceAccount.create false) -}} | ||
| {{- required "serviceAccount.name is required when serviceAccount.create is false" $serviceAccount.name }} | ||
| {{- else -}} | ||
| {{- default "gpu-process-exporter" $serviceAccount.name }} | ||
| {{- end }} | ||
| {{- end }} | ||
| {{/* | ||
| Create the cluster role name to use | ||
| */}} | ||
| {{- define "gpu-process-exporter.clusterRoleName" -}} | ||
| {{- default "gpu-exporter-role" .Values.rbac.clusterRoleName }} | ||
| {{- end }} | ||
| {{/* | ||
| Create the cluster role binding name to use | ||
| */}} | ||
| {{- define "gpu-process-exporter.clusterRoleBindingName" -}} | ||
| {{- default "gpu-exporter-binding" .Values.rbac.clusterRoleBindingName }} | ||
| {{- end }} | ||
| {{/* | ||
| Create the Prometheus scrape interval to use | ||
| */}} | ||
| {{- define "gpu-process-exporter.prometheusScrapeInterval" -}} | ||
| {{- default "20s" .Values.prometheusScrape.interval }} | ||
| {{- end }} | ||
| {{/* | ||
| Create the port to use by the container and service | ||
| */}} | ||
| {{- define "gpu-process-exporter.port" -}} | ||
| {{- default 9494 .Values.port }} | ||
| {{- end }} | ||
| {{/* | ||
| Create the service type to use | ||
| */}} | ||
| {{- define "gpu-process-exporter.serviceType" -}} | ||
| {{- default "ClusterIP" .Values.service.type }} | ||
| {{- end }} | ||
| {{/* | ||
| Create the host proc mount to use by the container | ||
| */}} | ||
| {{- define "gpu-process-exporter.hostProcMount" -}} | ||
| {{- default "/proc" .Values.hostProcMount }} | ||
| {{- end }} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,103 @@ | ||
| apiVersion: apps/v1 | ||
|
tsipo marked this conversation as resolved.
|
||
| kind: DaemonSet | ||
| metadata: | ||
| name: {{ .Chart.Name }} | ||
| namespace: {{ .Release.Namespace }} | ||
| labels: | ||
| app.kubernetes.io/name: {{ .Chart.Name }} | ||
| app.kubernetes.io/instance: {{ .Release.Name }} | ||
| spec: | ||
| selector: | ||
| matchLabels: | ||
| app.kubernetes.io/name: {{ .Chart.Name }} | ||
| app.kubernetes.io/instance: {{ .Release.Name }} | ||
| template: | ||
| metadata: | ||
| labels: | ||
| app.kubernetes.io/name: {{ .Chart.Name }} | ||
| app.kubernetes.io/instance: {{ .Release.Name }} | ||
| spec: | ||
| hostPID: true | ||
|
tsipo marked this conversation as resolved.
|
||
| nodeSelector: | ||
| nvidia.com/gpu.present: "true" | ||
| tolerations: | ||
| - key: nvidia.com/gpu | ||
| operator: Exists | ||
| effect: NoSchedule | ||
| serviceAccountName: {{ include "gpu-process-exporter.serviceAccountName" . }} | ||
| containers: | ||
| - name: exporter | ||
| image: "{{ include "gpu-process-exporter.imageRepository" . }}:{{ include "gpu-process-exporter.imageTag" . }}" | ||
| imagePullPolicy: {{ include "gpu-process-exporter.imagePullPolicy" . }} | ||
| env: | ||
|
tsipo marked this conversation as resolved.
|
||
| - name: NVIDIA_VISIBLE_DEVICES | ||
| value: "all" | ||
| - name: NVIDIA_DRIVER_CAPABILITIES | ||
| value: "utility" | ||
| - name: NODE_NAME | ||
| valueFrom: | ||
| fieldRef: | ||
| fieldPath: spec.nodeName | ||
| - name: EXPORTER_PORT | ||
| value: {{ include "gpu-process-exporter.port" . | quote }} | ||
| {{- if .Values.nvmlSearchPath }} | ||
| - name: NVML_SEARCH_PATH | ||
| value: {{ .Values.nvmlSearchPath | quote }} | ||
| {{- end }} | ||
| - name: SCRAPE_INTERVAL | ||
| value: {{ include "gpu-process-exporter.prometheusScrapeInterval" . | quote }} | ||
|
tsipo marked this conversation as resolved.
tsipo marked this conversation as resolved.
|
||
| ports: | ||
|
tsipo marked this conversation as resolved.
|
||
| - containerPort: {{ include "gpu-process-exporter.port" . }} | ||
| name: metrics | ||
| securityContext: | ||
| privileged: true | ||
| runAsUser: 0 | ||
| {{- if .Values.resources }} | ||
| resources: | ||
|
tsipo marked this conversation as resolved.
tsipo marked this conversation as resolved.
tsipo marked this conversation as resolved.
|
||
| {{- toYaml .Values.resources | nindent 12 }} | ||
| {{- end }} | ||
| volumeMounts: | ||
|
tsipo marked this conversation as resolved.
|
||
| - name: host-root | ||
| mountPath: /host/root | ||
| readOnly: true | ||
| - name: host-proc | ||
| mountPath: /host/proc | ||
| readOnly: true | ||
| volumes: | ||
| - name: host-root | ||
| hostPath: | ||
| path: / | ||
| - name: host-proc | ||
| hostPath: | ||
|
tsipo marked this conversation as resolved.
|
||
| path: {{ include "gpu-process-exporter.hostProcMount" . }} | ||
| --- | ||
|
tsipo marked this conversation as resolved.
tsipo marked this conversation as resolved.
|
||
| apiVersion: v1 | ||
| kind: Service | ||
| metadata: | ||
| name: {{ .Chart.Name }} | ||
| namespace: {{ .Release.Namespace }} | ||
| labels: | ||
| app.kubernetes.io/name: {{ .Chart.Name }} | ||
| app.kubernetes.io/instance: {{ .Release.Name }} | ||
| {{- $hasAnnotations := or .Values.prometheusScrape.annotate (and .Values.service.annotations (gt (len .Values.service.annotations) 0)) }} | ||
| {{- if $hasAnnotations }} | ||
| annotations: | ||
| {{- if .Values.prometheusScrape.annotate }} | ||
| prometheus.io/scrape: {{ .Values.prometheusScrape.annotate | quote }} | ||
| prometheus.io/port: {{ include "gpu-process-exporter.port" . | quote }} | ||
| prometheus.io/path: "/metrics" | ||
| {{- end }} | ||
| {{- if .Values.service.annotations }} | ||
| {{- toYaml .Values.service.annotations | nindent 4 }} | ||
| {{- end }} | ||
| {{- end }} | ||
| spec: | ||
| selector: | ||
| app.kubernetes.io/name: {{ .Chart.Name }} | ||
| app.kubernetes.io/instance: {{ .Release.Name }} | ||
| ports: | ||
| - name: metrics | ||
| protocol: TCP | ||
| port: {{ include "gpu-process-exporter.port" . }} | ||
| targetPort: {{ include "gpu-process-exporter.port" . }} | ||
| type: {{ include "gpu-process-exporter.serviceType" . }} | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| {{- if .Values.serviceAccount.create }} | ||
|
tsipo marked this conversation as resolved.
tsipo marked this conversation as resolved.
tsipo marked this conversation as resolved.
|
||
| apiVersion: v1 | ||
| kind: ServiceAccount | ||
| metadata: | ||
| name: {{ include "gpu-process-exporter.serviceAccountName" . }} | ||
| namespace: {{ .Release.Namespace }} | ||
| labels: | ||
| app.kubernetes.io/name: {{ .Chart.Name }} | ||
| app.kubernetes.io/instance: {{ .Release.Name }} | ||
| --- | ||
| {{- end }} | ||
| {{- if .Values.rbac.create }} | ||
|
tsipo marked this conversation as resolved.
tsipo marked this conversation as resolved.
|
||
| apiVersion: rbac.authorization.k8s.io/v1 | ||
| kind: ClusterRole | ||
| metadata: | ||
| name: {{ include "gpu-process-exporter.clusterRoleName" . }} | ||
| labels: | ||
| app.kubernetes.io/name: {{ .Chart.Name }} | ||
| app.kubernetes.io/instance: {{ .Release.Name }} | ||
| rules: | ||
| - apiGroups: [""] | ||
| resources: ["pods"] | ||
| verbs: ["get", "list", "watch"] | ||
| --- | ||
| apiVersion: rbac.authorization.k8s.io/v1 | ||
| kind: ClusterRoleBinding | ||
| metadata: | ||
| name: {{ include "gpu-process-exporter.clusterRoleBindingName" . }} | ||
| labels: | ||
| app.kubernetes.io/name: {{ .Chart.Name }} | ||
| app.kubernetes.io/instance: {{ .Release.Name }} | ||
| subjects: | ||
| - kind: ServiceAccount | ||
| name: {{ include "gpu-process-exporter.serviceAccountName" . }} | ||
| namespace: {{ .Release.Namespace }} | ||
| roleRef: | ||
| kind: ClusterRole | ||
| name: {{ include "gpu-process-exporter.clusterRoleName" . }} | ||
| apiGroup: rbac.authorization.k8s.io | ||
| {{- end }} | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.