Skip to content
This repository was archived by the owner on Jan 11, 2023. It is now read-only.
This repository was archived by the owner on Jan 11, 2023. It is now read-only.

Hybrid clusters with GPU & CPU do not work #3190

Description

@rolanddb

Is this a request for help?:
NO


Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of acs-engine?:
0.18.1

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes

What happened:
Tried to deploy a cluster with two agent pools; one with reguler VMs (CPU only) and one with GPU's.
Similar to this blog post: https://www.microsoft.com/developerblog/2017/11/21/autoscaling-deep-learning-training-kubernetes/

The ARM deployment failed with a VMExtensionProvisioningError.

The master nodes start correctly. Also, the nodes with GPU start correctly. The error is in the CPU nodes.

Logging into a the node with CPU, we see that the hyperkube-extract service won't start:

systemd[1]: hyperkube-extract.service: Service hold-off time over, scheduling restart.
systemd[1]: Stopped kubectl and kubelet extraction.
systemd[1]: Starting kubectl and kubelet extraction...
docker[38944]: v1.10.3: Pulling from hyperkube-amd64
docker[38944]: Digest: sha256:00d814b1f7763f4ab5be80c58e98140dfc69df107f253d7fdd714b30a714260a
docker[38944]: Status: Image is up to date for k8s-gcrio.azureedge.net/hyperkube-amd64:v1.10.3
docker[38959]: /usr/bin/docker: Error response from daemon: shim error: fork/exec /usr/bin/nvidia-container-runtime: no such file or directory.
systemd[1]: hyperkube-extract.service: Control process exited, code=exited status=127
systemd[1]: Failed to start kubectl and kubelet extraction.
systemd[1]: hyperkube-extract.service: Unit entered failed state.
systemd[1]: hyperkube-extract.service: Failed with result 'exit-code'.

We observe that the CPU-only nodes get the same docker-deamon.json as the GPU ones:

$ cat /etc/docker/daemon.json
{
  "live-restore": true,
  "log-driver": "json-file",
  "log-opts":  {
     "max-size": "50m",
     "max-file": "5"
  }
  ,"default-runtime": "nvidia",
  "runtimes": {
     "nvidia": {
         "path": "/usr/bin/nvidia-container-runtime",
         "runtimeArgs": []
    }
  }
}

This fails because the the regular VM with CPU only, do not have the nvidia container runtime.

What you expected to happen:
A working cluster. Different docker-daemon depending on the VM type.

How to reproduce it (as minimally and precisely as possible):
Follow instructions in the blog post above, deploying a cluster with 2 agent pools with both GPU and CPU.

Anything else we need to know:
We tried with an older version of the acs-engine, 0.9.4. We used this version by correlating the date of the blog post (nov 2017) with releases of the acs-engine. This older acs-engine, does work correctly.

The previous version of the acs-engine (0.17.1) also works correctly. So this must be a regression in the most recent release of acs-engine.

We think the refactoring to NVIDIA device-plugins may be related (#2545)

cc @johnhofman

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions