Is this a request for help?:
NO
Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE
What version of acs-engine?:
0.18.1
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes
What happened:
Tried to deploy a cluster with two agent pools; one with reguler VMs (CPU only) and one with GPU's.
Similar to this blog post: https://www.microsoft.com/developerblog/2017/11/21/autoscaling-deep-learning-training-kubernetes/
The ARM deployment failed with a VMExtensionProvisioningError.
The master nodes start correctly. Also, the nodes with GPU start correctly. The error is in the CPU nodes.
Logging into a the node with CPU, we see that the hyperkube-extract service won't start:
systemd[1]: hyperkube-extract.service: Service hold-off time over, scheduling restart.
systemd[1]: Stopped kubectl and kubelet extraction.
systemd[1]: Starting kubectl and kubelet extraction...
docker[38944]: v1.10.3: Pulling from hyperkube-amd64
docker[38944]: Digest: sha256:00d814b1f7763f4ab5be80c58e98140dfc69df107f253d7fdd714b30a714260a
docker[38944]: Status: Image is up to date for k8s-gcrio.azureedge.net/hyperkube-amd64:v1.10.3
docker[38959]: /usr/bin/docker: Error response from daemon: shim error: fork/exec /usr/bin/nvidia-container-runtime: no such file or directory.
systemd[1]: hyperkube-extract.service: Control process exited, code=exited status=127
systemd[1]: Failed to start kubectl and kubelet extraction.
systemd[1]: hyperkube-extract.service: Unit entered failed state.
systemd[1]: hyperkube-extract.service: Failed with result 'exit-code'.
We observe that the CPU-only nodes get the same docker-deamon.json as the GPU ones:
$ cat /etc/docker/daemon.json
{
"live-restore": true,
"log-driver": "json-file",
"log-opts": {
"max-size": "50m",
"max-file": "5"
}
,"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
This fails because the the regular VM with CPU only, do not have the nvidia container runtime.
What you expected to happen:
A working cluster. Different docker-daemon depending on the VM type.
How to reproduce it (as minimally and precisely as possible):
Follow instructions in the blog post above, deploying a cluster with 2 agent pools with both GPU and CPU.
Anything else we need to know:
We tried with an older version of the acs-engine, 0.9.4. We used this version by correlating the date of the blog post (nov 2017) with releases of the acs-engine. This older acs-engine, does work correctly.
The previous version of the acs-engine (0.17.1) also works correctly. So this must be a regression in the most recent release of acs-engine.
We think the refactoring to NVIDIA device-plugins may be related (#2545)
cc @johnhofman
Is this a request for help?:
NO
Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE
What version of acs-engine?:
0.18.1
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes
What happened:
Tried to deploy a cluster with two agent pools; one with reguler VMs (CPU only) and one with GPU's.
Similar to this blog post: https://www.microsoft.com/developerblog/2017/11/21/autoscaling-deep-learning-training-kubernetes/
The ARM deployment failed with a VMExtensionProvisioningError.
The master nodes start correctly. Also, the nodes with GPU start correctly. The error is in the CPU nodes.
Logging into a the node with CPU, we see that the hyperkube-extract service won't start:
We observe that the CPU-only nodes get the same docker-deamon.json as the GPU ones:
This fails because the the regular VM with CPU only, do not have the nvidia container runtime.
What you expected to happen:
A working cluster. Different docker-daemon depending on the VM type.
How to reproduce it (as minimally and precisely as possible):
Follow instructions in the blog post above, deploying a cluster with 2 agent pools with both GPU and CPU.
Anything else we need to know:
We tried with an older version of the acs-engine, 0.9.4. We used this version by correlating the date of the blog post (nov 2017) with releases of the acs-engine. This older acs-engine, does work correctly.
The previous version of the acs-engine (0.17.1) also works correctly. So this must be a regression in the most recent release of acs-engine.
We think the refactoring to NVIDIA device-plugins may be related (#2545)
cc @johnhofman