Hybrid clusters with GPU & CPU do not work

**Is this a request for help?**: 
NO

---

**Is this an ISSUE or FEATURE REQUEST?** (choose one):
ISSUE
---

**What version of acs-engine?**:
0.18.1
---


**Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)**
Kubernetes

**What happened**:
Tried to deploy a cluster with two agent pools; one with reguler VMs (CPU only) and one with GPU's.
Similar to this blog post: https://www.microsoft.com/developerblog/2017/11/21/autoscaling-deep-learning-training-kubernetes/

The ARM deployment failed with a VMExtensionProvisioningError. 

The master nodes start correctly. Also, the nodes with GPU start correctly. The error is in the CPU nodes.

Logging into a the node with CPU, we see that the hyperkube-extract service won't start:

```
systemd[1]: hyperkube-extract.service: Service hold-off time over, scheduling restart.
systemd[1]: Stopped kubectl and kubelet extraction.
systemd[1]: Starting kubectl and kubelet extraction...
docker[38944]: v1.10.3: Pulling from hyperkube-amd64
docker[38944]: Digest: sha256:00d814b1f7763f4ab5be80c58e98140dfc69df107f253d7fdd714b30a714260a
docker[38944]: Status: Image is up to date for k8s-gcrio.azureedge.net/hyperkube-amd64:v1.10.3
docker[38959]: /usr/bin/docker: Error response from daemon: shim error: fork/exec /usr/bin/nvidia-container-runtime: no such file or directory.
systemd[1]: hyperkube-extract.service: Control process exited, code=exited status=127
systemd[1]: Failed to start kubectl and kubelet extraction.
systemd[1]: hyperkube-extract.service: Unit entered failed state.
systemd[1]: hyperkube-extract.service: Failed with result 'exit-code'.
```


We observe that the CPU-only nodes get the same docker-deamon.json as the GPU ones:

```
$ cat /etc/docker/daemon.json
{
  "live-restore": true,
  "log-driver": "json-file",
  "log-opts":  {
     "max-size": "50m",
     "max-file": "5"
  }
  ,"default-runtime": "nvidia",
  "runtimes": {
     "nvidia": {
         "path": "/usr/bin/nvidia-container-runtime",
         "runtimeArgs": []
    }
  }
}
```
This fails because the the regular VM with CPU only, do not have the nvidia container runtime.

**What you expected to happen**:
A working cluster. Different docker-daemon depending on the VM type.

**How to reproduce it** (as minimally and precisely as possible):
Follow instructions in the blog post above, deploying a cluster with 2 agent pools with both GPU and CPU.

**Anything else we need to know**:
We tried with an older version of the acs-engine, 0.9.4. We used this version by correlating the date of the blog post (nov 2017) with releases of the acs-engine. This older acs-engine, does work correctly.

The previous version of the acs-engine (0.17.1) also works correctly. So this must be a regression in the most recent release of acs-engine. 

We think the refactoring to NVIDIA device-plugins may be related (#2545)

cc @johnhofman

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hybrid clusters with GPU & CPU do not work #3190

Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of acs-engine?:
0.18.1

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Hybrid clusters with GPU & CPU do not work #3190

Description

Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE

What version of acs-engine?: 0.18.1

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of acs-engine?:
0.18.1