Windows nodes may go into NotReady state after reboot

**Is this an ISSUE or FEATURE REQUEST?** (choose one): Issue

---

**What version of acs-engine?**: (master as of 10/11/2018)

---

**Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)**

Kubernetes v1.12.0

**What happened**:

After installing a Windows patch, the node went unavailable.


**What you expected to happen**:

Node should come back up

**How to reproduce it** (as minimally and precisely as possible):


Issue 1: kubeproxy was in a restart loop which slowed down the system dramatically. I had to stop kubelet/kubeproxy just to get the system to respond well enough for me to grep through logs.
-	There is nothing in c:\k\kubeproxy.err.log after 3:03:08 PM. I don’t think the service actually started after the reboot
-	C:\k\kubeproxy.log is in a loop with “Waiting for Network [azure] to be created . . .”
o	$hnsNetwork = Get-HnsNetwork | ? Name -EQ azure
o	while (!$hnsNetwork)
o	{
o	    Write-Host "Waiting for Network [azure] to be created . . ."
o	    Start-Sleep 10
o	    $hnsNetwork = Get-HnsNetwork | ? Name -EQ azure
o	}

Issue 2: kubelet wasn’t starting because docker wasn’t starting. It looks like the kubelet startup script intentionally deletes the HNSNetwork, but I don’t see any step to create it? 

@dascott – can you find out if this is still needed, or if we can remove it to clean it up?

# Find if network created by CNI exists, if yes, remove it
# This is required to keep the network non-persistent behavior
# Going forward, this would be done by HNS automatically during restart of the node
$hnsNetwork = Get-HnsNetwork | ? Name -EQ azure
if ($hnsNetwork)
{
    # Cleanup all containers
    docker ps -q | foreach {docker rm $_ -f}

    Write-Host "Cleaning up old HNS network found"
    Remove-HnsNetwork $hnsNetwork

Issue 3: The Docker service timed out starting after 30 seconds. The system was last booted 11/11 3:03:30 PM, and the service timed out at 3:04:07 PM. There were no other attempts to start it after that, even though the failure actions were set. 

(gcim Win32_OperatingSystem).LastBootUpTime

Thursday, October 11, 2018 3:03:30 PM

$scevt | ?{ $_.Message -Like "*docker*" } | format-list


Index              : 1829
EntryType          : Error
InstanceId         : 3221232472
Message            : The Docker service failed to start due to the following error:
                     %%1053
Category           : (0)
CategoryNumber     : 0
ReplacementStrings : {Docker, %%1053}
Source             : Service Control Manager
TimeGenerated      : 10/11/2018 3:04:07 PM
TimeWritten        : 10/11/2018 3:04:07 PM
UserName           :

Index              : 1828
EntryType          : Error
InstanceId         : 3221232481
Message            : A timeout was reached (30000 milliseconds) while waiting for the Docker service to connect.
Category           : (0)
CategoryNumber     : 0
ReplacementStrings : {30000, Docker}
Source             : Service Control Manager
TimeGenerated      : 10/11/2018 3:04:07 PM
TimeWritten        : 10/11/2018 3:04:07 PM
UserName           :

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Docker]
"FailureActions"=hex:84,03,00,00,00,00,00,00,00,00,00,00,03,00,00,00,14,00,00,\
  00,01,00,00,00,60,ea,00,00,01,00,00,00,60,ea,00,00,01,00,00,00,60,ea,00,00


Possible fix 1 – review the failure settings again and see if we can make Docker restart after timeout
Possible fix 2 – make kubelet depend on docker. Maybe that will make it try to start the service again


**Anything else we need to know**:
/label windows
cc @adelina-t @daschott


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows nodes may go into NotReady state after reboot #4001

Find if network created by CNI exists, if yes, remove it

This is required to keep the network non-persistent behavior

Going forward, this would be done by HNS automatically during restart of the node

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Windows nodes may go into NotReady state after reboot #4001

Description

Find if network created by CNI exists, if yes, remove it

This is required to keep the network non-persistent behavior

Going forward, this would be done by HNS automatically during restart of the node

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions