Skip to content
This repository was archived by the owner on Jan 11, 2023. It is now read-only.
This repository was archived by the owner on Jan 11, 2023. It is now read-only.

Windows nodes may go into NotReady state after reboot #4001

Description

@PatrickLang

Is this an ISSUE or FEATURE REQUEST? (choose one): Issue


What version of acs-engine?: (master as of 10/11/2018)


Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)

Kubernetes v1.12.0

What happened:

After installing a Windows patch, the node went unavailable.

What you expected to happen:

Node should come back up

How to reproduce it (as minimally and precisely as possible):

Issue 1: kubeproxy was in a restart loop which slowed down the system dramatically. I had to stop kubelet/kubeproxy just to get the system to respond well enough for me to grep through logs.

  • There is nothing in c:\k\kubeproxy.err.log after 3:03:08 PM. I don’t think the service actually started after the reboot
  • C:\k\kubeproxy.log is in a loop with “Waiting for Network [azure] to be created . . .”
    o $hnsNetwork = Get-HnsNetwork | ? Name -EQ azure
    o while (!$hnsNetwork)
    o {
    o Write-Host "Waiting for Network [azure] to be created . . ."
    o Start-Sleep 10
    o $hnsNetwork = Get-HnsNetwork | ? Name -EQ azure
    o }

Issue 2: kubelet wasn’t starting because docker wasn’t starting. It looks like the kubelet startup script intentionally deletes the HNSNetwork, but I don’t see any step to create it?

@dascott – can you find out if this is still needed, or if we can remove it to clean it up?

Find if network created by CNI exists, if yes, remove it

This is required to keep the network non-persistent behavior

Going forward, this would be done by HNS automatically during restart of the node

$hnsNetwork = Get-HnsNetwork | ? Name -EQ azure
if ($hnsNetwork)
{
# Cleanup all containers
docker ps -q | foreach {docker rm $_ -f}

Write-Host "Cleaning up old HNS network found"
Remove-HnsNetwork $hnsNetwork

Issue 3: The Docker service timed out starting after 30 seconds. The system was last booted 11/11 3:03:30 PM, and the service timed out at 3:04:07 PM. There were no other attempts to start it after that, even though the failure actions were set.

(gcim Win32_OperatingSystem).LastBootUpTime

Thursday, October 11, 2018 3:03:30 PM

$scevt | ?{ $_.Message -Like "docker" } | format-list

Index : 1829
EntryType : Error
InstanceId : 3221232472
Message : The Docker service failed to start due to the following error:
%%1053
Category : (0)
CategoryNumber : 0
ReplacementStrings : {Docker, %%1053}
Source : Service Control Manager
TimeGenerated : 10/11/2018 3:04:07 PM
TimeWritten : 10/11/2018 3:04:07 PM
UserName :

Index : 1828
EntryType : Error
InstanceId : 3221232481
Message : A timeout was reached (30000 milliseconds) while waiting for the Docker service to connect.
Category : (0)
CategoryNumber : 0
ReplacementStrings : {30000, Docker}
Source : Service Control Manager
TimeGenerated : 10/11/2018 3:04:07 PM
TimeWritten : 10/11/2018 3:04:07 PM
UserName :

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Docker]
"FailureActions"=hex:84,03,00,00,00,00,00,00,00,00,00,00,03,00,00,00,14,00,00,
00,01,00,00,00,60,ea,00,00,01,00,00,00,60,ea,00,00,01,00,00,00,60,ea,00,00

Possible fix 1 – review the failure settings again and see if we can make Docker restart after timeout
Possible fix 2 – make kubelet depend on docker. Maybe that will make it try to start the service again

Anything else we need to know:
/label windows
cc @adelina-t @daschott

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions