Is this an ISSUE or FEATURE REQUEST? (choose one): Issue
What version of acs-engine?: (master as of 10/11/2018)
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes v1.12.0
What happened:
After installing a Windows patch, the node went unavailable.
What you expected to happen:
Node should come back up
How to reproduce it (as minimally and precisely as possible):
Issue 1: kubeproxy was in a restart loop which slowed down the system dramatically. I had to stop kubelet/kubeproxy just to get the system to respond well enough for me to grep through logs.
- There is nothing in c:\k\kubeproxy.err.log after 3:03:08 PM. I don’t think the service actually started after the reboot
- C:\k\kubeproxy.log is in a loop with “Waiting for Network [azure] to be created . . .”
o $hnsNetwork = Get-HnsNetwork | ? Name -EQ azure
o while (!$hnsNetwork)
o {
o Write-Host "Waiting for Network [azure] to be created . . ."
o Start-Sleep 10
o $hnsNetwork = Get-HnsNetwork | ? Name -EQ azure
o }
Issue 2: kubelet wasn’t starting because docker wasn’t starting. It looks like the kubelet startup script intentionally deletes the HNSNetwork, but I don’t see any step to create it?
@dascott – can you find out if this is still needed, or if we can remove it to clean it up?
Find if network created by CNI exists, if yes, remove it
This is required to keep the network non-persistent behavior
Going forward, this would be done by HNS automatically during restart of the node
$hnsNetwork = Get-HnsNetwork | ? Name -EQ azure
if ($hnsNetwork)
{
# Cleanup all containers
docker ps -q | foreach {docker rm $_ -f}
Write-Host "Cleaning up old HNS network found"
Remove-HnsNetwork $hnsNetwork
Issue 3: The Docker service timed out starting after 30 seconds. The system was last booted 11/11 3:03:30 PM, and the service timed out at 3:04:07 PM. There were no other attempts to start it after that, even though the failure actions were set.
(gcim Win32_OperatingSystem).LastBootUpTime
Thursday, October 11, 2018 3:03:30 PM
$scevt | ?{ $_.Message -Like "docker" } | format-list
Index : 1829
EntryType : Error
InstanceId : 3221232472
Message : The Docker service failed to start due to the following error:
%%1053
Category : (0)
CategoryNumber : 0
ReplacementStrings : {Docker, %%1053}
Source : Service Control Manager
TimeGenerated : 10/11/2018 3:04:07 PM
TimeWritten : 10/11/2018 3:04:07 PM
UserName :
Index : 1828
EntryType : Error
InstanceId : 3221232481
Message : A timeout was reached (30000 milliseconds) while waiting for the Docker service to connect.
Category : (0)
CategoryNumber : 0
ReplacementStrings : {30000, Docker}
Source : Service Control Manager
TimeGenerated : 10/11/2018 3:04:07 PM
TimeWritten : 10/11/2018 3:04:07 PM
UserName :
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Docker]
"FailureActions"=hex:84,03,00,00,00,00,00,00,00,00,00,00,03,00,00,00,14,00,00,
00,01,00,00,00,60,ea,00,00,01,00,00,00,60,ea,00,00,01,00,00,00,60,ea,00,00
Possible fix 1 – review the failure settings again and see if we can make Docker restart after timeout
Possible fix 2 – make kubelet depend on docker. Maybe that will make it try to start the service again
Anything else we need to know:
/label windows
cc @adelina-t @daschott
Is this an ISSUE or FEATURE REQUEST? (choose one): Issue
What version of acs-engine?: (master as of 10/11/2018)
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes v1.12.0
What happened:
After installing a Windows patch, the node went unavailable.
What you expected to happen:
Node should come back up
How to reproduce it (as minimally and precisely as possible):
Issue 1: kubeproxy was in a restart loop which slowed down the system dramatically. I had to stop kubelet/kubeproxy just to get the system to respond well enough for me to grep through logs.
o $hnsNetwork = Get-HnsNetwork | ? Name -EQ azure
o while (!$hnsNetwork)
o {
o Write-Host "Waiting for Network [azure] to be created . . ."
o Start-Sleep 10
o $hnsNetwork = Get-HnsNetwork | ? Name -EQ azure
o }
Issue 2: kubelet wasn’t starting because docker wasn’t starting. It looks like the kubelet startup script intentionally deletes the HNSNetwork, but I don’t see any step to create it?
@dascott – can you find out if this is still needed, or if we can remove it to clean it up?
Find if network created by CNI exists, if yes, remove it
This is required to keep the network non-persistent behavior
Going forward, this would be done by HNS automatically during restart of the node
$hnsNetwork = Get-HnsNetwork | ? Name -EQ azure
if ($hnsNetwork)
{
# Cleanup all containers
docker ps -q | foreach {docker rm $_ -f}
Issue 3: The Docker service timed out starting after 30 seconds. The system was last booted 11/11 3:03:30 PM, and the service timed out at 3:04:07 PM. There were no other attempts to start it after that, even though the failure actions were set.
(gcim Win32_OperatingSystem).LastBootUpTime
Thursday, October 11, 2018 3:03:30 PM
$scevt | ?{ $_.Message -Like "docker" } | format-list
Index : 1829
EntryType : Error
InstanceId : 3221232472
Message : The Docker service failed to start due to the following error:
%%1053
Category : (0)
CategoryNumber : 0
ReplacementStrings : {Docker, %%1053}
Source : Service Control Manager
TimeGenerated : 10/11/2018 3:04:07 PM
TimeWritten : 10/11/2018 3:04:07 PM
UserName :
Index : 1828
EntryType : Error
InstanceId : 3221232481
Message : A timeout was reached (30000 milliseconds) while waiting for the Docker service to connect.
Category : (0)
CategoryNumber : 0
ReplacementStrings : {30000, Docker}
Source : Service Control Manager
TimeGenerated : 10/11/2018 3:04:07 PM
TimeWritten : 10/11/2018 3:04:07 PM
UserName :
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Docker]
"FailureActions"=hex:84,03,00,00,00,00,00,00,00,00,00,00,03,00,00,00,14,00,00,
00,01,00,00,00,60,ea,00,00,01,00,00,00,60,ea,00,00,01,00,00,00,60,ea,00,00
Possible fix 1 – review the failure settings again and see if we can make Docker restart after timeout
Possible fix 2 – make kubelet depend on docker. Maybe that will make it try to start the service again
Anything else we need to know:
/label windows
cc @adelina-t @daschott