Description of problem: When docker fails and the atomic-openshift-node still sends a Ready event to the master even when docker has failed. Version-Release number of selected component (if applicable): 3.3 How reproducible: 100% Steps to Reproduce: 1. Set node-monitor-grace-period=20s on the masters kubernetesMasterConfig: admissionConfig: pluginConfig: {} apiServerArguments: controllerArguments: node-monitor-grace-period: - "10s" 2. Stop docker Actual results: Node continues to show as read Expected results: Node after 20seconds shows as not ready Additional info: # date; ssh node-1 "systemctl stop docker" Wed Feb 1 16:34:14 EST 2017 [root@master-2 ~]# date; oc get nodes Wed Feb 1 16:37:15 EST 2017 NAME STATUS AGE master-1.example.com Ready 103d master-2.example.com Ready 103d node-1.example.com Ready 5d # date; oc get nodes Wed Feb 1 16:40:26 EST 2017 NAME STATUS AGE master-1.example.com Ready 103d master-2.example.com Ready 103d node-1.example.com NotReady 5d
See related issue: https://github.com/kubernetes/kubernetes/issues/30534 The kubelet used to have a fixed constant for how long it would tolerate the docker daemon being down before reporting the node as not ready. That was previously set to 5 minutes, which meant that it could take up to 5 minutes for the kubelet to report it was no longer ready. This was fixed via PR upstream in k8s 1.6: https://github.com/kubernetes/kubernetes/pull/38527 The new behavior is that the kubelet will wait 30s for the container runtime to be down before reporting the node as NotReady. Origin PR: https://github.com/openshift/origin/pull/12776
This has been merged into ocp and is in OCP v3.5.0.16 or newer.
Verify on v3.5.0.16+a26133a, when stop docker, node will report notready about in 30s. Steps [root@ip-172-18-11-215 ~]# openshift version openshift v3.5.0.16+a26133a kubernetes v1.5.2+43a9be4 etcd 3.1.0 [root@ip-172-18-11-215 ~]# oc get node NAME STATUS AGE ip-172-18-11-215.ec2.internal Ready 15m ip-172-18-7-97.ec2.internal Ready 15m [root@ip-172-18-11-215 ~]# systemctl stop docker [root@ip-172-18-11-215 ~]# date Fri Feb 3 22:28:23 EST 2017 [root@ip-172-18-11-215 ~]# oc get node NAME STATUS AGE ip-172-18-11-215.ec2.internal Ready 16m ip-172-18-7-97.ec2.internal Ready 16m [root@ip-172-18-11-215 ~]# oc get node NAME STATUS AGE ip-172-18-11-215.ec2.internal NotReady 16m ip-172-18-7-97.ec2.internal Ready 16m [root@ip-172-18-11-215 ~]# date Fri Feb 3 22:28:49 EST 2017
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0884