Bug 1418461 - Node shows as ready after failure of docker after node-monitor-grace-period expired
Summary: Node shows as ready after failure of docker after node-monitor-grace-period e...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Derek Carr
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks: 1770059
TreeView+ depends on / blocked
 
Reported: 2017-02-01 21:42 UTC by Ryan Howe
Modified: 2020-03-11 15:42 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The kubelet used to have a fixed constant for how long it would tolerate the docker daemon being down before reporting the node as not ready. That was previously set to 5 minutes, which meant that it could take up to 5 minutes for the kubelet to report it was no longer ready. Consequence: If the docker daemon being down did not cause the node to report its ready status for up to 5 minutes. Fix: he new behavior is that the kubelet will wait 30s for the container runtime to be down before reporting the node as NotReady. Result: The node reports NotReady faster when the docker daemon is down.
Clone Of:
: 1770059 (view as bug list)
Environment:
Last Closed: 2017-04-12 19:11:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0884 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.5 RPM Release Advisory 2017-04-12 22:50:07 UTC

Description Ryan Howe 2017-02-01 21:42:41 UTC
Description of problem:

When docker fails and the atomic-openshift-node still sends a Ready event to the master even when docker has failed. 

Version-Release number of selected component (if applicable):
3.3

How reproducible:
100%

Steps to Reproduce:
1. Set node-monitor-grace-period=20s on the masters 

kubernetesMasterConfig:
  admissionConfig:
    pluginConfig:
      {}
  apiServerArguments: 
  controllerArguments:
    node-monitor-grace-period:
      - "10s"
2. Stop docker 


Actual results:
Node continues to show as read

Expected results:
Node after 20seconds shows as not ready 

Additional info:

# date; ssh node-1 "systemctl stop docker"
Wed Feb  1 16:34:14 EST 2017

[root@master-2 ~]# date;  oc get nodes
Wed Feb  1 16:37:15 EST 2017
NAME                   STATUS    AGE
master-1.example.com   Ready     103d
master-2.example.com   Ready     103d
node-1.example.com     Ready     5d

# date;  oc get nodes
Wed Feb  1 16:40:26 EST 2017
NAME                   STATUS     AGE
master-1.example.com   Ready      103d
master-2.example.com   Ready      103d
node-1.example.com     NotReady   5d

Comment 1 Derek Carr 2017-02-02 18:37:47 UTC
See related issue:
https://github.com/kubernetes/kubernetes/issues/30534

The kubelet used to have a fixed constant for how long it would tolerate the docker daemon being down before reporting the node as not ready.  That was previously set to 5 minutes, which meant that it could take up to 5 minutes for the kubelet to report it was no longer ready.

This was fixed via PR upstream in k8s 1.6:
https://github.com/kubernetes/kubernetes/pull/38527

The new behavior is that the kubelet will wait 30s for the container runtime to be down before reporting the node as NotReady.

Origin PR:
https://github.com/openshift/origin/pull/12776

Comment 2 Troy Dawson 2017-02-03 22:35:41 UTC
This has been merged into ocp and is in OCP v3.5.0.16 or newer.

Comment 3 DeShuai Ma 2017-02-04 03:30:42 UTC
Verify on v3.5.0.16+a26133a, when stop docker, node will report notready about in 30s.

Steps
[root@ip-172-18-11-215 ~]# openshift version
openshift v3.5.0.16+a26133a
kubernetes v1.5.2+43a9be4
etcd 3.1.0
[root@ip-172-18-11-215 ~]# oc get node
NAME                            STATUS    AGE
ip-172-18-11-215.ec2.internal   Ready     15m
ip-172-18-7-97.ec2.internal     Ready     15m
[root@ip-172-18-11-215 ~]# systemctl stop docker
[root@ip-172-18-11-215 ~]# date
Fri Feb  3 22:28:23 EST 2017
[root@ip-172-18-11-215 ~]# oc get node
NAME                            STATUS    AGE
ip-172-18-11-215.ec2.internal   Ready     16m
ip-172-18-7-97.ec2.internal     Ready     16m
[root@ip-172-18-11-215 ~]# oc get node
NAME                            STATUS     AGE
ip-172-18-11-215.ec2.internal   NotReady   16m
ip-172-18-7-97.ec2.internal     Ready      16m
[root@ip-172-18-11-215 ~]# date
Fri Feb  3 22:28:49 EST 2017

Comment 5 errata-xmlrpc 2017-04-12 19:11:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884


Note You need to log in before you can comment on or make changes to this bug.