1418461 – Node shows as ready after failure of docker after node-monitor-grace-period expired

Bug 1418461 - Node shows as ready after failure of docker after node-monitor-grace-period expired

Summary: Node shows as ready after failure of docker after node-monitor-grace-period e...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Derek Carr
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1770059
TreeView+	depends on / blocked

Reported:	2017-02-01 21:42 UTC by Ryan Howe
Modified:	2020-03-11 15:42 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The kubelet used to have a fixed constant for how long it would tolerate the docker daemon being down before reporting the node as not ready. That was previously set to 5 minutes, which meant that it could take up to 5 minutes for the kubelet to report it was no longer ready. Consequence: If the docker daemon being down did not cause the node to report its ready status for up to 5 minutes. Fix: he new behavior is that the kubelet will wait 30s for the container runtime to be down before reporting the node as NotReady. Result: The node reports NotReady faster when the docker daemon is down.
Clone Of:
Clones:	1770059 (view as bug list)
Environment:
Last Closed:	2017-04-12 19:11:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0884	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.5 RPM Release Advisory	2017-04-12 22:50:07 UTC

Description Ryan Howe 2017-02-01 21:42:41 UTC

Description of problem:

When docker fails and the atomic-openshift-node still sends a Ready event to the master even when docker has failed. 

Version-Release number of selected component (if applicable):
3.3

How reproducible:
100%

Steps to Reproduce:
1. Set node-monitor-grace-period=20s on the masters 

kubernetesMasterConfig:
  admissionConfig:
    pluginConfig:
      {}
  apiServerArguments: 
  controllerArguments:
    node-monitor-grace-period:
      - "10s"
2. Stop docker 


Actual results:
Node continues to show as read

Expected results:
Node after 20seconds shows as not ready 

Additional info:

# date; ssh node-1 "systemctl stop docker"
Wed Feb  1 16:34:14 EST 2017

[root@master-2 ~]# date;  oc get nodes
Wed Feb  1 16:37:15 EST 2017
NAME                   STATUS    AGE
master-1.example.com   Ready     103d
master-2.example.com   Ready     103d
node-1.example.com     Ready     5d

# date;  oc get nodes
Wed Feb  1 16:40:26 EST 2017
NAME                   STATUS     AGE
master-1.example.com   Ready      103d
master-2.example.com   Ready      103d
node-1.example.com     NotReady   5d

Comment 1 Derek Carr 2017-02-02 18:37:47 UTC

See related issue:
https://github.com/kubernetes/kubernetes/issues/30534

The kubelet used to have a fixed constant for how long it would tolerate the docker daemon being down before reporting the node as not ready.  That was previously set to 5 minutes, which meant that it could take up to 5 minutes for the kubelet to report it was no longer ready.

This was fixed via PR upstream in k8s 1.6:
https://github.com/kubernetes/kubernetes/pull/38527

The new behavior is that the kubelet will wait 30s for the container runtime to be down before reporting the node as NotReady.

Origin PR:
https://github.com/openshift/origin/pull/12776

Comment 2 Troy Dawson 2017-02-03 22:35:41 UTC

This has been merged into ocp and is in OCP v3.5.0.16 or newer.

Comment 3 DeShuai Ma 2017-02-04 03:30:42 UTC

Verify on v3.5.0.16+a26133a, when stop docker, node will report notready about in 30s.

Steps
[root@ip-172-18-11-215 ~]# openshift version
openshift v3.5.0.16+a26133a
kubernetes v1.5.2+43a9be4
etcd 3.1.0
[root@ip-172-18-11-215 ~]# oc get node
NAME                            STATUS    AGE
ip-172-18-11-215.ec2.internal   Ready     15m
ip-172-18-7-97.ec2.internal     Ready     15m
[root@ip-172-18-11-215 ~]# systemctl stop docker
[root@ip-172-18-11-215 ~]# date
Fri Feb  3 22:28:23 EST 2017
[root@ip-172-18-11-215 ~]# oc get node
NAME                            STATUS    AGE
ip-172-18-11-215.ec2.internal   Ready     16m
ip-172-18-7-97.ec2.internal     Ready     16m
[root@ip-172-18-11-215 ~]# oc get node
NAME                            STATUS     AGE
ip-172-18-11-215.ec2.internal   NotReady   16m
ip-172-18-7-97.ec2.internal     Ready      16m
[root@ip-172-18-11-215 ~]# date
Fri Feb  3 22:28:49 EST 2017

Comment 5 errata-xmlrpc 2017-04-12 19:11:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884

Note You need to log in before you can comment on or make changes to this bug.