1907939 – Nodes goes into NotReady state (VMware)

Bug 1907939 - Nodes goes into NotReady state (VMware)

Summary: Nodes goes into NotReady state (VMware)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.5
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.z
Assignee:	Lukasz Szaszkiewicz
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:	1907938
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-15 14:45 UTC by OpenShift BugZilla Robot
Modified:	2024-06-13 23:44 UTC (History)
CC List:	51 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Sets the TCP_USER_TIMEOUT (https://man7.org/linux/man-pages/man7/tcp.7.html) socket option which controls for how long transmitted data may be unacknowledged before the connection is forcefully closed. Without that option detecting a broken network connection can take up to 15 minutes. During that time the platform might be unavailable.
Clone Of:
Environment:
Last Closed:	2021-02-03 10:11:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 25791	0	None	closed	Bug 1907939: Nodes goes into NotReady state (VMware)	2021-02-21 19:53:48 UTC
Red Hat Product Errata	RHSA-2021:0281	0	None	None	None	2021-02-03 10:12:28 UTC

Comment 1 Xingxing Xia 2021-01-13 12:55:50 UTC

Per Lukasz's Slack message, we "could try to verify/repo" this bug "in the same vein" as bug 1905194's verification steps. In this case "the client is kubelet not an api server".

Comment 2 Lukasz Szaszkiewicz 2021-01-15 09:58:04 UTC

PR in the merge queue.

Comment 4 Xingxing Xia 2021-01-21 11:58:04 UTC

Tried to launch old 4.4.32 which does not have the fix, and tried to reproduce with bug 1905194 steps:
[root@ip-10-0-155-170 ~]# ps -eF | grep kubelet
root        1361       1  5 364934 162984 1 07:01 ?        00:15:04 kubelet ...
[root@ip-10-0-155-170 ~]# nsenter -t 1361 -n /bin/bash
[root@ip-10-0-155-170 ~]# export PS1='[\u@\h \D{%F %T %Z} \W]\$ '
[root@ip-10-0-155-170 2021-01-21 11:47:55 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet
tcp        0      0 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         keepalive (17.14/0/0)
tcp        0      0 127.0.0.1:10248         127.0.0.1:56062         ESTABLISHED 1361/kubelet         keepalive (3.24/0/0)
tcp6       0      0 10.0.155.170:10250      10.0.175.250:38820      ESTABLISHED 1361/kubelet         keepalive (2.02/0/0)
tcp6       0      0 10.0.155.170:10250      10.128.2.9:43442        ESTABLISHED 1361/kubelet         keepalive (9.18/0/0)
tcp6       0      0 10.0.155.170:10250      10.0.175.250:36658      ESTABLISHED 1361/kubelet         keepalive (6.62/0/0)
tcp6       0      0 10.0.155.170:10250      10.128.2.9:43250        ESTABLISHED 1361/kubelet         keepalive (14.36/0/0)
[root@ip-10-0-155-170 2021-01-21 11:49:16 UTC ~]# iptables -I INPUT -m state --state ESTABLISHED,RELATED -p tcp --dport 53336 --sport 6443 -j DROP
[root@ip-10-0-155-170 2021-01-21 11:49:32 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0      0 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         keepalive (13.09/0/0)
[root@ip-10-0-155-170 2021-01-21 11:49:36 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0    572 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         on (5.14/5/0)
[root@ip-10-0-155-170 2021-01-21 11:49:47 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0   1141 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         on (2.53/5/0)
[root@ip-10-0-155-170 2021-01-21 11:49:49 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0   1263 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         on (0.21/5/0)
[root@ip-10-0-155-170 2021-01-21 11:49:52 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0   1263 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         on (8.92/6/0)
[root@ip-10-0-155-170 2021-01-21 11:49:56 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0   1263 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         on (7.25/6/0)
[root@ip-10-0-155-170 2021-01-21 11:49:58 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0      0 10.0.155.170:43620      10.0.163.182:6443       ESTABLISHED 1361/kubelet         keepalive (29.49/0/0)

In another terminal watch:
$ oc get no ip-10-0-155-170.us-east-2.compute.internal --no-headers -w

But cannot reproduced, didn't see NotReady. Will turn to use steps in 4.5 clone bug 1907938#c6 to verify.

Comment 5 Xingxing Xia 2021-01-22 10:17:52 UTC

Per the higher version clone bug 1901208#c5 Node QE's steps: "Discussed ... if the cluster runs for 5-8 hours without failures, the fix is good to go", I launched vSphere env of 4.4.0-0.nightly-2021-01-21-172857 which includes the fix, the cluster status (pods, COs, nodes, etc) keeps good, thus moving to VERIFIED:
$ oc get no
NAME              STATUS   ROLES    AGE     VERSION
compute-0         Ready    worker   7h55m   v1.17.1+f06151f
compute-1         Ready    worker   7h54m   v1.17.1+f06151f
control-plane-0   Ready    master   8h      v1.17.1+f06151f
control-plane-1   Ready    master   8h      v1.17.1+f06151f
control-plane-2   Ready    master   8h      v1.17.1+f06151f

Comment 8 errata-xmlrpc 2021-02-03 10:11:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.4.33 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0281

Note You need to log in before you can comment on or make changes to this bug.

aarapov
abhinkum
achakrat
amsingh
andcosta
aos-bugs
aprajapa
asheth
asm
aygarg
bleanhar
ccoleman
dbenoit
deparker
dphillip
emachado
jmalde
jokerman
jparrill
kechung
kewang
lszaszki
mchebbi
mdhanve
mdunnett
mfojtik
mnunes
mrobson
nagrawal
naygupta
openshift-bugs-escalate
pchavan
pkhaire
rbryant
rdave
rheinzma
rpalathi
rphillips
sagopina
schoudha
scuppett
smaudet
sople
sparpate
srengan
ssonigra
tschelle
tstellar
tsweeney
weinliu
xxia