Bug 1907939 - Nodes goes into NotReady state (VMware)
Summary: Nodes goes into NotReady state (VMware)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.5
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.4.z
Assignee: Lukasz Szaszkiewicz
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On: 1907938
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-15 14:45 UTC by OpenShift BugZilla Robot
Modified: 2022-12-08 11:08 UTC (History)
51 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Sets the TCP_USER_TIMEOUT (https://man7.org/linux/man-pages/man7/tcp.7.html) socket option which controls for how long transmitted data may be unacknowledged before the connection is forcefully closed. Without that option detecting a broken network connection can take up to 15 minutes. During that time the platform might be unavailable.
Clone Of:
Environment:
Last Closed: 2021-02-03 10:11:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 25791 0 None closed Bug 1907939: Nodes goes into NotReady state (VMware) 2021-02-21 19:53:48 UTC
Red Hat Product Errata RHSA-2021:0281 0 None None None 2021-02-03 10:12:28 UTC

Comment 1 Xingxing Xia 2021-01-13 12:55:50 UTC
Per Lukasz's Slack message, we "could try to verify/repo" this bug "in the same vein" as bug 1905194's verification steps. In this case "the client is kubelet not an api server".

Comment 2 Lukasz Szaszkiewicz 2021-01-15 09:58:04 UTC
PR in the merge queue.

Comment 4 Xingxing Xia 2021-01-21 11:58:04 UTC
Tried to launch old 4.4.32 which does not have the fix, and tried to reproduce with bug 1905194 steps:
[root@ip-10-0-155-170 ~]# ps -eF | grep kubelet
root        1361       1  5 364934 162984 1 07:01 ?        00:15:04 kubelet ...
[root@ip-10-0-155-170 ~]# nsenter -t 1361 -n /bin/bash
[root@ip-10-0-155-170 ~]# export PS1='[\u@\h \D{%F %T %Z} \W]\$ '
[root@ip-10-0-155-170 2021-01-21 11:47:55 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet
tcp        0      0 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         keepalive (17.14/0/0)
tcp        0      0 127.0.0.1:10248         127.0.0.1:56062         ESTABLISHED 1361/kubelet         keepalive (3.24/0/0)
tcp6       0      0 10.0.155.170:10250      10.0.175.250:38820      ESTABLISHED 1361/kubelet         keepalive (2.02/0/0)
tcp6       0      0 10.0.155.170:10250      10.128.2.9:43442        ESTABLISHED 1361/kubelet         keepalive (9.18/0/0)
tcp6       0      0 10.0.155.170:10250      10.0.175.250:36658      ESTABLISHED 1361/kubelet         keepalive (6.62/0/0)
tcp6       0      0 10.0.155.170:10250      10.128.2.9:43250        ESTABLISHED 1361/kubelet         keepalive (14.36/0/0)
[root@ip-10-0-155-170 2021-01-21 11:49:16 UTC ~]# iptables -I INPUT -m state --state ESTABLISHED,RELATED -p tcp --dport 53336 --sport 6443 -j DROP
[root@ip-10-0-155-170 2021-01-21 11:49:32 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0      0 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         keepalive (13.09/0/0)
[root@ip-10-0-155-170 2021-01-21 11:49:36 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0    572 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         on (5.14/5/0)
[root@ip-10-0-155-170 2021-01-21 11:49:47 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0   1141 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         on (2.53/5/0)
[root@ip-10-0-155-170 2021-01-21 11:49:49 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0   1263 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         on (0.21/5/0)
[root@ip-10-0-155-170 2021-01-21 11:49:52 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0   1263 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         on (8.92/6/0)
[root@ip-10-0-155-170 2021-01-21 11:49:56 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0   1263 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         on (7.25/6/0)
[root@ip-10-0-155-170 2021-01-21 11:49:58 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0      0 10.0.155.170:43620      10.0.163.182:6443       ESTABLISHED 1361/kubelet         keepalive (29.49/0/0)

In another terminal watch:
$ oc get no ip-10-0-155-170.us-east-2.compute.internal --no-headers -w

But cannot reproduced, didn't see NotReady. Will turn to use steps in 4.5 clone bug 1907938#c6 to verify.

Comment 5 Xingxing Xia 2021-01-22 10:17:52 UTC
Per the higher version clone bug 1901208#c5 Node QE's steps: "Discussed ... if the cluster runs for 5-8 hours without failures, the fix is good to go", I launched vSphere env of 4.4.0-0.nightly-2021-01-21-172857 which includes the fix, the cluster status (pods, COs, nodes, etc) keeps good, thus moving to VERIFIED:
$ oc get no
NAME              STATUS   ROLES    AGE     VERSION
compute-0         Ready    worker   7h55m   v1.17.1+f06151f
compute-1         Ready    worker   7h54m   v1.17.1+f06151f
control-plane-0   Ready    master   8h      v1.17.1+f06151f
control-plane-1   Ready    master   8h      v1.17.1+f06151f
control-plane-2   Ready    master   8h      v1.17.1+f06151f

Comment 8 errata-xmlrpc 2021-02-03 10:11:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.4.33 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0281


Note You need to log in before you can comment on or make changes to this bug.