Bug 1907939

Summary: Nodes goes into NotReady state (VMware)
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: kube-apiserverAssignee: Lukasz Szaszkiewicz <lszaszki>
Status: CLOSED ERRATA QA Contact: Xingxing Xia <xxia>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.5CC: aarapov, abhinkum, achakrat, amsingh, andcosta, aos-bugs, aprajapa, asheth, asm, aygarg, bleanhar, ccoleman, dbenoit, deparker, dphillip, emachado, jmalde, jokerman, jparrill, kechung, kewang, lszaszki, mchebbi, mdhanve, mdunnett, mfojtik, mnunes, mrobson, nagrawal, naygupta, openshift-bugs-escalate, pchavan, pkhaire, rbryant, rdave, rheinzma, rpalathi, rphillips, sagopina, schoudha, scuppett, smaudet, sople, sparpate, srengan, ssonigra, tschelle, tstellar, tsweeney, weinliu, xxia
Target Milestone: ---Keywords: Bugfix, UpcomingSprint
Target Release: 4.4.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Sets the TCP_USER_TIMEOUT (https://man7.org/linux/man-pages/man7/tcp.7.html) socket option which controls for how long transmitted data may be unacknowledged before the connection is forcefully closed. Without that option detecting a broken network connection can take up to 15 minutes. During that time the platform might be unavailable.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-03 10:11:43 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1907938    
Bug Blocks:    

Comment 1 Xingxing Xia 2021-01-13 12:55:50 UTC
Per Lukasz's Slack message, we "could try to verify/repo" this bug "in the same vein" as bug 1905194's verification steps. In this case "the client is kubelet not an api server".

Comment 2 Lukasz Szaszkiewicz 2021-01-15 09:58:04 UTC
PR in the merge queue.

Comment 4 Xingxing Xia 2021-01-21 11:58:04 UTC
Tried to launch old 4.4.32 which does not have the fix, and tried to reproduce with bug 1905194 steps:
[root@ip-10-0-155-170 ~]# ps -eF | grep kubelet
root        1361       1  5 364934 162984 1 07:01 ?        00:15:04 kubelet ...
[root@ip-10-0-155-170 ~]# nsenter -t 1361 -n /bin/bash
[root@ip-10-0-155-170 ~]# export PS1='[\u@\h \D{%F %T %Z} \W]\$ '
[root@ip-10-0-155-170 2021-01-21 11:47:55 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet
tcp        0      0 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         keepalive (17.14/0/0)
tcp        0      0 127.0.0.1:10248         127.0.0.1:56062         ESTABLISHED 1361/kubelet         keepalive (3.24/0/0)
tcp6       0      0 10.0.155.170:10250      10.0.175.250:38820      ESTABLISHED 1361/kubelet         keepalive (2.02/0/0)
tcp6       0      0 10.0.155.170:10250      10.128.2.9:43442        ESTABLISHED 1361/kubelet         keepalive (9.18/0/0)
tcp6       0      0 10.0.155.170:10250      10.0.175.250:36658      ESTABLISHED 1361/kubelet         keepalive (6.62/0/0)
tcp6       0      0 10.0.155.170:10250      10.128.2.9:43250        ESTABLISHED 1361/kubelet         keepalive (14.36/0/0)
[root@ip-10-0-155-170 2021-01-21 11:49:16 UTC ~]# iptables -I INPUT -m state --state ESTABLISHED,RELATED -p tcp --dport 53336 --sport 6443 -j DROP
[root@ip-10-0-155-170 2021-01-21 11:49:32 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0      0 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         keepalive (13.09/0/0)
[root@ip-10-0-155-170 2021-01-21 11:49:36 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0    572 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         on (5.14/5/0)
[root@ip-10-0-155-170 2021-01-21 11:49:47 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0   1141 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         on (2.53/5/0)
[root@ip-10-0-155-170 2021-01-21 11:49:49 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0   1263 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         on (0.21/5/0)
[root@ip-10-0-155-170 2021-01-21 11:49:52 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0   1263 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         on (8.92/6/0)
[root@ip-10-0-155-170 2021-01-21 11:49:56 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0   1263 10.0.155.170:53336      10.0.134.252:6443       ESTABLISHED 1361/kubelet         on (7.25/6/0)
[root@ip-10-0-155-170 2021-01-21 11:49:58 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443
tcp        0      0 10.0.155.170:43620      10.0.163.182:6443       ESTABLISHED 1361/kubelet         keepalive (29.49/0/0)

In another terminal watch:
$ oc get no ip-10-0-155-170.us-east-2.compute.internal --no-headers -w

But cannot reproduced, didn't see NotReady. Will turn to use steps in 4.5 clone bug 1907938#c6 to verify.

Comment 5 Xingxing Xia 2021-01-22 10:17:52 UTC
Per the higher version clone bug 1901208#c5 Node QE's steps: "Discussed ... if the cluster runs for 5-8 hours without failures, the fix is good to go", I launched vSphere env of 4.4.0-0.nightly-2021-01-21-172857 which includes the fix, the cluster status (pods, COs, nodes, etc) keeps good, thus moving to VERIFIED:
$ oc get no
NAME              STATUS   ROLES    AGE     VERSION
compute-0         Ready    worker   7h55m   v1.17.1+f06151f
compute-1         Ready    worker   7h54m   v1.17.1+f06151f
control-plane-0   Ready    master   8h      v1.17.1+f06151f
control-plane-1   Ready    master   8h      v1.17.1+f06151f
control-plane-2   Ready    master   8h      v1.17.1+f06151f

Comment 8 errata-xmlrpc 2021-02-03 10:11:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.4.33 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0281