Bug 1907939
Summary: | Nodes goes into NotReady state (VMware) | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | OpenShift BugZilla Robot <openshift-bugzilla-robot> |
Component: | kube-apiserver | Assignee: | Lukasz Szaszkiewicz <lszaszki> |
Status: | CLOSED ERRATA | QA Contact: | Xingxing Xia <xxia> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 4.5 | CC: | aarapov, abhinkum, achakrat, amsingh, andcosta, aos-bugs, aprajapa, asheth, asm, aygarg, bleanhar, ccoleman, dbenoit, deparker, dphillip, emachado, jmalde, jokerman, jparrill, kechung, kewang, lszaszki, mchebbi, mdhanve, mdunnett, mfojtik, mnunes, mrobson, nagrawal, naygupta, openshift-bugs-escalate, pchavan, pkhaire, rbryant, rdave, rheinzma, rpalathi, rphillips, sagopina, schoudha, scuppett, smaudet, sople, sparpate, srengan, ssonigra, tschelle, tstellar, tsweeney, weinliu, xxia |
Target Milestone: | --- | Keywords: | Bugfix, UpcomingSprint |
Target Release: | 4.4.z | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Sets the TCP_USER_TIMEOUT (https://man7.org/linux/man-pages/man7/tcp.7.html) socket option which controls for how long transmitted data may be unacknowledged before the connection is forcefully closed.
Without that option detecting a broken network connection can take up to 15 minutes. During that time the platform might be unavailable.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-02-03 10:11:43 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1907938 | ||
Bug Blocks: |
Comment 1
Xingxing Xia
2021-01-13 12:55:50 UTC
PR in the merge queue. Tried to launch old 4.4.32 which does not have the fix, and tried to reproduce with bug 1905194 steps: [root@ip-10-0-155-170 ~]# ps -eF | grep kubelet root 1361 1 5 364934 162984 1 07:01 ? 00:15:04 kubelet ... [root@ip-10-0-155-170 ~]# nsenter -t 1361 -n /bin/bash [root@ip-10-0-155-170 ~]# export PS1='[\u@\h \D{%F %T %Z} \W]\$ ' [root@ip-10-0-155-170 2021-01-21 11:47:55 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet tcp 0 0 10.0.155.170:53336 10.0.134.252:6443 ESTABLISHED 1361/kubelet keepalive (17.14/0/0) tcp 0 0 127.0.0.1:10248 127.0.0.1:56062 ESTABLISHED 1361/kubelet keepalive (3.24/0/0) tcp6 0 0 10.0.155.170:10250 10.0.175.250:38820 ESTABLISHED 1361/kubelet keepalive (2.02/0/0) tcp6 0 0 10.0.155.170:10250 10.128.2.9:43442 ESTABLISHED 1361/kubelet keepalive (9.18/0/0) tcp6 0 0 10.0.155.170:10250 10.0.175.250:36658 ESTABLISHED 1361/kubelet keepalive (6.62/0/0) tcp6 0 0 10.0.155.170:10250 10.128.2.9:43250 ESTABLISHED 1361/kubelet keepalive (14.36/0/0) [root@ip-10-0-155-170 2021-01-21 11:49:16 UTC ~]# iptables -I INPUT -m state --state ESTABLISHED,RELATED -p tcp --dport 53336 --sport 6443 -j DROP [root@ip-10-0-155-170 2021-01-21 11:49:32 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443 tcp 0 0 10.0.155.170:53336 10.0.134.252:6443 ESTABLISHED 1361/kubelet keepalive (13.09/0/0) [root@ip-10-0-155-170 2021-01-21 11:49:36 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443 tcp 0 572 10.0.155.170:53336 10.0.134.252:6443 ESTABLISHED 1361/kubelet on (5.14/5/0) [root@ip-10-0-155-170 2021-01-21 11:49:47 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443 tcp 0 1141 10.0.155.170:53336 10.0.134.252:6443 ESTABLISHED 1361/kubelet on (2.53/5/0) [root@ip-10-0-155-170 2021-01-21 11:49:49 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443 tcp 0 1263 10.0.155.170:53336 10.0.134.252:6443 ESTABLISHED 1361/kubelet on (0.21/5/0) [root@ip-10-0-155-170 2021-01-21 11:49:52 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443 tcp 0 1263 10.0.155.170:53336 10.0.134.252:6443 ESTABLISHED 1361/kubelet on (8.92/6/0) [root@ip-10-0-155-170 2021-01-21 11:49:56 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443 tcp 0 1263 10.0.155.170:53336 10.0.134.252:6443 ESTABLISHED 1361/kubelet on (7.25/6/0) [root@ip-10-0-155-170 2021-01-21 11:49:58 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443 tcp 0 0 10.0.155.170:43620 10.0.163.182:6443 ESTABLISHED 1361/kubelet keepalive (29.49/0/0) In another terminal watch: $ oc get no ip-10-0-155-170.us-east-2.compute.internal --no-headers -w But cannot reproduced, didn't see NotReady. Will turn to use steps in 4.5 clone bug 1907938#c6 to verify. Per the higher version clone bug 1901208#c5 Node QE's steps: "Discussed ... if the cluster runs for 5-8 hours without failures, the fix is good to go", I launched vSphere env of 4.4.0-0.nightly-2021-01-21-172857 which includes the fix, the cluster status (pods, COs, nodes, etc) keeps good, thus moving to VERIFIED: $ oc get no NAME STATUS ROLES AGE VERSION compute-0 Ready worker 7h55m v1.17.1+f06151f compute-1 Ready worker 7h54m v1.17.1+f06151f control-plane-0 Ready master 8h v1.17.1+f06151f control-plane-1 Ready master 8h v1.17.1+f06151f control-plane-2 Ready master 8h v1.17.1+f06151f Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.4.33 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0281 |