Per Lukasz's Slack message, we "could try to verify/repo" this bug "in the same vein" as bug 1905194's verification steps. In this case "the client is kubelet not an api server".
PR in the merge queue.
Tried to launch old 4.4.32 which does not have the fix, and tried to reproduce with bug 1905194 steps: [root@ip-10-0-155-170 ~]# ps -eF | grep kubelet root 1361 1 5 364934 162984 1 07:01 ? 00:15:04 kubelet ... [root@ip-10-0-155-170 ~]# nsenter -t 1361 -n /bin/bash [root@ip-10-0-155-170 ~]# export PS1='[\u@\h \D{%F %T %Z} \W]\$ ' [root@ip-10-0-155-170 2021-01-21 11:47:55 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet tcp 0 0 10.0.155.170:53336 10.0.134.252:6443 ESTABLISHED 1361/kubelet keepalive (17.14/0/0) tcp 0 0 127.0.0.1:10248 127.0.0.1:56062 ESTABLISHED 1361/kubelet keepalive (3.24/0/0) tcp6 0 0 10.0.155.170:10250 10.0.175.250:38820 ESTABLISHED 1361/kubelet keepalive (2.02/0/0) tcp6 0 0 10.0.155.170:10250 10.128.2.9:43442 ESTABLISHED 1361/kubelet keepalive (9.18/0/0) tcp6 0 0 10.0.155.170:10250 10.0.175.250:36658 ESTABLISHED 1361/kubelet keepalive (6.62/0/0) tcp6 0 0 10.0.155.170:10250 10.128.2.9:43250 ESTABLISHED 1361/kubelet keepalive (14.36/0/0) [root@ip-10-0-155-170 2021-01-21 11:49:16 UTC ~]# iptables -I INPUT -m state --state ESTABLISHED,RELATED -p tcp --dport 53336 --sport 6443 -j DROP [root@ip-10-0-155-170 2021-01-21 11:49:32 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443 tcp 0 0 10.0.155.170:53336 10.0.134.252:6443 ESTABLISHED 1361/kubelet keepalive (13.09/0/0) [root@ip-10-0-155-170 2021-01-21 11:49:36 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443 tcp 0 572 10.0.155.170:53336 10.0.134.252:6443 ESTABLISHED 1361/kubelet on (5.14/5/0) [root@ip-10-0-155-170 2021-01-21 11:49:47 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443 tcp 0 1141 10.0.155.170:53336 10.0.134.252:6443 ESTABLISHED 1361/kubelet on (2.53/5/0) [root@ip-10-0-155-170 2021-01-21 11:49:49 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443 tcp 0 1263 10.0.155.170:53336 10.0.134.252:6443 ESTABLISHED 1361/kubelet on (0.21/5/0) [root@ip-10-0-155-170 2021-01-21 11:49:52 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443 tcp 0 1263 10.0.155.170:53336 10.0.134.252:6443 ESTABLISHED 1361/kubelet on (8.92/6/0) [root@ip-10-0-155-170 2021-01-21 11:49:56 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443 tcp 0 1263 10.0.155.170:53336 10.0.134.252:6443 ESTABLISHED 1361/kubelet on (7.25/6/0) [root@ip-10-0-155-170 2021-01-21 11:49:58 UTC ~]# netstat --tcp --numeric --program --timer --wide | grep kubelet | grep 6443 tcp 0 0 10.0.155.170:43620 10.0.163.182:6443 ESTABLISHED 1361/kubelet keepalive (29.49/0/0) In another terminal watch: $ oc get no ip-10-0-155-170.us-east-2.compute.internal --no-headers -w But cannot reproduced, didn't see NotReady. Will turn to use steps in 4.5 clone bug 1907938#c6 to verify.
Per the higher version clone bug 1901208#c5 Node QE's steps: "Discussed ... if the cluster runs for 5-8 hours without failures, the fix is good to go", I launched vSphere env of 4.4.0-0.nightly-2021-01-21-172857 which includes the fix, the cluster status (pods, COs, nodes, etc) keeps good, thus moving to VERIFIED: $ oc get no NAME STATUS ROLES AGE VERSION compute-0 Ready worker 7h55m v1.17.1+f06151f compute-1 Ready worker 7h54m v1.17.1+f06151f control-plane-0 Ready master 8h v1.17.1+f06151f control-plane-1 Ready master 8h v1.17.1+f06151f control-plane-2 Ready master 8h v1.17.1+f06151f
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.4.33 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0281