Description of problem: One rollout of kube-apiserver hits "connection to the server api was refused - did you specify the right host or port" about 7 mins intermittent disruption. A follow up of https://issues.redhat.com/browse/OCPQE-808 Version-Release number of selected component (if applicable): 4.5.1 How reproducible: Always Steps to Reproduce: 1. Set up an Azure env, the matrix is: UPI on Azure OVN with RHCOS workers removed and RHEL workers scaled up (QE installation template: aos-4_5/upi-on-azure/versioned-installer-ovn-remove_rhcos_worker-ci). I don't yet know whether this is common issue among different platforms, thus just write down what my env is. 2. Check the nodes, pods, COs, all are good. 3. Rollout kube-apiserver $ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]' And meantime watch oc commands with: $ scripts/watch-during-kube-apiserver-rollout.sh |& tee watch-during-kube-apiserver-rollout.log Actual results: 3. The watch shows as below, about 7 mins intermittent disruption: The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:16:50Z for oc get --raw The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:16:53Z for oc get cm The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:16:55Z for oc get --raw The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:16:58Z for oc get cm The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:17:01Z for oc get --raw The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:20:08Z for oc get cm The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:20:11Z for oc get --raw The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:20:14Z for oc get cm The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:20:16Z for oc get --raw The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:20:19Z for oc get cm The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:20:22Z for oc get --raw The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:20:25Z for oc get cm The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:20:27Z for oc get --raw The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:23:51Z for oc get cm The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:23:52Z for oc get cm The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:23:52Z for oc get cm The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:23:53Z for oc get cm The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:23:55Z for oc get --raw The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:23:55Z for oc get cm The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:23:56Z for oc get cm The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:23:59Z for oc get --raw The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:24:01Z for oc get cm The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:24:04Z for oc get --raw The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port? hit failure at 2020-07-14T09:24:07Z for oc get cm Expected results: 3. Should have no 7 mins intermittent disruption Additional info: $ cat scripts/watch-during-kube-apiserver-rollout.sh while true do oc get --raw "/" > /dev/null if [ "$?" != "0" ]; then echo "hit failure at $(date -u "+%Y-%m-%dT%H:%M:%SZ") for oc get --raw" fi oc get cm -A -o yaml > /dev/null if [ "$?" != "0" ]; then echo "hit failure at $(date -u "+%Y-%m-%dT%H:%M:%SZ") for oc get cm" fi done
Doing more check: $ oc get po -n openshift-kube-apiserver --show-labels -l apiserver NAME READY STATUS RESTARTS AGE LABELS kube-apiserver-xxia0714-1-dmj5j-master-0 4/4 Running 0 126m apiserver=true,app=openshift-kube-apiserver,revision=10 kube-apiserver-xxia0714-1-dmj5j-master-1 4/4 Running 0 129m apiserver=true,app=openshift-kube-apiserver,revision=10 kube-apiserver-xxia0714-1-dmj5j-master-2 4/4 Running 0 132m apiserver=true,app=openshift-kube-apiserver,revision=10 $ for i in kube-apiserver-xxia0714-1-dmj5j-master-{0..2}; do oc logs -n openshift-kube-apiserver -c kube-apiserver $i > logs/$i; done $ grep -l "TLS handshake error.*168.63.129.16.*EOF" logs/kube-apiserver-xxia0714-1-dmj5j-master-* logs/kube-apiserver-xxia0714-1-dmj5j-master-0 logs/kube-apiserver-xxia0714-1-dmj5j-master-1 logs/kube-apiserver-xxia0714-1-dmj5j-master-2 ^ all 3 masters have 168.63.129.16 EOF, but https://bugzilla.redhat.com/show_bug.cgi?id=1836016#c6 said that issue was VERIFIED. So need this bug 1856729 should exist as new tracker. Meantime, there are 10.0.1.5 EOF and 127.0.0.1 EOF, I think this is not totally same as https://bugzilla.redhat.com/show_bug.cgi?id=1852396#c0 . $ grep "TLS handshake error.*EOF" logs/kube-apiserver-xxia0714-1-dmj5j-master-* | grep -v "168.63.129.16" logs/kube-apiserver-xxia0714-1-dmj5j-master-0:I0714 10:12:37.665218 1 log.go:172] http: TLS handshake error from 10.0.1.5:40288: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-0:I0714 10:19:41.433882 1 log.go:172] http: TLS handshake error from 10.0.1.5:40726: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-0:I0714 11:40:32.153563 1 log.go:172] http: TLS handshake error from 10.0.1.5:45628: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 09:43:24.418709 1 log.go:172] http: TLS handshake error from 10.0.1.5:38516: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 09:43:24.462869 1 log.go:172] http: TLS handshake error from 10.0.1.5:38522: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 09:52:28.564260 1 log.go:172] http: TLS handshake error from 10.0.1.5:39074: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 10:05:04.065307 1 log.go:172] http: TLS handshake error from 10.0.1.5:39830: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 10:30:47.003264 1 log.go:172] http: TLS handshake error from 10.0.1.5:41420: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 10:53:17.972259 1 log.go:172] http: TLS handshake error from 10.0.1.5:42770: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 11:37:00.004179 1 log.go:172] http: TLS handshake error from 10.0.1.5:45404: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:30:39.912598 1 log.go:172] http: TLS handshake error from 127.0.0.1:49778: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:30:39.926085 1 log.go:172] http: TLS handshake error from 127.0.0.1:49780: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:30:39.926110 1 log.go:172] http: TLS handshake error from 127.0.0.1:49782: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:30:39.926126 1 log.go:172] http: TLS handshake error from 127.0.0.1:49776: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:33:00.016383 1 log.go:172] http: TLS handshake error from 10.0.1.5:37890: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:43:34.493902 1 log.go:172] http: TLS handshake error from 10.0.1.5:38532: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 10:01:42.754045 1 log.go:172] http: TLS handshake error from 10.0.1.5:39630: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 10:19:01.077289 1 log.go:172] http: TLS handshake error from 10.0.1.5:40680: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 10:28:35.734614 1 log.go:172] http: TLS handshake error from 10.0.1.5:41280: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 10:31:47.377325 1 log.go:172] http: TLS handshake error from 10.0.1.5:41466: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 11:07:35.112080 1 log.go:172] http: TLS handshake error from 10.0.1.5:43608: EOF 10.0.1.5 is not an IP of any node: $ oc get no -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME xxia0714-1-dmj5j-master-0 Ready master 9h v1.18.3+6025c28 10.0.0.7 <none> Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa) 4.18.0-193.12.1.el8_2.x86_64 cri-o://1.18.2-18.rhaos4.5.git754d46b.el8 xxia0714-1-dmj5j-master-1 Ready master 9h v1.18.3+6025c28 10.0.0.6 <none> Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa) 4.18.0-193.12.1.el8_2.x86_64 cri-o://1.18.2-18.rhaos4.5.git754d46b.el8 xxia0714-1-dmj5j-master-2 Ready master 9h v1.18.3+6025c28 10.0.0.8 <none> Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa) 4.18.0-193.12.1.el8_2.x86_64 cri-o://1.18.2-18.rhaos4.5.git754d46b.el8 xxia0714-1-dmj5j-rhel-0 Ready worker 8h v1.18.3+b74c5ed 10.0.1.6 <none> Red Hat Enterprise Linux Server 7.8 (Maipo) 3.10.0-1127.13.1.el7.x86_64 cri-o://1.18.2-18.rhaos4.5.git754d46b.el7 xxia0714-1-dmj5j-rhel-1 Ready worker 8h v1.18.3+b74c5ed 10.0.1.7 <none> Red Hat Enterprise Linux Server 7.8 (Maipo) 3.10.0-1127.13.1.el7.x86_64 cri-o://1.18.2-18.rhaos4.5.git754d46b.el7
In QE auto CI failure issue https://issues.redhat.com/browse/OCPQE-1272 , another cluster "Disconnected UPI on Azure with RHCOS & RHEL7.8 & FIPS on & http_proxy & Etcd Encryption on" running 4.5.0-0.nightly-2020-07-13-224910 also hits 7 mins intermittent disruption. This is considered abnormal cluster by CI and thus QE auto test runs prematurely exit, making the cases pass ratio is not high. From this point of view, it is TestBlocker, thus adding the keyword.
Please provide must-gather output.
So far tried 4.6 IPI on GCP, UPI on GCP, IPI on Azure envs, these envs didn't reproduce. Per occurrences in comment 0, comment 4 and comment 6, seemed only UPI Azure envs reproduced.
We fixed the health check for azure UPI in https://github.com/openshift/installer/pull/3720 https://bugzilla.redhat.com/show_bug.cgi?id=1836016 which are verified by the QE. seems like a 4.6 fix. So please make sure you are using the appropriate UPI templates to get the fix. *** This bug has been marked as a duplicate of bug 1836016 ***