Bug 1856729

Summary: [UPI Azure] One rollout of kube-apiserver hits "connection to the server api was refused - did you specify the right host or port" about 7 mins intermittent disruption
Product: OpenShift Container Platform Reporter: Xingxing Xia <xxia>
Component: InstallerAssignee: aos-install
Installer sub component: openshift-installer QA Contact: Gaoyun Pei <gpei>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: medium CC: adahiya, aos-bugs, esimard, mfojtik, yanyang
Version: 4.5Keywords: Reopened, TestBlocker
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-24 16:50:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Xingxing Xia 2020-07-14 10:21:54 UTC
Description of problem:
One rollout of kube-apiserver hits "connection to the server api was refused - did you specify the right host or port" about 7 mins intermittent disruption.
A follow up of https://issues.redhat.com/browse/OCPQE-808

Version-Release number of selected component (if applicable):
4.5.1

How reproducible:
Always

Steps to Reproduce:
1. Set up an Azure env, the matrix is: UPI on Azure OVN with RHCOS workers removed and RHEL workers scaled up (QE installation template: aos-4_5/upi-on-azure/versioned-installer-ovn-remove_rhcos_worker-ci).
I don't yet know whether this is common issue among different platforms, thus just write down what my env is.

2. Check the nodes, pods, COs, all are good.
3. Rollout kube-apiserver
$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'

And meantime watch oc commands with: 
$ scripts/watch-during-kube-apiserver-rollout.sh |& tee watch-during-kube-apiserver-rollout.log

Actual results:
3. The watch shows as below, about 7 mins intermittent disruption:
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:16:50Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:16:53Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:16:55Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:16:58Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:17:01Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:20:08Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:20:11Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:20:14Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:20:16Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:20:19Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:20:22Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:20:25Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:20:27Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:23:51Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:23:52Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:23:52Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:23:53Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:23:55Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:23:55Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:23:56Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:23:59Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:24:01Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:24:04Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:24:07Z for oc get cm

Expected results:
3. Should have no 7 mins intermittent disruption

Additional info:
$ cat scripts/watch-during-kube-apiserver-rollout.sh 
while true
do
  oc get --raw "/" > /dev/null
  if [ "$?" != "0" ]; then
    echo "hit failure at $(date -u "+%Y-%m-%dT%H:%M:%SZ") for oc get --raw"
  fi
  oc get cm -A -o yaml > /dev/null
  if [ "$?" != "0" ]; then
    echo "hit failure at $(date -u "+%Y-%m-%dT%H:%M:%SZ") for oc get cm"
  fi
done

Comment 2 Xingxing Xia 2020-07-14 11:57:58 UTC
Doing more check:
$ oc get po -n openshift-kube-apiserver --show-labels -l apiserver
NAME                                       READY   STATUS    RESTARTS   AGE    LABELS
kube-apiserver-xxia0714-1-dmj5j-master-0   4/4     Running   0          126m   apiserver=true,app=openshift-kube-apiserver,revision=10
kube-apiserver-xxia0714-1-dmj5j-master-1   4/4     Running   0          129m   apiserver=true,app=openshift-kube-apiserver,revision=10
kube-apiserver-xxia0714-1-dmj5j-master-2   4/4     Running   0          132m   apiserver=true,app=openshift-kube-apiserver,revision=10
$ for i in kube-apiserver-xxia0714-1-dmj5j-master-{0..2}; do oc logs -n openshift-kube-apiserver -c kube-apiserver $i > logs/$i; done
$ grep -l "TLS handshake error.*168.63.129.16.*EOF" logs/kube-apiserver-xxia0714-1-dmj5j-master-*
logs/kube-apiserver-xxia0714-1-dmj5j-master-0
logs/kube-apiserver-xxia0714-1-dmj5j-master-1
logs/kube-apiserver-xxia0714-1-dmj5j-master-2

^ all 3 masters have 168.63.129.16 EOF, but https://bugzilla.redhat.com/show_bug.cgi?id=1836016#c6 said that issue was VERIFIED. So need this bug 1856729 should exist as new tracker.

Meantime, there are 10.0.1.5 EOF and 127.0.0.1 EOF, I think this is not totally same as https://bugzilla.redhat.com/show_bug.cgi?id=1852396#c0 .
$ grep "TLS handshake error.*EOF" logs/kube-apiserver-xxia0714-1-dmj5j-master-* | grep -v "168.63.129.16"
logs/kube-apiserver-xxia0714-1-dmj5j-master-0:I0714 10:12:37.665218       1 log.go:172] http: TLS handshake error from 10.0.1.5:40288: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-0:I0714 10:19:41.433882       1 log.go:172] http: TLS handshake error from 10.0.1.5:40726: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-0:I0714 11:40:32.153563       1 log.go:172] http: TLS handshake error from 10.0.1.5:45628: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 09:43:24.418709       1 log.go:172] http: TLS handshake error from 10.0.1.5:38516: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 09:43:24.462869       1 log.go:172] http: TLS handshake error from 10.0.1.5:38522: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 09:52:28.564260       1 log.go:172] http: TLS handshake error from 10.0.1.5:39074: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 10:05:04.065307       1 log.go:172] http: TLS handshake error from 10.0.1.5:39830: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 10:30:47.003264       1 log.go:172] http: TLS handshake error from 10.0.1.5:41420: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 10:53:17.972259       1 log.go:172] http: TLS handshake error from 10.0.1.5:42770: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 11:37:00.004179       1 log.go:172] http: TLS handshake error from 10.0.1.5:45404: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:30:39.912598       1 log.go:172] http: TLS handshake error from 127.0.0.1:49778: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:30:39.926085       1 log.go:172] http: TLS handshake error from 127.0.0.1:49780: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:30:39.926110       1 log.go:172] http: TLS handshake error from 127.0.0.1:49782: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:30:39.926126       1 log.go:172] http: TLS handshake error from 127.0.0.1:49776: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:33:00.016383       1 log.go:172] http: TLS handshake error from 10.0.1.5:37890: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:43:34.493902       1 log.go:172] http: TLS handshake error from 10.0.1.5:38532: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 10:01:42.754045       1 log.go:172] http: TLS handshake error from 10.0.1.5:39630: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 10:19:01.077289       1 log.go:172] http: TLS handshake error from 10.0.1.5:40680: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 10:28:35.734614       1 log.go:172] http: TLS handshake error from 10.0.1.5:41280: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 10:31:47.377325       1 log.go:172] http: TLS handshake error from 10.0.1.5:41466: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 11:07:35.112080       1 log.go:172] http: TLS handshake error from 10.0.1.5:43608: EOF

10.0.1.5 is not an IP of any node:
$ oc get no -o wide
NAME                        STATUS   ROLES    AGE   VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
xxia0714-1-dmj5j-master-0   Ready    master   9h    v1.18.3+6025c28   10.0.0.7      <none>        Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa)   4.18.0-193.12.1.el8_2.x86_64   cri-o://1.18.2-18.rhaos4.5.git754d46b.el8
xxia0714-1-dmj5j-master-1   Ready    master   9h    v1.18.3+6025c28   10.0.0.6      <none>        Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa)   4.18.0-193.12.1.el8_2.x86_64   cri-o://1.18.2-18.rhaos4.5.git754d46b.el8
xxia0714-1-dmj5j-master-2   Ready    master   9h    v1.18.3+6025c28   10.0.0.8      <none>        Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa)   4.18.0-193.12.1.el8_2.x86_64   cri-o://1.18.2-18.rhaos4.5.git754d46b.el8
xxia0714-1-dmj5j-rhel-0     Ready    worker   8h    v1.18.3+b74c5ed   10.0.1.6      <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.13.1.el7.x86_64    cri-o://1.18.2-18.rhaos4.5.git754d46b.el7
xxia0714-1-dmj5j-rhel-1     Ready    worker   8h    v1.18.3+b74c5ed   10.0.1.7      <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.13.1.el7.x86_64    cri-o://1.18.2-18.rhaos4.5.git754d46b.el7

Comment 4 Xingxing Xia 2020-07-16 12:18:11 UTC
In QE auto CI failure issue https://issues.redhat.com/browse/OCPQE-1272 , another cluster "Disconnected UPI on Azure with RHCOS & RHEL7.8 & FIPS on & http_proxy & Etcd Encryption on" running 4.5.0-0.nightly-2020-07-13-224910 also hits 7 mins intermittent disruption. This is considered abnormal cluster by CI and thus QE auto test runs prematurely exit, making the cases pass ratio is not high. From this point of view, it is TestBlocker, thus adding the keyword.

Comment 5 Stefan Schimanski 2020-08-03 12:27:43 UTC
Please provide must-gather output.

Comment 7 Xingxing Xia 2020-08-20 11:34:19 UTC
So far tried 4.6 IPI on GCP, UPI on GCP, IPI on Azure envs, these envs didn't reproduce. Per occurrences in comment 0, comment 4 and comment 6, seemed only UPI Azure envs reproduced.

Comment 11 Abhinav Dahiya 2020-08-24 16:50:47 UTC
We fixed the health check for azure UPI in https://github.com/openshift/installer/pull/3720 https://bugzilla.redhat.com/show_bug.cgi?id=1836016 which are verified by the QE. seems like a 4.6 fix.
So please make sure you are using the appropriate UPI templates to get the fix.

*** This bug has been marked as a duplicate of bug 1836016 ***