Bug 1856729 - [UPI Azure] One rollout of kube-apiserver hits "connection to the server api was refused - did you specify the right host or port" about 7 mins intermittent disruption
Summary: [UPI Azure] One rollout of kube-apiserver hits "connection to the server api ...
Keywords:
Status: CLOSED DUPLICATE of bug 1836016
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.6.0
Assignee: aos-install
QA Contact: Gaoyun Pei
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-14 10:21 UTC by Xingxing Xia
Modified: 2020-08-24 16:50 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-24 16:50:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Xingxing Xia 2020-07-14 10:21:54 UTC
Description of problem:
One rollout of kube-apiserver hits "connection to the server api was refused - did you specify the right host or port" about 7 mins intermittent disruption.
A follow up of https://issues.redhat.com/browse/OCPQE-808

Version-Release number of selected component (if applicable):
4.5.1

How reproducible:
Always

Steps to Reproduce:
1. Set up an Azure env, the matrix is: UPI on Azure OVN with RHCOS workers removed and RHEL workers scaled up (QE installation template: aos-4_5/upi-on-azure/versioned-installer-ovn-remove_rhcos_worker-ci).
I don't yet know whether this is common issue among different platforms, thus just write down what my env is.

2. Check the nodes, pods, COs, all are good.
3. Rollout kube-apiserver
$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'

And meantime watch oc commands with: 
$ scripts/watch-during-kube-apiserver-rollout.sh |& tee watch-during-kube-apiserver-rollout.log

Actual results:
3. The watch shows as below, about 7 mins intermittent disruption:
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:16:50Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:16:53Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:16:55Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:16:58Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:17:01Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:20:08Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:20:11Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:20:14Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:20:16Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:20:19Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:20:22Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:20:25Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:20:27Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:23:51Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:23:52Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:23:52Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:23:53Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:23:55Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:23:55Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:23:56Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:23:59Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:24:01Z for oc get cm
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:24:04Z for oc get --raw
The connection to the server api.xxx.xxx.azure.xxx.xxx.com:6443 was refused - did you specify the right host or port?
hit failure at 2020-07-14T09:24:07Z for oc get cm

Expected results:
3. Should have no 7 mins intermittent disruption

Additional info:
$ cat scripts/watch-during-kube-apiserver-rollout.sh 
while true
do
  oc get --raw "/" > /dev/null
  if [ "$?" != "0" ]; then
    echo "hit failure at $(date -u "+%Y-%m-%dT%H:%M:%SZ") for oc get --raw"
  fi
  oc get cm -A -o yaml > /dev/null
  if [ "$?" != "0" ]; then
    echo "hit failure at $(date -u "+%Y-%m-%dT%H:%M:%SZ") for oc get cm"
  fi
done

Comment 2 Xingxing Xia 2020-07-14 11:57:58 UTC
Doing more check:
$ oc get po -n openshift-kube-apiserver --show-labels -l apiserver
NAME                                       READY   STATUS    RESTARTS   AGE    LABELS
kube-apiserver-xxia0714-1-dmj5j-master-0   4/4     Running   0          126m   apiserver=true,app=openshift-kube-apiserver,revision=10
kube-apiserver-xxia0714-1-dmj5j-master-1   4/4     Running   0          129m   apiserver=true,app=openshift-kube-apiserver,revision=10
kube-apiserver-xxia0714-1-dmj5j-master-2   4/4     Running   0          132m   apiserver=true,app=openshift-kube-apiserver,revision=10
$ for i in kube-apiserver-xxia0714-1-dmj5j-master-{0..2}; do oc logs -n openshift-kube-apiserver -c kube-apiserver $i > logs/$i; done
$ grep -l "TLS handshake error.*168.63.129.16.*EOF" logs/kube-apiserver-xxia0714-1-dmj5j-master-*
logs/kube-apiserver-xxia0714-1-dmj5j-master-0
logs/kube-apiserver-xxia0714-1-dmj5j-master-1
logs/kube-apiserver-xxia0714-1-dmj5j-master-2

^ all 3 masters have 168.63.129.16 EOF, but https://bugzilla.redhat.com/show_bug.cgi?id=1836016#c6 said that issue was VERIFIED. So need this bug 1856729 should exist as new tracker.

Meantime, there are 10.0.1.5 EOF and 127.0.0.1 EOF, I think this is not totally same as https://bugzilla.redhat.com/show_bug.cgi?id=1852396#c0 .
$ grep "TLS handshake error.*EOF" logs/kube-apiserver-xxia0714-1-dmj5j-master-* | grep -v "168.63.129.16"
logs/kube-apiserver-xxia0714-1-dmj5j-master-0:I0714 10:12:37.665218       1 log.go:172] http: TLS handshake error from 10.0.1.5:40288: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-0:I0714 10:19:41.433882       1 log.go:172] http: TLS handshake error from 10.0.1.5:40726: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-0:I0714 11:40:32.153563       1 log.go:172] http: TLS handshake error from 10.0.1.5:45628: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 09:43:24.418709       1 log.go:172] http: TLS handshake error from 10.0.1.5:38516: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 09:43:24.462869       1 log.go:172] http: TLS handshake error from 10.0.1.5:38522: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 09:52:28.564260       1 log.go:172] http: TLS handshake error from 10.0.1.5:39074: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 10:05:04.065307       1 log.go:172] http: TLS handshake error from 10.0.1.5:39830: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 10:30:47.003264       1 log.go:172] http: TLS handshake error from 10.0.1.5:41420: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 10:53:17.972259       1 log.go:172] http: TLS handshake error from 10.0.1.5:42770: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 11:37:00.004179       1 log.go:172] http: TLS handshake error from 10.0.1.5:45404: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:30:39.912598       1 log.go:172] http: TLS handshake error from 127.0.0.1:49778: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:30:39.926085       1 log.go:172] http: TLS handshake error from 127.0.0.1:49780: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:30:39.926110       1 log.go:172] http: TLS handshake error from 127.0.0.1:49782: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:30:39.926126       1 log.go:172] http: TLS handshake error from 127.0.0.1:49776: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:33:00.016383       1 log.go:172] http: TLS handshake error from 10.0.1.5:37890: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:43:34.493902       1 log.go:172] http: TLS handshake error from 10.0.1.5:38532: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 10:01:42.754045       1 log.go:172] http: TLS handshake error from 10.0.1.5:39630: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 10:19:01.077289       1 log.go:172] http: TLS handshake error from 10.0.1.5:40680: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 10:28:35.734614       1 log.go:172] http: TLS handshake error from 10.0.1.5:41280: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 10:31:47.377325       1 log.go:172] http: TLS handshake error from 10.0.1.5:41466: EOF
logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 11:07:35.112080       1 log.go:172] http: TLS handshake error from 10.0.1.5:43608: EOF

10.0.1.5 is not an IP of any node:
$ oc get no -o wide
NAME                        STATUS   ROLES    AGE   VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
xxia0714-1-dmj5j-master-0   Ready    master   9h    v1.18.3+6025c28   10.0.0.7      <none>        Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa)   4.18.0-193.12.1.el8_2.x86_64   cri-o://1.18.2-18.rhaos4.5.git754d46b.el8
xxia0714-1-dmj5j-master-1   Ready    master   9h    v1.18.3+6025c28   10.0.0.6      <none>        Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa)   4.18.0-193.12.1.el8_2.x86_64   cri-o://1.18.2-18.rhaos4.5.git754d46b.el8
xxia0714-1-dmj5j-master-2   Ready    master   9h    v1.18.3+6025c28   10.0.0.8      <none>        Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa)   4.18.0-193.12.1.el8_2.x86_64   cri-o://1.18.2-18.rhaos4.5.git754d46b.el8
xxia0714-1-dmj5j-rhel-0     Ready    worker   8h    v1.18.3+b74c5ed   10.0.1.6      <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.13.1.el7.x86_64    cri-o://1.18.2-18.rhaos4.5.git754d46b.el7
xxia0714-1-dmj5j-rhel-1     Ready    worker   8h    v1.18.3+b74c5ed   10.0.1.7      <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.13.1.el7.x86_64    cri-o://1.18.2-18.rhaos4.5.git754d46b.el7

Comment 4 Xingxing Xia 2020-07-16 12:18:11 UTC
In QE auto CI failure issue https://issues.redhat.com/browse/OCPQE-1272 , another cluster "Disconnected UPI on Azure with RHCOS & RHEL7.8 & FIPS on & http_proxy & Etcd Encryption on" running 4.5.0-0.nightly-2020-07-13-224910 also hits 7 mins intermittent disruption. This is considered abnormal cluster by CI and thus QE auto test runs prematurely exit, making the cases pass ratio is not high. From this point of view, it is TestBlocker, thus adding the keyword.

Comment 5 Stefan Schimanski 2020-08-03 12:27:43 UTC
Please provide must-gather output.

Comment 7 Xingxing Xia 2020-08-20 11:34:19 UTC
So far tried 4.6 IPI on GCP, UPI on GCP, IPI on Azure envs, these envs didn't reproduce. Per occurrences in comment 0, comment 4 and comment 6, seemed only UPI Azure envs reproduced.

Comment 11 Abhinav Dahiya 2020-08-24 16:50:47 UTC
We fixed the health check for azure UPI in https://github.com/openshift/installer/pull/3720 https://bugzilla.redhat.com/show_bug.cgi?id=1836016 which are verified by the QE. seems like a 4.6 fix.
So please make sure you are using the appropriate UPI templates to get the fix.

*** This bug has been marked as a duplicate of bug 1836016 ***


Note You need to log in before you can comment on or make changes to this bug.