Bug 1856729
Summary: | [UPI Azure] One rollout of kube-apiserver hits "connection to the server api was refused - did you specify the right host or port" about 7 mins intermittent disruption | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Xingxing Xia <xxia> |
Component: | Installer | Assignee: | aos-install |
Installer sub component: | openshift-installer | QA Contact: | Gaoyun Pei <gpei> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | high | ||
Priority: | medium | CC: | adahiya, aos-bugs, esimard, mfojtik, yanyang |
Version: | 4.5 | Keywords: | Reopened, TestBlocker |
Target Milestone: | --- | ||
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-08-24 16:50:47 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Xingxing Xia
2020-07-14 10:21:54 UTC
Doing more check: $ oc get po -n openshift-kube-apiserver --show-labels -l apiserver NAME READY STATUS RESTARTS AGE LABELS kube-apiserver-xxia0714-1-dmj5j-master-0 4/4 Running 0 126m apiserver=true,app=openshift-kube-apiserver,revision=10 kube-apiserver-xxia0714-1-dmj5j-master-1 4/4 Running 0 129m apiserver=true,app=openshift-kube-apiserver,revision=10 kube-apiserver-xxia0714-1-dmj5j-master-2 4/4 Running 0 132m apiserver=true,app=openshift-kube-apiserver,revision=10 $ for i in kube-apiserver-xxia0714-1-dmj5j-master-{0..2}; do oc logs -n openshift-kube-apiserver -c kube-apiserver $i > logs/$i; done $ grep -l "TLS handshake error.*168.63.129.16.*EOF" logs/kube-apiserver-xxia0714-1-dmj5j-master-* logs/kube-apiserver-xxia0714-1-dmj5j-master-0 logs/kube-apiserver-xxia0714-1-dmj5j-master-1 logs/kube-apiserver-xxia0714-1-dmj5j-master-2 ^ all 3 masters have 168.63.129.16 EOF, but https://bugzilla.redhat.com/show_bug.cgi?id=1836016#c6 said that issue was VERIFIED. So need this bug 1856729 should exist as new tracker. Meantime, there are 10.0.1.5 EOF and 127.0.0.1 EOF, I think this is not totally same as https://bugzilla.redhat.com/show_bug.cgi?id=1852396#c0 . $ grep "TLS handshake error.*EOF" logs/kube-apiserver-xxia0714-1-dmj5j-master-* | grep -v "168.63.129.16" logs/kube-apiserver-xxia0714-1-dmj5j-master-0:I0714 10:12:37.665218 1 log.go:172] http: TLS handshake error from 10.0.1.5:40288: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-0:I0714 10:19:41.433882 1 log.go:172] http: TLS handshake error from 10.0.1.5:40726: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-0:I0714 11:40:32.153563 1 log.go:172] http: TLS handshake error from 10.0.1.5:45628: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 09:43:24.418709 1 log.go:172] http: TLS handshake error from 10.0.1.5:38516: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 09:43:24.462869 1 log.go:172] http: TLS handshake error from 10.0.1.5:38522: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 09:52:28.564260 1 log.go:172] http: TLS handshake error from 10.0.1.5:39074: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 10:05:04.065307 1 log.go:172] http: TLS handshake error from 10.0.1.5:39830: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 10:30:47.003264 1 log.go:172] http: TLS handshake error from 10.0.1.5:41420: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 10:53:17.972259 1 log.go:172] http: TLS handshake error from 10.0.1.5:42770: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-1:I0714 11:37:00.004179 1 log.go:172] http: TLS handshake error from 10.0.1.5:45404: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:30:39.912598 1 log.go:172] http: TLS handshake error from 127.0.0.1:49778: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:30:39.926085 1 log.go:172] http: TLS handshake error from 127.0.0.1:49780: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:30:39.926110 1 log.go:172] http: TLS handshake error from 127.0.0.1:49782: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:30:39.926126 1 log.go:172] http: TLS handshake error from 127.0.0.1:49776: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:33:00.016383 1 log.go:172] http: TLS handshake error from 10.0.1.5:37890: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 09:43:34.493902 1 log.go:172] http: TLS handshake error from 10.0.1.5:38532: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 10:01:42.754045 1 log.go:172] http: TLS handshake error from 10.0.1.5:39630: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 10:19:01.077289 1 log.go:172] http: TLS handshake error from 10.0.1.5:40680: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 10:28:35.734614 1 log.go:172] http: TLS handshake error from 10.0.1.5:41280: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 10:31:47.377325 1 log.go:172] http: TLS handshake error from 10.0.1.5:41466: EOF logs/kube-apiserver-xxia0714-1-dmj5j-master-2:I0714 11:07:35.112080 1 log.go:172] http: TLS handshake error from 10.0.1.5:43608: EOF 10.0.1.5 is not an IP of any node: $ oc get no -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME xxia0714-1-dmj5j-master-0 Ready master 9h v1.18.3+6025c28 10.0.0.7 <none> Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa) 4.18.0-193.12.1.el8_2.x86_64 cri-o://1.18.2-18.rhaos4.5.git754d46b.el8 xxia0714-1-dmj5j-master-1 Ready master 9h v1.18.3+6025c28 10.0.0.6 <none> Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa) 4.18.0-193.12.1.el8_2.x86_64 cri-o://1.18.2-18.rhaos4.5.git754d46b.el8 xxia0714-1-dmj5j-master-2 Ready master 9h v1.18.3+6025c28 10.0.0.8 <none> Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa) 4.18.0-193.12.1.el8_2.x86_64 cri-o://1.18.2-18.rhaos4.5.git754d46b.el8 xxia0714-1-dmj5j-rhel-0 Ready worker 8h v1.18.3+b74c5ed 10.0.1.6 <none> Red Hat Enterprise Linux Server 7.8 (Maipo) 3.10.0-1127.13.1.el7.x86_64 cri-o://1.18.2-18.rhaos4.5.git754d46b.el7 xxia0714-1-dmj5j-rhel-1 Ready worker 8h v1.18.3+b74c5ed 10.0.1.7 <none> Red Hat Enterprise Linux Server 7.8 (Maipo) 3.10.0-1127.13.1.el7.x86_64 cri-o://1.18.2-18.rhaos4.5.git754d46b.el7 In QE auto CI failure issue https://issues.redhat.com/browse/OCPQE-1272 , another cluster "Disconnected UPI on Azure with RHCOS & RHEL7.8 & FIPS on & http_proxy & Etcd Encryption on" running 4.5.0-0.nightly-2020-07-13-224910 also hits 7 mins intermittent disruption. This is considered abnormal cluster by CI and thus QE auto test runs prematurely exit, making the cases pass ratio is not high. From this point of view, it is TestBlocker, thus adding the keyword. Please provide must-gather output. So far tried 4.6 IPI on GCP, UPI on GCP, IPI on Azure envs, these envs didn't reproduce. Per occurrences in comment 0, comment 4 and comment 6, seemed only UPI Azure envs reproduced. We fixed the health check for azure UPI in https://github.com/openshift/installer/pull/3720 https://bugzilla.redhat.com/show_bug.cgi?id=1836016 which are verified by the QE. seems like a 4.6 fix. So please make sure you are using the appropriate UPI templates to get the fix. *** This bug has been marked as a duplicate of bug 1836016 *** |