Bug 1876167
Summary: | [RFE] need to be able to enable and disable openshift-apiserver connectivity checks | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Luis Sanchez <sanchezl> |
Component: | openshift-apiserver | Assignee: | Luis Sanchez <sanchezl> |
Status: | CLOSED WONTFIX | QA Contact: | Xingxing Xia <xxia> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.6 | CC: | aos-bugs, kewang, mfojtik, sttts, xxia |
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1876166 | Environment: | |
Last Closed: | 2021-02-10 20:43:09 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Luis Sanchez
2020-09-06 00:24:37 UTC
I think this is the same issue I am seeing on recent nightlys on Azure (4.6.0-0.nightly-2020-09-13-023938) [m@localhost 46-azure-install]$ oc get events -n openshift-apiserver |less 57m Warning ConnectivityOutageDetected deployment/apiserver Connectivity outage detected: load-balancer-api-external: failed to establish a TCP connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443: dial tcp 10.0.0.4:6443: i/o timeout 61m Normal ConnectivityRestored deployment/apiserver Connectivity restored after 1.025622661s: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded 61m Normal ConnectivityRestored deployment/apiserver Connectivity restored after 1.995560051s: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded 61m Normal ConnectivityRestored deployment/apiserver Connectivity restored after 4.846401561s: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded 61m Normal ConnectivityRestored deployment/apiserver Connectivity restored after 2.004979486s: load-balancer-api-internal: tcp connection to api-int.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded 61m Normal ConnectivityRestored deployment/apiserver Connectivity restored after 1.00476912s: load-balancer-api-internal: tcp connection to api-int.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded 61m Normal ConnectivityRestored deployment/apiserver Connectivity restored after 996.230724ms: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded 61m Normal ConnectivityRestored deployment/apiserver Connectivity restored after 4.934150936s: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded 57m Warning ConnectivityOutageDetected deployment/apiserver Connectivity outage detected: load-balancer-api-internal: failed to establish a TCP connection to api-int.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443: dial tcp 10.0.0.4:6443: i/o timeout These show up constantly on the web console as well. The connec(In reply to Mike Gahagan from comment #1) > I think this is the same issue I am seeing on recent nightlys on Azure > (4.6.0-0.nightly-2020-09-13-023938) > > These show up constantly on the web console as well. This was fixed by bug 1878794. Tested in 4.6.0-0.nightly-2020-10-09-224055, by default it is disabled. But cannot enable it because encountering below CrashLoopBackOff problem. $ oc edit openshiftapiserver/cluster ... spec: ... unsupportedConfigOverrides: operator: enableConnectivityCheckController: "True" Then watch, both KAS and OAS pods keep CrashLoopBackOff on the check-endpoints container: $ oc get po -n openshift-apiserver --show-labels -o wide -w NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS apiserver-657f5c5c87-x9z94 0/2 Init:0/1 0 5s <none> ip-10-0-62-115.ap-northeast-1.compute.internal <none> <none> apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=657f5c5c87,revision=4 ... apiserver-657f5c5c87-8r4kd 1/2 CrashLoopBackOff 6 9m46s 10.128.0.24 ip-10-0-76-22.ap-northeast-1.compute.internal <none> <none> apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=657f5c5c87,revision=4 apiserver-657f5c5c87-9lncz 1/2 CrashLoopBackOff 6 10m 10.129.0.42 ip-10-0-51-70.ap-northeast-1.compute.internal <none> <none> apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=657f5c5c87,revision=4 apiserver-657f5c5c87-x9z94 1/2 CrashLoopBackOff 6 10m 10.130.0.45 ip-10-0-62-115.ap-northeast-1.compute.internal <none> <none> apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=657f5c5c87,revision=4 $ oc logs -c openshift-apiserver-check-endpoints apiserver-657f5c5c87-8r4kd -n openshift-apiserver ... I1010 10:10:41.609467 1 base_controller.go:109] Starting #1 worker of check-endpoints controller ... I1010 10:10:41.677077 1 start_stop_controllers.go:70] The server doesn't have a resource type "podnetworkconnectivitychecks.controlplane.operator.openshift.io". $ oc get po -n openshift-kube-apiserver --show-labels -l apiserver NAME READY STATUS RESTARTS AGE LABELS kube-apiserver-ip-10-0-51-70.ap-northeast-1.compute.internal 4/5 CrashLoopBackOff 26 5h50m apiserver=true,app=openshift-kube-apiserver,revision=9 kube-apiserver-ip-10-0-62-115.ap-northeast-1.compute.internal 4/5 CrashLoopBackOff 26 5h46m apiserver=true,app=openshift-kube-apiserver,revision=9 kube-apiserver-ip-10-0-76-22.ap-northeast-1.compute.internal 4/5 CrashLoopBackOff 26 5h41m apiserver=true,app=openshift-kube-apiserver,revision=9 $ oc logs -c kube-apiserver-check-endpoints kube-apiserver-ip-10-0-51-70.ap-northeast-1.compute.internal -n openshift-kube-apiserver ... I1010 10:13:46.304353 1 base_controller.go:166] Shutting down CheckEndpointsTimeToStart ... I1010 10:13:46.304377 1 base_controller.go:113] Shutting down worker of CheckEndpointsTimeToStart controller ... I1010 10:13:46.304875 1 base_controller.go:103] All CheckEndpointsTimeToStart workers have been terminated ... I1010 10:13:46.496032 1 start_stop_controllers.go:70] The server doesn't have a resource type "podnetworkconnectivitychecks.controlplane.operator.openshift.io". Check kube-apiserver logs, found below logs are repeated: $ oc logs -c kube-apiserver kube-apiserver-ip-10-0-51-70.ap-northeast-1.compute.internal -n openshift-kube-apiserver ... I1010 10:21:59.685378 18 aggregator.go:226] Updating OpenAPI spec because k8s_internal_local_delegation_chain_0000000002 is updated I1010 10:22:01.278489 18 aggregator.go:229] Finished OpenAPI spec generation after 1.593081531s I1010 10:22:01.933031 18 controller.go:189] Updating CRD OpenAPI spec because podnetworkconnectivitychecks.controlplane.operator.openshift.io changed ... I1010 10:22:02.054075 18 store.go:1378] Monitoring podnetworkconnectivitychecks.controlplane.operator.openshift.io count at <storage-prefix>//controlplane.operator.openshift.io/podnetworkconnectivitychecks I1010 10:22:02.056595 18 cacher.go:402] cacher (*unstructured.Unstructured): initialized I1010 10:22:02.886290 18 controller.go:172] Updating CRD OpenAPI spec because podnetworkconnectivitychecks.controlplane.operator.openshift.io was removed ... I1010 10:22:03.081078 18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I1010 10:22:03.081123 18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" ... W1010 10:22:03.963110 18 controller.go:142] slow openapi aggregation of "podnetworkconnectivitychecks.controlplane.operator.openshift.io": 1.076830326s |