Hide Forgot
Description of problem: openshift-apiserver pod performs connectivity checks to report on network outages. Sometimes when debugging certain types of issues, the activity from the connectivity checks results in too much data, making it more difficult to pinpoint the root cause. We need to be able to temporarily disable the connectivity checks.
I think this is the same issue I am seeing on recent nightlys on Azure (4.6.0-0.nightly-2020-09-13-023938) [m@localhost 46-azure-install]$ oc get events -n openshift-apiserver |less 57m Warning ConnectivityOutageDetected deployment/apiserver Connectivity outage detected: load-balancer-api-external: failed to establish a TCP connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443: dial tcp 10.0.0.4:6443: i/o timeout 61m Normal ConnectivityRestored deployment/apiserver Connectivity restored after 1.025622661s: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded 61m Normal ConnectivityRestored deployment/apiserver Connectivity restored after 1.995560051s: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded 61m Normal ConnectivityRestored deployment/apiserver Connectivity restored after 4.846401561s: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded 61m Normal ConnectivityRestored deployment/apiserver Connectivity restored after 2.004979486s: load-balancer-api-internal: tcp connection to api-int.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded 61m Normal ConnectivityRestored deployment/apiserver Connectivity restored after 1.00476912s: load-balancer-api-internal: tcp connection to api-int.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded 61m Normal ConnectivityRestored deployment/apiserver Connectivity restored after 996.230724ms: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded 61m Normal ConnectivityRestored deployment/apiserver Connectivity restored after 4.934150936s: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded 57m Warning ConnectivityOutageDetected deployment/apiserver Connectivity outage detected: load-balancer-api-internal: failed to establish a TCP connection to api-int.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443: dial tcp 10.0.0.4:6443: i/o timeout These show up constantly on the web console as well.
The connec(In reply to Mike Gahagan from comment #1) > I think this is the same issue I am seeing on recent nightlys on Azure > (4.6.0-0.nightly-2020-09-13-023938) > > These show up constantly on the web console as well. This was fixed by bug 1878794.
Tested in 4.6.0-0.nightly-2020-10-09-224055, by default it is disabled. But cannot enable it because encountering below CrashLoopBackOff problem. $ oc edit openshiftapiserver/cluster ... spec: ... unsupportedConfigOverrides: operator: enableConnectivityCheckController: "True" Then watch, both KAS and OAS pods keep CrashLoopBackOff on the check-endpoints container: $ oc get po -n openshift-apiserver --show-labels -o wide -w NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS apiserver-657f5c5c87-x9z94 0/2 Init:0/1 0 5s <none> ip-10-0-62-115.ap-northeast-1.compute.internal <none> <none> apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=657f5c5c87,revision=4 ... apiserver-657f5c5c87-8r4kd 1/2 CrashLoopBackOff 6 9m46s 10.128.0.24 ip-10-0-76-22.ap-northeast-1.compute.internal <none> <none> apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=657f5c5c87,revision=4 apiserver-657f5c5c87-9lncz 1/2 CrashLoopBackOff 6 10m 10.129.0.42 ip-10-0-51-70.ap-northeast-1.compute.internal <none> <none> apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=657f5c5c87,revision=4 apiserver-657f5c5c87-x9z94 1/2 CrashLoopBackOff 6 10m 10.130.0.45 ip-10-0-62-115.ap-northeast-1.compute.internal <none> <none> apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=657f5c5c87,revision=4 $ oc logs -c openshift-apiserver-check-endpoints apiserver-657f5c5c87-8r4kd -n openshift-apiserver ... I1010 10:10:41.609467 1 base_controller.go:109] Starting #1 worker of check-endpoints controller ... I1010 10:10:41.677077 1 start_stop_controllers.go:70] The server doesn't have a resource type "podnetworkconnectivitychecks.controlplane.operator.openshift.io". $ oc get po -n openshift-kube-apiserver --show-labels -l apiserver NAME READY STATUS RESTARTS AGE LABELS kube-apiserver-ip-10-0-51-70.ap-northeast-1.compute.internal 4/5 CrashLoopBackOff 26 5h50m apiserver=true,app=openshift-kube-apiserver,revision=9 kube-apiserver-ip-10-0-62-115.ap-northeast-1.compute.internal 4/5 CrashLoopBackOff 26 5h46m apiserver=true,app=openshift-kube-apiserver,revision=9 kube-apiserver-ip-10-0-76-22.ap-northeast-1.compute.internal 4/5 CrashLoopBackOff 26 5h41m apiserver=true,app=openshift-kube-apiserver,revision=9 $ oc logs -c kube-apiserver-check-endpoints kube-apiserver-ip-10-0-51-70.ap-northeast-1.compute.internal -n openshift-kube-apiserver ... I1010 10:13:46.304353 1 base_controller.go:166] Shutting down CheckEndpointsTimeToStart ... I1010 10:13:46.304377 1 base_controller.go:113] Shutting down worker of CheckEndpointsTimeToStart controller ... I1010 10:13:46.304875 1 base_controller.go:103] All CheckEndpointsTimeToStart workers have been terminated ... I1010 10:13:46.496032 1 start_stop_controllers.go:70] The server doesn't have a resource type "podnetworkconnectivitychecks.controlplane.operator.openshift.io". Check kube-apiserver logs, found below logs are repeated: $ oc logs -c kube-apiserver kube-apiserver-ip-10-0-51-70.ap-northeast-1.compute.internal -n openshift-kube-apiserver ... I1010 10:21:59.685378 18 aggregator.go:226] Updating OpenAPI spec because k8s_internal_local_delegation_chain_0000000002 is updated I1010 10:22:01.278489 18 aggregator.go:229] Finished OpenAPI spec generation after 1.593081531s I1010 10:22:01.933031 18 controller.go:189] Updating CRD OpenAPI spec because podnetworkconnectivitychecks.controlplane.operator.openshift.io changed ... I1010 10:22:02.054075 18 store.go:1378] Monitoring podnetworkconnectivitychecks.controlplane.operator.openshift.io count at <storage-prefix>//controlplane.operator.openshift.io/podnetworkconnectivitychecks I1010 10:22:02.056595 18 cacher.go:402] cacher (*unstructured.Unstructured): initialized I1010 10:22:02.886290 18 controller.go:172] Updating CRD OpenAPI spec because podnetworkconnectivitychecks.controlplane.operator.openshift.io was removed ... I1010 10:22:03.081078 18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I1010 10:22:03.081123 18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" ... W1010 10:22:03.963110 18 controller.go:142] slow openapi aggregation of "podnetworkconnectivitychecks.controlplane.operator.openshift.io": 1.076830326s