Bug 1876167

Summary: [RFE] need to be able to enable and disable openshift-apiserver connectivity checks
Product: OpenShift Container Platform Reporter: Luis Sanchez <sanchezl>
Component: openshift-apiserverAssignee: Luis Sanchez <sanchezl>
Status: CLOSED WONTFIX QA Contact: Xingxing Xia <xxia>
Severity: high Docs Contact:
Priority: high    
Version: 4.6CC: aos-bugs, kewang, mfojtik, sttts, xxia
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1876166 Environment:
Last Closed: 2021-02-10 20:43:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Luis Sanchez 2020-09-06 00:24:37 UTC
Description of problem:

openshift-apiserver pod performs connectivity checks to report on network outages. Sometimes when debugging certain types of issues, the activity from the connectivity checks results in too much data, making it more difficult to pinpoint the root cause. We need to be able to temporarily disable the connectivity checks.

Comment 1 Mike Gahagan 2020-09-15 19:31:26 UTC
I think this is the same issue I am seeing on recent nightlys on Azure (4.6.0-0.nightly-2020-09-13-023938)

[m@localhost 46-azure-install]$ oc get events -n openshift-apiserver |less
57m         Warning   ConnectivityOutageDetected   deployment/apiserver              Connectivity outage detected: load-balancer-api-external: failed to establish a TCP connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443: dial tcp 10.0.0.4:6443: i/o timeout
61m         Normal    ConnectivityRestored         deployment/apiserver              Connectivity restored after 1.025622661s: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded
61m         Normal    ConnectivityRestored         deployment/apiserver              Connectivity restored after 1.995560051s: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded
61m         Normal    ConnectivityRestored         deployment/apiserver              Connectivity restored after 4.846401561s: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded
61m         Normal    ConnectivityRestored         deployment/apiserver              Connectivity restored after 2.004979486s: load-balancer-api-internal: tcp connection to api-int.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded
61m         Normal    ConnectivityRestored         deployment/apiserver              Connectivity restored after 1.00476912s: load-balancer-api-internal: tcp connection to api-int.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded
61m         Normal    ConnectivityRestored         deployment/apiserver              Connectivity restored after 996.230724ms: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded
61m         Normal    ConnectivityRestored         deployment/apiserver              Connectivity restored after 4.934150936s: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded
57m         Warning   ConnectivityOutageDetected   deployment/apiserver              Connectivity outage detected: load-balancer-api-internal: failed to establish a TCP connection to api-int.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443: dial tcp 10.0.0.4:6443: i/o timeout

These show up constantly on the web console as well.

Comment 2 Luis Sanchez 2020-10-01 13:41:19 UTC
The connec(In reply to Mike Gahagan from comment #1)
> I think this is the same issue I am seeing on recent nightlys on Azure
> (4.6.0-0.nightly-2020-09-13-023938)
> 
> These show up constantly on the web console as well.

This was fixed by bug 1878794.

Comment 5 Xingxing Xia 2020-10-10 10:32:25 UTC
Tested in 4.6.0-0.nightly-2020-10-09-224055, by default it is disabled. But cannot enable it because encountering below CrashLoopBackOff problem.
$ oc edit openshiftapiserver/cluster
...
spec:
...
  unsupportedConfigOverrides:
    operator:
      enableConnectivityCheckController: "True"

Then watch, both KAS and OAS pods keep CrashLoopBackOff on the check-endpoints container:
$ oc get po -n openshift-apiserver --show-labels -o wide -w
NAME                         READY   STATUS     RESTARTS   AGE   IP            NODE                                             NOMINATED NODE   READINESS GATES   LABELS
apiserver-657f5c5c87-x9z94   0/2     Init:0/1   0          5s    <none>        ip-10-0-62-115.ap-northeast-1.compute.internal   <none>           <none>            apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=657f5c5c87,revision=4
...
apiserver-657f5c5c87-8r4kd   1/2     CrashLoopBackOff   6          9m46s   10.128.0.24   ip-10-0-76-22.ap-northeast-1.compute.internal    <none>           <none>            apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=657f5c5c87,revision=4
apiserver-657f5c5c87-9lncz   1/2     CrashLoopBackOff   6          10m     10.129.0.42   ip-10-0-51-70.ap-northeast-1.compute.internal    <none>           <none>            apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=657f5c5c87,revision=4
apiserver-657f5c5c87-x9z94   1/2     CrashLoopBackOff   6          10m     10.130.0.45   ip-10-0-62-115.ap-northeast-1.compute.internal   <none>           <none>            apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=657f5c5c87,revision=4

$ oc logs -c openshift-apiserver-check-endpoints apiserver-657f5c5c87-8r4kd -n openshift-apiserver
...
I1010 10:10:41.609467       1 base_controller.go:109] Starting #1 worker of check-endpoints controller ...
I1010 10:10:41.677077       1 start_stop_controllers.go:70] The server doesn't have a resource type "podnetworkconnectivitychecks.controlplane.operator.openshift.io".

$ oc get po -n openshift-kube-apiserver --show-labels -l apiserver
NAME                                                            READY   STATUS             RESTARTS   AGE     LABELS
kube-apiserver-ip-10-0-51-70.ap-northeast-1.compute.internal    4/5     CrashLoopBackOff   26         5h50m   apiserver=true,app=openshift-kube-apiserver,revision=9
kube-apiserver-ip-10-0-62-115.ap-northeast-1.compute.internal   4/5     CrashLoopBackOff   26         5h46m   apiserver=true,app=openshift-kube-apiserver,revision=9
kube-apiserver-ip-10-0-76-22.ap-northeast-1.compute.internal    4/5     CrashLoopBackOff   26         5h41m   apiserver=true,app=openshift-kube-apiserver,revision=9
$ oc logs -c kube-apiserver-check-endpoints kube-apiserver-ip-10-0-51-70.ap-northeast-1.compute.internal -n openshift-kube-apiserver
...
I1010 10:13:46.304353       1 base_controller.go:166] Shutting down CheckEndpointsTimeToStart ...
I1010 10:13:46.304377       1 base_controller.go:113] Shutting down worker of CheckEndpointsTimeToStart controller ...
I1010 10:13:46.304875       1 base_controller.go:103] All CheckEndpointsTimeToStart workers have been terminated
...
I1010 10:13:46.496032       1 start_stop_controllers.go:70] The server doesn't have a resource type "podnetworkconnectivitychecks.controlplane.operator.openshift.io".

Check kube-apiserver logs, found below logs are repeated:
$ oc logs -c kube-apiserver kube-apiserver-ip-10-0-51-70.ap-northeast-1.compute.internal -n openshift-kube-apiserver
...
I1010 10:21:59.685378      18 aggregator.go:226] Updating OpenAPI spec because k8s_internal_local_delegation_chain_0000000002 is updated
I1010 10:22:01.278489      18 aggregator.go:229] Finished OpenAPI spec generation after 1.593081531s
I1010 10:22:01.933031      18 controller.go:189] Updating CRD OpenAPI spec because podnetworkconnectivitychecks.controlplane.operator.openshift.io changed
...
I1010 10:22:02.054075      18 store.go:1378] Monitoring podnetworkconnectivitychecks.controlplane.operator.openshift.io count at <storage-prefix>//controlplane.operator.openshift.io/podnetworkconnectivitychecks
I1010 10:22:02.056595      18 cacher.go:402] cacher (*unstructured.Unstructured): initialized
I1010 10:22:02.886290      18 controller.go:172] Updating CRD OpenAPI spec because podnetworkconnectivitychecks.controlplane.operator.openshift.io was removed
...
I1010 10:22:03.081078      18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I1010 10:22:03.081123      18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
...
W1010 10:22:03.963110      18 controller.go:142] slow openapi aggregation of "podnetworkconnectivitychecks.controlplane.operator.openshift.io": 1.076830326s