1876167 – [RFE] need to be able to enable and disable openshift-apiserver connectivity checks

Bug 1876167 - [RFE] need to be able to enable and disable openshift-apiserver connectivity checks

Summary: [RFE] need to be able to enable and disable openshift-apiserver connectivity ...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	openshift-apiserver
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Luis Sanchez
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-06 00:24 UTC by Luis Sanchez
Modified:	2021-02-10 20:43 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1876166
Environment:
Last Closed:	2021-02-10 20:43:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-openshift-apiserver-operator pull 388	0	None	closed	Bug 1876167: disable openshift-apiserver connectivity checks	2021-02-10 20:41:41 UTC

Description Luis Sanchez 2020-09-06 00:24:37 UTC

Description of problem:

openshift-apiserver pod performs connectivity checks to report on network outages. Sometimes when debugging certain types of issues, the activity from the connectivity checks results in too much data, making it more difficult to pinpoint the root cause. We need to be able to temporarily disable the connectivity checks.

Comment 1 Mike Gahagan 2020-09-15 19:31:26 UTC

I think this is the same issue I am seeing on recent nightlys on Azure (4.6.0-0.nightly-2020-09-13-023938)

[m@localhost 46-azure-install]$ oc get events -n openshift-apiserver |less
57m         Warning   ConnectivityOutageDetected   deployment/apiserver              Connectivity outage detected: load-balancer-api-external: failed to establish a TCP connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443: dial tcp 10.0.0.4:6443: i/o timeout
61m         Normal    ConnectivityRestored         deployment/apiserver              Connectivity restored after 1.025622661s: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded
61m         Normal    ConnectivityRestored         deployment/apiserver              Connectivity restored after 1.995560051s: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded
61m         Normal    ConnectivityRestored         deployment/apiserver              Connectivity restored after 4.846401561s: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded
61m         Normal    ConnectivityRestored         deployment/apiserver              Connectivity restored after 2.004979486s: load-balancer-api-internal: tcp connection to api-int.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded
61m         Normal    ConnectivityRestored         deployment/apiserver              Connectivity restored after 1.00476912s: load-balancer-api-internal: tcp connection to api-int.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded
61m         Normal    ConnectivityRestored         deployment/apiserver              Connectivity restored after 996.230724ms: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded
61m         Normal    ConnectivityRestored         deployment/apiserver              Connectivity restored after 4.934150936s: load-balancer-api-external: tcp connection to api.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443 succeeded
57m         Warning   ConnectivityOutageDetected   deployment/apiserver              Connectivity outage detected: load-balancer-api-internal: failed to establish a TCP connection to api-int.mgahagan-1411509.qe.azure.devcluster.openshift.com:6443: dial tcp 10.0.0.4:6443: i/o timeout

These show up constantly on the web console as well.

Comment 2 Luis Sanchez 2020-10-01 13:41:19 UTC

The connec(In reply to Mike Gahagan from comment #1)
> I think this is the same issue I am seeing on recent nightlys on Azure
> (4.6.0-0.nightly-2020-09-13-023938)
> 
> These show up constantly on the web console as well.

This was fixed by bug 1878794.

Comment 5 Xingxing Xia 2020-10-10 10:32:25 UTC

Tested in 4.6.0-0.nightly-2020-10-09-224055, by default it is disabled. But cannot enable it because encountering below CrashLoopBackOff problem.
$ oc edit openshiftapiserver/cluster
...
spec:
...
  unsupportedConfigOverrides:
    operator:
      enableConnectivityCheckController: "True"

Then watch, both KAS and OAS pods keep CrashLoopBackOff on the check-endpoints container:
$ oc get po -n openshift-apiserver --show-labels -o wide -w
NAME                         READY   STATUS     RESTARTS   AGE   IP            NODE                                             NOMINATED NODE   READINESS GATES   LABELS
apiserver-657f5c5c87-x9z94   0/2     Init:0/1   0          5s    <none>        ip-10-0-62-115.ap-northeast-1.compute.internal   <none>           <none>            apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=657f5c5c87,revision=4
...
apiserver-657f5c5c87-8r4kd   1/2     CrashLoopBackOff   6          9m46s   10.128.0.24   ip-10-0-76-22.ap-northeast-1.compute.internal    <none>           <none>            apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=657f5c5c87,revision=4
apiserver-657f5c5c87-9lncz   1/2     CrashLoopBackOff   6          10m     10.129.0.42   ip-10-0-51-70.ap-northeast-1.compute.internal    <none>           <none>            apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=657f5c5c87,revision=4
apiserver-657f5c5c87-x9z94   1/2     CrashLoopBackOff   6          10m     10.130.0.45   ip-10-0-62-115.ap-northeast-1.compute.internal   <none>           <none>            apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=657f5c5c87,revision=4

$ oc logs -c openshift-apiserver-check-endpoints apiserver-657f5c5c87-8r4kd -n openshift-apiserver
...
I1010 10:10:41.609467       1 base_controller.go:109] Starting #1 worker of check-endpoints controller ...
I1010 10:10:41.677077       1 start_stop_controllers.go:70] The server doesn't have a resource type "podnetworkconnectivitychecks.controlplane.operator.openshift.io".

$ oc get po -n openshift-kube-apiserver --show-labels -l apiserver
NAME                                                            READY   STATUS             RESTARTS   AGE     LABELS
kube-apiserver-ip-10-0-51-70.ap-northeast-1.compute.internal    4/5     CrashLoopBackOff   26         5h50m   apiserver=true,app=openshift-kube-apiserver,revision=9
kube-apiserver-ip-10-0-62-115.ap-northeast-1.compute.internal   4/5     CrashLoopBackOff   26         5h46m   apiserver=true,app=openshift-kube-apiserver,revision=9
kube-apiserver-ip-10-0-76-22.ap-northeast-1.compute.internal    4/5     CrashLoopBackOff   26         5h41m   apiserver=true,app=openshift-kube-apiserver,revision=9
$ oc logs -c kube-apiserver-check-endpoints kube-apiserver-ip-10-0-51-70.ap-northeast-1.compute.internal -n openshift-kube-apiserver
...
I1010 10:13:46.304353       1 base_controller.go:166] Shutting down CheckEndpointsTimeToStart ...
I1010 10:13:46.304377       1 base_controller.go:113] Shutting down worker of CheckEndpointsTimeToStart controller ...
I1010 10:13:46.304875       1 base_controller.go:103] All CheckEndpointsTimeToStart workers have been terminated
...
I1010 10:13:46.496032       1 start_stop_controllers.go:70] The server doesn't have a resource type "podnetworkconnectivitychecks.controlplane.operator.openshift.io".

Check kube-apiserver logs, found below logs are repeated:
$ oc logs -c kube-apiserver kube-apiserver-ip-10-0-51-70.ap-northeast-1.compute.internal -n openshift-kube-apiserver
...
I1010 10:21:59.685378      18 aggregator.go:226] Updating OpenAPI spec because k8s_internal_local_delegation_chain_0000000002 is updated
I1010 10:22:01.278489      18 aggregator.go:229] Finished OpenAPI spec generation after 1.593081531s
I1010 10:22:01.933031      18 controller.go:189] Updating CRD OpenAPI spec because podnetworkconnectivitychecks.controlplane.operator.openshift.io changed
...
I1010 10:22:02.054075      18 store.go:1378] Monitoring podnetworkconnectivitychecks.controlplane.operator.openshift.io count at <storage-prefix>//controlplane.operator.openshift.io/podnetworkconnectivitychecks
I1010 10:22:02.056595      18 cacher.go:402] cacher (*unstructured.Unstructured): initialized
I1010 10:22:02.886290      18 controller.go:172] Updating CRD OpenAPI spec because podnetworkconnectivitychecks.controlplane.operator.openshift.io was removed
...
I1010 10:22:03.081078      18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I1010 10:22:03.081123      18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
...
W1010 10:22:03.963110      18 controller.go:142] slow openapi aggregation of "podnetworkconnectivitychecks.controlplane.operator.openshift.io": 1.076830326s

Note You need to log in before you can comment on or make changes to this bug.