1987005 – [arbiter] OCP Console fail goes into endless look during authentication after set of temporary network disruptions which separatates cluster zones

Bug 1987005 - [arbiter] OCP Console fail goes into endless look during authentication after set of temporary network disruptions which separatates cluster zones [NEEDINFO]

Summary: [arbiter] OCP Console fail goes into endless look during authentication after...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Dean West
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:	LifecycleStale
Depends On:
Blocks:	1984103
TreeView+	depends on / blocked

Reported:	2021-07-28 17:10 UTC by Martin Bukatovic
Modified:	2022-04-27 11:49 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1993800 (view as bug list)
Environment:
Last Closed:	2022-04-27 11:49:43 UTC
Target Upstream Version:
Embargoed:
Flags:	mfojtik: needinfo?

Attachments	(Terms of Use)
screenshot #1: endless loading page shown after a login attempt (214.75 KB, image/png) 2021-07-28 17:24 UTC, Martin Bukatovic	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1993800	1	high	CLOSED	4.8: Static pod installer backoff broken WAS: [arbiter] OCP Console fail goes into endless look during authentication af...	2021-09-21 08:01:45 UTC
Red Hat Bugzilla	1993802	1	None	None	None	2021-08-19 14:43:15 UTC

Description Martin Bukatovic 2021-07-28 17:10:26 UTC

Description of problem
======================

After multiple network disconnections between active zones in OCP cluster, it's
no longer possible to reach OCP Console, because it goes into endless cycle
during authentication, utilizing over 100% of cpu cores. This makes OCP Console
not accessible.

Previously, I observed similar problem during the disruption itself, but then
OCP Console was able to recover (see BZ 1945572). But this time, OCP Console
can't recover while the rest of the cluster seems to be up. Moreover the
failure mode is not the same.

Version-Release number of selected component
============================================

OCP 4.8.0-0.nightly-2021-07-26-182557
LSO 4.8.0-202107231447
OCS 4.8.0-456.ci

How reproducible
================

1/1

Steps to Reproduce
==================

1. Install OCP on vSphere, with 3 master and 6 worker nodes.

2. Pick one master node and label it as an arbiter, eg.:

   ```
   $ oc label node $node topology.kubernetes.io/zone=arbiter
   ```

3. Label one remaining master node as "data-1" zone, the other as zone
   "data-2".

4. Half of the worker nodes label as zone "data-1", and the other half as
   zone "data-2".

Note: So at this point, you have nodes labeled like this:

```
$ oc get nodes -L topology.kubernetes.io/zone
NAME              STATUS   ROLES    AGE   VERSION           ZONE
compute-0         Ready    worker   31h   v1.21.1+051ac4f   data-1
compute-1         Ready    worker   31h   v1.21.1+051ac4f   data-2
compute-2         Ready    worker   31h   v1.21.1+051ac4f   data-1
compute-3         Ready    worker   31h   v1.21.1+051ac4f   data-2
compute-4         Ready    worker   31h   v1.21.1+051ac4f   data-1
compute-5         Ready    worker   31h   v1.21.1+051ac4f   data-2
control-plane-0   Ready    master   31h   v1.21.1+051ac4f   arbiter
control-plane-1   Ready    master   31h   v1.21.1+051ac4f   data-1
control-plane-2   Ready    master   31h   v1.21.1+051ac4f   data-2
```

5. Open OCP Console and login as kubeadmin.

6. Schedule several network disruptions between zones in a row.

7. Check OCP Console and try to use it after all the disruptions has passed
   and OCP cluster had enough time to recover.

Actual results
==============

OCP Console loads a login page, but then an attempt to login gets stuck:

- page shown in the browser shows an empty page with a loading animation
- lookin into network log, the login process seems to be stuck in an endless
  cycle
- cpu utilization of the browser goes slightly over 100%

Expected results
================

OCP Console is able to recover after the network problems are resovled, so
that it's reachable again.

Additional info
===============

See additional evidence referenced in comments below, such as must gather,
vide screencast and firefox network debug log (har file).

Network split configuration scheduled when I observed this problem for the 1st
time (timestamps below are in CEST):

```
ocp-network-split-sched -t 2021-07-28T02:00 --split-len 2 ab-bc
ocp-network-split-sched -t 2021-07-28T02:05 --split-len 1 ab-bc
ocp-network-split-sched -t 2021-07-28T02:10 --split-len 5 ab-bc
ocp-network-split-sched -t 2021-07-28T02:30 --split-len 30 ab-bc
ocp-network-split-sched -t 2021-07-28T04:00 --split-len 30 ab-ac
ocp-network-split-sched -t 2021-07-28T05:00 --split-len 30 ab-bc
ocp-network-split-sched -t 2021-07-28T06:00 --split-len 30 ab-ac
```

In other words, I scheduled a batch of 2 types of network disruptions (ab-bc,
and ab-ac) over nigh, and checked the result in the morning.

See also https://mbukatov.gitlab.io/ocp-network-split/usage.html

Comment 3 Martin Bukatovic 2021-07-28 17:24:40 UTC

Created attachment 1806856 [details]
screenshot #1: endless loading page shown after a login attempt

Comment 4 Jakub Hadvig 2021-07-29 10:17:36 UTC

After a further investigation and chat with Standa Laznicka we found out that there has been an issue with certificate rotation for kube-apiserver, which was cause by the networking disruptions.
Assigning to the oauth-apiserver team per discussion with Standa.

Comment 5 Standa Laznicka 2021-07-29 12:18:58 UTC

The reason the console is failing to load properly is the following:
1. the certificate for the kube-apiserver used to authenticate user requests was expiring, the authentication operator managed to refresh it more or less in time (1h20min before its expiry):
(authn operator logs)
```
2021-07-28T07:41:27.593140593Z I0728 07:41:27.591722       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"0742657a-78a1-473f-ad4b-d74c09986b62", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CertificateRotationStarted' The current client certificate for OpenShiftAuthenticatorCertRequester expires in 1h23m15s. Start certificate rotation
2021-07-28T07:41:27.606162541Z I0728 07:41:27.606096       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"0742657a-78a1-473f-ad4b-d74c09986b62", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CSRCreated' A csr "system:openshift:openshift-authenticator" is created for OpenShiftAuthenticatorCertRequester
2021-07-28T07:41:27.608072817Z I0728 07:41:27.606532       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"0742657a-78a1-473f-ad4b-d74c09986b62", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CSRApproval' The CSR "system:openshift:openshift-authenticator" has been approved
2021-07-28T07:41:27.634065548Z I0728 07:41:27.632873       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"0742657a-78a1-473f-ad4b-d74c09986b62", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ClientCertificateCreated' A new client certificate for OpenShiftAuthenticatorCertRequester is available 
2021-07-28T07:41:27.650657399Z I0728 07:41:27.650586       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"0742657a-78a1-473f-ad4b-d74c09986b62", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretUpdated' Updated Secret/webhook-authentication-integrated-oauth -n openshift-config because it changed

```

2. the new configuration was picked by the kube-apiserver-operator:
```
2021-07-28T07:41:27.693678732Z I0728 07:41:27.693569       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"0c094fd0-04ff-4310-b148-7c163bc72c31", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretUpdated' Updated Secret/webhook-authenticator -n openshift-kube-apiserver because it changed
2021-07-28T07:41:27.699933877Z I0728 07:41:27.697699       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"0c094fd0-04ff-4310-b148-7c163bc72c31", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RevisionTriggered' new revision 11 triggered by "secret/webhook-authenticator has changed"
```

3. however, the kube-apiserver pods have been failing for a tremendously long time - there are 2322 (!!!) installer pod retries in the kube-apiserver NS. Fun fact: this directory crashes the GNOME File Roller 3.40.0.

This leads me to believe that either the networking issues were not solved properly or there were more issues than that. I'm going to move this to the kube-apiserver component, I believe they might be quite interested in the high number of pods kept in the openshift-kube-apiserver NS.

Comment 6 Stefan Schimanski 2021-08-16 07:46:17 UTC

Created bug for the installer backoff issue. Moving back to apiserver-auth until the root cause has been found.

Comment 7 Stefan Schimanski 2021-08-16 08:15:13 UTC

The root cause for the installer retries was a crash-looping kube-apiserver pod due to "etcd request timed out".

Comment 8 Stefan Schimanski 2021-08-16 08:19:10 UTC

The kube-apiserver operator reports crash looping pods from at least 2021-07-28T03:35:58.857796673Z.

On the etcd side:

2021-07-28T04:26:04.096369302Z {"level":"warn","ts":"2021-07-28T04:26:04.096Z","caller":"etcdserver/util.go:163","msg":"apply request took too long","took":"1.000048303s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/kubernetes.io/namespaces/default\" keys_only:true ","response":"","error":"context deadline exceeded"}
2021-07-28T04:26:04.096369302Z {"level":"info","ts":"2021-07-28T04:26:04.096Z","caller":"traceutil/trace.go:145","msg":"trace[721358143] range","detail":"{range_begin:/kubernetes.io/namespaces/default; range_end:; }","duration":"1.000148406s","start":"2021-07-28T04:26:03.096Z","end":"2021-07-28T04:26:04.096Z","steps":["trace[721358143] 'agreement among raft nodes before linearized reading'  (duration: 1.000006269s)"]}
2021-07-28T04:26:04.891877937Z {"level":"warn","ts":"2021-07-28T04:26:04.891Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"c238e77e379579c9","rtt":"985.835µs","error":"dial tcp 10.1.160.124:2380: i/o timeout"}
2021-07-28T04:26:04.921454332Z {"level":"warn","ts":"2021-07-28T04:26:04.921Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"91d4d361ae240b7e","rtt":"1.112162ms","error":"dial tcp 10.1.160.105:2380: i/o timeout"}
2021-07-28T04:26:04.951220504Z {"level":"warn","ts":"2021-07-28T04:26:04.951Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"c238e77e379579c9","rtt":"8.446719ms","error":"dial tcp 10.1.160.124:2380: i/o timeout"}
2021-07-28T04:26:04.992775193Z {"level":"warn","ts":"2021-07-28T04:26:04.992Z","caller":"etcdhttp/metrics.go:111","msg":"serving /health false; no leader"}
2021-07-28T04:26:04.992775193Z {"level":"warn","ts":"2021-07-28T04:26:04.992Z","caller":"etcdhttp/metrics.go:60","msg":"/health error","output":"{\"health\":\"false\"}","status-code":503}
2021-07-28T04:26:05.016683053Z {"level":"warn","ts":"2021-07-28T04:26:05.016Z","caller":"etcdserver/util.go:163","msg":"apply request took too long","took":"10.000070286s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/openshift.io/health\" ","response":"","error":"context deadline exceeded"}
2021-07-28T04:26:05.016683053Z {"level":"info","ts":"2021-07-28T04:26:05.016Z","caller":"traceutil/trace.go:145","msg":"trace[1017250282] range","detail":"{range_begin:/openshift.io/health; range_end:; }","duration":"10.000165828s","start":"2021-07-28T04:25:55.016Z","end":"2021-07-28T04:26:05.016Z","steps":["trace[1017250282] 'agreement among raft nodes before linearized reading'  (duration: 10.000029203s)"]}
2021-07-28T04:26:05.095099686Z {"level":"warn","ts":"2021-07-28T04:26:05.094Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"91d4d361ae240b7e","rtt":"9.700365ms","error":"dial tcp 10.1.160.105:2380: i/o timeout"}
2021-07-28T04:26:05.272508435Z {"level":"info","ts":"2021-07-28T04:26:05.272Z","caller":"raft/raft.go:923","msg":"9d8b6fda0edcc708 is starting a new election at term 5144"}
2021-07-28T04:26:05.272508435Z {"level":"info","ts":"2021-07-28T04:26:05.272Z","caller":"raft/raft.go:713","msg":"9d8b6fda0edcc708 became candidate at term 5145"}
2021-07-28T04:26:05.272508435Z {"level":"info","ts":"2021-07-28T04:26:05.272Z","caller":"raft/raft.go:824","msg":"9d8b6fda0edcc708 received MsgVoteResp from 9d8b6fda0edcc708 at term 5145"}

I.e. leader elections after network i/o timeout. This is like network releated.

Comment 9 Antonio Ojea 2021-08-16 09:32:07 UTC

kube-apiserver installer pods are not able to be scheduled because of "preemption: error finding a set of pods to preempt: no set of running pods found to reclaim resources: " ???

ul 28 08:05:51.893716 control-plane-0 hyperkube[1489]: I0728 08:05:51.893661    1489 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/installer-11-retry-132-control-plane-0]
Jul 28 08:05:51.894218 control-plane-0 hyperkube[1489]: I0728 08:05:51.893820    1489 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"openshift-kube-apiserver"/"installer-sa-dockercfg-lwq2t"
Jul 28 08:05:51.894218 control-plane-0 hyperkube[1489]: I0728 08:05:51.894099    1489 reflector.go:219] Starting reflector *v1.ConfigMap (0s) from object-"openshift-kube-apiserver"/"kube-root-ca.crt"
Jul 28 08:05:51.894430 control-plane-0 hyperkube[1489]: I0728 08:05:51.894407    1489 topology_manager.go:187] "Topology Admit Handler"
Jul 28 08:05:51.894827 control-plane-0 hyperkube[1489]: I0728 08:05:51.894803    1489 predicate.go:113] "Failed to admit pod, unexpected error while attempting to recover from admission failure" pod="openshift-kube-apiserver/installer-11-retry-132-control-plane-0" err="preemption: error finding a set of pods to preempt: no set of running pods found to reclaim resources: [(res: cpu, q: 120), ]"
Jul 28 08:05:51.944236 control-plane-0 hyperkube[1489]: I0728 08:05:51.944183    1489 reflector.go:225] Stopping reflector *v1.Secret (0s) from object-"openshift-kube-apiserver"/"installer-sa-dockercfg-lwq2t"
Jul 28 08:05:51.944497 control-plane-0 hyperkube[1489]: I0728 08:05:51.944264    1489 reflector.go:225] Stopping reflector *v1.ConfigMap (0s) from object-"openshift-kube-apiserver"/"kube-root-ca.crt"
Jul 28 08:05:57.634185 control-plane-0 hyperkube[1489]: I0728 08:05:57.634131    1489 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kube-api-access\" (UniqueName: \"kubernetes.io/projected/6604eea2-5dd9-4e41-9db4-52806485ec97-kube-api-access\") pod \"installer-8-retry-138-control-plane-0\" (UID: \"6604eea2-5dd9-4e41-9db4-52806485ec97\") "
Jul 28 08:05:57.634606 control-plane-0 hyperkube[1489]: E0728 08:05:57.634334    1489 projected.go:293] Couldn't get configMap openshift-kube-apiserver/kube-root-ca.crt: object "openshift-kube-apiserver"/"kube-root-ca.crt" not registered
Jul 28 08:05:57.634606 control-plane-0 hyperkube[1489]: E0728 08:05:57.634356    1489 projected.go:199] Error preparing data for projected volume kube-api-access for pod openshift-kube-apiserver/installer-8-retry-138-control-plane-0: object "openshift-kube-apiserver"/"kube-root-ca.crt" not registered
Jul 28 08:05:57.634606 control-plane-0 hyperkube[1489]: E0728 08:05:57.634448    1489 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/projected/6604eea2-5dd9-4e41-9db4-52806485ec97-kube-api-access podName:6604eea2-5dd9-4e41-9db4-52806485ec97 nodeName:}" failed. No retries permitted until 2021-07-28 08:07:59.634422515 +0000 UTC m=+81247.416074029 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"kube-api-access\" (UniqueName: \"kubernetes.io/projected/6604eea2-5dd9-4e41-9db4-52806485ec97-kube-api-access\") pod \"installer-8-retry-138-control-plane-0\" (UID: \"6604eea2-5dd9-4e41-9db4-52806485ec97\") : object \"openshift-kube-apiserver\"/\"kube-root-ca.crt\" not registered"
Jul 28 08:06:04.699746 control-plane-0 hyperkube[1489]: I0728 08:06:04.699680    1489 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kube-api-access\" (UniqueName: \"kubernetes.io/projected/9074e95e-afd2-4e22-8257-d95e26b7bbd9-kube-api-access\") pod \"installer-8-retry-139-control-plane-0\" (UID: \"9074e95e-afd2-4e22-8257-d95e26b7bbd9\") "
Jul 28 08:06:04.701116 control-plane-0 hyperkube[1489]: E0728 08:06:04.700221    1489 projected.go:293] Couldn't get configMap openshift-kube-apiserver/kube-root-ca.crt: object "openshift-kube-apiserver"/"kube-root-ca.crt" not registered
Jul 28 08:06:04.701116 control-plane-0 hyperkube[1489]: E0728 08:06:04.700248    1489 projected.go:199] Error preparing data for projected volume kube-api-access for pod openshift-kube-apiserver/installer-8-retry-139-control-plane-0: object "openshift-kube-apiserver"/"kube-root-ca.crt" not registered
Jul 28 08:06:04.701116 control-plane-0 hyperkube[1489]: E0728 08:06:04.700322    1489 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/projected/9074e95e-afd2-4e22-8257-d95e26b7bbd9-kube-api-access podName:9074e95e-afd2-4e22-8257-d95e26b7bbd9 nodeName:}" failed. No retries permitted until 2021-07-28 08:08:06.700297709 +0000 UTC m=+81254.481949225 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"kube-api-access\" (UniqueName: \"kubernetes.io/projected/9074e95e-afd2-4e22-8257-d95e26b7bbd9-kube-api-access\") pod \"installer-8-retry-139-control-plane-0\" (UID: \"9074e95e-afd2-4e22-8257-d95e26b7bbd9\") : object \"openshift-kube-apiserver\"/\"kube-root-ca.crt\" not registered"
Jul 28 08:06:05.710273 control-plane-0 hyperkube[1489]: I0728 08:06:05.710222    1489 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-apiserver/kube-apiserver-control-plane-0" status=Running
Jul 28 08:06:05.710273 control-plane-0 hyperkube[1489]: I0728 08:06:05.710273    1489 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-controller-manager/kube-controller-manager-control-plane-0" status=Running
Jul 28 08:06:05.710836 control-plane-0 hyperkube[1489]: I0728 08:06:05.710404    1489 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-control-plane-0" status=Running
Jul 28 08:06:05.710836 control-plane-0 hyperkube[1489]: I0728 08:06:05.710417    1489 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-scheduler/openshift-kube-scheduler-control-plane-0" status=Running
Jul 28 08:06:06.491948 control-plane-0 hyperkube[1489]: I0728 08:06:06.491900    1489 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/installer-11-retry-133-control-plane-0]
Jul 28 08:06:06.492162 control-plane-0 hyperkube[1489]: I0728 08:06:06.492077    1489 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"openshift-kube-apiserver"/"installer-sa-dockercfg-lwq2t"
Jul 28 08:06:06.492162 control-plane-0 hyperkube[1489]: I0728 08:06:06.492138    1489 reflector.go:219] Starting reflector *v1.ConfigMap (0s) from object-"openshift-kube-apiserver"/"kube-root-ca.crt"
Jul 28 08:06:06.492260 control-plane-0 hyperkube[1489]: I0728 08:06:06.492247    1489 topology_manager.go:187] "Topology Admit Handler"
Jul 28 08:06:06.492679 control-plane-0 hyperkube[1489]: I0728 08:06:06.492659    1489 predicate.go:113] "Failed to admit pod, unexpected error while attempting to recover from admission failure" pod="openshift-kube-apiserver/installer-11-retry-133-control-plane-0" err="preemption: error finding a set of pods to preempt: no set of running pods found to reclaim resources: [(res: cpu, q: 120), ]"

Comment 10 Michal Fojtik 2021-10-01 10:30:07 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Note You need to log in before you can comment on or make changes to this bug.