Description of problem ====================== After multiple network disconnections between active zones in OCP cluster, it's no longer possible to reach OCP Console, because it goes into endless cycle during authentication, utilizing over 100% of cpu cores. This makes OCP Console not accessible. Previously, I observed similar problem during the disruption itself, but then OCP Console was able to recover (see BZ 1945572). But this time, OCP Console can't recover while the rest of the cluster seems to be up. Moreover the failure mode is not the same. Version-Release number of selected component ============================================ OCP 4.8.0-0.nightly-2021-07-26-182557 LSO 4.8.0-202107231447 OCS 4.8.0-456.ci How reproducible ================ 1/1 Steps to Reproduce ================== 1. Install OCP on vSphere, with 3 master and 6 worker nodes. 2. Pick one master node and label it as an arbiter, eg.: ``` $ oc label node $node topology.kubernetes.io/zone=arbiter ``` 3. Label one remaining master node as "data-1" zone, the other as zone "data-2". 4. Half of the worker nodes label as zone "data-1", and the other half as zone "data-2". Note: So at this point, you have nodes labeled like this: ``` $ oc get nodes -L topology.kubernetes.io/zone NAME STATUS ROLES AGE VERSION ZONE compute-0 Ready worker 31h v1.21.1+051ac4f data-1 compute-1 Ready worker 31h v1.21.1+051ac4f data-2 compute-2 Ready worker 31h v1.21.1+051ac4f data-1 compute-3 Ready worker 31h v1.21.1+051ac4f data-2 compute-4 Ready worker 31h v1.21.1+051ac4f data-1 compute-5 Ready worker 31h v1.21.1+051ac4f data-2 control-plane-0 Ready master 31h v1.21.1+051ac4f arbiter control-plane-1 Ready master 31h v1.21.1+051ac4f data-1 control-plane-2 Ready master 31h v1.21.1+051ac4f data-2 ``` 5. Open OCP Console and login as kubeadmin. 6. Schedule several network disruptions between zones in a row. 7. Check OCP Console and try to use it after all the disruptions has passed and OCP cluster had enough time to recover. Actual results ============== OCP Console loads a login page, but then an attempt to login gets stuck: - page shown in the browser shows an empty page with a loading animation - lookin into network log, the login process seems to be stuck in an endless cycle - cpu utilization of the browser goes slightly over 100% Expected results ================ OCP Console is able to recover after the network problems are resovled, so that it's reachable again. Additional info =============== See additional evidence referenced in comments below, such as must gather, vide screencast and firefox network debug log (har file). Network split configuration scheduled when I observed this problem for the 1st time (timestamps below are in CEST): ``` ocp-network-split-sched -t 2021-07-28T02:00 --split-len 2 ab-bc ocp-network-split-sched -t 2021-07-28T02:05 --split-len 1 ab-bc ocp-network-split-sched -t 2021-07-28T02:10 --split-len 5 ab-bc ocp-network-split-sched -t 2021-07-28T02:30 --split-len 30 ab-bc ocp-network-split-sched -t 2021-07-28T04:00 --split-len 30 ab-ac ocp-network-split-sched -t 2021-07-28T05:00 --split-len 30 ab-bc ocp-network-split-sched -t 2021-07-28T06:00 --split-len 30 ab-ac ``` In other words, I scheduled a batch of 2 types of network disruptions (ab-bc, and ab-ac) over nigh, and checked the result in the morning. See also https://mbukatov.gitlab.io/ocp-network-split/usage.html
Created attachment 1806856 [details] screenshot #1: endless loading page shown after a login attempt
After a further investigation and chat with Standa Laznicka we found out that there has been an issue with certificate rotation for kube-apiserver, which was cause by the networking disruptions. Assigning to the oauth-apiserver team per discussion with Standa.
The reason the console is failing to load properly is the following: 1. the certificate for the kube-apiserver used to authenticate user requests was expiring, the authentication operator managed to refresh it more or less in time (1h20min before its expiry): (authn operator logs) ``` 2021-07-28T07:41:27.593140593Z I0728 07:41:27.591722 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"0742657a-78a1-473f-ad4b-d74c09986b62", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CertificateRotationStarted' The current client certificate for OpenShiftAuthenticatorCertRequester expires in 1h23m15s. Start certificate rotation 2021-07-28T07:41:27.606162541Z I0728 07:41:27.606096 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"0742657a-78a1-473f-ad4b-d74c09986b62", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CSRCreated' A csr "system:openshift:openshift-authenticator" is created for OpenShiftAuthenticatorCertRequester 2021-07-28T07:41:27.608072817Z I0728 07:41:27.606532 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"0742657a-78a1-473f-ad4b-d74c09986b62", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CSRApproval' The CSR "system:openshift:openshift-authenticator" has been approved 2021-07-28T07:41:27.634065548Z I0728 07:41:27.632873 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"0742657a-78a1-473f-ad4b-d74c09986b62", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ClientCertificateCreated' A new client certificate for OpenShiftAuthenticatorCertRequester is available 2021-07-28T07:41:27.650657399Z I0728 07:41:27.650586 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"0742657a-78a1-473f-ad4b-d74c09986b62", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretUpdated' Updated Secret/webhook-authentication-integrated-oauth -n openshift-config because it changed ``` 2. the new configuration was picked by the kube-apiserver-operator: ``` 2021-07-28T07:41:27.693678732Z I0728 07:41:27.693569 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"0c094fd0-04ff-4310-b148-7c163bc72c31", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretUpdated' Updated Secret/webhook-authenticator -n openshift-kube-apiserver because it changed 2021-07-28T07:41:27.699933877Z I0728 07:41:27.697699 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"0c094fd0-04ff-4310-b148-7c163bc72c31", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RevisionTriggered' new revision 11 triggered by "secret/webhook-authenticator has changed" ``` 3. however, the kube-apiserver pods have been failing for a tremendously long time - there are 2322 (!!!) installer pod retries in the kube-apiserver NS. Fun fact: this directory crashes the GNOME File Roller 3.40.0. This leads me to believe that either the networking issues were not solved properly or there were more issues than that. I'm going to move this to the kube-apiserver component, I believe they might be quite interested in the high number of pods kept in the openshift-kube-apiserver NS.
Created bug for the installer backoff issue. Moving back to apiserver-auth until the root cause has been found.
The root cause for the installer retries was a crash-looping kube-apiserver pod due to "etcd request timed out".
The kube-apiserver operator reports crash looping pods from at least 2021-07-28T03:35:58.857796673Z. On the etcd side: 2021-07-28T04:26:04.096369302Z {"level":"warn","ts":"2021-07-28T04:26:04.096Z","caller":"etcdserver/util.go:163","msg":"apply request took too long","took":"1.000048303s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/kubernetes.io/namespaces/default\" keys_only:true ","response":"","error":"context deadline exceeded"} 2021-07-28T04:26:04.096369302Z {"level":"info","ts":"2021-07-28T04:26:04.096Z","caller":"traceutil/trace.go:145","msg":"trace[721358143] range","detail":"{range_begin:/kubernetes.io/namespaces/default; range_end:; }","duration":"1.000148406s","start":"2021-07-28T04:26:03.096Z","end":"2021-07-28T04:26:04.096Z","steps":["trace[721358143] 'agreement among raft nodes before linearized reading' (duration: 1.000006269s)"]} 2021-07-28T04:26:04.891877937Z {"level":"warn","ts":"2021-07-28T04:26:04.891Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"c238e77e379579c9","rtt":"985.835µs","error":"dial tcp 10.1.160.124:2380: i/o timeout"} 2021-07-28T04:26:04.921454332Z {"level":"warn","ts":"2021-07-28T04:26:04.921Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"91d4d361ae240b7e","rtt":"1.112162ms","error":"dial tcp 10.1.160.105:2380: i/o timeout"} 2021-07-28T04:26:04.951220504Z {"level":"warn","ts":"2021-07-28T04:26:04.951Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"c238e77e379579c9","rtt":"8.446719ms","error":"dial tcp 10.1.160.124:2380: i/o timeout"} 2021-07-28T04:26:04.992775193Z {"level":"warn","ts":"2021-07-28T04:26:04.992Z","caller":"etcdhttp/metrics.go:111","msg":"serving /health false; no leader"} 2021-07-28T04:26:04.992775193Z {"level":"warn","ts":"2021-07-28T04:26:04.992Z","caller":"etcdhttp/metrics.go:60","msg":"/health error","output":"{\"health\":\"false\"}","status-code":503} 2021-07-28T04:26:05.016683053Z {"level":"warn","ts":"2021-07-28T04:26:05.016Z","caller":"etcdserver/util.go:163","msg":"apply request took too long","took":"10.000070286s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/openshift.io/health\" ","response":"","error":"context deadline exceeded"} 2021-07-28T04:26:05.016683053Z {"level":"info","ts":"2021-07-28T04:26:05.016Z","caller":"traceutil/trace.go:145","msg":"trace[1017250282] range","detail":"{range_begin:/openshift.io/health; range_end:; }","duration":"10.000165828s","start":"2021-07-28T04:25:55.016Z","end":"2021-07-28T04:26:05.016Z","steps":["trace[1017250282] 'agreement among raft nodes before linearized reading' (duration: 10.000029203s)"]} 2021-07-28T04:26:05.095099686Z {"level":"warn","ts":"2021-07-28T04:26:05.094Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"91d4d361ae240b7e","rtt":"9.700365ms","error":"dial tcp 10.1.160.105:2380: i/o timeout"} 2021-07-28T04:26:05.272508435Z {"level":"info","ts":"2021-07-28T04:26:05.272Z","caller":"raft/raft.go:923","msg":"9d8b6fda0edcc708 is starting a new election at term 5144"} 2021-07-28T04:26:05.272508435Z {"level":"info","ts":"2021-07-28T04:26:05.272Z","caller":"raft/raft.go:713","msg":"9d8b6fda0edcc708 became candidate at term 5145"} 2021-07-28T04:26:05.272508435Z {"level":"info","ts":"2021-07-28T04:26:05.272Z","caller":"raft/raft.go:824","msg":"9d8b6fda0edcc708 received MsgVoteResp from 9d8b6fda0edcc708 at term 5145"} I.e. leader elections after network i/o timeout. This is like network releated.
kube-apiserver installer pods are not able to be scheduled because of "preemption: error finding a set of pods to preempt: no set of running pods found to reclaim resources: " ??? ul 28 08:05:51.893716 control-plane-0 hyperkube[1489]: I0728 08:05:51.893661 1489 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/installer-11-retry-132-control-plane-0] Jul 28 08:05:51.894218 control-plane-0 hyperkube[1489]: I0728 08:05:51.893820 1489 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"openshift-kube-apiserver"/"installer-sa-dockercfg-lwq2t" Jul 28 08:05:51.894218 control-plane-0 hyperkube[1489]: I0728 08:05:51.894099 1489 reflector.go:219] Starting reflector *v1.ConfigMap (0s) from object-"openshift-kube-apiserver"/"kube-root-ca.crt" Jul 28 08:05:51.894430 control-plane-0 hyperkube[1489]: I0728 08:05:51.894407 1489 topology_manager.go:187] "Topology Admit Handler" Jul 28 08:05:51.894827 control-plane-0 hyperkube[1489]: I0728 08:05:51.894803 1489 predicate.go:113] "Failed to admit pod, unexpected error while attempting to recover from admission failure" pod="openshift-kube-apiserver/installer-11-retry-132-control-plane-0" err="preemption: error finding a set of pods to preempt: no set of running pods found to reclaim resources: [(res: cpu, q: 120), ]" Jul 28 08:05:51.944236 control-plane-0 hyperkube[1489]: I0728 08:05:51.944183 1489 reflector.go:225] Stopping reflector *v1.Secret (0s) from object-"openshift-kube-apiserver"/"installer-sa-dockercfg-lwq2t" Jul 28 08:05:51.944497 control-plane-0 hyperkube[1489]: I0728 08:05:51.944264 1489 reflector.go:225] Stopping reflector *v1.ConfigMap (0s) from object-"openshift-kube-apiserver"/"kube-root-ca.crt" Jul 28 08:05:57.634185 control-plane-0 hyperkube[1489]: I0728 08:05:57.634131 1489 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kube-api-access\" (UniqueName: \"kubernetes.io/projected/6604eea2-5dd9-4e41-9db4-52806485ec97-kube-api-access\") pod \"installer-8-retry-138-control-plane-0\" (UID: \"6604eea2-5dd9-4e41-9db4-52806485ec97\") " Jul 28 08:05:57.634606 control-plane-0 hyperkube[1489]: E0728 08:05:57.634334 1489 projected.go:293] Couldn't get configMap openshift-kube-apiserver/kube-root-ca.crt: object "openshift-kube-apiserver"/"kube-root-ca.crt" not registered Jul 28 08:05:57.634606 control-plane-0 hyperkube[1489]: E0728 08:05:57.634356 1489 projected.go:199] Error preparing data for projected volume kube-api-access for pod openshift-kube-apiserver/installer-8-retry-138-control-plane-0: object "openshift-kube-apiserver"/"kube-root-ca.crt" not registered Jul 28 08:05:57.634606 control-plane-0 hyperkube[1489]: E0728 08:05:57.634448 1489 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/projected/6604eea2-5dd9-4e41-9db4-52806485ec97-kube-api-access podName:6604eea2-5dd9-4e41-9db4-52806485ec97 nodeName:}" failed. No retries permitted until 2021-07-28 08:07:59.634422515 +0000 UTC m=+81247.416074029 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"kube-api-access\" (UniqueName: \"kubernetes.io/projected/6604eea2-5dd9-4e41-9db4-52806485ec97-kube-api-access\") pod \"installer-8-retry-138-control-plane-0\" (UID: \"6604eea2-5dd9-4e41-9db4-52806485ec97\") : object \"openshift-kube-apiserver\"/\"kube-root-ca.crt\" not registered" Jul 28 08:06:04.699746 control-plane-0 hyperkube[1489]: I0728 08:06:04.699680 1489 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kube-api-access\" (UniqueName: \"kubernetes.io/projected/9074e95e-afd2-4e22-8257-d95e26b7bbd9-kube-api-access\") pod \"installer-8-retry-139-control-plane-0\" (UID: \"9074e95e-afd2-4e22-8257-d95e26b7bbd9\") " Jul 28 08:06:04.701116 control-plane-0 hyperkube[1489]: E0728 08:06:04.700221 1489 projected.go:293] Couldn't get configMap openshift-kube-apiserver/kube-root-ca.crt: object "openshift-kube-apiserver"/"kube-root-ca.crt" not registered Jul 28 08:06:04.701116 control-plane-0 hyperkube[1489]: E0728 08:06:04.700248 1489 projected.go:199] Error preparing data for projected volume kube-api-access for pod openshift-kube-apiserver/installer-8-retry-139-control-plane-0: object "openshift-kube-apiserver"/"kube-root-ca.crt" not registered Jul 28 08:06:04.701116 control-plane-0 hyperkube[1489]: E0728 08:06:04.700322 1489 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/projected/9074e95e-afd2-4e22-8257-d95e26b7bbd9-kube-api-access podName:9074e95e-afd2-4e22-8257-d95e26b7bbd9 nodeName:}" failed. No retries permitted until 2021-07-28 08:08:06.700297709 +0000 UTC m=+81254.481949225 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"kube-api-access\" (UniqueName: \"kubernetes.io/projected/9074e95e-afd2-4e22-8257-d95e26b7bbd9-kube-api-access\") pod \"installer-8-retry-139-control-plane-0\" (UID: \"9074e95e-afd2-4e22-8257-d95e26b7bbd9\") : object \"openshift-kube-apiserver\"/\"kube-root-ca.crt\" not registered" Jul 28 08:06:05.710273 control-plane-0 hyperkube[1489]: I0728 08:06:05.710222 1489 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-apiserver/kube-apiserver-control-plane-0" status=Running Jul 28 08:06:05.710273 control-plane-0 hyperkube[1489]: I0728 08:06:05.710273 1489 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-controller-manager/kube-controller-manager-control-plane-0" status=Running Jul 28 08:06:05.710836 control-plane-0 hyperkube[1489]: I0728 08:06:05.710404 1489 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-control-plane-0" status=Running Jul 28 08:06:05.710836 control-plane-0 hyperkube[1489]: I0728 08:06:05.710417 1489 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-scheduler/openshift-kube-scheduler-control-plane-0" status=Running Jul 28 08:06:06.491948 control-plane-0 hyperkube[1489]: I0728 08:06:06.491900 1489 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/installer-11-retry-133-control-plane-0] Jul 28 08:06:06.492162 control-plane-0 hyperkube[1489]: I0728 08:06:06.492077 1489 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"openshift-kube-apiserver"/"installer-sa-dockercfg-lwq2t" Jul 28 08:06:06.492162 control-plane-0 hyperkube[1489]: I0728 08:06:06.492138 1489 reflector.go:219] Starting reflector *v1.ConfigMap (0s) from object-"openshift-kube-apiserver"/"kube-root-ca.crt" Jul 28 08:06:06.492260 control-plane-0 hyperkube[1489]: I0728 08:06:06.492247 1489 topology_manager.go:187] "Topology Admit Handler" Jul 28 08:06:06.492679 control-plane-0 hyperkube[1489]: I0728 08:06:06.492659 1489 predicate.go:113] "Failed to admit pod, unexpected error while attempting to recover from admission failure" pod="openshift-kube-apiserver/installer-11-retry-133-control-plane-0" err="preemption: error finding a set of pods to preempt: no set of running pods found to reclaim resources: [(res: cpu, q: 120), ]"
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.