Description of problem: We've come across a pretty rare bug when doing mass installations of SNO clusters that prevents the installation from completing on its own. The symptoms are as follows: - The mirror pod for kube-controller-manager is stuck with 3/4 containers ready - The container that's down is kube-controller-manager - oc logs ... for that particular container leads to this error: >[root@e24-h01-000-r640 omer]# oc logs -n openshift-kube-controller-manager kube-controller-manager-sno01271 kube-controller-manager >Error from server (BadRequest): container "kube-controller-manager" in pod "kube-controller-manager-sno01271" is not available >[root@e24-h01-000-r640 omer]# oc logs -n openshift-kube-controller-manager kube-controller-manager-sno01271 kube-controller-manager --previous >Error from server (BadRequest): container "kube-controller-manager" in pod "kube-controller-manager-sno01271" is not available Noting the cri-o//... ID of this container from the mirror pod's status and then SSHing into the node and running sudo crictl logs ... with that ID we see the culprit: >E0505 20:36:46.587883 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api/v1/namespaces/kube-system/configmaps/kube-controller-manager?timeout=5s": dial tcp [fc00:1001::8de]:6443: connect: connection refused >E0505 20:36:47.526268 1 namespaced_resources_deleter.go:161] unable to get all supported resources from server: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api": dial tcp [fc00:1001::8de]:6443: connect: connection refused >F0505 20:36:47.526298 1 namespaced_resources_deleter.go:164] Unable to get any supported resources from server: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api": dial tcp [fc00:1001::8de]:6443: connect: connection refused >goroutine 413 [running]: >k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0x1) >... Where fc00:1001::8de is the IPv6 of the single node (it's an IPv6 OVNKubernetes offline cluster, although I don't think that's a crucial detail). The fatal log is coming from these [1] lines. The connection refused part is not very surprising, since this is a SNO and its API server fluctuates between being up and down, especially during the initial stages of the cluster. Version-Release number of selected component (if applicable): 4.10.13 How reproducible: Around 0.087% (usually 1-2 out of 2300) of all installations fail due to this bug Steps to Reproduce: 1. Run 1000 SNO installations 2. One of them should suffer from this Actual results: - kube-controller-manager crashes - kube-controller-manager doesn't get back up - The logs for kube-controller-manager cannot be retrieved via `oc logs...` - The cluster cannot complete installation because kube-controller-manager is down Expected results: - kube-controller-manager, like all other controllers, should have better tolerance for API downtime. There was an effort to do so across OpenShift (with library-go) to avoid this kind of crashes, but looks like this condition was missed - The container should have been restarted by kubelet - The logs should still be available via oc logs. - The cluster should complete installation Additional info: If you need a live cluster with this bug reproduced please reach out, it can be arranged [1] https://github.com/openshift/kubernetes/blob/fe7796f337ea0d35bc3e6b5428d63685d1833cb5/pkg/controller/namespace/deletion/namespaced_resources_deleter.go#L159-L165
Created attachment 1877632 [details] Logs for the crahsed kube-controller-manager pod
we know from our installer pod that it successfully created static pod manifest with revision 4 (installer pod phase: Succeeded). But we can only see pod with revision 2 (which as mentioned above fails later and is not restarted) we can also observe the missing static pod via the following event 15:46:13 (x39) openshift-kube-controller-manager-operator kube-controller-manager-operator-missingstaticpodcontroller kube-controller-manager-operator MissingStaticPod static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 4 on node: "sno01271" didn't show up, waited: 2m30s This seems to be an issue with kubelet not working properly. Could these logs be relevant? May 05 20:37:01 sno01271 bash[31866]: E0505 20:37:01.684790 31866 kubelet.go:1755] "Failed creating a mirror pod for" err="pods \"openshift-kube-scheduler-sno01271\" already exists" pod="openshift-kube-scheduler/ope nshift-kube-scheduler-sno01271" May 05 20:37:01 sno01271 bash[31866]: E0505 20:37:01.685663 31866 kubelet.go:1755] "Failed creating a mirror pod for" err="pods \"etcd-sno01271\" already exists" pod="openshift-etcd/etcd-sno01271" May 05 20:37:01 sno01271 bash[31866]: E0505 20:37:01.687341 31866 kubelet.go:1755] "Failed creating a mirror pod for" err="pods \"kube-controller-manager-sno01271\" already exists" pod="openshift-kube-controller-manager/kube-controller-manager-sno01271" I am seeing a lot of network errors as well (even later, not just in initial stage), so that might be also the cause. Can somebody from kubelet give this a look?
We observe this bug with a similar frequency on 4.9 mass installations we do as well
Hey folks, we are still seeing this issue in large scale installs of SNOs: * 4.9.36 OCP * 2296 SNOs attempted to be installed * 2224 Successful, 72 failures * 3 of those failures where of this bug I can provide access to those SNOs for further analysis or grab something from it to assist in solving the bug.
*** Bug 2088289 has been marked as a duplicate of this bug. ***
another occurence in https://bugzilla.redhat.com/show_bug.cgi?id=2088289
Looks like the same problem as: https://github.com/kubernetes/kubernetes/issues/111174 https://github.com/kubernetes/kubernetes/pull/111175
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-9256