Bug 2082628

Summary: new revision of KCM static pod not created while the old revision gets broken / with wrong status
Product: OpenShift Container Platform Reporter: Omer Tuchfeld <otuchfel>
Component: NodeAssignee: Joel Smith <joelsmith>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED DEFERRED Docs Contact:
Severity: medium    
Priority: unspecified CC: akrzos, bzvonar, fkrepins, imiller, mbargenq, mfojtik, qiwan, rphillips
Version: 4.10   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-09 01:18:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Omer Tuchfeld 2022-05-06 15:47:38 UTC
Description of problem:

We've come across a pretty rare bug when doing mass installations of SNO clusters that prevents the installation from completing on its own. 

The symptoms are as follows:
- The mirror pod for kube-controller-manager is stuck with 3/4 containers ready
- The container that's down is kube-controller-manager
- oc logs ... for that particular container leads to this error:

>[root@e24-h01-000-r640 omer]# oc logs -n openshift-kube-controller-manager kube-controller-manager-sno01271 kube-controller-manager
>Error from server (BadRequest): container "kube-controller-manager" in pod "kube-controller-manager-sno01271" is not available
>[root@e24-h01-000-r640 omer]# oc logs -n openshift-kube-controller-manager kube-controller-manager-sno01271 kube-controller-manager --previous
>Error from server (BadRequest): container "kube-controller-manager" in pod "kube-controller-manager-sno01271" is not available

Noting the cri-o//... ID of this container from the mirror pod's status and then SSHing into the node and running sudo crictl logs ... with that ID we see the culprit:

>E0505 20:36:46.587883       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api/v1/namespaces/kube-system/configmaps/kube-controller-manager?timeout=5s": dial tcp [fc00:1001::8de]:6443: connect: connection refused
>E0505 20:36:47.526268       1 namespaced_resources_deleter.go:161] unable to get all supported resources from server: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api": dial tcp [fc00:1001::8de]:6443: connect: connection refused
>F0505 20:36:47.526298       1 namespaced_resources_deleter.go:164] Unable to get any supported resources from server: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api": dial tcp [fc00:1001::8de]:6443: connect: connection refused
>goroutine 413 [running]:
>k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0x1)
>...

Where fc00:1001::8de is the IPv6 of the single node (it's an IPv6 OVNKubernetes offline cluster, although I don't think that's a crucial detail). The fatal log is coming from these [1] lines.

The connection refused part is not very surprising, since this is a SNO and its API server fluctuates between being up and down, especially during the initial stages of the cluster.

Version-Release number of selected component (if applicable):
4.10.13

How reproducible:
Around 0.087% (usually 1-2 out of 2300) of all installations fail due to this bug

Steps to Reproduce:
1. Run 1000 SNO installations
2. One of them should suffer from this

Actual results:
- kube-controller-manager crashes

- kube-controller-manager doesn't get back up

- The logs for kube-controller-manager cannot be retrieved via `oc logs...`

- The cluster cannot complete installation because kube-controller-manager is down

Expected results:
- kube-controller-manager, like all other controllers, should have better tolerance for API downtime. There was an effort to do so across OpenShift (with library-go) to avoid this kind of crashes, but looks like this condition was missed

- The container should have been restarted by kubelet

- The logs should still be available via oc logs.

- The cluster should complete installation

Additional info:
If you need a live cluster with this bug reproduced please reach out, it can be arranged

[1] https://github.com/openshift/kubernetes/blob/fe7796f337ea0d35bc3e6b5428d63685d1833cb5/pkg/controller/namespace/deletion/namespaced_resources_deleter.go#L159-L165

Comment 1 Omer Tuchfeld 2022-05-06 15:55:39 UTC
Created attachment 1877632 [details]
Logs for the crahsed kube-controller-manager pod

Comment 5 Filip Krepinsky 2022-05-09 22:19:18 UTC
we know from our installer pod that it successfully created static pod manifest with revision 4 (installer pod phase: Succeeded). But we can only see pod with revision 2 (which as mentioned above fails later and is not restarted)

we can also observe the missing static pod via the following event

15:46:13 (x39)	openshift-kube-controller-manager-operator	kube-controller-manager-operator-missingstaticpodcontroller	kube-controller-manager-operator	 MissingStaticPod static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 4 on node: "sno01271" didn't show up, waited: 2m30s

This seems to be an issue with kubelet not working properly. Could these logs be relevant?

May 05 20:37:01 sno01271 bash[31866]: E0505 20:37:01.684790   31866 kubelet.go:1755] "Failed creating a mirror pod for" err="pods \"openshift-kube-scheduler-sno01271\" already exists" pod="openshift-kube-scheduler/ope
nshift-kube-scheduler-sno01271"
May 05 20:37:01 sno01271 bash[31866]: E0505 20:37:01.685663   31866 kubelet.go:1755] "Failed creating a mirror pod for" err="pods \"etcd-sno01271\" already exists" pod="openshift-etcd/etcd-sno01271"
May 05 20:37:01 sno01271 bash[31866]: E0505 20:37:01.687341   31866 kubelet.go:1755] "Failed creating a mirror pod for" err="pods \"kube-controller-manager-sno01271\" already exists" pod="openshift-kube-controller-manager/kube-controller-manager-sno01271"

I am seeing a lot of network errors as well (even later, not just in initial stage), so that might be also the cause. 

Can somebody from kubelet give this a look?

Comment 6 Omer Tuchfeld 2022-05-11 19:55:53 UTC
We observe this bug with a similar frequency on 4.9 mass installations we do as well

Comment 7 Alex Krzos 2022-06-02 14:21:40 UTC
Hey folks, we are still seeing this issue in large scale installs of SNOs:
* 4.9.36 OCP
* 2296 SNOs attempted to be installed
* 2224 Successful, 72 failures
* 3 of those failures where of this bug

I can provide access to those SNOs for further analysis or grab something from it to assist in solving the bug.

Comment 8 Filip Krepinsky 2022-06-14 19:44:35 UTC
*** Bug 2088289 has been marked as a duplicate of this bug. ***

Comment 9 Filip Krepinsky 2022-06-14 19:46:08 UTC
another occurence in https://bugzilla.redhat.com/show_bug.cgi?id=2088289

Comment 11 Shiftzilla 2023-03-09 01:18:46 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9256