2082628 – new revision of KCM static pod not created while the old revision gets broken / with wrong status

Bug 2082628 - new revision of KCM static pod not created while the old revision gets broken / with wrong status

Summary: new revision of KCM static pod not created while the old revision gets broken...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Joel Smith
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2088289 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-06 15:47 UTC by Omer Tuchfeld
Modified:	2023-03-09 01:18 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-03-09 01:18:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Omer Tuchfeld 2022-05-06 15:47:38 UTC

Description of problem:

We've come across a pretty rare bug when doing mass installations of SNO clusters that prevents the installation from completing on its own. 

The symptoms are as follows:
- The mirror pod for kube-controller-manager is stuck with 3/4 containers ready
- The container that's down is kube-controller-manager
- oc logs ... for that particular container leads to this error:

>[root@e24-h01-000-r640 omer]# oc logs -n openshift-kube-controller-manager kube-controller-manager-sno01271 kube-controller-manager
>Error from server (BadRequest): container "kube-controller-manager" in pod "kube-controller-manager-sno01271" is not available
>[root@e24-h01-000-r640 omer]# oc logs -n openshift-kube-controller-manager kube-controller-manager-sno01271 kube-controller-manager --previous
>Error from server (BadRequest): container "kube-controller-manager" in pod "kube-controller-manager-sno01271" is not available

Noting the cri-o//... ID of this container from the mirror pod's status and then SSHing into the node and running sudo crictl logs ... with that ID we see the culprit:

>E0505 20:36:46.587883       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api/v1/namespaces/kube-system/configmaps/kube-controller-manager?timeout=5s": dial tcp [fc00:1001::8de]:6443: connect: connection refused
>E0505 20:36:47.526268       1 namespaced_resources_deleter.go:161] unable to get all supported resources from server: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api": dial tcp [fc00:1001::8de]:6443: connect: connection refused
>F0505 20:36:47.526298       1 namespaced_resources_deleter.go:164] Unable to get any supported resources from server: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api": dial tcp [fc00:1001::8de]:6443: connect: connection refused
>goroutine 413 [running]:
>k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0x1)
>...

Where fc00:1001::8de is the IPv6 of the single node (it's an IPv6 OVNKubernetes offline cluster, although I don't think that's a crucial detail). The fatal log is coming from these [1] lines.

The connection refused part is not very surprising, since this is a SNO and its API server fluctuates between being up and down, especially during the initial stages of the cluster.

Version-Release number of selected component (if applicable):
4.10.13

How reproducible:
Around 0.087% (usually 1-2 out of 2300) of all installations fail due to this bug

Steps to Reproduce:
1. Run 1000 SNO installations
2. One of them should suffer from this

Actual results:
- kube-controller-manager crashes

- kube-controller-manager doesn't get back up

- The logs for kube-controller-manager cannot be retrieved via `oc logs...`

- The cluster cannot complete installation because kube-controller-manager is down

Expected results:
- kube-controller-manager, like all other controllers, should have better tolerance for API downtime. There was an effort to do so across OpenShift (with library-go) to avoid this kind of crashes, but looks like this condition was missed

- The container should have been restarted by kubelet

- The logs should still be available via oc logs.

- The cluster should complete installation

Additional info:
If you need a live cluster with this bug reproduced please reach out, it can be arranged

[1] https://github.com/openshift/kubernetes/blob/fe7796f337ea0d35bc3e6b5428d63685d1833cb5/pkg/controller/namespace/deletion/namespaced_resources_deleter.go#L159-L165

Comment 1 Omer Tuchfeld 2022-05-06 15:55:39 UTC

Created attachment 1877632 [details]
Logs for the crahsed kube-controller-manager pod

Comment 5 Filip Krepinsky 2022-05-09 22:19:18 UTC

we know from our installer pod that it successfully created static pod manifest with revision 4 (installer pod phase: Succeeded). But we can only see pod with revision 2 (which as mentioned above fails later and is not restarted)

we can also observe the missing static pod via the following event

15:46:13 (x39)	openshift-kube-controller-manager-operator	kube-controller-manager-operator-missingstaticpodcontroller	kube-controller-manager-operator	 MissingStaticPod static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 4 on node: "sno01271" didn't show up, waited: 2m30s

This seems to be an issue with kubelet not working properly. Could these logs be relevant?

May 05 20:37:01 sno01271 bash[31866]: E0505 20:37:01.684790   31866 kubelet.go:1755] "Failed creating a mirror pod for" err="pods \"openshift-kube-scheduler-sno01271\" already exists" pod="openshift-kube-scheduler/ope
nshift-kube-scheduler-sno01271"
May 05 20:37:01 sno01271 bash[31866]: E0505 20:37:01.685663   31866 kubelet.go:1755] "Failed creating a mirror pod for" err="pods \"etcd-sno01271\" already exists" pod="openshift-etcd/etcd-sno01271"
May 05 20:37:01 sno01271 bash[31866]: E0505 20:37:01.687341   31866 kubelet.go:1755] "Failed creating a mirror pod for" err="pods \"kube-controller-manager-sno01271\" already exists" pod="openshift-kube-controller-manager/kube-controller-manager-sno01271"

I am seeing a lot of network errors as well (even later, not just in initial stage), so that might be also the cause. 

Can somebody from kubelet give this a look?

Comment 6 Omer Tuchfeld 2022-05-11 19:55:53 UTC

We observe this bug with a similar frequency on 4.9 mass installations we do as well

Comment 7 Alex Krzos 2022-06-02 14:21:40 UTC

Hey folks, we are still seeing this issue in large scale installs of SNOs:
* 4.9.36 OCP
* 2296 SNOs attempted to be installed
* 2224 Successful, 72 failures
* 3 of those failures where of this bug

I can provide access to those SNOs for further analysis or grab something from it to assist in solving the bug.

Comment 8 Filip Krepinsky 2022-06-14 19:44:35 UTC

*** Bug 2088289 has been marked as a duplicate of this bug. ***

Comment 9 Filip Krepinsky 2022-06-14 19:46:08 UTC

another occurence in https://bugzilla.redhat.com/show_bug.cgi?id=2088289

Comment 10 Ryan Phillips 2022-10-06 14:27:56 UTC

Looks like the same problem as:

https://github.com/kubernetes/kubernetes/issues/111174
https://github.com/kubernetes/kubernetes/pull/111175

Comment 11 Shiftzilla 2023-03-09 01:18:46 UTC

OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9256

Note You need to log in before you can comment on or make changes to this bug.