1958371 – Pod stuck in pending state after getting scheduled

Bug 1958371 - Pod stuck in pending state after getting scheduled

Summary: Pod stuck in pending state after getting scheduled

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Elana Hashman
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:	1915085
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-07 18:31 UTC by Anshul Verma
Modified:	2024-12-20 20:01 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-22 08:29:44 UTC
Target Upstream Version:
Embargoed:
Flags:	ehashman: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 779	0	None	closed	Bug 1958371: UPSTREAM: 98424: register all pending pod deletions and check for kill	2021-06-03 21:51:59 UTC
Red Hat Product Errata	RHBA-2021:2410	0	None	None	None	2021-06-22 08:30:32 UTC

Description Anshul Verma 2021-05-07 18:31:33 UTC

Description of problem:

The newly created stuck in the `pending` state indefinitely.

This is a node-based issue. It happens on some nodes at a time.
~~~
Every 2.0s: oc get pods -A  -o wide | grep -i -e pending -e termina                                                                                                                               KDC1-L-F28SW5J: Wed May  5 14:43:21 2021
mk-dev-01                                          palantir-1-w5f74                                          0/1     Terminating                  0          87m     <none>          ip-10-10-11-20    <none>           <none>
mk-test-10                                         cron-tlmca-asstmrg-1620201000-9tcrf                       0/1     Pending                      0          83m     <none>          ip-10-10-11-20    <none>           <none>
mk-test-10                                         cron-tlmrp-bocs-1620201000-v4s7l                          0/1     Pending                      0          83m     <none>          ip-10-10-11-20    <none>           <none>
mk-test-10                                         cron-tlmrp-crestmap-1620201000-kj6bv                      0/1     Pending                      0          83m     <none>          ip-10-10-11-20    <none>           <none>
mk-test-10                                         cron-tlmrp-trtaledmap-1620201000-pgcg7                    0/1     Pending                      0          83m     <none>          ip-10-10-11-20    <none>           <none>
~~~

There are no related logs in the kubelet logs even at log-level 8. The following are the only related logs seen in kubelet logs -
~~
Apr 28 07:08:49 ip-10-10-5-75 hyperkube[1916527]: I0428 07:08:49.508955 1916527 config.go:383] Receiving a new pod "mq-test-pod5_mk-dev-10(0b40b900-619a-4533-94ba-bd9f1eca06cb)"
Apr 28 07:10:08 ip-10-10-5-75 hyperkube[1916527]: I0428 07:10:08.771221 1916527 config.go:383] Receiving a new pod "mq-test-pod_mk-dev-10(030edc20-0666-4658-b711-5f5758729e12)"
Apr 28 07:10:38 ip-10-10-5-75 hyperkube[1916527]: I0428 07:10:38.523560 1916527 config.go:383] Receiving a new pod "mq-test-pod_mk-dev-10(5ced5147-8c23-4489-99a8-8e2594a17920)"
~~

Restart of crio and kubelet resolves the issue for sometime but the issue comes back again.

The pods does gets scheduled as seen in the scheduler logs -
~~~
|ansverma@supportshell ~/02925020
|$ oc logs openshift-kube-scheduler-ip-10-10-11-249 -c kube-scheduler | egrep "cron-tlmrp-bocs-1620201000-v4s7l|cron-tlmrp-crestmap-1620201000-kj6bv|cron-tlmrp-trtaledmap-1620201000-pgcg7"
2021-05-05T07:50:04.306019974Z I0505 07:50:04.304690       1 scheduler.go:597] "Successfully bound pod to node" pod="mk-test-10/cron-tlmrp-bocs-1620201000-v4s7l" node="ip-10-10-11-20" evaluatedNodes=61 feasibleNodes=21
2021-05-05T07:50:04.392086680Z I0505 07:50:04.392049       1 scheduler.go:597] "Successfully bound pod to node" pod="mk-test-10/cron-tlmrp-crestmap-1620201000-kj6bv" node="ip-10-10-11-20" evaluatedNodes=61 feasibleNodes=21
2021-05-05T07:50:04.495647885Z I0505 07:50:04.495609       1 scheduler.go:597] "Successfully bound pod to node" pod="mk-test-10/cron-tlmrp-trtaledmap-1620201000-pgcg7" node="ip-10-10-11-20" evaluatedNodes=61 feasibleNodes=21
~~~

Comment 4 Stefan Schimanski 2021-05-17 18:07:48 UTC

I don't see a reason why this should be an api issues, especially with 

> Restart of crio and kubelet resolves the issue for sometime but the issue comes back again.

Apiserver does not cleanup pods, kubelet and/or kube-controller-manager do.

Comment 12 Anshul Verma 2021-05-31 11:04:40 UTC

Elana, 

Any Update?
Please check this on priority. This is pretty important.

Comment 23 Elana Hashman 2021-06-11 20:26:21 UTC

@Sunil the patches have landed on the 4.6 branch, can you please verify them? https://github.com/openshift/kubernetes/pull/779 is merged, and addresses the issues detailed in https://bugzilla.redhat.com/show_bug.cgi?id=1915085, the dependent bug.

Comment 27 Sunil Choudhary 2021-06-17 10:50:46 UTC

Checked on 4.6.0-0.nightly-2021-06-11-193006.

Ran below tests multiple times on couple of clusters.
1. Create a pod and delete it within seconds of its creation.
2. Create a pod and delete its project within seconds of pod creation.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2021-06-11-193006   True        False         156m    Cluster version is 4.6.0-0.nightly-2021-06-11-193006

Comment 29 errata-xmlrpc 2021-06-22 08:29:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.35 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2410

Note You need to log in before you can comment on or make changes to this bug.