Bug 1958371 - Pod stuck in pending state after getting scheduled
Summary: Pod stuck in pending state after getting scheduled
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.z
Assignee: Elana Hashman
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On: 1915085
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-07 18:31 UTC by Anshul Verma
Modified: 2021-06-22 08:30 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-22 08:29:44 UTC
Target Upstream Version:
Embargoed:
ehashman: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 779 0 None closed Bug 1958371: UPSTREAM: 98424: register all pending pod deletions and check for kill 2021-06-03 21:51:59 UTC
Red Hat Product Errata RHBA-2021:2410 0 None None None 2021-06-22 08:30:32 UTC

Description Anshul Verma 2021-05-07 18:31:33 UTC
Description of problem:

The newly created stuck in the `pending` state indefinitely.

This is a node-based issue. It happens on some nodes at a time.
~~~
Every 2.0s: oc get pods -A  -o wide | grep -i -e pending -e termina                                                                                                                               KDC1-L-F28SW5J: Wed May  5 14:43:21 2021
mk-dev-01                                          palantir-1-w5f74                                          0/1     Terminating                  0          87m     <none>          ip-10-10-11-20    <none>           <none>
mk-test-10                                         cron-tlmca-asstmrg-1620201000-9tcrf                       0/1     Pending                      0          83m     <none>          ip-10-10-11-20    <none>           <none>
mk-test-10                                         cron-tlmrp-bocs-1620201000-v4s7l                          0/1     Pending                      0          83m     <none>          ip-10-10-11-20    <none>           <none>
mk-test-10                                         cron-tlmrp-crestmap-1620201000-kj6bv                      0/1     Pending                      0          83m     <none>          ip-10-10-11-20    <none>           <none>
mk-test-10                                         cron-tlmrp-trtaledmap-1620201000-pgcg7                    0/1     Pending                      0          83m     <none>          ip-10-10-11-20    <none>           <none>
~~~

There are no related logs in the kubelet logs even at log-level 8. The following are the only related logs seen in kubelet logs -
~~
Apr 28 07:08:49 ip-10-10-5-75 hyperkube[1916527]: I0428 07:08:49.508955 1916527 config.go:383] Receiving a new pod "mq-test-pod5_mk-dev-10(0b40b900-619a-4533-94ba-bd9f1eca06cb)"
Apr 28 07:10:08 ip-10-10-5-75 hyperkube[1916527]: I0428 07:10:08.771221 1916527 config.go:383] Receiving a new pod "mq-test-pod_mk-dev-10(030edc20-0666-4658-b711-5f5758729e12)"
Apr 28 07:10:38 ip-10-10-5-75 hyperkube[1916527]: I0428 07:10:38.523560 1916527 config.go:383] Receiving a new pod "mq-test-pod_mk-dev-10(5ced5147-8c23-4489-99a8-8e2594a17920)"
~~

Restart of crio and kubelet resolves the issue for sometime but the issue comes back again.

The pods does gets scheduled as seen in the scheduler logs -
~~~
|ansverma@supportshell ~/02925020
|$ oc logs openshift-kube-scheduler-ip-10-10-11-249 -c kube-scheduler | egrep "cron-tlmrp-bocs-1620201000-v4s7l|cron-tlmrp-crestmap-1620201000-kj6bv|cron-tlmrp-trtaledmap-1620201000-pgcg7"
2021-05-05T07:50:04.306019974Z I0505 07:50:04.304690       1 scheduler.go:597] "Successfully bound pod to node" pod="mk-test-10/cron-tlmrp-bocs-1620201000-v4s7l" node="ip-10-10-11-20" evaluatedNodes=61 feasibleNodes=21
2021-05-05T07:50:04.392086680Z I0505 07:50:04.392049       1 scheduler.go:597] "Successfully bound pod to node" pod="mk-test-10/cron-tlmrp-crestmap-1620201000-kj6bv" node="ip-10-10-11-20" evaluatedNodes=61 feasibleNodes=21
2021-05-05T07:50:04.495647885Z I0505 07:50:04.495609       1 scheduler.go:597] "Successfully bound pod to node" pod="mk-test-10/cron-tlmrp-trtaledmap-1620201000-pgcg7" node="ip-10-10-11-20" evaluatedNodes=61 feasibleNodes=21
~~~

Comment 4 Stefan Schimanski 2021-05-17 18:07:48 UTC
I don't see a reason why this should be an api issues, especially with 

> Restart of crio and kubelet resolves the issue for sometime but the issue comes back again.

Apiserver does not cleanup pods, kubelet and/or kube-controller-manager do.

Comment 12 Anshul Verma 2021-05-31 11:04:40 UTC
Elana, 

Any Update?
Please check this on priority. This is pretty important.

Comment 23 Elana Hashman 2021-06-11 20:26:21 UTC
@Sunil the patches have landed on the 4.6 branch, can you please verify them? https://github.com/openshift/kubernetes/pull/779 is merged, and addresses the issues detailed in https://bugzilla.redhat.com/show_bug.cgi?id=1915085, the dependent bug.

Comment 27 Sunil Choudhary 2021-06-17 10:50:46 UTC
Checked on 4.6.0-0.nightly-2021-06-11-193006.

Ran below tests multiple times on couple of clusters.
1. Create a pod and delete it within seconds of its creation.
2. Create a pod and delete its project within seconds of pod creation.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2021-06-11-193006   True        False         156m    Cluster version is 4.6.0-0.nightly-2021-06-11-193006

Comment 29 errata-xmlrpc 2021-06-22 08:29:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.35 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2410


Note You need to log in before you can comment on or make changes to this bug.