Bug 1958371

Summary: Pod stuck in pending state after getting scheduled
Product: OpenShift Container Platform Reporter: Anshul Verma <ansverma>
Component: NodeAssignee: Elana Hashman <ehashman>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, ehashman, mfojtik, openshift-bugs-escalate, rkant, rphillips, rupatel, sttts, wking, xxia
Version: 4.6.zFlags: ehashman: needinfo-
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-22 08:29:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1915085    
Bug Blocks:    

Description Anshul Verma 2021-05-07 18:31:33 UTC
Description of problem:

The newly created stuck in the `pending` state indefinitely.

This is a node-based issue. It happens on some nodes at a time.
~~~
Every 2.0s: oc get pods -A  -o wide | grep -i -e pending -e termina                                                                                                                               KDC1-L-F28SW5J: Wed May  5 14:43:21 2021
mk-dev-01                                          palantir-1-w5f74                                          0/1     Terminating                  0          87m     <none>          ip-10-10-11-20    <none>           <none>
mk-test-10                                         cron-tlmca-asstmrg-1620201000-9tcrf                       0/1     Pending                      0          83m     <none>          ip-10-10-11-20    <none>           <none>
mk-test-10                                         cron-tlmrp-bocs-1620201000-v4s7l                          0/1     Pending                      0          83m     <none>          ip-10-10-11-20    <none>           <none>
mk-test-10                                         cron-tlmrp-crestmap-1620201000-kj6bv                      0/1     Pending                      0          83m     <none>          ip-10-10-11-20    <none>           <none>
mk-test-10                                         cron-tlmrp-trtaledmap-1620201000-pgcg7                    0/1     Pending                      0          83m     <none>          ip-10-10-11-20    <none>           <none>
~~~

There are no related logs in the kubelet logs even at log-level 8. The following are the only related logs seen in kubelet logs -
~~
Apr 28 07:08:49 ip-10-10-5-75 hyperkube[1916527]: I0428 07:08:49.508955 1916527 config.go:383] Receiving a new pod "mq-test-pod5_mk-dev-10(0b40b900-619a-4533-94ba-bd9f1eca06cb)"
Apr 28 07:10:08 ip-10-10-5-75 hyperkube[1916527]: I0428 07:10:08.771221 1916527 config.go:383] Receiving a new pod "mq-test-pod_mk-dev-10(030edc20-0666-4658-b711-5f5758729e12)"
Apr 28 07:10:38 ip-10-10-5-75 hyperkube[1916527]: I0428 07:10:38.523560 1916527 config.go:383] Receiving a new pod "mq-test-pod_mk-dev-10(5ced5147-8c23-4489-99a8-8e2594a17920)"
~~

Restart of crio and kubelet resolves the issue for sometime but the issue comes back again.

The pods does gets scheduled as seen in the scheduler logs -
~~~
|ansverma@supportshell ~/02925020
|$ oc logs openshift-kube-scheduler-ip-10-10-11-249 -c kube-scheduler | egrep "cron-tlmrp-bocs-1620201000-v4s7l|cron-tlmrp-crestmap-1620201000-kj6bv|cron-tlmrp-trtaledmap-1620201000-pgcg7"
2021-05-05T07:50:04.306019974Z I0505 07:50:04.304690       1 scheduler.go:597] "Successfully bound pod to node" pod="mk-test-10/cron-tlmrp-bocs-1620201000-v4s7l" node="ip-10-10-11-20" evaluatedNodes=61 feasibleNodes=21
2021-05-05T07:50:04.392086680Z I0505 07:50:04.392049       1 scheduler.go:597] "Successfully bound pod to node" pod="mk-test-10/cron-tlmrp-crestmap-1620201000-kj6bv" node="ip-10-10-11-20" evaluatedNodes=61 feasibleNodes=21
2021-05-05T07:50:04.495647885Z I0505 07:50:04.495609       1 scheduler.go:597] "Successfully bound pod to node" pod="mk-test-10/cron-tlmrp-trtaledmap-1620201000-pgcg7" node="ip-10-10-11-20" evaluatedNodes=61 feasibleNodes=21
~~~

Comment 4 Stefan Schimanski 2021-05-17 18:07:48 UTC
I don't see a reason why this should be an api issues, especially with 

> Restart of crio and kubelet resolves the issue for sometime but the issue comes back again.

Apiserver does not cleanup pods, kubelet and/or kube-controller-manager do.

Comment 12 Anshul Verma 2021-05-31 11:04:40 UTC
Elana, 

Any Update?
Please check this on priority. This is pretty important.

Comment 23 Elana Hashman 2021-06-11 20:26:21 UTC
@Sunil the patches have landed on the 4.6 branch, can you please verify them? https://github.com/openshift/kubernetes/pull/779 is merged, and addresses the issues detailed in https://bugzilla.redhat.com/show_bug.cgi?id=1915085, the dependent bug.

Comment 27 Sunil Choudhary 2021-06-17 10:50:46 UTC
Checked on 4.6.0-0.nightly-2021-06-11-193006.

Ran below tests multiple times on couple of clusters.
1. Create a pod and delete it within seconds of its creation.
2. Create a pod and delete its project within seconds of pod creation.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2021-06-11-193006   True        False         156m    Cluster version is 4.6.0-0.nightly-2021-06-11-193006

Comment 29 errata-xmlrpc 2021-06-22 08:29:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.35 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2410