Hide Forgot
Description of problem: The newly created stuck in the `pending` state indefinitely. This is a node-based issue. It happens on some nodes at a time. ~~~ Every 2.0s: oc get pods -A -o wide | grep -i -e pending -e termina KDC1-L-F28SW5J: Wed May 5 14:43:21 2021 mk-dev-01 palantir-1-w5f74 0/1 Terminating 0 87m <none> ip-10-10-11-20 <none> <none> mk-test-10 cron-tlmca-asstmrg-1620201000-9tcrf 0/1 Pending 0 83m <none> ip-10-10-11-20 <none> <none> mk-test-10 cron-tlmrp-bocs-1620201000-v4s7l 0/1 Pending 0 83m <none> ip-10-10-11-20 <none> <none> mk-test-10 cron-tlmrp-crestmap-1620201000-kj6bv 0/1 Pending 0 83m <none> ip-10-10-11-20 <none> <none> mk-test-10 cron-tlmrp-trtaledmap-1620201000-pgcg7 0/1 Pending 0 83m <none> ip-10-10-11-20 <none> <none> ~~~ There are no related logs in the kubelet logs even at log-level 8. The following are the only related logs seen in kubelet logs - ~~ Apr 28 07:08:49 ip-10-10-5-75 hyperkube[1916527]: I0428 07:08:49.508955 1916527 config.go:383] Receiving a new pod "mq-test-pod5_mk-dev-10(0b40b900-619a-4533-94ba-bd9f1eca06cb)" Apr 28 07:10:08 ip-10-10-5-75 hyperkube[1916527]: I0428 07:10:08.771221 1916527 config.go:383] Receiving a new pod "mq-test-pod_mk-dev-10(030edc20-0666-4658-b711-5f5758729e12)" Apr 28 07:10:38 ip-10-10-5-75 hyperkube[1916527]: I0428 07:10:38.523560 1916527 config.go:383] Receiving a new pod "mq-test-pod_mk-dev-10(5ced5147-8c23-4489-99a8-8e2594a17920)" ~~ Restart of crio and kubelet resolves the issue for sometime but the issue comes back again. The pods does gets scheduled as seen in the scheduler logs - ~~~ |ansverma@supportshell ~/02925020 |$ oc logs openshift-kube-scheduler-ip-10-10-11-249 -c kube-scheduler | egrep "cron-tlmrp-bocs-1620201000-v4s7l|cron-tlmrp-crestmap-1620201000-kj6bv|cron-tlmrp-trtaledmap-1620201000-pgcg7" 2021-05-05T07:50:04.306019974Z I0505 07:50:04.304690 1 scheduler.go:597] "Successfully bound pod to node" pod="mk-test-10/cron-tlmrp-bocs-1620201000-v4s7l" node="ip-10-10-11-20" evaluatedNodes=61 feasibleNodes=21 2021-05-05T07:50:04.392086680Z I0505 07:50:04.392049 1 scheduler.go:597] "Successfully bound pod to node" pod="mk-test-10/cron-tlmrp-crestmap-1620201000-kj6bv" node="ip-10-10-11-20" evaluatedNodes=61 feasibleNodes=21 2021-05-05T07:50:04.495647885Z I0505 07:50:04.495609 1 scheduler.go:597] "Successfully bound pod to node" pod="mk-test-10/cron-tlmrp-trtaledmap-1620201000-pgcg7" node="ip-10-10-11-20" evaluatedNodes=61 feasibleNodes=21 ~~~
I don't see a reason why this should be an api issues, especially with > Restart of crio and kubelet resolves the issue for sometime but the issue comes back again. Apiserver does not cleanup pods, kubelet and/or kube-controller-manager do.
Elana, Any Update? Please check this on priority. This is pretty important.
@Sunil the patches have landed on the 4.6 branch, can you please verify them? https://github.com/openshift/kubernetes/pull/779 is merged, and addresses the issues detailed in https://bugzilla.redhat.com/show_bug.cgi?id=1915085, the dependent bug.
Checked on 4.6.0-0.nightly-2021-06-11-193006. Ran below tests multiple times on couple of clusters. 1. Create a pod and delete it within seconds of its creation. 2. Create a pod and delete its project within seconds of pod creation. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2021-06-11-193006 True False 156m Cluster version is 4.6.0-0.nightly-2021-06-11-193006
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.35 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2410