Bug 2011513
| Summary: | Kubelet rejects pods that use resources that should be freed by completed pods | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jiří Mencák <jmencak> | ||||
| Component: | Node | Assignee: | Ryan Phillips <rphillips> | ||||
| Node sub component: | Kubelet | QA Contact: | Weinan Liu <weinliu> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | urgent | ||||||
| Priority: | urgent | CC: | aos-bugs, ehashman, jiwei, mifiedle, rphillips, wking, xtian, yliu1 | ||||
| Version: | 4.9 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.10.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 2011815 (view as bug list) | Environment: | |||||
| Last Closed: | 2022-03-10 16:17:19 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 2011815, 2011956 | ||||||
| Attachments: |
|
||||||
|
Description
Jiří Mencák
2021-10-06 17:41:04 UTC
I have been able to reproduce this on an upstream single node local-up-cluster.sh - filed https://github.com/kubernetes/kubernetes/issues/105523 Upstream fix PR: https://github.com/kubernetes/kubernetes/pull/105527 E2E test that verifies behaviour is broken on HEAD: https://github.com/kubernetes/kubernetes/pull/105552 Cherry-pick to verify behaviour is working on 1.21: https://github.com/kubernetes/kubernetes/pull/105553 Test name is "[sig-node] Restart [Serial] [Slow] [Disruptive] Kubelet should correctly account for terminated pods after restart" and is part of the node serial suite. We are waiting to get final CI results back before LGTM/approval. Only the first PR should merge, and then backported to 1.22. I created a "/test pull-kubernetes-node-kubelet-serial-122" job for testing this against the 1.22 branch. Issue not fixed on 4.10.0-0.nightly-2021-10-08-050801, which is still in Ready state(not Accepted) at 2021-10-08T05:08:01Z. It may not have the PR included. Waiting for next build to check. oc get po NAME READY STATUS RESTARTS AGE complete1 0/1 Completed 0 3m56s complete2 0/1 Completed 0 3m51s complete3 0/1 Completed 0 3m46s complete4 0/1 Completed 0 3m40s complete5 0/1 Completed 0 3m37s complete6 0/1 Completed 0 3m32s complete7 0/1 Completed 0 3m26s complete8 0/1 Completed 0 3m21s running1 1/1 Running 1 3m16s running2 0/1 OutOfcpu 0 3m15s running3 0/1 OutOfcpu 0 3m14s running4 0/1 OutOfcpu 0 3m13s running5 0/1 OutOfcpu 0 3m12s running6 0/1 OutOfcpu 0 3m11s Hey Weinan, the fix is not in 4.10.0-0.nightly-2021-10-08-050801 yet, thanks for checking! One way to check is git log for openshift/kubernetes and compare it with kubelet version. You need commit equal or higher than 931224322c58da67eb8b3e9d4d3ff0e7dbf81cf2. You can get the kubelet version by checking the output of $ oc get no NAME STATUS ROLES AGE VERSION jmencak-fxfd2-master-0.c.openshift-gce-devel.internal Ready master 5m27s v1.22.1+4d7e196 jmencak-fxfd2-master-1.c.openshift-gce-devel.internal Ready master 5m39s v1.22.1+4d7e196 jmencak-fxfd2-master-2.c.openshift-gce-devel.internal Ready master 5m40s v1.22.1+4d7e196 4d7e196 indicates the kubelet version that doesn't have the fix. NAME STATUS ROLES AGE VERSION ip-10-0-128-201.us-east-2.compute.internal Ready worker 48m v1.22.1+4d7e196 ip-10-0-142-115.us-east-2.compute.internal Ready master 57m v1.22.1+4d7e196 ip-10-0-165-183.us-east-2.compute.internal Ready master 58m v1.22.1+4d7e196 ip-10-0-206-28.us-east-2.compute.internal Ready master 57m v1.22.1+4d7e196 oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-10-08-090421 True False 35m Cluster version is 4.10.0-0.nightly-2021-10-08-090421 @Jiří, thanks! 4.10.0-0.nightly-2021-10-08-090421 does not have the fix yet. *** Bug 2005647 has been marked as a duplicate of this bug. *** *** Bug 2009092 has been marked as a duplicate of this bug. *** verified to be fixed $ ./complete.sh pod/complete1 created "Pending" "Pending" "Succeeded" pod/complete2 created "Pending" "Pending" "Succeeded" pod/complete3 created "Pending" "Pending" "Succeeded" pod/complete4 created "Pending" "Pending" "Pending" "Succeeded" pod/complete5 created "Pending" "Pending" "Succeeded" pod/complete6 created "Pending" "Pending" "Succeeded" pod/complete7 created "Pending" "Pending" "Succeeded" pod/complete8 created "Pending" "Pending" "Succeeded" pod/running1 created pod/running2 created pod/running3 created pod/running4 created pod/running5 created pod/running6 created $ oc get po NAME READY STATUS RESTARTS AGE complete1 0/1 Completed 0 6m29s complete2 0/1 Completed 0 6m23s complete3 0/1 Completed 0 6m17s complete4 0/1 Completed 0 6m11s complete5 0/1 Completed 0 6m4s complete6 0/1 Completed 0 5m58s complete7 0/1 Completed 0 5m52s complete8 0/1 Completed 0 5m46s running1 1/1 Running 0 5m40s running2 1/1 Running 0 5m39s running3 0/1 Pending 0 5m38s running4 0/1 Pending 0 5m37s running5 0/1 Pending 0 5m36s running6 0/1 Pending 0 5m35s $ oc get no NAME STATUS ROLES AGE VERSION ci-ln-xxnd56k-f76d1-dx979-master-0 Ready master,worker 26m v1.22.1+9312243 [weinliu@rhel8 verification-tests]$ oc debug node/ci-ln-xxnd56k-f76d1-dx979-master-0 Starting pod/ci-ln-xxnd56k-f76d1-dx979-master-0-debug ... To use host binaries, run `chroot /host` chroot /host Pod IP: 10.0.0.3 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# reboot Removing debug pod ... error: unable to delete the debug pod "ci-ln-xxnd56k-f76d1-dx979-master-0-debug": Delete https://api.ci-ln-xxnd56k-f76d1.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/default/pods/ci-ln-xxnd56k-f76d1-dx979-master-0-debug: unexpected EOF [weinliu@rhel8 verification-tests]$ oc get no NAME STATUS ROLES AGE VERSION ci-ln-xxnd56k-f76d1-dx979-master-0 Ready master,worker 29m v1.22.1+9312243 $ oc get po NAME READY STATUS RESTARTS AGE ci-ln-xxnd56k-f76d1-dx979-master-0-debug 0/1 Completed 1 4m17s complete1 0/1 Completed 0 11m complete2 0/1 Completed 0 11m complete3 0/1 Completed 0 10m complete4 0/1 Completed 0 10m complete5 0/1 Completed 0 10m complete6 0/1 Completed 0 10m complete7 0/1 Completed 0 10m complete8 0/1 Completed 0 10m running1 0/1 ContainerCreating 1 10m running2 0/1 ContainerCreating 1 10m running3 0/1 Pending 0 10m running4 0/1 Pending 0 10m running5 0/1 Pending 0 10m running6 0/1 Pending 0 10m $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-10-16-173656 True False 14m Cluster version is 4.10.0-0.nightly-2021-10-16-173656 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |