Bug 1998193
Summary: | test pod in OutOfCpu error after SNO reboot | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | yliu1 |
Component: | Node | Assignee: | Ryan Phillips <rphillips> |
Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | urgent | ||
Priority: | high | CC: | alukiano, aos-bugs, ehashman, imiller, rphillips |
Version: | 4.9 | Keywords: | Reopened |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-10-01 18:28:29 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
yliu1
2021-08-26 14:52:30 UTC
There were some CPU manager critical fixes (by Artyom) included as part of the k8s 1.22.1 rebase: https://github.com/openshift/kubernetes/pull/916 I didn't get familiar with them yet, but it is possible they might address this issue too. Yang, were you able to reproduce this on a smaller (CPU-wise) scale too? It also can be an additional side effect of the bug https://bugzilla.redhat.com/show_bug.cgi?id=1997657. I was unable to reproduce this on a small scale (4 CPUs) and the oldest 4.9 nightly I could get my hands on: 4.9.0-0.nightly-2021-08-28-082738 prior to openshift/kubernetes/pull/920 merge. Yang, could you try with the latest 4.9 nightly and if you still see the problem a minimum reproducer would help a lot. It is only reproducible when maxing out the isolated cpus. My cluster has 4 reserved cpus for mgmt, and 76 isolated cpus for workloads. I had two big guaranteed pods, one with 55 cpus is recovered properly, and the other one with 20 cpus failed with outofcpu error. But I will try it on 4.9 fc-1, which was built from 08-30 nightly. Issue is reproducible with 4.9.0-fc.1 build. The key is to max out the node capacity before reboot. I used pods with less cpus this time (4-5 cpus for each stressng pod). Most were recovered except the last one. [yliu1@yliu1 ~]$ oc get pods -n ran-test NAME READY STATUS RESTARTS AGE oslat-58x25 1/1 Running 1 9m56s oslat-qvhg2 1/1 Running 1 9m56s stress-ng-9hjrf 1/1 Running 1 9m47s stress-ng-jhrc4 1/1 Running 1 9m47s stress-ng-nlk7v 1/1 Running 1 9m47s stress-ng-r276d 0/1 OutOfcpu 0 9m46s stress-ng-tnp5q 1/1 Running 1 9m47s stress-ng-tszdz 1/1 Running 1 9m47s status: message: 'Pod Node didn''t have enough resource: cpu, requested: 5000, used: 73380, capacity: 76000' phase: Failed reason: OutOfcpu As mentioned in description, the "used" cpus in above error message seems incorrect assuming it's supposed to display the usage for unreserved cpus. Another data point, tried ~25 burstable pods with 1 cpu requested and total requested cpus roughly max out the node capacity. Got following error on 3 burstable stress-ng pods, which is showing more used than capacity. status: message: 'Pod Node didn''t have enough resource: cpu, requested: 1000, used: 77230, capacity: 76000' phase: Failed reason: OutOfcpu Me and Yang are unable to reproduce this on a smaller scale 8CPU SNO cluster without workload partitioning and PAO. Next steps, debugging on a larger cluster with workload partitioning and PAO. I've reproduced the described behavior on Yang's cluster with two "idle/lightweight" busybox containers (no need to torture the cluster with stress-ng). Both of them were "qosClass: Guaranteed". The first one with 55 CPUs, the other one with 20 CPUs. Total CPUs on the cluster 80, 4 were reservedSystemCPUs and workload partitioning[1] was used. Next steps, investigate if this is workload partitioning related. [1] https://github.com/openshift/enhancements/blob/master/enhancements/workload-partitioning/management-workload-partitioning.md The issue is reproducable in 4.9.0-fc.1, but not in 4.9.0-rc.0 and 4.9.0-rc.1. Further investigation showed that the culprit really was BZ1997657 as suggested by Artyom (thank you). The issue is reproducible without https://github.com/openshift/kubernetes/pull/920, however, I've failed to reproduce it with PR920 included. Closing as a duplicate of BZ1997657. If you can still reproduce this bug with 4.9.0-rc.0 and above, please re-open. *** This bug has been marked as a duplicate of bug 1997657 *** Same test passed with 4.9.0 rc.0 build. : ) This issue This issue now reappars on 4.9 rc.3. [yliu1@yliu1 ~]$ oc version Client Version: 4.9.0-rc.0 Server Version: 4.9.0-rc.3 Kubernetes Version: v1.22.0-rc.0+af080cb [yliu1@yliu1 ~]$ oc get pods -n ran-test |grep -v Running NAME READY STATUS RESTARTS AGE stress-ng-5vttj 0/1 OutOfcpu 0 23m stress-ng-lqkl7 0/1 OutOfcpu 0 23m stress-ng-z64wn 0/1 OutOfcpu 0 23m oc describe pod -n ran-test stress-ng-z64wn Normal AddedInterface 15m multus Add eth0 [10.128.0.201/23] from ovn-kubernetes Normal Pulled 14m kubelet Container image "quay.io/imiller/stress-ng:2.0" already present on machine Normal Created 14m kubelet Created container stress-ng Normal Started 14m kubelet Started container stress-ng Warning OutOfcpu 7m31s kubelet Node didn't have enough resource: cpu, requested: 1000, used: 91380, capacity: 92000 I can confirm this can still be reproduced with 4.9.0-rc.3. After reboot, one of three usually happens: - Pod "Running" (no issues). - Pod stuck in "OutOfcpu". - Pod stuck in "ContainerCreating". This is what I noticed in the kubelet logs when the pod was stuck in ContainerCreating after reboot. Sep 29 12:41:20 sno kubelet[2954]: E0929 12:41:20.753390 2954 kubelet_pods.go:1427] "Pod attempted illegal phase transition" pod="default/guaranteed2" originalStatusPhase=Failed apiStatusPhase=Pending apiStatus="&PodStatus{Phase:Pending,Conditions:[]PodCondition{},Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[]ContainerStatus{ContainerStatus{Name:guaranteed2,State:ContainerState{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,},Running:nil,Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:quay.io/jmencak/busybox,ImageID:,ContainerID:,Started:nil,},},QOSClass:Guaranteed,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{},EphemeralContainerStatuses:[]ContainerStatus{},}" This is likely not CPU Manager's fault from what I can understand at the moment. /cc Ryan for extra eyeballs. This is not a CPU manager bug. Changing component to Kubelet. See: https://bugzilla.redhat.com/show_bug.cgi?id=1997657#c34 This was already fixed by https://github.com/openshift/kubernetes/pull/920 (e.g. 4.9.0-rc.0 and 4.9.0-rc.1), but broken again later on. The problem is that the "Completed" pods are now again filling the node's capacity. Closing again as a duplicate of 1997657 and moving to the Kubelet sub-component. Ryan, happy to share a simple reproducer. *** This bug has been marked as a duplicate of bug 1997657 *** *** This bug has been marked as a duplicate of bug 1997478 *** Discussed with Ryan, and he pointed out PR for 1997478 should also fix this issue. |