Bug 1998193

Summary:	test pod in OutOfCpu error after SNO reboot
Product:	OpenShift Container Platform	Reporter:	yliu1
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Node sub component:	Kubelet	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	urgent
Priority:	high	CC:	alukiano, aos-bugs, ehashman, imiller, rphillips
Version:	4.9	Keywords:	Reopened
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-01 18:28:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description yliu1 2021-08-26 14:52:30 UTC

Description of problem:

2 guaranteed test pods were launched on SNO cluster with 80 cpus and workload partitioning enabled (4 reserved and 76 isolated). oslat pod was launched with 55 cpus, and stress-ng pod was launched with 20 cpus.

After systemctl reboot SNO node, stress-ng pod stuck with OutOfcpu error.

Version-Release number of selected component (if applicable):
Server Version: 4.9.0-0.nightly-2021-08-16-154237 and 4.9.0-fc.0
Kubernetes Version: v1.22.0-rc.0+76ff583
kernel version: 4.18.0-305.2.1.rt7.74.el8_4.ocptest.5.x86_64

How reproducible:
Reproducible. (Happened 3 out of 3 times)

Steps to Reproduce:
1. Launch guaranteed test pods that requests 75 out of 76 isolated cpus
2. Soft reboot SNO node via systemctl reboot
3. Wait for cluster and test pods to recover

Actual results:
3. Cluster is recovered, but one of the test pods (stress-ng) gets stuck in OutOfcpu error.

Looking at following error message, it feels like the "used" might have included the reserved cpus for mgmt, while the "capacity" only accounted for isolated/unreserved cpus

status:
message: 'Pod Node didn''t have enough resource: cpu, requested: 20000, used: 59690,
capacity: 76000'
phase: Failed
reason: OutOfcpu
startTime: "2021-08-26T13:49:07Z"

Expected results:
3. All test pods are recovered

Additional info:
- If I delete the error pod, and relaunch the same pod, it will succeed, however after reboot, the same issue reappears.
- Same test was executed multiple times on another SNO cluster with 4.8.0 build and this issue was never encountered.
- Both clusters have workload partitioning enabled. (CPU Requests and Limits for all Non-Terminated management pods are 0 in oc describe nodes)

Comment 2 Jiří Mencák 2021-09-02 11:00:09 UTC

There were some CPU manager critical fixes (by Artyom) included as part of the k8s 1.22.1 rebase:
https://github.com/openshift/kubernetes/pull/916

I didn't get familiar with them yet, but it is possible they might address this issue too.
Yang, were you able to reproduce this on a smaller (CPU-wise) scale too?

Comment 3 Artyom 2021-09-02 11:03:53 UTC

It also can be an additional side effect of the bug https://bugzilla.redhat.com/show_bug.cgi?id=1997657.

Comment 4 Jiří Mencák 2021-09-02 16:57:36 UTC

I was unable to reproduce this on a small scale (4 CPUs) and the oldest 4.9 nightly I could get my hands on: 4.9.0-0.nightly-2021-08-28-082738
prior to openshift/kubernetes/pull/920 merge.

Yang, could you try with the latest 4.9 nightly and if you still see the problem a minimum reproducer would help a lot.

Comment 5 yliu1 2021-09-02 18:15:13 UTC

It is only reproducible when maxing out the isolated cpus. 
My cluster has 4 reserved cpus for mgmt, and 76 isolated cpus for workloads. I had two big guaranteed pods, one with 55 cpus is recovered properly, and the other one with 20 cpus failed with outofcpu error.
But I will try it on 4.9 fc-1, which was built from 08-30 nightly.

Comment 6 yliu1 2021-09-02 19:59:22 UTC

Issue is reproducible with 4.9.0-fc.1 build. 
The key is to max out the node capacity before reboot. I used pods with less cpus this time (4-5 cpus for each stressng pod). Most were recovered except the last one.

[yliu1@yliu1 ~]$ oc get pods -n ran-test 
NAME              READY   STATUS     RESTARTS   AGE
oslat-58x25       1/1     Running    1          9m56s
oslat-qvhg2       1/1     Running    1          9m56s
stress-ng-9hjrf   1/1     Running    1          9m47s
stress-ng-jhrc4   1/1     Running    1          9m47s
stress-ng-nlk7v   1/1     Running    1          9m47s
stress-ng-r276d   0/1     OutOfcpu   0          9m46s
stress-ng-tnp5q   1/1     Running    1          9m47s
stress-ng-tszdz   1/1     Running    1          9m47s



status:
  message: 'Pod Node didn''t have enough resource: cpu, requested: 5000, used: 73380,
    capacity: 76000'
  phase: Failed
  reason: OutOfcpu

Comment 7 yliu1 2021-09-02 20:06:04 UTC

As mentioned in description, the "used" cpus in above error message seems incorrect assuming it's supposed to display the usage for unreserved cpus.

Comment 8 yliu1 2021-09-02 20:26:04 UTC

Another data point, tried ~25 burstable pods with 1 cpu requested and total requested cpus roughly max out the node capacity. 
Got following error on 3 burstable stress-ng pods, which is showing more used than capacity.

status:
  message: 'Pod Node didn''t have enough resource: cpu, requested: 1000, used: 77230,
    capacity: 76000'
  phase: Failed
  reason: OutOfcpu

Comment 9 Jiří Mencák 2021-09-08 05:34:51 UTC

Me and Yang are unable to reproduce this on a smaller scale 8CPU SNO cluster without workload partitioning and PAO.
Next steps, debugging on a larger cluster with workload partitioning and PAO.

Comment 10 Jiří Mencák 2021-09-08 08:56:36 UTC

I've reproduced the described behavior on Yang's cluster with two "idle/lightweight" busybox containers (no need to torture the cluster with stress-ng).
Both of them were "qosClass: Guaranteed".  The first one with 55 CPUs, the other one with 20 CPUs.  Total CPUs on the cluster 80, 4 were reservedSystemCPUs
and workload partitioning[1] was used.  Next steps, investigate if this is workload partitioning related.

[1] https://github.com/openshift/enhancements/blob/master/enhancements/workload-partitioning/management-workload-partitioning.md

Comment 12 Jiří Mencák 2021-09-14 14:07:06 UTC

The issue is reproducable in 4.9.0-fc.1, but not in 4.9.0-rc.0 and 4.9.0-rc.1.  Further investigation showed that the culprit really
was BZ1997657 as suggested by Artyom (thank you).  The issue is reproducible without https://github.com/openshift/kubernetes/pull/920,
however, I've failed to reproduce it with PR920 included.  Closing as a duplicate of BZ1997657.  If you can still reproduce this bug with
4.9.0-rc.0 and above, please re-open.

*** This bug has been marked as a duplicate of bug 1997657 ***

Comment 13 yliu1 2021-09-15 13:33:49 UTC

Same test passed with 4.9.0 rc.0 build. : )

Comment 14 yliu1 2021-09-27 19:30:12 UTC

This issue

Comment 15 yliu1 2021-09-27 19:32:28 UTC

This issue now reappars on 4.9 rc.3. 


[yliu1@yliu1 ~]$ oc version
Client Version: 4.9.0-rc.0
Server Version: 4.9.0-rc.3
Kubernetes Version: v1.22.0-rc.0+af080cb


[yliu1@yliu1 ~]$ oc get pods -n ran-test |grep -v Running
NAME              READY   STATUS     RESTARTS   AGE
stress-ng-5vttj   0/1     OutOfcpu   0          23m
stress-ng-lqkl7   0/1     OutOfcpu   0          23m
stress-ng-z64wn   0/1     OutOfcpu   0          23m


oc describe pod -n ran-test stress-ng-z64wn
  Normal   AddedInterface  15m    multus             Add eth0 [10.128.0.201/23] from ovn-kubernetes
  Normal   Pulled          14m    kubelet            Container image "quay.io/imiller/stress-ng:2.0" already present on machine
  Normal   Created         14m    kubelet            Created container stress-ng
  Normal   Started         14m    kubelet            Started container stress-ng
  Warning  OutOfcpu        7m31s  kubelet            Node didn't have enough resource: cpu, requested: 1000, used: 91380, capacity: 92000

Comment 16 Jiří Mencák 2021-09-29 13:11:17 UTC

I can confirm this can still be reproduced with 4.9.0-rc.3.

After reboot, one of three usually happens:
- Pod "Running" (no issues).
- Pod stuck in "OutOfcpu".
- Pod stuck in "ContainerCreating".

This is what I noticed in the kubelet logs when the pod was stuck in ContainerCreating after reboot.

Sep 29 12:41:20 sno kubelet[2954]: E0929 12:41:20.753390    2954 kubelet_pods.go:1427] "Pod attempted illegal phase transition" pod="default/guaranteed2" originalStatusPhase=Failed apiStatusPhase=Pending apiStatus="&PodStatus{Phase:Pending,Conditions:[]PodCondition{},Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[]ContainerStatus{ContainerStatus{Name:guaranteed2,State:ContainerState{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,},Running:nil,Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:quay.io/jmencak/busybox,ImageID:,ContainerID:,Started:nil,},},QOSClass:Guaranteed,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{},EphemeralContainerStatuses:[]ContainerStatus{},}"

This is likely not CPU Manager's fault from what I can understand at the moment. /cc Ryan for extra eyeballs.

Comment 18 Jiří Mencák 2021-10-01 18:28:29 UTC

This is not a CPU manager bug.  Changing component to Kubelet.  See: https://bugzilla.redhat.com/show_bug.cgi?id=1997657#c34

This was already fixed by https://github.com/openshift/kubernetes/pull/920 (e.g. 4.9.0-rc.0 and 4.9.0-rc.1), but broken again
later on.  The problem is that the "Completed" pods are now again filling the node's capacity.

Closing again as a duplicate of 1997657 and moving to the Kubelet sub-component.  Ryan, happy to share a simple reproducer.

*** This bug has been marked as a duplicate of bug 1997657 ***

Comment 19 yliu1 2021-10-05 13:38:18 UTC


*** This bug has been marked as a duplicate of bug 1997478 ***

Comment 20 yliu1 2021-10-05 13:40:35 UTC

Discussed with Ryan, and he pointed out PR for 1997478 should also fix this issue.