Bug 2022507

Summary:	Pods stuck in OutOfpods state after running cluster-density
Product:	OpenShift Container Platform	Reporter:	Mohit Sheth <msheth>
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Node sub component:	Kubelet	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aojeagar, harpatil, jchaloup, llopezmo, mfojtik, pmuller, rphillips, rsevilla, rzaleski, smalleni, wlewis, xxia
Version:	4.9
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Fixed a regression that could incorrectly reject pods with OutOfCpu errors if they were rapidly scheduled after other pods were reported as complete in the API. The Kubelet now waits to report the phase of a pod as terminal in the API until all running containers are guaranteed to have stopped and no new containers can be started. Short-lived pods may take slightly longer (~1s) to report Succeeded or Failed after this change.	Story Points:	---
Clone Of:
Clones:	2065774 (view as bug list)		Environment:
Last Closed:	2022-08-10 10:39:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2065774

Description Mohit Sheth 2021-11-11 20:09:19 UTC

Description of problem:
When running cluster density at 25 node scale with 500 projects we see a lot of puds in  OutOfpods state

>>Warning  OutOfpods  71m   kubelet            Node didn't have enough resource: pods, requested: 1, used: 251, capacity: 250

Version-Release number of selected component (if applicable):
4.9 and 4.10

How reproducible:
Always

Steps to Reproduce:
1. Run cluster density test and observe the pods


Actual results:
Pods stuck in OutOfpods state

Expected results:
Test to complete without any failures

Additional info:

Comment 5 Jan Chaloupka 2021-11-23 10:12:40 UTC

Hello Mohit,

> Pods stuck in OutOfpods state

Number of pods, cpu and memory are primary resources which each node has to have plenty to be able to run a pod. Given that, it's a must requirement checked before a pod gets scheduled to a node by the kube-scheduler [1].

> Warning  OutOfpods  71m   kubelet            Node didn't have enough resource: pods, requested: 1, used: 251, capacity: 250

The event we see here is reported by the kubelet [2]. The incident can be caused either by the kube-scheduler's cache not getting quickly enough updated or by setting each pod's .spec.nodeName field explicitly (by short-cutting the kube-scheduler).

> When running cluster density at 25 node scale with 500 projects ...

I am not familiar with the cluster density test. Can you please more elaborate on how the test is run? Are all the pods created in the test created with an empty .spec.nodeName field? Is it possible some pods have the .spec.nodeName field set when they are created?

Is the issue reproducible all the time? Can you share the kube-scheduler logs/must-gather from the time of the incident?

[1] https://github.com/openshift/kubernetes/blob/release-4.9/pkg/scheduler/framework/plugins/noderesources/fit.go#L260-L268
[2] https://github.com/openshift/kubernetes/blob/release-4.9/pkg/kubelet/lifecycle/predicate.go#L141-L144

Comment 7 Joe Talerico 2021-11-23 12:08:23 UTC

(In reply to Jan Chaloupka from comment #5)
> Hello Mohit,
> 
> > Pods stuck in OutOfpods state
> 
> Number of pods, cpu and memory are primary resources which each node has to
> have plenty to be able to run a pod. Given that, it's a must requirement
> checked before a pod gets scheduled to a node by the kube-scheduler [1].
> 
> > Warning  OutOfpods  71m   kubelet            Node didn't have enough resource: pods, requested: 1, used: 251, capacity: 250
> 
> The event we see here is reported by the kubelet [2]. The incident can be
> caused either by the kube-scheduler's cache not getting quickly enough
> updated or by setting each pod's .spec.nodeName field explicitly (by
> short-cutting the kube-scheduler).
> 
> > When running cluster density at 25 node scale with 500 projects ...
> 
> I am not familiar with the cluster density test. Can you please more
> elaborate on how the test is run? Are all the pods created in the test
> created with an empty .spec.nodeName field? Is it possible some pods have
> the .spec.nodeName field set when they are created?

Regarding kube-burner cluster-density workload[1][2].

We do no static mapping of pods to nodes for this workload.

Additionally, to note, we have run this workload since early 4.X and never have had this scheduling issue. So this is a regression. 


[1]https://github.com/cloud-bulldozer/benchmark-operator/blob/master/docs/kube-burner.md#supported-workloads
[2]https://github.com/cloud-bulldozer/benchmark-operator/blob/master/roles/kube-burner/templates/cluster-density.yml.j2
> 
> Is the issue reproducible all the time? Can you share the kube-scheduler
> logs/must-gather from the time of the incident?
> 
> [1]
> https://github.com/openshift/kubernetes/blob/release-4.9/pkg/scheduler/
> framework/plugins/noderesources/fit.go#L260-L268
> [2]
> https://github.com/openshift/kubernetes/blob/release-4.9/pkg/kubelet/
> lifecycle/predicate.go#L141-L144

Comment 10 Jan Chaloupka 2021-11-25 15:24:29 UTC

Will need more time and information to figure the root cause of the incident.

Comment 36 Harshal Patil 2021-12-10 10:19:41 UTC

I added a static pod at (when we already have 250 pods running),


Dec 10 08:57:23.934906 ip-10-0-177-126 hyperkube[1368]: E1210 08:57:23.934865    1368 file.go:109] "Unable to process watch event" err="can't process config file \"/etc/kubernetes/manifests/static-web.yaml\": /etc/kubernetes/manifests/static-web.yaml: couldn't parse as pod(Object 'Kind' is missing in 'null'), please check config file"
Dec 10 08:57:23.943338 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:23.943310    1368 kubelet.go:2077] "SyncLoop ADD" source="file" pods=[default/static-web-ip-10-0-177-126.us-west-1.compute.internal]
Dec 10 08:57:23.943569 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:23.943551    1368 topology_manager.go:200] "Topology Admit Handler"

That triggers the eviction of pod busybox-22,

  Warning  Preempting      72m   kubelet            Preempted in order to admit critical pod
  Normal   Killing         72m   kubelet            Stopping container busybox-22


Dec 10 08:57:23.944882 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:23.944848    1368 kuberuntime_container.go:720] "Killing container with a grace period" pod="default/busybox-22" podUID=77ae2076-4934-41cc-8300-74b2ad5b9ffc containerName="busybox-22" containerID="cri-o://53243766100e71d3609dd982138bbb83a77272c4fb3dccbb9eeaf623a24463c6" gracePeriod=30
Dec 10 08:57:23.945179 ip-10-0-177-126 crio[1337]: time="2021-12-10 08:57:23.945133626Z" level=info msg="Stopping container: 53243766100e71d3609dd982138bbb83a77272c4fb3dccbb9eeaf623a24463c6 (timeout: 30s)" id=a3715aa7-d7db-4371-ba76-93979fc19392 name=/runtime.v1alpha2.RuntimeService/StopContainer

Dec 10 08:57:54.589320 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.589288    1368 preemption.go:117] "Pod evicted successfully" pod="default/busybox-22"


Finally kubelet emits the event ContainerDied for the pod busybox-22, 

Dec 10 08:57:54.713832 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.713790    1368 kubelet.go:2115] "SyncLoop (PLEG): event for pod" pod="default/busybox-22" event=&{ID:77ae2076-4934-41cc-8300-74b2ad5b9ffc Type:ContainerDied Data:53243766100e71d3609dd982138bbb83a77272c4fb3dccbb9eeaf623a24463c6}
Dec 10 08:57:54.713832 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.713832    1368 kubelet.go:2115] "SyncLoop (PLEG): event for pod" pod="default/busybox-22" event=&{ID:77ae2076-4934-41cc-8300-74b2ad5b9ffc Type:ContainerDied Data:65a5165e4b7e03c0d60f93e28d3c8fa411804494be0aea3a442cbd35e5995abf}

But this pod was evicted to make a room for the static pod. But the scheduler probably seeing this event tries to schedule a new pod that was in the Pending state, 

Dec 10 08:57:54.714026 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.713865    1368 kubelet.go:2077] "SyncLoop ADD" source="api" pods=[default/busybox-245]
Dec 10 08:57:54.714076 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.714048    1368 topology_manager.go:200] "Topology Admit Handler"


and it fails because the one slot that was opened is already taken by the static pod,

Dec 10 08:57:54.714924 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.714893    1368 predicate.go:144] "Predicate failed on Pod" pod="default/busybox-245" err="Node didn't have enough resource: pods, requested: 1, used: 250, capacity: 250"

Comment 70 Ryan Phillips 2022-03-11 13:42:28 UTC

Fix is being tracked here: https://github.com/kubernetes/kubernetes/pull/108366

Comment 73 Sunil Choudhary 2022-03-23 11:26:25 UTC

Verified on  4.11.0-0.nightly-2022-03-20-160505

$ oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-128-164.us-east-2.compute.internal   Ready    worker   5h34m   v1.23.3+02aefbf
ip-10-0-132-3.us-east-2.compute.internal     Ready    master   5h39m   v1.23.3+02aefbf
ip-10-0-173-182.us-east-2.compute.internal   Ready    master   5h40m   v1.23.3+02aefbf
ip-10-0-189-25.us-east-2.compute.internal    Ready    worker   5h34m   v1.23.3+02aefbf
ip-10-0-218-101.us-east-2.compute.internal   Ready    worker   5h31m   v1.23.3+02aefbf
ip-10-0-221-24.us-east-2.compute.internal    Ready    master   5h39m   v1.23.3+02aefbf

$ oc describe node ip-10-0-128-164.us-east-2.compute.internal
...
Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         2
  ephemeral-storage:           125293548Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      8006184Ki
  pods:                        50
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         1500m
  ephemeral-storage:           115470533646
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      6855208Ki
  pods:                        50
...


$ oc adm cordon ip-10-0-189-25.us-east-2.compute.internal
node/ip-10-0-189-25.us-east-2.compute.internal cordoned

$ oc adm cordon ip-10-0-218-101.us-east-2.compute.internal
node/ip-10-0-218-101.us-east-2.compute.internal cordoned

$ oc get nodes
NAME                                         STATUS                     ROLES    AGE     VERSION
ip-10-0-128-164.us-east-2.compute.internal   Ready                      worker   5h35m   v1.23.3+02aefbf
ip-10-0-132-3.us-east-2.compute.internal     Ready                      master   5h40m   v1.23.3+02aefbf
ip-10-0-173-182.us-east-2.compute.internal   Ready                      master   5h41m   v1.23.3+02aefbf
ip-10-0-189-25.us-east-2.compute.internal    Ready,SchedulingDisabled   worker   5h35m   v1.23.3+02aefbf
ip-10-0-218-101.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   5h32m   v1.23.3+02aefbf
ip-10-0-221-24.us-east-2.compute.internal    Ready                      master   5h40m   v1.23.3+02aefbf


$ cat job.yaml 
apiVersion: batch/v1
kind: Job
metadata:
  name: test1
spec:
  parallelism: 50
  completions: 300
  template:
    spec:
      containers:
      - name: busybox
        image: quay.io/prometheus/busybox
        command: ["echo",  "ok"]
      restartPolicy: Never

$ oc create -f job.yaml 
job.batch/test1 created

$ oc get job
NAME    COMPLETIONS   DURATION   AGE
test1   300/300       3m18s      4m18s

$ oc describe job test1
Name:             test1
Namespace:        app
Selector:         controller-uid=239ac8d7-c8ab-4c9c-87ad-d6d3752d1416
Labels:           controller-uid=239ac8d7-c8ab-4c9c-87ad-d6d3752d1416
                  job-name=test1
Annotations:      batch.kubernetes.io/job-tracking: 
Parallelism:      50
Completions:      300
Completion Mode:  NonIndexed
Start Time:       Wed, 23 Mar 2022 16:44:15 +0530
Completed At:     Wed, 23 Mar 2022 16:47:33 +0530
Duration:         3m18s
Pods Statuses:    0 Active / 300 Succeeded / 2 Failed
Pod Template:
  Labels:  controller-uid=239ac8d7-c8ab-4c9c-87ad-d6d3752d1416
           job-name=test1
  Containers:
   busybox:
    Image:      quay.io/prometheus/busybox
    Port:       <none>
    Host Port:  <none>
    Command:
      echo
      ok
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type    Reason            Age                     From            Message
  ----    ------            ----                    ----            -------
  Normal  SuccessfulCreate  4m22s                   job-controller  Created pod: test1-ddfvd
  Normal  SuccessfulCreate  4m22s                   job-controller  Created pod: test1-kp78z
  Normal  SuccessfulCreate  4m22s                   job-controller  Created pod: test1-mmm87
  Normal  SuccessfulCreate  4m22s                   job-controller  Created pod: test1-8vsgd
  Normal  SuccessfulCreate  4m22s                   job-controller  Created pod: test1-cv2tt
  Normal  SuccessfulCreate  4m22s                   job-controller  Created pod: test1-4csjw
  Normal  SuccessfulCreate  4m22s                   job-controller  Created pod: test1-dsws9
  Normal  SuccessfulCreate  4m22s                   job-controller  Created pod: test1-cwkm8
  Normal  SuccessfulCreate  4m22s                   job-controller  Created pod: test1-fffw5
  Normal  SuccessfulCreate  4m22s (x16 over 4m22s)  job-controller  (combined from similar events): Created pod: test1-pgvnp

$ oc debug node/ip-10-0-128-164.us-east-2.compute.internal
Starting pod/ip-10-0-128-164us-east-2computeinternal-debug ...
...


sh-4.4# journalctl -u kubelet | grep -i OutOfPod
sh-4.4# 

sh-4.4# journalctl -u kubelet | grep -i "Node didn't have enough resource"
sh-4.4# 

sh-4.4# journalctl -u kubelet | grep -i "Predicate failed on Pod"         
sh-4.4#

Comment 78 errata-xmlrpc 2022-08-10 10:39:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 79 Red Hat Bugzilla 2023-09-15 01:49:56 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days