Description of problem: When running cluster density at 25 node scale with 500 projects we see a lot of puds in OutOfpods state >>Warning OutOfpods 71m kubelet Node didn't have enough resource: pods, requested: 1, used: 251, capacity: 250 Version-Release number of selected component (if applicable): 4.9 and 4.10 How reproducible: Always Steps to Reproduce: 1. Run cluster density test and observe the pods Actual results: Pods stuck in OutOfpods state Expected results: Test to complete without any failures Additional info:
Hello Mohit, > Pods stuck in OutOfpods state Number of pods, cpu and memory are primary resources which each node has to have plenty to be able to run a pod. Given that, it's a must requirement checked before a pod gets scheduled to a node by the kube-scheduler [1]. > Warning OutOfpods 71m kubelet Node didn't have enough resource: pods, requested: 1, used: 251, capacity: 250 The event we see here is reported by the kubelet [2]. The incident can be caused either by the kube-scheduler's cache not getting quickly enough updated or by setting each pod's .spec.nodeName field explicitly (by short-cutting the kube-scheduler). > When running cluster density at 25 node scale with 500 projects ... I am not familiar with the cluster density test. Can you please more elaborate on how the test is run? Are all the pods created in the test created with an empty .spec.nodeName field? Is it possible some pods have the .spec.nodeName field set when they are created? Is the issue reproducible all the time? Can you share the kube-scheduler logs/must-gather from the time of the incident? [1] https://github.com/openshift/kubernetes/blob/release-4.9/pkg/scheduler/framework/plugins/noderesources/fit.go#L260-L268 [2] https://github.com/openshift/kubernetes/blob/release-4.9/pkg/kubelet/lifecycle/predicate.go#L141-L144
(In reply to Jan Chaloupka from comment #5) > Hello Mohit, > > > Pods stuck in OutOfpods state > > Number of pods, cpu and memory are primary resources which each node has to > have plenty to be able to run a pod. Given that, it's a must requirement > checked before a pod gets scheduled to a node by the kube-scheduler [1]. > > > Warning OutOfpods 71m kubelet Node didn't have enough resource: pods, requested: 1, used: 251, capacity: 250 > > The event we see here is reported by the kubelet [2]. The incident can be > caused either by the kube-scheduler's cache not getting quickly enough > updated or by setting each pod's .spec.nodeName field explicitly (by > short-cutting the kube-scheduler). > > > When running cluster density at 25 node scale with 500 projects ... > > I am not familiar with the cluster density test. Can you please more > elaborate on how the test is run? Are all the pods created in the test > created with an empty .spec.nodeName field? Is it possible some pods have > the .spec.nodeName field set when they are created? Regarding kube-burner cluster-density workload[1][2]. We do no static mapping of pods to nodes for this workload. Additionally, to note, we have run this workload since early 4.X and never have had this scheduling issue. So this is a regression. [1]https://github.com/cloud-bulldozer/benchmark-operator/blob/master/docs/kube-burner.md#supported-workloads [2]https://github.com/cloud-bulldozer/benchmark-operator/blob/master/roles/kube-burner/templates/cluster-density.yml.j2 > > Is the issue reproducible all the time? Can you share the kube-scheduler > logs/must-gather from the time of the incident? > > [1] > https://github.com/openshift/kubernetes/blob/release-4.9/pkg/scheduler/ > framework/plugins/noderesources/fit.go#L260-L268 > [2] > https://github.com/openshift/kubernetes/blob/release-4.9/pkg/kubelet/ > lifecycle/predicate.go#L141-L144
Will need more time and information to figure the root cause of the incident.
I added a static pod at (when we already have 250 pods running), Dec 10 08:57:23.934906 ip-10-0-177-126 hyperkube[1368]: E1210 08:57:23.934865 1368 file.go:109] "Unable to process watch event" err="can't process config file \"/etc/kubernetes/manifests/static-web.yaml\": /etc/kubernetes/manifests/static-web.yaml: couldn't parse as pod(Object 'Kind' is missing in 'null'), please check config file" Dec 10 08:57:23.943338 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:23.943310 1368 kubelet.go:2077] "SyncLoop ADD" source="file" pods=[default/static-web-ip-10-0-177-126.us-west-1.compute.internal] Dec 10 08:57:23.943569 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:23.943551 1368 topology_manager.go:200] "Topology Admit Handler" That triggers the eviction of pod busybox-22, Warning Preempting 72m kubelet Preempted in order to admit critical pod Normal Killing 72m kubelet Stopping container busybox-22 Dec 10 08:57:23.944882 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:23.944848 1368 kuberuntime_container.go:720] "Killing container with a grace period" pod="default/busybox-22" podUID=77ae2076-4934-41cc-8300-74b2ad5b9ffc containerName="busybox-22" containerID="cri-o://53243766100e71d3609dd982138bbb83a77272c4fb3dccbb9eeaf623a24463c6" gracePeriod=30 Dec 10 08:57:23.945179 ip-10-0-177-126 crio[1337]: time="2021-12-10 08:57:23.945133626Z" level=info msg="Stopping container: 53243766100e71d3609dd982138bbb83a77272c4fb3dccbb9eeaf623a24463c6 (timeout: 30s)" id=a3715aa7-d7db-4371-ba76-93979fc19392 name=/runtime.v1alpha2.RuntimeService/StopContainer Dec 10 08:57:54.589320 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.589288 1368 preemption.go:117] "Pod evicted successfully" pod="default/busybox-22" Finally kubelet emits the event ContainerDied for the pod busybox-22, Dec 10 08:57:54.713832 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.713790 1368 kubelet.go:2115] "SyncLoop (PLEG): event for pod" pod="default/busybox-22" event=&{ID:77ae2076-4934-41cc-8300-74b2ad5b9ffc Type:ContainerDied Data:53243766100e71d3609dd982138bbb83a77272c4fb3dccbb9eeaf623a24463c6} Dec 10 08:57:54.713832 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.713832 1368 kubelet.go:2115] "SyncLoop (PLEG): event for pod" pod="default/busybox-22" event=&{ID:77ae2076-4934-41cc-8300-74b2ad5b9ffc Type:ContainerDied Data:65a5165e4b7e03c0d60f93e28d3c8fa411804494be0aea3a442cbd35e5995abf} But this pod was evicted to make a room for the static pod. But the scheduler probably seeing this event tries to schedule a new pod that was in the Pending state, Dec 10 08:57:54.714026 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.713865 1368 kubelet.go:2077] "SyncLoop ADD" source="api" pods=[default/busybox-245] Dec 10 08:57:54.714076 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.714048 1368 topology_manager.go:200] "Topology Admit Handler" and it fails because the one slot that was opened is already taken by the static pod, Dec 10 08:57:54.714924 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.714893 1368 predicate.go:144] "Predicate failed on Pod" pod="default/busybox-245" err="Node didn't have enough resource: pods, requested: 1, used: 250, capacity: 250"
Fix is being tracked here: https://github.com/kubernetes/kubernetes/pull/108366
Verified on 4.11.0-0.nightly-2022-03-20-160505 $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-128-164.us-east-2.compute.internal Ready worker 5h34m v1.23.3+02aefbf ip-10-0-132-3.us-east-2.compute.internal Ready master 5h39m v1.23.3+02aefbf ip-10-0-173-182.us-east-2.compute.internal Ready master 5h40m v1.23.3+02aefbf ip-10-0-189-25.us-east-2.compute.internal Ready worker 5h34m v1.23.3+02aefbf ip-10-0-218-101.us-east-2.compute.internal Ready worker 5h31m v1.23.3+02aefbf ip-10-0-221-24.us-east-2.compute.internal Ready master 5h39m v1.23.3+02aefbf $ oc describe node ip-10-0-128-164.us-east-2.compute.internal ... Capacity: attachable-volumes-aws-ebs: 39 cpu: 2 ephemeral-storage: 125293548Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 8006184Ki pods: 50 Allocatable: attachable-volumes-aws-ebs: 39 cpu: 1500m ephemeral-storage: 115470533646 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 6855208Ki pods: 50 ... $ oc adm cordon ip-10-0-189-25.us-east-2.compute.internal node/ip-10-0-189-25.us-east-2.compute.internal cordoned $ oc adm cordon ip-10-0-218-101.us-east-2.compute.internal node/ip-10-0-218-101.us-east-2.compute.internal cordoned $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-128-164.us-east-2.compute.internal Ready worker 5h35m v1.23.3+02aefbf ip-10-0-132-3.us-east-2.compute.internal Ready master 5h40m v1.23.3+02aefbf ip-10-0-173-182.us-east-2.compute.internal Ready master 5h41m v1.23.3+02aefbf ip-10-0-189-25.us-east-2.compute.internal Ready,SchedulingDisabled worker 5h35m v1.23.3+02aefbf ip-10-0-218-101.us-east-2.compute.internal Ready,SchedulingDisabled worker 5h32m v1.23.3+02aefbf ip-10-0-221-24.us-east-2.compute.internal Ready master 5h40m v1.23.3+02aefbf $ cat job.yaml apiVersion: batch/v1 kind: Job metadata: name: test1 spec: parallelism: 50 completions: 300 template: spec: containers: - name: busybox image: quay.io/prometheus/busybox command: ["echo", "ok"] restartPolicy: Never $ oc create -f job.yaml job.batch/test1 created $ oc get job NAME COMPLETIONS DURATION AGE test1 300/300 3m18s 4m18s $ oc describe job test1 Name: test1 Namespace: app Selector: controller-uid=239ac8d7-c8ab-4c9c-87ad-d6d3752d1416 Labels: controller-uid=239ac8d7-c8ab-4c9c-87ad-d6d3752d1416 job-name=test1 Annotations: batch.kubernetes.io/job-tracking: Parallelism: 50 Completions: 300 Completion Mode: NonIndexed Start Time: Wed, 23 Mar 2022 16:44:15 +0530 Completed At: Wed, 23 Mar 2022 16:47:33 +0530 Duration: 3m18s Pods Statuses: 0 Active / 300 Succeeded / 2 Failed Pod Template: Labels: controller-uid=239ac8d7-c8ab-4c9c-87ad-d6d3752d1416 job-name=test1 Containers: busybox: Image: quay.io/prometheus/busybox Port: <none> Host Port: <none> Command: echo ok Environment: <none> Mounts: <none> Volumes: <none> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 4m22s job-controller Created pod: test1-ddfvd Normal SuccessfulCreate 4m22s job-controller Created pod: test1-kp78z Normal SuccessfulCreate 4m22s job-controller Created pod: test1-mmm87 Normal SuccessfulCreate 4m22s job-controller Created pod: test1-8vsgd Normal SuccessfulCreate 4m22s job-controller Created pod: test1-cv2tt Normal SuccessfulCreate 4m22s job-controller Created pod: test1-4csjw Normal SuccessfulCreate 4m22s job-controller Created pod: test1-dsws9 Normal SuccessfulCreate 4m22s job-controller Created pod: test1-cwkm8 Normal SuccessfulCreate 4m22s job-controller Created pod: test1-fffw5 Normal SuccessfulCreate 4m22s (x16 over 4m22s) job-controller (combined from similar events): Created pod: test1-pgvnp $ oc debug node/ip-10-0-128-164.us-east-2.compute.internal Starting pod/ip-10-0-128-164us-east-2computeinternal-debug ... ... sh-4.4# journalctl -u kubelet | grep -i OutOfPod sh-4.4# sh-4.4# journalctl -u kubelet | grep -i "Node didn't have enough resource" sh-4.4# sh-4.4# journalctl -u kubelet | grep -i "Predicate failed on Pod" sh-4.4#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days