Bug 2022507
| Summary: | Pods stuck in OutOfpods state after running cluster-density | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mohit Sheth <msheth> | |
| Component: | Node | Assignee: | Ryan Phillips <rphillips> | |
| Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | high | CC: | aojeagar, harpatil, jchaloup, llopezmo, mfojtik, pmuller, rphillips, rsevilla, rzaleski, smalleni, wlewis, xxia | |
| Version: | 4.9 | |||
| Target Milestone: | --- | |||
| Target Release: | 4.11.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Fixed a regression that could incorrectly reject pods with OutOfCpu errors if they were rapidly scheduled after other pods were reported as complete in the API. The Kubelet now waits to report the phase of a pod as terminal in the API until all running containers are guaranteed to have stopped and no new containers can be started. Short-lived pods may take slightly longer (~1s) to report Succeeded or Failed after this change.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 2065774 (view as bug list) | Environment: | ||
| Last Closed: | 2022-08-10 10:39:46 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2065774 | |||
Hello Mohit, > Pods stuck in OutOfpods state Number of pods, cpu and memory are primary resources which each node has to have plenty to be able to run a pod. Given that, it's a must requirement checked before a pod gets scheduled to a node by the kube-scheduler [1]. > Warning OutOfpods 71m kubelet Node didn't have enough resource: pods, requested: 1, used: 251, capacity: 250 The event we see here is reported by the kubelet [2]. The incident can be caused either by the kube-scheduler's cache not getting quickly enough updated or by setting each pod's .spec.nodeName field explicitly (by short-cutting the kube-scheduler). > When running cluster density at 25 node scale with 500 projects ... I am not familiar with the cluster density test. Can you please more elaborate on how the test is run? Are all the pods created in the test created with an empty .spec.nodeName field? Is it possible some pods have the .spec.nodeName field set when they are created? Is the issue reproducible all the time? Can you share the kube-scheduler logs/must-gather from the time of the incident? [1] https://github.com/openshift/kubernetes/blob/release-4.9/pkg/scheduler/framework/plugins/noderesources/fit.go#L260-L268 [2] https://github.com/openshift/kubernetes/blob/release-4.9/pkg/kubelet/lifecycle/predicate.go#L141-L144 (In reply to Jan Chaloupka from comment #5) > Hello Mohit, > > > Pods stuck in OutOfpods state > > Number of pods, cpu and memory are primary resources which each node has to > have plenty to be able to run a pod. Given that, it's a must requirement > checked before a pod gets scheduled to a node by the kube-scheduler [1]. > > > Warning OutOfpods 71m kubelet Node didn't have enough resource: pods, requested: 1, used: 251, capacity: 250 > > The event we see here is reported by the kubelet [2]. The incident can be > caused either by the kube-scheduler's cache not getting quickly enough > updated or by setting each pod's .spec.nodeName field explicitly (by > short-cutting the kube-scheduler). > > > When running cluster density at 25 node scale with 500 projects ... > > I am not familiar with the cluster density test. Can you please more > elaborate on how the test is run? Are all the pods created in the test > created with an empty .spec.nodeName field? Is it possible some pods have > the .spec.nodeName field set when they are created? Regarding kube-burner cluster-density workload[1][2]. We do no static mapping of pods to nodes for this workload. Additionally, to note, we have run this workload since early 4.X and never have had this scheduling issue. So this is a regression. [1]https://github.com/cloud-bulldozer/benchmark-operator/blob/master/docs/kube-burner.md#supported-workloads [2]https://github.com/cloud-bulldozer/benchmark-operator/blob/master/roles/kube-burner/templates/cluster-density.yml.j2 > > Is the issue reproducible all the time? Can you share the kube-scheduler > logs/must-gather from the time of the incident? > > [1] > https://github.com/openshift/kubernetes/blob/release-4.9/pkg/scheduler/ > framework/plugins/noderesources/fit.go#L260-L268 > [2] > https://github.com/openshift/kubernetes/blob/release-4.9/pkg/kubelet/ > lifecycle/predicate.go#L141-L144 Will need more time and information to figure the root cause of the incident. I added a static pod at (when we already have 250 pods running),
Dec 10 08:57:23.934906 ip-10-0-177-126 hyperkube[1368]: E1210 08:57:23.934865 1368 file.go:109] "Unable to process watch event" err="can't process config file \"/etc/kubernetes/manifests/static-web.yaml\": /etc/kubernetes/manifests/static-web.yaml: couldn't parse as pod(Object 'Kind' is missing in 'null'), please check config file"
Dec 10 08:57:23.943338 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:23.943310 1368 kubelet.go:2077] "SyncLoop ADD" source="file" pods=[default/static-web-ip-10-0-177-126.us-west-1.compute.internal]
Dec 10 08:57:23.943569 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:23.943551 1368 topology_manager.go:200] "Topology Admit Handler"
That triggers the eviction of pod busybox-22,
Warning Preempting 72m kubelet Preempted in order to admit critical pod
Normal Killing 72m kubelet Stopping container busybox-22
Dec 10 08:57:23.944882 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:23.944848 1368 kuberuntime_container.go:720] "Killing container with a grace period" pod="default/busybox-22" podUID=77ae2076-4934-41cc-8300-74b2ad5b9ffc containerName="busybox-22" containerID="cri-o://53243766100e71d3609dd982138bbb83a77272c4fb3dccbb9eeaf623a24463c6" gracePeriod=30
Dec 10 08:57:23.945179 ip-10-0-177-126 crio[1337]: time="2021-12-10 08:57:23.945133626Z" level=info msg="Stopping container: 53243766100e71d3609dd982138bbb83a77272c4fb3dccbb9eeaf623a24463c6 (timeout: 30s)" id=a3715aa7-d7db-4371-ba76-93979fc19392 name=/runtime.v1alpha2.RuntimeService/StopContainer
Dec 10 08:57:54.589320 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.589288 1368 preemption.go:117] "Pod evicted successfully" pod="default/busybox-22"
Finally kubelet emits the event ContainerDied for the pod busybox-22,
Dec 10 08:57:54.713832 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.713790 1368 kubelet.go:2115] "SyncLoop (PLEG): event for pod" pod="default/busybox-22" event=&{ID:77ae2076-4934-41cc-8300-74b2ad5b9ffc Type:ContainerDied Data:53243766100e71d3609dd982138bbb83a77272c4fb3dccbb9eeaf623a24463c6}
Dec 10 08:57:54.713832 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.713832 1368 kubelet.go:2115] "SyncLoop (PLEG): event for pod" pod="default/busybox-22" event=&{ID:77ae2076-4934-41cc-8300-74b2ad5b9ffc Type:ContainerDied Data:65a5165e4b7e03c0d60f93e28d3c8fa411804494be0aea3a442cbd35e5995abf}
But this pod was evicted to make a room for the static pod. But the scheduler probably seeing this event tries to schedule a new pod that was in the Pending state,
Dec 10 08:57:54.714026 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.713865 1368 kubelet.go:2077] "SyncLoop ADD" source="api" pods=[default/busybox-245]
Dec 10 08:57:54.714076 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.714048 1368 topology_manager.go:200] "Topology Admit Handler"
and it fails because the one slot that was opened is already taken by the static pod,
Dec 10 08:57:54.714924 ip-10-0-177-126 hyperkube[1368]: I1210 08:57:54.714893 1368 predicate.go:144] "Predicate failed on Pod" pod="default/busybox-245" err="Node didn't have enough resource: pods, requested: 1, used: 250, capacity: 250"
Fix is being tracked here: https://github.com/kubernetes/kubernetes/pull/108366 Verified on 4.11.0-0.nightly-2022-03-20-160505
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-128-164.us-east-2.compute.internal Ready worker 5h34m v1.23.3+02aefbf
ip-10-0-132-3.us-east-2.compute.internal Ready master 5h39m v1.23.3+02aefbf
ip-10-0-173-182.us-east-2.compute.internal Ready master 5h40m v1.23.3+02aefbf
ip-10-0-189-25.us-east-2.compute.internal Ready worker 5h34m v1.23.3+02aefbf
ip-10-0-218-101.us-east-2.compute.internal Ready worker 5h31m v1.23.3+02aefbf
ip-10-0-221-24.us-east-2.compute.internal Ready master 5h39m v1.23.3+02aefbf
$ oc describe node ip-10-0-128-164.us-east-2.compute.internal
...
Capacity:
attachable-volumes-aws-ebs: 39
cpu: 2
ephemeral-storage: 125293548Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8006184Ki
pods: 50
Allocatable:
attachable-volumes-aws-ebs: 39
cpu: 1500m
ephemeral-storage: 115470533646
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 6855208Ki
pods: 50
...
$ oc adm cordon ip-10-0-189-25.us-east-2.compute.internal
node/ip-10-0-189-25.us-east-2.compute.internal cordoned
$ oc adm cordon ip-10-0-218-101.us-east-2.compute.internal
node/ip-10-0-218-101.us-east-2.compute.internal cordoned
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-128-164.us-east-2.compute.internal Ready worker 5h35m v1.23.3+02aefbf
ip-10-0-132-3.us-east-2.compute.internal Ready master 5h40m v1.23.3+02aefbf
ip-10-0-173-182.us-east-2.compute.internal Ready master 5h41m v1.23.3+02aefbf
ip-10-0-189-25.us-east-2.compute.internal Ready,SchedulingDisabled worker 5h35m v1.23.3+02aefbf
ip-10-0-218-101.us-east-2.compute.internal Ready,SchedulingDisabled worker 5h32m v1.23.3+02aefbf
ip-10-0-221-24.us-east-2.compute.internal Ready master 5h40m v1.23.3+02aefbf
$ cat job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: test1
spec:
parallelism: 50
completions: 300
template:
spec:
containers:
- name: busybox
image: quay.io/prometheus/busybox
command: ["echo", "ok"]
restartPolicy: Never
$ oc create -f job.yaml
job.batch/test1 created
$ oc get job
NAME COMPLETIONS DURATION AGE
test1 300/300 3m18s 4m18s
$ oc describe job test1
Name: test1
Namespace: app
Selector: controller-uid=239ac8d7-c8ab-4c9c-87ad-d6d3752d1416
Labels: controller-uid=239ac8d7-c8ab-4c9c-87ad-d6d3752d1416
job-name=test1
Annotations: batch.kubernetes.io/job-tracking:
Parallelism: 50
Completions: 300
Completion Mode: NonIndexed
Start Time: Wed, 23 Mar 2022 16:44:15 +0530
Completed At: Wed, 23 Mar 2022 16:47:33 +0530
Duration: 3m18s
Pods Statuses: 0 Active / 300 Succeeded / 2 Failed
Pod Template:
Labels: controller-uid=239ac8d7-c8ab-4c9c-87ad-d6d3752d1416
job-name=test1
Containers:
busybox:
Image: quay.io/prometheus/busybox
Port: <none>
Host Port: <none>
Command:
echo
ok
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 4m22s job-controller Created pod: test1-ddfvd
Normal SuccessfulCreate 4m22s job-controller Created pod: test1-kp78z
Normal SuccessfulCreate 4m22s job-controller Created pod: test1-mmm87
Normal SuccessfulCreate 4m22s job-controller Created pod: test1-8vsgd
Normal SuccessfulCreate 4m22s job-controller Created pod: test1-cv2tt
Normal SuccessfulCreate 4m22s job-controller Created pod: test1-4csjw
Normal SuccessfulCreate 4m22s job-controller Created pod: test1-dsws9
Normal SuccessfulCreate 4m22s job-controller Created pod: test1-cwkm8
Normal SuccessfulCreate 4m22s job-controller Created pod: test1-fffw5
Normal SuccessfulCreate 4m22s (x16 over 4m22s) job-controller (combined from similar events): Created pod: test1-pgvnp
$ oc debug node/ip-10-0-128-164.us-east-2.compute.internal
Starting pod/ip-10-0-128-164us-east-2computeinternal-debug ...
...
sh-4.4# journalctl -u kubelet | grep -i OutOfPod
sh-4.4#
sh-4.4# journalctl -u kubelet | grep -i "Node didn't have enough resource"
sh-4.4#
sh-4.4# journalctl -u kubelet | grep -i "Predicate failed on Pod"
sh-4.4#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days |
Description of problem: When running cluster density at 25 node scale with 500 projects we see a lot of puds in OutOfpods state >>Warning OutOfpods 71m kubelet Node didn't have enough resource: pods, requested: 1, used: 251, capacity: 250 Version-Release number of selected component (if applicable): 4.9 and 4.10 How reproducible: Always Steps to Reproduce: 1. Run cluster density test and observe the pods Actual results: Pods stuck in OutOfpods state Expected results: Test to complete without any failures Additional info: