>1. Is this issue specific to ARO or there is similar test run on other cloud providers that runs successfully? I have reproduced the same behavior on OCP 4.3.27 in Azure (not ARO), I was not able to reproduce it on AWS although the AWS cluster I used had 3 rather than 2 worker nodes. Worker nodes in both cloud providers have 8GiB of RAM. >2. The container is growing array in a tight-loop, is this a realistic scenario? I don't know how quickly Operating system on ARO can detect and kill a process for OOM. My concern is, depending on capacity and other load on node, this script may leave very little CPU for other activities. In my OCP on Azure cluster I'm able to see oom-kills kill the python process but I also saw a number of hung task warnings from the kernel as well. I believe this is impacting ARO customers running more realistic workloads but Jim can speak to that better than I can.
For both my test clusters: system-reserved: cpu: 500m memory: 1Gi ephemeral-storage: 1Gi I don't see kube-reserved defined anywhere on either cluster but I'll keep looking.
One difference I'm seeing on the Azure cluster during the oom-kill events is several rcu_sched stall warnings which might point to why the nodes are going NotReady (possibly due to i/o performance differences) [ 7315.479017] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ 7315.541774] Memory cgroup stats for [ 7315.541655] rcu: 0-....: (144 ticks this GP) idle=7f2/1/0x4000000000000000 softirq=1385795/1385810 fqs=14822 [ 7315.541655] rcu: (detected by 1, t=60002 jiffies, g=2694213, q=38010) [ 7315.541655] Sending NMI from CPU 1 to CPUs 0: [ 7315.542014] NMI backtrace for cpu 0 [ 7315.542014] CPU: 0 PID: 2195 Comm: ovs-vswitchd Not tainted 4.18.0-147.13.2.el8_1.x86_64 #1 [ 7315.542014] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 [ 7315.542014] RIP: 0010:cpuacct_account_field+0x27/0x50 [ 7315.542014] Code: 00 00 00 0f 1f 44 00 00 48 8b 87 08 0d 00 00 48 8b 48 10 48 81 f9 20 93 25 91 74 2a 48 63 f6 48 c1 e6 03 48 8b 81 38 01 00 00 <65>48 03 05 61 97 f0 6f 48 01 14 30 48 8b 89 28 01 00 00 48 81 f9 [ 7315.542014] RSP: 0018:ffff961cf7a03ea8 EFLAGS: 00000083 [ 7315.542014] RAX: 0000326e88221278 RBX: ffff961cb699af80 RCX: ffff961cf614e600 [ 7315.542014] RDX: 000000000003b9bd RSI: 0000000000000010 RDI: ffff961cb699af80 [ 7315.542014] RBP: 000000000003b9bd R08: 0000000000000002 R09: 011d94d851f61f2c [ 7315.542014] R10: 00000f47e7c5b1a8 R11: 0000000000000000 R12: 0000000000000002 [ 7315.542014] R13: ffff961cf7a1cf80 R14: ffffffff90146710 R15: ffff961cf7a1d0b8 [ 7315.542014] FS: 00007f197bbb7d00(0000) GS:ffff961cf7a00000(0000) knlGS:0000000000000000 [ 7315.542014] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7315.542014] CR2: 00000000004bad80 CR3: 0000000276e0c003 CR4: 00000000003606f0 [ 7315.542014] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 7315.542014] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 7315.542014] Call Trace: [ 7315.542014] <IRQ> [ 7315.542014] account_system_index_time+0x63/0x90 [ 7315.542014] update_process_times+0x1c/0x60 [ 7315.542014] tick_sched_handle+0x22/0x60 [ 7315.542014] tick_sched_timer+0x37/0x70 [ 7315.542014] __hrtimer_run_queues+0x100/0x280 [ 7315.542014] hrtimer_interrupt+0x100/0x220 [ 7315.542014] ? sched_clock+0x5/0x10 [ 7315.542014] hv_stimer0_isr+0x20/0x30 [hv_vmbus] [ 7315.542014] hv_stimer0_vector_handler+0x3b/0x70 [ 7315.542014] hv_stimer0_callback_vector+0xf/0x20 [ 7315.542014] </IRQ> [ 7315.542014] RIP: 0010:vprintk_emit+0x3a4/0x450 [ 7315.542014] Code: 90 84 d2 74 6d 0f b6 15 da 38 92 01 48 c7 c0 20 a7 a3 91 84 d2 74 09 f3 90 0f b6 10 84 d2 75 f7 e8 31 0b 00 00 48 89 df 57 9d <0f>1f 44 00 00 e8 f2 e3 ff ff e9 28 fe ff ff 80 3d aa e9 2e 01 00
This is also happening on 4.5.2 on Azure: mgahagan-cr9wt-worker-northcentralus-c57nw NotReady worker 99m v1.18.3+b74c5ed 10.0.32.4 <none> Red Hat Enterprise Linux CoreOS 45.82.202007141718-0 (Ootpa) 4.18.0-193.13.2.el8_2.x86_64 cri-o://1.18.2-18.rhaos4.5.git754d46b.el8
Also tried a possible workaround of setting system-reserved.memory to 1250Gi on my 4.5.2 cluster and it didn't help on 4.5.2.
Moving to RHCOS to take a closer look at the dmesg output.
OOM is an OpenShift wide topic that impacts multiple teams, among them RHCOS and Node. At the present time you must configure pod limits: https://docs.openshift.com/container-platform/4.5/nodes/clusters/nodes-cluster-resource-configure.html Some clusters may want a mutating admission webhook to enforce this. As I understand things, if you're not applying limits, then none of the system reserved bits come into effect. If this bug stays against RHCOS, since it's not actionable we'll close it.
xref bug for adding alert when we go over the system memory reservation https://bugzilla.redhat.com/show_bug.cgi?id=1881208
Summarizing because someone was confused: No amount of manual tuning can prevent this problem from happening in all cases. Identifying and eliminating the underlying hang is the key outcome now that we have an alert to identify the minimal and acceptable short-term workaround.
*** Bug 1877059 has been marked as a duplicate of this bug. ***
*** Bug 1889734 has been marked as a duplicate of this bug. ***
*** Bug 1890684 has been marked as a duplicate of this bug. ***
*** Bug 1910086 has been marked as a duplicate of this bug. ***
*** Bug 1892909 has been marked as a duplicate of this bug. ***
Verified on 4.7.0-0.nightly-2021-01-04-215816. Tested this on 2 nodes. Cordoned the node and created RC and I could see pod being evicted due to System OOM and then node trying to reclaim memory without going into NotReady state. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-01-04-215816 True False 45m Cluster version is 4.7.0-0.nightly-2021-01-04-215816 $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-139-84.us-east-2.compute.internal Ready master 68m v1.20.0+87544c5 10.0.139.84 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 ip-10-0-146-230.us-east-2.compute.internal Ready worker 63m v1.20.0+87544c5 10.0.146.230 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 ip-10-0-164-104.us-east-2.compute.internal Ready,SchedulingDisabled worker 62m v1.20.0+87544c5 10.0.164.104 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 ip-10-0-169-245.us-east-2.compute.internal Ready master 68m v1.20.0+87544c5 10.0.169.245 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 ip-10-0-192-137.us-east-2.compute.internal Ready master 68m v1.20.0+87544c5 10.0.192.137 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 ip-10-0-218-188.us-east-2.compute.internal Ready,SchedulingDisabled worker 62m v1.20.0+87544c5 10.0.218.188 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 $ oc create -f rc.yaml replicationcontroller/badmem created $ oc get rc NAME DESIRED CURRENT READY AGE badmem 1 1 0 3s $ oc get pods NAME READY STATUS RESTARTS AGE badmem-tjhjd 0/1 ContainerCreating 0 6s $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES badmem-tjhjd 1/1 Running 0 20s 10.131.0.42 ip-10-0-146-230.us-east-2.compute.internal <none> <none> $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES badmem-fk9x5 0/1 CrashLoopBackOff 7 22m 10.131.0.51 ip-10-0-146-230.us-east-2.compute.internal <none> <none> badmem-tjhjd 0/1 Evicted 0 23m <none> ip-10-0-146-230.us-east-2.compute.internal <none> <none> $ oc describe pod badmem-fk9x5 Name: badmem-fk9x5 Namespace: app Priority: 0 Node: ip-10-0-146-230.us-east-2.compute.internal/10.0.146.230 Start Time: Tue, 05 Jan 2021 13:09:05 +0530 Labels: app=badmem Annotations: k8s.v1.cni.cncf.io/network-status: [{ "name": "", "interface": "eth0", "ips": [ "10.131.0.51" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "", "interface": "eth0", "ips": [ "10.131.0.51" ], "default": true, "dns": {} }] openshift.io/scc: restricted Status: Running IP: 10.131.0.51 IPs: IP: 10.131.0.51 Controlled By: ReplicationController/badmem Containers: badmem: Container ID: cri-o://6474e9157e9ee59730590413eebbcf2316fae85d3de6237ebd5221f54e77bd33 Image: registry.redhat.io/rhel7:latest Image ID: registry.redhat.io/rhel7@sha256:110e61d28c1bfa1aad79e0413b98a70679a070baafb70e122fda4d105651599e Port: <none> Host Port: <none> Args: python -c x = [] while True: x.append("x" * 1048576) State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Tue, 05 Jan 2021 13:21:26 +0530 Finished: Tue, 05 Jan 2021 13:21:34 +0530 Ready: False Restart Count: 7 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-gz2d7 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: default-token-gz2d7: Type: Secret (a volume populated by a Secret) SecretName: default-token-gz2d7 Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> 0/6 nodes are available: 1 node(s) had taint {node.kubernetes.io/memory-pressure: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Warning FailedScheduling <unknown> 0/6 nodes are available: 1 node(s) had taint {node.kubernetes.io/memory-pressure: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Normal Scheduled <unknown> Successfully assigned app/badmem-fk9x5 to ip-10-0-146-230.us-east-2.compute.internal Normal AddedInterface 17m multus Add eth0 [10.131.0.51/23] Normal Pulled 17m kubelet, ip-10-0-146-230.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 889.354606ms Normal Pulled 17m kubelet, ip-10-0-146-230.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 935.541782ms Normal Pulled 16m kubelet, ip-10-0-146-230.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 878.485939ms Normal Created 16m (x4 over 17m) kubelet, ip-10-0-146-230.us-east-2.compute.internal Created container badmem Normal Started 16m (x4 over 17m) kubelet, ip-10-0-146-230.us-east-2.compute.internal Started container badmem Normal Pulled 16m kubelet, ip-10-0-146-230.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 819.24038ms Normal Pulling 15m (x5 over 17m) kubelet, ip-10-0-146-230.us-east-2.compute.internal Pulling image "registry.redhat.io/rhel7:latest" Normal Pulled 15m kubelet, ip-10-0-146-230.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 1.221594963s Warning BackOff 2m25s (x61 over 17m) kubelet, ip-10-0-146-230.us-east-2.compute.internal Back-off restarting failed container $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-139-84.us-east-2.compute.internal Ready master 94m v1.20.0+87544c5 ip-10-0-146-230.us-east-2.compute.internal Ready worker 88m v1.20.0+87544c5 ip-10-0-164-104.us-east-2.compute.internal Ready,SchedulingDisabled worker 88m v1.20.0+87544c5 ip-10-0-169-245.us-east-2.compute.internal Ready master 94m v1.20.0+87544c5 ip-10-0-192-137.us-east-2.compute.internal Ready master 94m v1.20.0+87544c5 ip-10-0-218-188.us-east-2.compute.internal Ready,SchedulingDisabled worker 88m v1.20.0+87544c5 $ oc describe node ip-10-0-146-230.us-east-2.compute.internal Name: ip-10-0-146-230.us-east-2.compute.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m5.large beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-east-2 failure-domain.beta.kubernetes.io/zone=us-east-2a kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-10-0-146-230 kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=m5.large node.openshift.io/os_id=rhcos topology.ebs.csi.aws.com/zone=us-east-2a topology.kubernetes.io/region=us-east-2 topology.kubernetes.io/zone=us-east-2a Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0b76f16895b9950f4"} machine.openshift.io/machine: openshift-machine-api/sunilc0501-q8g7n-worker-us-east-2a-s8sh9 machineconfiguration.openshift.io/currentConfig: rendered-worker-3b5bd44448e8d9aa6de4000b0f64c1d7 machineconfiguration.openshift.io/desiredConfig: rendered-worker-3b5bd44448e8d9aa6de4000b0f64c1d7 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Tue, 05 Jan 2021 11:58:58 +0530 Taints: node.kubernetes.io/memory-pressure:NoSchedule Unschedulable: false Lease: HolderIdentity: ip-10-0-146-230.us-east-2.compute.internal AcquireTime: <unset> RenewTime: Tue, 05 Jan 2021 13:28:00 +0530 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure True Tue, 05 Jan 2021 13:26:56 +0530 Tue, 05 Jan 2021 13:26:56 +0530 KubeletHasInsufficientMemory kubelet has insufficient memory available DiskPressure False Tue, 05 Jan 2021 13:26:56 +0530 Tue, 05 Jan 2021 11:58:58 +0530 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Tue, 05 Jan 2021 13:26:56 +0530 Tue, 05 Jan 2021 11:58:58 +0530 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Tue, 05 Jan 2021 13:26:56 +0530 Tue, 05 Jan 2021 11:59:48 +0530 KubeletReady kubelet is posting ready status Addresses: InternalIP: 10.0.146.230 Hostname: ip-10-0-146-230.us-east-2.compute.internal InternalDNS: ip-10-0-146-230.us-east-2.compute.internal Capacity: attachable-volumes-aws-ebs: 25 cpu: 2 ephemeral-storage: 125293548Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 7934684Ki pods: 250 Allocatable: attachable-volumes-aws-ebs: 25 cpu: 1500m ephemeral-storage: 114396791822 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 6783708Ki pods: 250 System Info: Machine ID: ec29e18d242aa4cd9260b6285abe896e System UUID: ec29e18d-242a-a4cd-9260-b6285abe896e Boot ID: 793835d0-a758-4fba-9c1f-9a82685497f1 Kernel Version: 4.18.0-240.10.1.el8_3.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 Kubelet Version: v1.20.0+87544c5 Kube-Proxy Version: v1.20.0+87544c5 ProviderID: aws:///us-east-2a/i-0b76f16895b9950f4 Non-terminated Pods: (28 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- app badmem-fk9x5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 24m openshift-cluster-csi-drivers aws-ebs-csi-driver-node-8cvqs 30m (2%) 0 (0%) 150Mi (2%) 0 (0%) 84m openshift-cluster-node-tuning-operator tuned-vchjb 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 84m openshift-console downloads-6d7bb8f56d-zw8fl 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 68m openshift-dns dns-default-ts8jw 65m (4%) 0 (0%) 110Mi (1%) 0 (0%) 88m openshift-image-registry image-registry-59b74c4947-ld2ql 100m (6%) 0 (0%) 256Mi (3%) 0 (0%) 68m openshift-image-registry node-ca-7n5rh 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 84m openshift-ingress-canary ingress-canary-rl4rx 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 84m openshift-ingress router-default-7854b58d84-p64n9 100m (6%) 0 (0%) 256Mi (3%) 0 (0%) 68m openshift-kube-storage-version-migrator migrator-777f85c94f-spws6 100m (6%) 0 (0%) 200Mi (3%) 0 (0%) 68m openshift-machine-config-operator machine-config-daemon-54jtc 40m (2%) 0 (0%) 100Mi (1%) 0 (0%) 89m openshift-marketplace certified-operators-9l674 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 68m openshift-marketplace community-operators-77rzq 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 80s openshift-marketplace qe-app-registry-55vfw 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 68m openshift-marketplace redhat-marketplace-qm5g4 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 14m openshift-marketplace redhat-operators-7bb2s 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 68m openshift-monitoring alertmanager-main-0 8m (0%) 0 (0%) 270Mi (4%) 0 (0%) 68m openshift-monitoring grafana-56f75b4dfd-8l9k9 5m (0%) 0 (0%) 120Mi (1%) 0 (0%) 68m openshift-monitoring node-exporter-wqb52 9m (0%) 0 (0%) 210Mi (3%) 0 (0%) 84m openshift-monitoring openshift-state-metrics-8dcd45497-6x7zq 3m (0%) 0 (0%) 190Mi (2%) 0 (0%) 68m openshift-monitoring prometheus-adapter-8649fb987f-k9jt4 1m (0%) 0 (0%) 25Mi (0%) 0 (0%) 68m openshift-monitoring prometheus-k8s-1 76m (5%) 0 (0%) 1204Mi (18%) 0 (0%) 68m openshift-monitoring thanos-querier-89cbbf9b8-6s987 9m (0%) 0 (0%) 92Mi (1%) 0 (0%) 68m openshift-multus multus-dld6h 10m (0%) 0 (0%) 150Mi (2%) 0 (0%) 89m openshift-multus network-metrics-daemon-bjd6m 20m (1%) 0 (0%) 120Mi (1%) 0 (0%) 88m openshift-network-diagnostics network-check-target-hfqmw 10m (0%) 0 (0%) 150Mi (2%) 0 (0%) 84m openshift-sdn ovs-4m9ng 100m (6%) 0 (0%) 400Mi (6%) 0 (0%) 88m openshift-sdn sdn-vrt5v 110m (7%) 0 (0%) 220Mi (3%) 0 (0%) 88m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 886m (59%) 0 (0%) memory 4603Mi (69%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NodeHasNoDiskPressure 89m (x7 over 89m) kubelet, ip-10-0-146-230.us-east-2.compute.internal Node ip-10-0-146-230.us-east-2.compute.internal status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 89m (x7 over 89m) kubelet, ip-10-0-146-230.us-east-2.compute.internal Node ip-10-0-146-230.us-east-2.compute.internal status is now: NodeHasSufficientPID Normal NodeReady 88m kubelet, ip-10-0-146-230.us-east-2.compute.internal Node ip-10-0-146-230.us-east-2.compute.internal status is now: NodeReady Normal NodeNotSchedulable 69m kubelet, ip-10-0-146-230.us-east-2.compute.internal Node ip-10-0-146-230.us-east-2.compute.internal status is now: NodeNotSchedulable Normal NodeSchedulable 68m kubelet, ip-10-0-146-230.us-east-2.compute.internal Node ip-10-0-146-230.us-east-2.compute.internal status is now: NodeSchedulable Warning SystemOOM 25m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: python, pid: 153197 Warning SystemOOM 25m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: opm, pid: 47565 Warning SystemOOM 25m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: opm, pid: 47871 Warning SystemOOM 25m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: opm, pid: 49284 Warning SystemOOM 24m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: python, pid: 154084 Normal NodeHasInsufficientMemory 24m kubelet, ip-10-0-146-230.us-east-2.compute.internal Node ip-10-0-146-230.us-east-2.compute.internal status is now: NodeHasInsufficientMemory Warning SystemOOM 24m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: opm, pid: 47671 Warning SystemOOM 24m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: opm, pid: 155463 Warning SystemOOM 24m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: opm, pid: 154420 Normal NodeHasSufficientMemory 19m (x8 over 89m) kubelet, ip-10-0-146-230.us-east-2.compute.internal Node ip-10-0-146-230.us-east-2.compute.internal status is now: NodeHasSufficientMemory Warning SystemOOM 18m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: python, pid: 171167 Warning SystemOOM 6m34s (x15 over 18m) kubelet, ip-10-0-146-230.us-east-2.compute.internal (combined from similar events): System OOM encountered, victim process: python, pid: 200649 Warning EvictionThresholdMet 80s (x5 over 25m) kubelet, ip-10-0-146-230.us-east-2.compute.internal Attempting to reclaim memory $ oc get pods NAME READY STATUS RESTARTS AGE badmem-fk9x5 0/1 CrashLoopBackOff 14 55m badmem-tjhjd 0/1 Evicted 0 56m $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES badmem-r2rfx 0/1 OOMKilled 1 29s 10.129.2.33 ip-10-0-164-104.us-east-2.compute.internal <none> <none> badmem-tjhjd 0/1 Evicted 0 57m <none> ip-10-0-146-230.us-east-2.compute.internal <none> <none> $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-139-84.us-east-2.compute.internal Ready master 126m v1.20.0+87544c5 10.0.139.84 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 ip-10-0-146-230.us-east-2.compute.internal Ready,SchedulingDisabled worker 121m v1.20.0+87544c5 10.0.146.230 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 ip-10-0-164-104.us-east-2.compute.internal Ready worker 120m v1.20.0+87544c5 10.0.164.104 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 ip-10-0-169-245.us-east-2.compute.internal Ready master 126m v1.20.0+87544c5 10.0.169.245 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 ip-10-0-192-137.us-east-2.compute.internal Ready master 126m v1.20.0+87544c5 10.0.192.137 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 ip-10-0-218-188.us-east-2.compute.internal Ready,SchedulingDisabled worker 120m v1.20.0+87544c5 10.0.218.188 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES badmem-4c9jt 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none> badmem-65pk6 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none> badmem-8g5vk 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none> badmem-b8zvb 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none> badmem-cq7hh 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none> badmem-d8mcg 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none> badmem-gmkdr 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none> badmem-hx9k7 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none> badmem-j2lqx 0/1 CrashLoopBackOff 21 91m 10.129.2.37 ip-10-0-164-104.us-east-2.compute.internal <none> <none> badmem-llqjc 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none> badmem-r2rfx 0/1 Evicted 0 94m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none> badmem-tjhjd 0/1 Evicted 0 151m <none> ip-10-0-146-230.us-east-2.compute.internal <none> <none> badmem-wkbj4 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none> badmem-wtmxh 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none> badmem-x95vs 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none> badmem-z7hkj 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none> $ oc describe pod badmem-j2lqx Name: badmem-j2lqx Namespace: app Priority: 0 Node: ip-10-0-164-104.us-east-2.compute.internal/10.0.164.104 Start Time: Tue, 05 Jan 2021 14:07:14 +0530 Labels: app=badmem Annotations: k8s.v1.cni.cncf.io/network-status: [{ "name": "", "interface": "eth0", "ips": [ "10.129.2.37" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "", "interface": "eth0", "ips": [ "10.129.2.37" ], "default": true, "dns": {} }] openshift.io/scc: restricted Status: Running IP: 10.129.2.37 IPs: IP: 10.129.2.37 Controlled By: ReplicationController/badmem Containers: badmem: Container ID: cri-o://4c435138458dc16988352f95fe9653bf0315c263cf709266b601a667fc21c832 Image: registry.redhat.io/rhel7:latest Image ID: registry.redhat.io/rhel7@sha256:110e61d28c1bfa1aad79e0413b98a70679a070baafb70e122fda4d105651599e Port: <none> Host Port: <none> Args: python -c x = [] while True: x.append("x" * 1048576) State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Tue, 05 Jan 2021 15:31:56 +0530 Finished: Tue, 05 Jan 2021 15:32:02 +0530 Ready: False Restart Count: 21 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-gz2d7 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: default-token-gz2d7: Type: Secret (a volume populated by a Secret) SecretName: default-token-gz2d7 Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> 0/6 nodes are available: 1 node(s) had taint {node.kubernetes.io/memory-pressure: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Warning FailedScheduling <unknown> 0/6 nodes are available: 1 node(s) had taint {node.kubernetes.io/memory-pressure: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Normal Scheduled <unknown> Successfully assigned app/badmem-j2lqx to ip-10-0-164-104.us-east-2.compute.internal Normal AddedInterface 86m multus Add eth0 [10.129.2.37/23] Normal Pulled 86m kubelet, ip-10-0-164-104.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 964.725482ms Normal Pulled 86m kubelet, ip-10-0-164-104.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 1.759608201s Normal Pulled 86m kubelet, ip-10-0-164-104.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 939.182997ms Normal Pulled 85m kubelet, ip-10-0-164-104.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 919.173303ms Normal Started 85m (x4 over 86m) kubelet, ip-10-0-164-104.us-east-2.compute.internal Started container badmem Normal Pulled 85m kubelet, ip-10-0-164-104.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 934.860053ms Normal Pulling 85m (x5 over 86m) kubelet, ip-10-0-164-104.us-east-2.compute.internal Pulling image "registry.redhat.io/rhel7:latest" Normal Created 85m (x5 over 86m) kubelet, ip-10-0-164-104.us-east-2.compute.internal Created container badmem Warning BackOff 100s (x382 over 86m) kubelet, ip-10-0-164-104.us-east-2.compute.internal Back-off restarting failed container $ oc describe node ip-10-0-164-104.us-east-2.compute.internal Name: ip-10-0-164-104.us-east-2.compute.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m5.large beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-east-2 failure-domain.beta.kubernetes.io/zone=us-east-2b kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-10-0-164-104 kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=m5.large node.openshift.io/os_id=rhcos topology.ebs.csi.aws.com/zone=us-east-2b topology.kubernetes.io/region=us-east-2 topology.kubernetes.io/zone=us-east-2b Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-048ee2a82b51c1319"} machine.openshift.io/machine: openshift-machine-api/sunilc0501-q8g7n-worker-us-east-2b-l6skb machineconfiguration.openshift.io/currentConfig: rendered-worker-3b5bd44448e8d9aa6de4000b0f64c1d7 machineconfiguration.openshift.io/desiredConfig: rendered-worker-3b5bd44448e8d9aa6de4000b0f64c1d7 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Tue, 05 Jan 2021 11:59:42 +0530 Taints: <none> Unschedulable: false Lease: HolderIdentity: ip-10-0-164-104.us-east-2.compute.internal AcquireTime: <unset> RenewTime: Tue, 05 Jan 2021 15:34:20 +0530 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Tue, 05 Jan 2021 15:33:28 +0530 Tue, 05 Jan 2021 15:00:33 +0530 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 05 Jan 2021 15:33:28 +0530 Tue, 05 Jan 2021 11:59:42 +0530 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Tue, 05 Jan 2021 15:33:28 +0530 Tue, 05 Jan 2021 11:59:42 +0530 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Tue, 05 Jan 2021 15:33:28 +0530 Tue, 05 Jan 2021 12:00:53 +0530 KubeletReady kubelet is posting ready status Addresses: InternalIP: 10.0.164.104 Hostname: ip-10-0-164-104.us-east-2.compute.internal InternalDNS: ip-10-0-164-104.us-east-2.compute.internal Capacity: attachable-volumes-aws-ebs: 25 cpu: 2 ephemeral-storage: 125293548Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 7934700Ki pods: 250 Allocatable: attachable-volumes-aws-ebs: 25 cpu: 1500m ephemeral-storage: 114396791822 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 6783724Ki pods: 250 System Info: Machine ID: ec216bdfbcb53e43cf9f0bacd7069f16 System UUID: ec216bdf-bcb5-3e43-cf9f-0bacd7069f16 Boot ID: 2006871b-5342-405e-b92d-23edb79081b6 Kernel Version: 4.18.0-240.10.1.el8_3.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 Kubelet Version: v1.20.0+87544c5 Kube-Proxy Version: v1.20.0+87544c5 ProviderID: aws:///us-east-2b/i-048ee2a82b51c1319 Non-terminated Pods: (16 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- app badmem-j2lqx 0 (0%) 0 (0%) 0 (0%) 0 (0%) 92m openshift-cluster-csi-drivers aws-ebs-csi-driver-node-r6msv 30m (2%) 0 (0%) 150Mi (2%) 0 (0%) 3h30m openshift-cluster-node-tuning-operator tuned-kqtmn 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 3h30m openshift-dns dns-default-5bkv9 65m (4%) 0 (0%) 110Mi (1%) 0 (0%) 3h34m openshift-image-registry node-ca-2wgqb 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 3h30m openshift-ingress-canary ingress-canary-7kwz6 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 3h30m openshift-machine-config-operator machine-config-daemon-2vxmt 40m (2%) 0 (0%) 100Mi (1%) 0 (0%) 3h34m openshift-marketplace certified-operators-b9dvl 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 34m openshift-marketplace qe-app-registry-wfmf7 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 89m openshift-marketplace redhat-marketplace-tnx8p 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 31m openshift-monitoring node-exporter-hq6ff 9m (0%) 0 (0%) 210Mi (3%) 0 (0%) 3h30m openshift-multus multus-pccck 10m (0%) 0 (0%) 150Mi (2%) 0 (0%) 3h34m openshift-multus network-metrics-daemon-wgvrg 20m (1%) 0 (0%) 120Mi (1%) 0 (0%) 3h34m openshift-network-diagnostics network-check-target-nzq88 10m (0%) 0 (0%) 150Mi (2%) 0 (0%) 3h30m openshift-sdn ovs-stf64 100m (6%) 0 (0%) 400Mi (6%) 0 (0%) 3h34m openshift-sdn sdn-z7pvs 110m (7%) 0 (0%) 220Mi (3%) 0 (0%) 3h34m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 454m (30%) 0 (0%) memory 1840Mi (27%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NodeNotSchedulable 152m (x2 over 3h13m) kubelet, ip-10-0-164-104.us-east-2.compute.internal Node ip-10-0-164-104.us-east-2.compute.internal status is now: NodeNotSchedulable Normal NodeSchedulable 95m (x2 over 3h12m) kubelet, ip-10-0-164-104.us-east-2.compute.internal Node ip-10-0-164-104.us-east-2.compute.internal status is now: NodeSchedulable Warning SystemOOM 94m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: python, pid: 93966 Warning SystemOOM 94m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: python, pid: 94071 Warning SystemOOM 94m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: kube-rbac-proxy, pid: 3462 Warning SystemOOM 93m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: python, pid: 94343 Warning SystemOOM 93m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: kube-rbac-proxy, pid: 94208 Warning SystemOOM 93m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: python, pid: 94776 Warning SystemOOM 93m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: kube-rbac-proxy, pid: 94594 Normal NodeHasInsufficientMemory 92m kubelet, ip-10-0-164-104.us-east-2.compute.internal Node ip-10-0-164-104.us-east-2.compute.internal status is now: NodeHasInsufficientMemory Warning SystemOOM 92m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: python, pid: 95543 Warning SystemOOM 92m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: kube-rbac-proxy, pid: 95146 Warning EvictionThresholdMet 38m (x2 over 92m) kubelet, ip-10-0-164-104.us-east-2.compute.internal Attempting to reclaim memory Normal NodeHasSufficientMemory 33m (x11 over 3h35m) kubelet, ip-10-0-164-104.us-east-2.compute.internal Node ip-10-0-164-104.us-east-2.compute.internal status is now: NodeHasSufficientMemory Warning SystemOOM 2m22s (x49 over 87m) kubelet, ip-10-0-164-104.us-east-2.compute.internal (combined from similar events): System OOM encountered, victim process: python, pid: 198213
*** Bug 1873816 has been marked as a duplicate of this bug. ***
*** Bug 1910801 has been marked as a duplicate of this bug. ***
*** Bug 1915023 has been marked as a duplicate of this bug. ***
Hello, Are there any plans of backporting this fix to older 4.x releases? Thanks, Neil Girard
*** Bug 1904051 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
*** Bug 1931467 has been marked as a duplicate of this bug. ***
Following KCS articles about this bug have been written: https://access.redhat.com/solutions/5853471, which links to https://access.redhat.com/solutions/5843241.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days