Bug 1857446
| Summary: | ARO/Azure: excessive pod memory allocation causes node lockup | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jim Minter <jminter> | |
| Component: | Node | Assignee: | Harshal Patil <harpatil> | |
| Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | high | CC: | aaleman, abodhe, akamra, akaris, ambrown, aos-bugs, asalkeld, awestbro, bbreard, ccoleman, deads, decarr, dmoessne, dornelas, dyocum, fdeutsch, ffranz, guchen, harpatil, hongkliu, imcleod, jeder, jhunter, jligon, jlucky, jokerman, llong, llopezmo, markmc, mchebbi, mguetta, mharri, nagrawal, nelluri, ngirard, nmalik, nstielau, pkliczew, rphillips, smilner, steven.barre, travi, vuberti, wking, xtian | |
| Version: | 4.3.z | Keywords: | NeedsTestCase | |
| Target Milestone: | --- | |||
| Target Release: | 4.7.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | No Doc Update | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1860031 (view as bug list) | Environment: | ||
| Last Closed: | 2021-02-24 15:13:58 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1896327 | |||
| Bug Blocks: | 1860031, 1873114, 1882116, 1904051, 1908661, 1909062 | |||
|
Comment 4
Mike Gahagan
2020-07-16 18:18:08 UTC
For both my test clusters: system-reserved: cpu: 500m memory: 1Gi ephemeral-storage: 1Gi I don't see kube-reserved defined anywhere on either cluster but I'll keep looking. One difference I'm seeing on the Azure cluster during the oom-kill events is several rcu_sched stall warnings which might point to why the nodes are going NotReady (possibly due to i/o performance differences) [ 7315.479017] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ 7315.541774] Memory cgroup stats for [ 7315.541655] rcu: 0-....: (144 ticks this GP) idle=7f2/1/0x4000000000000000 softirq=1385795/1385810 fqs=14822 [ 7315.541655] rcu: (detected by 1, t=60002 jiffies, g=2694213, q=38010) [ 7315.541655] Sending NMI from CPU 1 to CPUs 0: [ 7315.542014] NMI backtrace for cpu 0 [ 7315.542014] CPU: 0 PID: 2195 Comm: ovs-vswitchd Not tainted 4.18.0-147.13.2.el8_1.x86_64 #1 [ 7315.542014] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 [ 7315.542014] RIP: 0010:cpuacct_account_field+0x27/0x50 [ 7315.542014] Code: 00 00 00 0f 1f 44 00 00 48 8b 87 08 0d 00 00 48 8b 48 10 48 81 f9 20 93 25 91 74 2a 48 63 f6 48 c1 e6 03 48 8b 81 38 01 00 00 <65>48 03 05 61 97 f0 6f 48 01 14 30 48 8b 89 28 01 00 00 48 81 f9 [ 7315.542014] RSP: 0018:ffff961cf7a03ea8 EFLAGS: 00000083 [ 7315.542014] RAX: 0000326e88221278 RBX: ffff961cb699af80 RCX: ffff961cf614e600 [ 7315.542014] RDX: 000000000003b9bd RSI: 0000000000000010 RDI: ffff961cb699af80 [ 7315.542014] RBP: 000000000003b9bd R08: 0000000000000002 R09: 011d94d851f61f2c [ 7315.542014] R10: 00000f47e7c5b1a8 R11: 0000000000000000 R12: 0000000000000002 [ 7315.542014] R13: ffff961cf7a1cf80 R14: ffffffff90146710 R15: ffff961cf7a1d0b8 [ 7315.542014] FS: 00007f197bbb7d00(0000) GS:ffff961cf7a00000(0000) knlGS:0000000000000000 [ 7315.542014] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7315.542014] CR2: 00000000004bad80 CR3: 0000000276e0c003 CR4: 00000000003606f0 [ 7315.542014] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 7315.542014] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 7315.542014] Call Trace: [ 7315.542014] <IRQ> [ 7315.542014] account_system_index_time+0x63/0x90 [ 7315.542014] update_process_times+0x1c/0x60 [ 7315.542014] tick_sched_handle+0x22/0x60 [ 7315.542014] tick_sched_timer+0x37/0x70 [ 7315.542014] __hrtimer_run_queues+0x100/0x280 [ 7315.542014] hrtimer_interrupt+0x100/0x220 [ 7315.542014] ? sched_clock+0x5/0x10 [ 7315.542014] hv_stimer0_isr+0x20/0x30 [hv_vmbus] [ 7315.542014] hv_stimer0_vector_handler+0x3b/0x70 [ 7315.542014] hv_stimer0_callback_vector+0xf/0x20 [ 7315.542014] </IRQ> [ 7315.542014] RIP: 0010:vprintk_emit+0x3a4/0x450 [ 7315.542014] Code: 90 84 d2 74 6d 0f b6 15 da 38 92 01 48 c7 c0 20 a7 a3 91 84 d2 74 09 f3 90 0f b6 10 84 d2 75 f7 e8 31 0b 00 00 48 89 df 57 9d <0f>1f 44 00 00 e8 f2 e3 ff ff e9 28 fe ff ff 80 3d aa e9 2e 01 00 This is also happening on 4.5.2 on Azure: mgahagan-cr9wt-worker-northcentralus-c57nw NotReady worker 99m v1.18.3+b74c5ed 10.0.32.4 <none> Red Hat Enterprise Linux CoreOS 45.82.202007141718-0 (Ootpa) 4.18.0-193.13.2.el8_2.x86_64 cri-o://1.18.2-18.rhaos4.5.git754d46b.el8 Also tried a possible workaround of setting system-reserved.memory to 1250Gi on my 4.5.2 cluster and it didn't help on 4.5.2. Moving to RHCOS to take a closer look at the dmesg output. OOM is an OpenShift wide topic that impacts multiple teams, among them RHCOS and Node. At the present time you must configure pod limits: https://docs.openshift.com/container-platform/4.5/nodes/clusters/nodes-cluster-resource-configure.html Some clusters may want a mutating admission webhook to enforce this. As I understand things, if you're not applying limits, then none of the system reserved bits come into effect. If this bug stays against RHCOS, since it's not actionable we'll close it. xref bug for adding alert when we go over the system memory reservation https://bugzilla.redhat.com/show_bug.cgi?id=1881208 Summarizing because someone was confused: No amount of manual tuning can prevent this problem from happening in all cases. Identifying and eliminating the underlying hang is the key outcome now that we have an alert to identify the minimal and acceptable short-term workaround. *** Bug 1877059 has been marked as a duplicate of this bug. *** *** Bug 1889734 has been marked as a duplicate of this bug. *** *** Bug 1890684 has been marked as a duplicate of this bug. *** *** Bug 1910086 has been marked as a duplicate of this bug. *** *** Bug 1892909 has been marked as a duplicate of this bug. *** Verified on 4.7.0-0.nightly-2021-01-04-215816.
Tested this on 2 nodes. Cordoned the node and created RC and I could see pod being evicted due to System OOM and then node trying to reclaim memory without going into NotReady state.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.7.0-0.nightly-2021-01-04-215816 True False 45m Cluster version is 4.7.0-0.nightly-2021-01-04-215816
$ oc get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-0-139-84.us-east-2.compute.internal Ready master 68m v1.20.0+87544c5 10.0.139.84 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
ip-10-0-146-230.us-east-2.compute.internal Ready worker 63m v1.20.0+87544c5 10.0.146.230 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
ip-10-0-164-104.us-east-2.compute.internal Ready,SchedulingDisabled worker 62m v1.20.0+87544c5 10.0.164.104 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
ip-10-0-169-245.us-east-2.compute.internal Ready master 68m v1.20.0+87544c5 10.0.169.245 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
ip-10-0-192-137.us-east-2.compute.internal Ready master 68m v1.20.0+87544c5 10.0.192.137 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
ip-10-0-218-188.us-east-2.compute.internal Ready,SchedulingDisabled worker 62m v1.20.0+87544c5 10.0.218.188 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
$ oc create -f rc.yaml
replicationcontroller/badmem created
$ oc get rc
NAME DESIRED CURRENT READY AGE
badmem 1 1 0 3s
$ oc get pods
NAME READY STATUS RESTARTS AGE
badmem-tjhjd 0/1 ContainerCreating 0 6s
$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
badmem-tjhjd 1/1 Running 0 20s 10.131.0.42 ip-10-0-146-230.us-east-2.compute.internal <none> <none>
$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
badmem-fk9x5 0/1 CrashLoopBackOff 7 22m 10.131.0.51 ip-10-0-146-230.us-east-2.compute.internal <none> <none>
badmem-tjhjd 0/1 Evicted 0 23m <none> ip-10-0-146-230.us-east-2.compute.internal <none> <none>
$ oc describe pod badmem-fk9x5
Name: badmem-fk9x5
Namespace: app
Priority: 0
Node: ip-10-0-146-230.us-east-2.compute.internal/10.0.146.230
Start Time: Tue, 05 Jan 2021 13:09:05 +0530
Labels: app=badmem
Annotations: k8s.v1.cni.cncf.io/network-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"10.131.0.51"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"10.131.0.51"
],
"default": true,
"dns": {}
}]
openshift.io/scc: restricted
Status: Running
IP: 10.131.0.51
IPs:
IP: 10.131.0.51
Controlled By: ReplicationController/badmem
Containers:
badmem:
Container ID: cri-o://6474e9157e9ee59730590413eebbcf2316fae85d3de6237ebd5221f54e77bd33
Image: registry.redhat.io/rhel7:latest
Image ID: registry.redhat.io/rhel7@sha256:110e61d28c1bfa1aad79e0413b98a70679a070baafb70e122fda4d105651599e
Port: <none>
Host Port: <none>
Args:
python
-c
x = []
while True:
x.append("x" * 1048576)
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 05 Jan 2021 13:21:26 +0530
Finished: Tue, 05 Jan 2021 13:21:34 +0530
Ready: False
Restart Count: 7
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-gz2d7 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-gz2d7:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-gz2d7
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> 0/6 nodes are available: 1 node(s) had taint {node.kubernetes.io/memory-pressure: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
Warning FailedScheduling <unknown> 0/6 nodes are available: 1 node(s) had taint {node.kubernetes.io/memory-pressure: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
Normal Scheduled <unknown> Successfully assigned app/badmem-fk9x5 to ip-10-0-146-230.us-east-2.compute.internal
Normal AddedInterface 17m multus Add eth0 [10.131.0.51/23]
Normal Pulled 17m kubelet, ip-10-0-146-230.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 889.354606ms
Normal Pulled 17m kubelet, ip-10-0-146-230.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 935.541782ms
Normal Pulled 16m kubelet, ip-10-0-146-230.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 878.485939ms
Normal Created 16m (x4 over 17m) kubelet, ip-10-0-146-230.us-east-2.compute.internal Created container badmem
Normal Started 16m (x4 over 17m) kubelet, ip-10-0-146-230.us-east-2.compute.internal Started container badmem
Normal Pulled 16m kubelet, ip-10-0-146-230.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 819.24038ms
Normal Pulling 15m (x5 over 17m) kubelet, ip-10-0-146-230.us-east-2.compute.internal Pulling image "registry.redhat.io/rhel7:latest"
Normal Pulled 15m kubelet, ip-10-0-146-230.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 1.221594963s
Warning BackOff 2m25s (x61 over 17m) kubelet, ip-10-0-146-230.us-east-2.compute.internal Back-off restarting failed container
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-139-84.us-east-2.compute.internal Ready master 94m v1.20.0+87544c5
ip-10-0-146-230.us-east-2.compute.internal Ready worker 88m v1.20.0+87544c5
ip-10-0-164-104.us-east-2.compute.internal Ready,SchedulingDisabled worker 88m v1.20.0+87544c5
ip-10-0-169-245.us-east-2.compute.internal Ready master 94m v1.20.0+87544c5
ip-10-0-192-137.us-east-2.compute.internal Ready master 94m v1.20.0+87544c5
ip-10-0-218-188.us-east-2.compute.internal Ready,SchedulingDisabled worker 88m v1.20.0+87544c5
$ oc describe node ip-10-0-146-230.us-east-2.compute.internal
Name: ip-10-0-146-230.us-east-2.compute.internal
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m5.large
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-2
failure-domain.beta.kubernetes.io/zone=us-east-2a
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-0-146-230
kubernetes.io/os=linux
node-role.kubernetes.io/worker=
node.kubernetes.io/instance-type=m5.large
node.openshift.io/os_id=rhcos
topology.ebs.csi.aws.com/zone=us-east-2a
topology.kubernetes.io/region=us-east-2
topology.kubernetes.io/zone=us-east-2a
Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0b76f16895b9950f4"}
machine.openshift.io/machine: openshift-machine-api/sunilc0501-q8g7n-worker-us-east-2a-s8sh9
machineconfiguration.openshift.io/currentConfig: rendered-worker-3b5bd44448e8d9aa6de4000b0f64c1d7
machineconfiguration.openshift.io/desiredConfig: rendered-worker-3b5bd44448e8d9aa6de4000b0f64c1d7
machineconfiguration.openshift.io/reason:
machineconfiguration.openshift.io/state: Done
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 05 Jan 2021 11:58:58 +0530
Taints: node.kubernetes.io/memory-pressure:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: ip-10-0-146-230.us-east-2.compute.internal
AcquireTime: <unset>
RenewTime: Tue, 05 Jan 2021 13:28:00 +0530
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure True Tue, 05 Jan 2021 13:26:56 +0530 Tue, 05 Jan 2021 13:26:56 +0530 KubeletHasInsufficientMemory kubelet has insufficient memory available
DiskPressure False Tue, 05 Jan 2021 13:26:56 +0530 Tue, 05 Jan 2021 11:58:58 +0530 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 05 Jan 2021 13:26:56 +0530 Tue, 05 Jan 2021 11:58:58 +0530 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 05 Jan 2021 13:26:56 +0530 Tue, 05 Jan 2021 11:59:48 +0530 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.0.146.230
Hostname: ip-10-0-146-230.us-east-2.compute.internal
InternalDNS: ip-10-0-146-230.us-east-2.compute.internal
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 125293548Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7934684Ki
pods: 250
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 1500m
ephemeral-storage: 114396791822
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 6783708Ki
pods: 250
System Info:
Machine ID: ec29e18d242aa4cd9260b6285abe896e
System UUID: ec29e18d-242a-a4cd-9260-b6285abe896e
Boot ID: 793835d0-a758-4fba-9c1f-9a82685497f1
Kernel Version: 4.18.0-240.10.1.el8_3.x86_64
OS Image: Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa)
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
Kubelet Version: v1.20.0+87544c5
Kube-Proxy Version: v1.20.0+87544c5
ProviderID: aws:///us-east-2a/i-0b76f16895b9950f4
Non-terminated Pods: (28 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
app badmem-fk9x5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 24m
openshift-cluster-csi-drivers aws-ebs-csi-driver-node-8cvqs 30m (2%) 0 (0%) 150Mi (2%) 0 (0%) 84m
openshift-cluster-node-tuning-operator tuned-vchjb 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 84m
openshift-console downloads-6d7bb8f56d-zw8fl 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 68m
openshift-dns dns-default-ts8jw 65m (4%) 0 (0%) 110Mi (1%) 0 (0%) 88m
openshift-image-registry image-registry-59b74c4947-ld2ql 100m (6%) 0 (0%) 256Mi (3%) 0 (0%) 68m
openshift-image-registry node-ca-7n5rh 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 84m
openshift-ingress-canary ingress-canary-rl4rx 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 84m
openshift-ingress router-default-7854b58d84-p64n9 100m (6%) 0 (0%) 256Mi (3%) 0 (0%) 68m
openshift-kube-storage-version-migrator migrator-777f85c94f-spws6 100m (6%) 0 (0%) 200Mi (3%) 0 (0%) 68m
openshift-machine-config-operator machine-config-daemon-54jtc 40m (2%) 0 (0%) 100Mi (1%) 0 (0%) 89m
openshift-marketplace certified-operators-9l674 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 68m
openshift-marketplace community-operators-77rzq 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 80s
openshift-marketplace qe-app-registry-55vfw 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 68m
openshift-marketplace redhat-marketplace-qm5g4 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 14m
openshift-marketplace redhat-operators-7bb2s 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 68m
openshift-monitoring alertmanager-main-0 8m (0%) 0 (0%) 270Mi (4%) 0 (0%) 68m
openshift-monitoring grafana-56f75b4dfd-8l9k9 5m (0%) 0 (0%) 120Mi (1%) 0 (0%) 68m
openshift-monitoring node-exporter-wqb52 9m (0%) 0 (0%) 210Mi (3%) 0 (0%) 84m
openshift-monitoring openshift-state-metrics-8dcd45497-6x7zq 3m (0%) 0 (0%) 190Mi (2%) 0 (0%) 68m
openshift-monitoring prometheus-adapter-8649fb987f-k9jt4 1m (0%) 0 (0%) 25Mi (0%) 0 (0%) 68m
openshift-monitoring prometheus-k8s-1 76m (5%) 0 (0%) 1204Mi (18%) 0 (0%) 68m
openshift-monitoring thanos-querier-89cbbf9b8-6s987 9m (0%) 0 (0%) 92Mi (1%) 0 (0%) 68m
openshift-multus multus-dld6h 10m (0%) 0 (0%) 150Mi (2%) 0 (0%) 89m
openshift-multus network-metrics-daemon-bjd6m 20m (1%) 0 (0%) 120Mi (1%) 0 (0%) 88m
openshift-network-diagnostics network-check-target-hfqmw 10m (0%) 0 (0%) 150Mi (2%) 0 (0%) 84m
openshift-sdn ovs-4m9ng 100m (6%) 0 (0%) 400Mi (6%) 0 (0%) 88m
openshift-sdn sdn-vrt5v 110m (7%) 0 (0%) 220Mi (3%) 0 (0%) 88m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 886m (59%) 0 (0%)
memory 4603Mi (69%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NodeHasNoDiskPressure 89m (x7 over 89m) kubelet, ip-10-0-146-230.us-east-2.compute.internal Node ip-10-0-146-230.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 89m (x7 over 89m) kubelet, ip-10-0-146-230.us-east-2.compute.internal Node ip-10-0-146-230.us-east-2.compute.internal status is now: NodeHasSufficientPID
Normal NodeReady 88m kubelet, ip-10-0-146-230.us-east-2.compute.internal Node ip-10-0-146-230.us-east-2.compute.internal status is now: NodeReady
Normal NodeNotSchedulable 69m kubelet, ip-10-0-146-230.us-east-2.compute.internal Node ip-10-0-146-230.us-east-2.compute.internal status is now: NodeNotSchedulable
Normal NodeSchedulable 68m kubelet, ip-10-0-146-230.us-east-2.compute.internal Node ip-10-0-146-230.us-east-2.compute.internal status is now: NodeSchedulable
Warning SystemOOM 25m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: python, pid: 153197
Warning SystemOOM 25m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: opm, pid: 47565
Warning SystemOOM 25m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: opm, pid: 47871
Warning SystemOOM 25m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: opm, pid: 49284
Warning SystemOOM 24m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: python, pid: 154084
Normal NodeHasInsufficientMemory 24m kubelet, ip-10-0-146-230.us-east-2.compute.internal Node ip-10-0-146-230.us-east-2.compute.internal status is now: NodeHasInsufficientMemory
Warning SystemOOM 24m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: opm, pid: 47671
Warning SystemOOM 24m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: opm, pid: 155463
Warning SystemOOM 24m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: opm, pid: 154420
Normal NodeHasSufficientMemory 19m (x8 over 89m) kubelet, ip-10-0-146-230.us-east-2.compute.internal Node ip-10-0-146-230.us-east-2.compute.internal status is now: NodeHasSufficientMemory
Warning SystemOOM 18m kubelet, ip-10-0-146-230.us-east-2.compute.internal System OOM encountered, victim process: python, pid: 171167
Warning SystemOOM 6m34s (x15 over 18m) kubelet, ip-10-0-146-230.us-east-2.compute.internal (combined from similar events): System OOM encountered, victim process: python, pid: 200649
Warning EvictionThresholdMet 80s (x5 over 25m) kubelet, ip-10-0-146-230.us-east-2.compute.internal Attempting to reclaim memory
$ oc get pods
NAME READY STATUS RESTARTS AGE
badmem-fk9x5 0/1 CrashLoopBackOff 14 55m
badmem-tjhjd 0/1 Evicted 0 56m
$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
badmem-r2rfx 0/1 OOMKilled 1 29s 10.129.2.33 ip-10-0-164-104.us-east-2.compute.internal <none> <none>
badmem-tjhjd 0/1 Evicted 0 57m <none> ip-10-0-146-230.us-east-2.compute.internal <none> <none>
$ oc get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-0-139-84.us-east-2.compute.internal Ready master 126m v1.20.0+87544c5 10.0.139.84 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
ip-10-0-146-230.us-east-2.compute.internal Ready,SchedulingDisabled worker 121m v1.20.0+87544c5 10.0.146.230 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
ip-10-0-164-104.us-east-2.compute.internal Ready worker 120m v1.20.0+87544c5 10.0.164.104 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
ip-10-0-169-245.us-east-2.compute.internal Ready master 126m v1.20.0+87544c5 10.0.169.245 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
ip-10-0-192-137.us-east-2.compute.internal Ready master 126m v1.20.0+87544c5 10.0.192.137 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
ip-10-0-218-188.us-east-2.compute.internal Ready,SchedulingDisabled worker 120m v1.20.0+87544c5 10.0.218.188 <none> Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
badmem-4c9jt 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none>
badmem-65pk6 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none>
badmem-8g5vk 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none>
badmem-b8zvb 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none>
badmem-cq7hh 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none>
badmem-d8mcg 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none>
badmem-gmkdr 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none>
badmem-hx9k7 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none>
badmem-j2lqx 0/1 CrashLoopBackOff 21 91m 10.129.2.37 ip-10-0-164-104.us-east-2.compute.internal <none> <none>
badmem-llqjc 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none>
badmem-r2rfx 0/1 Evicted 0 94m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none>
badmem-tjhjd 0/1 Evicted 0 151m <none> ip-10-0-146-230.us-east-2.compute.internal <none> <none>
badmem-wkbj4 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none>
badmem-wtmxh 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none>
badmem-x95vs 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none>
badmem-z7hkj 0/1 Evicted 0 91m <none> ip-10-0-164-104.us-east-2.compute.internal <none> <none>
$ oc describe pod badmem-j2lqx
Name: badmem-j2lqx
Namespace: app
Priority: 0
Node: ip-10-0-164-104.us-east-2.compute.internal/10.0.164.104
Start Time: Tue, 05 Jan 2021 14:07:14 +0530
Labels: app=badmem
Annotations: k8s.v1.cni.cncf.io/network-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"10.129.2.37"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"10.129.2.37"
],
"default": true,
"dns": {}
}]
openshift.io/scc: restricted
Status: Running
IP: 10.129.2.37
IPs:
IP: 10.129.2.37
Controlled By: ReplicationController/badmem
Containers:
badmem:
Container ID: cri-o://4c435138458dc16988352f95fe9653bf0315c263cf709266b601a667fc21c832
Image: registry.redhat.io/rhel7:latest
Image ID: registry.redhat.io/rhel7@sha256:110e61d28c1bfa1aad79e0413b98a70679a070baafb70e122fda4d105651599e
Port: <none>
Host Port: <none>
Args:
python
-c
x = []
while True:
x.append("x" * 1048576)
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 05 Jan 2021 15:31:56 +0530
Finished: Tue, 05 Jan 2021 15:32:02 +0530
Ready: False
Restart Count: 21
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-gz2d7 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-gz2d7:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-gz2d7
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> 0/6 nodes are available: 1 node(s) had taint {node.kubernetes.io/memory-pressure: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
Warning FailedScheduling <unknown> 0/6 nodes are available: 1 node(s) had taint {node.kubernetes.io/memory-pressure: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
Normal Scheduled <unknown> Successfully assigned app/badmem-j2lqx to ip-10-0-164-104.us-east-2.compute.internal
Normal AddedInterface 86m multus Add eth0 [10.129.2.37/23]
Normal Pulled 86m kubelet, ip-10-0-164-104.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 964.725482ms
Normal Pulled 86m kubelet, ip-10-0-164-104.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 1.759608201s
Normal Pulled 86m kubelet, ip-10-0-164-104.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 939.182997ms
Normal Pulled 85m kubelet, ip-10-0-164-104.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 919.173303ms
Normal Started 85m (x4 over 86m) kubelet, ip-10-0-164-104.us-east-2.compute.internal Started container badmem
Normal Pulled 85m kubelet, ip-10-0-164-104.us-east-2.compute.internal Successfully pulled image "registry.redhat.io/rhel7:latest" in 934.860053ms
Normal Pulling 85m (x5 over 86m) kubelet, ip-10-0-164-104.us-east-2.compute.internal Pulling image "registry.redhat.io/rhel7:latest"
Normal Created 85m (x5 over 86m) kubelet, ip-10-0-164-104.us-east-2.compute.internal Created container badmem
Warning BackOff 100s (x382 over 86m) kubelet, ip-10-0-164-104.us-east-2.compute.internal Back-off restarting failed container
$ oc describe node ip-10-0-164-104.us-east-2.compute.internal
Name: ip-10-0-164-104.us-east-2.compute.internal
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m5.large
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-2
failure-domain.beta.kubernetes.io/zone=us-east-2b
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-0-164-104
kubernetes.io/os=linux
node-role.kubernetes.io/worker=
node.kubernetes.io/instance-type=m5.large
node.openshift.io/os_id=rhcos
topology.ebs.csi.aws.com/zone=us-east-2b
topology.kubernetes.io/region=us-east-2
topology.kubernetes.io/zone=us-east-2b
Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-048ee2a82b51c1319"}
machine.openshift.io/machine: openshift-machine-api/sunilc0501-q8g7n-worker-us-east-2b-l6skb
machineconfiguration.openshift.io/currentConfig: rendered-worker-3b5bd44448e8d9aa6de4000b0f64c1d7
machineconfiguration.openshift.io/desiredConfig: rendered-worker-3b5bd44448e8d9aa6de4000b0f64c1d7
machineconfiguration.openshift.io/reason:
machineconfiguration.openshift.io/state: Done
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 05 Jan 2021 11:59:42 +0530
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: ip-10-0-164-104.us-east-2.compute.internal
AcquireTime: <unset>
RenewTime: Tue, 05 Jan 2021 15:34:20 +0530
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Tue, 05 Jan 2021 15:33:28 +0530 Tue, 05 Jan 2021 15:00:33 +0530 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 05 Jan 2021 15:33:28 +0530 Tue, 05 Jan 2021 11:59:42 +0530 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 05 Jan 2021 15:33:28 +0530 Tue, 05 Jan 2021 11:59:42 +0530 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 05 Jan 2021 15:33:28 +0530 Tue, 05 Jan 2021 12:00:53 +0530 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.0.164.104
Hostname: ip-10-0-164-104.us-east-2.compute.internal
InternalDNS: ip-10-0-164-104.us-east-2.compute.internal
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 125293548Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7934700Ki
pods: 250
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 1500m
ephemeral-storage: 114396791822
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 6783724Ki
pods: 250
System Info:
Machine ID: ec216bdfbcb53e43cf9f0bacd7069f16
System UUID: ec216bdf-bcb5-3e43-cf9f-0bacd7069f16
Boot ID: 2006871b-5342-405e-b92d-23edb79081b6
Kernel Version: 4.18.0-240.10.1.el8_3.x86_64
OS Image: Red Hat Enterprise Linux CoreOS 47.83.202101041743-0 (Ootpa)
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
Kubelet Version: v1.20.0+87544c5
Kube-Proxy Version: v1.20.0+87544c5
ProviderID: aws:///us-east-2b/i-048ee2a82b51c1319
Non-terminated Pods: (16 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
app badmem-j2lqx 0 (0%) 0 (0%) 0 (0%) 0 (0%) 92m
openshift-cluster-csi-drivers aws-ebs-csi-driver-node-r6msv 30m (2%) 0 (0%) 150Mi (2%) 0 (0%) 3h30m
openshift-cluster-node-tuning-operator tuned-kqtmn 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 3h30m
openshift-dns dns-default-5bkv9 65m (4%) 0 (0%) 110Mi (1%) 0 (0%) 3h34m
openshift-image-registry node-ca-2wgqb 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 3h30m
openshift-ingress-canary ingress-canary-7kwz6 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 3h30m
openshift-machine-config-operator machine-config-daemon-2vxmt 40m (2%) 0 (0%) 100Mi (1%) 0 (0%) 3h34m
openshift-marketplace certified-operators-b9dvl 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 34m
openshift-marketplace qe-app-registry-wfmf7 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 89m
openshift-marketplace redhat-marketplace-tnx8p 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 31m
openshift-monitoring node-exporter-hq6ff 9m (0%) 0 (0%) 210Mi (3%) 0 (0%) 3h30m
openshift-multus multus-pccck 10m (0%) 0 (0%) 150Mi (2%) 0 (0%) 3h34m
openshift-multus network-metrics-daemon-wgvrg 20m (1%) 0 (0%) 120Mi (1%) 0 (0%) 3h34m
openshift-network-diagnostics network-check-target-nzq88 10m (0%) 0 (0%) 150Mi (2%) 0 (0%) 3h30m
openshift-sdn ovs-stf64 100m (6%) 0 (0%) 400Mi (6%) 0 (0%) 3h34m
openshift-sdn sdn-z7pvs 110m (7%) 0 (0%) 220Mi (3%) 0 (0%) 3h34m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 454m (30%) 0 (0%)
memory 1840Mi (27%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NodeNotSchedulable 152m (x2 over 3h13m) kubelet, ip-10-0-164-104.us-east-2.compute.internal Node ip-10-0-164-104.us-east-2.compute.internal status is now: NodeNotSchedulable
Normal NodeSchedulable 95m (x2 over 3h12m) kubelet, ip-10-0-164-104.us-east-2.compute.internal Node ip-10-0-164-104.us-east-2.compute.internal status is now: NodeSchedulable
Warning SystemOOM 94m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: python, pid: 93966
Warning SystemOOM 94m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: python, pid: 94071
Warning SystemOOM 94m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: kube-rbac-proxy, pid: 3462
Warning SystemOOM 93m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: python, pid: 94343
Warning SystemOOM 93m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: kube-rbac-proxy, pid: 94208
Warning SystemOOM 93m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: python, pid: 94776
Warning SystemOOM 93m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: kube-rbac-proxy, pid: 94594
Normal NodeHasInsufficientMemory 92m kubelet, ip-10-0-164-104.us-east-2.compute.internal Node ip-10-0-164-104.us-east-2.compute.internal status is now: NodeHasInsufficientMemory
Warning SystemOOM 92m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: python, pid: 95543
Warning SystemOOM 92m kubelet, ip-10-0-164-104.us-east-2.compute.internal System OOM encountered, victim process: kube-rbac-proxy, pid: 95146
Warning EvictionThresholdMet 38m (x2 over 92m) kubelet, ip-10-0-164-104.us-east-2.compute.internal Attempting to reclaim memory
Normal NodeHasSufficientMemory 33m (x11 over 3h35m) kubelet, ip-10-0-164-104.us-east-2.compute.internal Node ip-10-0-164-104.us-east-2.compute.internal status is now: NodeHasSufficientMemory
Warning SystemOOM 2m22s (x49 over 87m) kubelet, ip-10-0-164-104.us-east-2.compute.internal (combined from similar events): System OOM encountered, victim process: python, pid: 198213
*** Bug 1873816 has been marked as a duplicate of this bug. *** *** Bug 1910801 has been marked as a duplicate of this bug. *** *** Bug 1915023 has been marked as a duplicate of this bug. *** Hello, Are there any plans of backporting this fix to older 4.x releases? Thanks, Neil Girard *** Bug 1904051 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 *** Bug 1931467 has been marked as a duplicate of this bug. *** Following KCS articles about this bug have been written: https://access.redhat.com/solutions/5853471, which links to https://access.redhat.com/solutions/5843241. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |