Bug 1800319
| Summary: | [4.5] A pod that gradually leaks memory causes node to become unreachable for 10 minutes | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> | ||||
| Component: | Node | Assignee: | Ryan Phillips <rphillips> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Weinan Liu <weinliu> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.4 | CC: | aos-bugs, cfillekes, dfeddema, hannsj_uhl, Holger.Wolf, jerzhang, jokerman, jshepherd, minmli, rphillips, tdale, weinliu, xxia | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.5.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
Cause: kubepods.slice memory cgroup was not being set to the max limit minus the reservations. This caused the nodes to become overloaded with OOMs and not evict workloads.
Consequence:
Fix: Sets the kubepods.slice memory reservation correctly.
Result:
|
Story Points: | --- | ||||
| Clone Of: | |||||||
| : | 1801824 1802687 1806786 (view as bug list) | Environment: | |||||
| Last Closed: | 2020-05-13 21:56:57 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1765215, 1766237, 1801826, 1801829, 1802687, 1805398, 1806786, 1808444 | ||||||
| Attachments: |
|
||||||
Once this is fixed we need to test against 4.3 and 4.2 and backport if it happens - this can DoS a node. The OOMKiller on Z is so active in the cases we tested on z/VM clusters that it fills the 4GB z/VM console message spool with OOMkiller messages, which requires manual intervention at the z/VM x3270 console to conduct a sometimes required complete re-installation. It might be worth testing https://bugzilla.redhat.com/show_bug.cgi?id=1800319 with long-running forked multiprocess memory hogs such as `stress-ng --mmap 64 &`. On Z, this memory hog not only completely disables the node and the node's monitoring cluster operator and dns/routing operator, but does so over the better part of a day, only falling into complete catatonia late at night. It appears that the eviction starts, but the OOMKiller is killing so many things at such a rate, that it appears to even kill the service conducting the eviction process. We're trying to convince the owners of OCP 4.3 test clusters on Power architecture at IBM to try these workloads, too. Going to reopen while we engage the kernel team, we believe this is a fundamental issue with OOM kill and would manifest regardless of our default reservation if the kubelet had enough usage. Kernel issue is https://bugzilla.redhat.com/show_bug.cgi?id=1803217 *** Bug 1803239 has been marked as a duplicate of this bug. *** on version :4.4.0-0.nightly-2020-02-19-213909 , this bug is reproduced
Time starts from creating memory-hog pod, 3 minutes later, the node becomes Unknown status and is tainted unreachable;
8 minutes later, the pod becomes Terminating status;
15 minutes later, the pod disappeared (oomkilled) and the node becomes Ready status.
So there are at least 12 minutes in which the node keeps Unknown, and heartbeat stoped.
after node recover, the events are delivered
[lyman@localhost env]$ oc describe node ip-10-0-153-107.us-east-2.compute.internal
Name: ip-10-0-153-107.us-east-2.compute.internal
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m4.large
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-2
failure-domain.beta.kubernetes.io/zone=us-east-2b
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-0-153-107
kubernetes.io/os=linux
node-role.kubernetes.io/worker=
node.kubernetes.io/instance-type=m4.large
node.openshift.io/os_id=rhcos
topology.kubernetes.io/region=us-east-2
topology.kubernetes.io/zone=us-east-2b
Annotations: machine.openshift.io/machine: openshift-machine-api/minmli-0220-zgsmn-worker-us-east-2b-zwwcf
machineconfiguration.openshift.io/currentConfig: rendered-worker-7438f0d51b46b0f81add0bf8ec2fbe1a
machineconfiguration.openshift.io/desiredConfig: rendered-worker-7438f0d51b46b0f81add0bf8ec2fbe1a
machineconfiguration.openshift.io/reason:
machineconfiguration.openshift.io/state: Done
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 20 Feb 2020 11:19:31 +0800
Taints: node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unreachable:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure Unknown Fri, 21 Feb 2020 14:40:41 +0800 Fri, 21 Feb 2020 14:43:31 +0800 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Fri, 21 Feb 2020 14:40:41 +0800 Fri, 21 Feb 2020 14:43:31 +0800 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Fri, 21 Feb 2020 14:40:41 +0800 Fri, 21 Feb 2020 14:43:31 +0800 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Fri, 21 Feb 2020 14:40:41 +0800 Fri, 21 Feb 2020 14:43:31 +0800 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: 10.0.153.107
Hostname: ip-10-0-153-107.us-east-2.compute.internal
InternalDNS: ip-10-0-153-107.us-east-2.compute.internal
Capacity:
attachable-volumes-aws-ebs: 39
cpu: 2
ephemeral-storage: 125277164Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8161840Ki
pods: 250
Allocatable:
attachable-volumes-aws-ebs: 39
cpu: 1500m
ephemeral-storage: 114381692328
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7010864Ki
pods: 250
System Info:
Machine ID: 8efca36c2ad941a7a0e84909f882c4ed
System UUID: ec2d4765-d316-147c-37a8-a4fc04bd9239
Boot ID: 029d4d02-8598-4bcd-aaa0-f23bb12e88c5
Kernel Version: 4.18.0-147.5.1.el8_1.x86_64
OS Image: Red Hat Enterprise Linux CoreOS 44.81.202002191330-0 (Ootpa)
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.17.0-4.dev.rhaos4.4.gitc3436cc.el8
Kubelet Version: v1.17.1
Kube-Proxy Version: v1.17.1
ProviderID: aws:///us-east-2b/i-042d34751c5d914e5
Non-terminated Pods: (14 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
minmli memory-hog-pod 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13m
openshift-cluster-node-tuning-operator tuned-b7jdj 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 27h
openshift-dns dns-default-b7pxh 110m (7%) 0 (0%) 70Mi (1%) 512Mi (7%) 27h
openshift-image-registry node-ca-bvcfz 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 27h
openshift-ingress router-default-5cd6c75986-vchdp 100m (6%) 0 (0%) 256Mi (3%) 0 (0%) 27h
openshift-machine-config-operator machine-config-daemon-mlp2c 40m (2%) 0 (0%) 100Mi (1%) 0 (0%) 27h
openshift-monitoring alertmanager-main-0 110m (7%) 100m (6%) 245Mi (3%) 25Mi (0%) 27h
openshift-monitoring grafana-755b7df4f9-tph2m 110m (7%) 0 (0%) 120Mi (1%) 0 (0%) 27h
openshift-monitoring node-exporter-khpq5 112m (7%) 0 (0%) 200Mi (2%) 0 (0%) 27h
openshift-monitoring prometheus-adapter-d64c8db56-c2ww8 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 3h49m
openshift-monitoring prometheus-k8s-1 480m (32%) 200m (13%) 1234Mi (18%) 50Mi (0%) 27h
openshift-multus multus-bfhxb 10m (0%) 0 (0%) 150Mi (2%) 0 (0%) 27h
openshift-sdn ovs-m8zhh 200m (13%) 0 (0%) 400Mi (5%) 0 (0%) 27h
openshift-sdn sdn-pxlfz 100m (6%) 0 (0%) 200Mi (2%) 0 (0%) 27h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1402m (93%) 300m (20%)
memory 3055Mi (44%) 587Mi (8%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events: <none>
After node recover:
[lyman@localhost env]$ oc describe node ip-10-0-153-107.us-east-2.compute.internal
Name: ip-10-0-153-107.us-east-2.compute.internal
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m4.large
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-2
failure-domain.beta.kubernetes.io/zone=us-east-2b
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-0-153-107
kubernetes.io/os=linux
node-role.kubernetes.io/worker=
node.kubernetes.io/instance-type=m4.large
node.openshift.io/os_id=rhcos
topology.kubernetes.io/region=us-east-2
topology.kubernetes.io/zone=us-east-2b
Annotations: machine.openshift.io/machine: openshift-machine-api/minmli-0220-zgsmn-worker-us-east-2b-zwwcf
machineconfiguration.openshift.io/currentConfig: rendered-worker-7438f0d51b46b0f81add0bf8ec2fbe1a
machineconfiguration.openshift.io/desiredConfig: rendered-worker-7438f0d51b46b0f81add0bf8ec2fbe1a
machineconfiguration.openshift.io/reason:
machineconfiguration.openshift.io/state: Done
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 20 Feb 2020 11:19:31 +0800
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Fri, 21 Feb 2020 14:56:47 +0800 Fri, 21 Feb 2020 14:56:37 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 21 Feb 2020 14:56:47 +0800 Fri, 21 Feb 2020 14:56:37 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 21 Feb 2020 14:56:47 +0800 Fri, 21 Feb 2020 14:56:37 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 21 Feb 2020 14:56:47 +0800 Fri, 21 Feb 2020 14:56:47 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.0.153.107
Hostname: ip-10-0-153-107.us-east-2.compute.internal
InternalDNS: ip-10-0-153-107.us-east-2.compute.internal
Capacity:
attachable-volumes-aws-ebs: 39
cpu: 2
ephemeral-storage: 125277164Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8161840Ki
pods: 250
Allocatable:
attachable-volumes-aws-ebs: 39
cpu: 1500m
ephemeral-storage: 114381692328
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7010864Ki
pods: 250
System Info:
Machine ID: 8efca36c2ad941a7a0e84909f882c4ed
System UUID: ec2d4765-d316-147c-37a8-a4fc04bd9239
Boot ID: 029d4d02-8598-4bcd-aaa0-f23bb12e88c5
Kernel Version: 4.18.0-147.5.1.el8_1.x86_64
OS Image: Red Hat Enterprise Linux CoreOS 44.81.202002191330-0 (Ootpa)
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.17.0-4.dev.rhaos4.4.gitc3436cc.el8
Kubelet Version: v1.17.1
Kube-Proxy Version: v1.17.1
ProviderID: aws:///us-east-2b/i-042d34751c5d914e5
Non-terminated Pods: (10 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
openshift-cluster-node-tuning-operator tuned-b7jdj 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 27h
openshift-dns dns-default-b7pxh 110m (7%) 0 (0%) 70Mi (1%) 512Mi (7%) 27h
openshift-image-registry node-ca-bvcfz 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 27h
openshift-machine-config-operator machine-config-daemon-mlp2c 40m (2%) 0 (0%) 100Mi (1%) 0 (0%) 27h
openshift-monitoring alertmanager-main-0 110m (7%) 100m (6%) 245Mi (3%) 25Mi (0%) 2m21s
openshift-monitoring node-exporter-khpq5 112m (7%) 0 (0%) 200Mi (2%) 0 (0%) 27h
openshift-monitoring prometheus-k8s-1 480m (32%) 200m (13%) 1234Mi (18%) 50Mi (0%) 2m21s
openshift-multus multus-bfhxb 10m (0%) 0 (0%) 150Mi (2%) 0 (0%) 27h
openshift-sdn ovs-m8zhh 200m (13%) 0 (0%) 400Mi (5%) 0 (0%) 27h
openshift-sdn sdn-pxlfz 100m (6%) 0 (0%) 200Mi (2%) 0 (0%) 27h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1182m (78%) 300m (20%)
memory 2659Mi (38%) 587Mi (8%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning SystemOOM 2m44s kubelet, ip-10-0-153-107.us-east-2.compute.internal System OOM encountered, victim process: stress, pid: 3170956
Normal NodeHasSufficientMemory 2m43s (x9 over 27h) kubelet, ip-10-0-153-107.us-east-2.compute.internal Node ip-10-0-153-107.us-east-2.compute.internal status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 2m43s (x9 over 27h) kubelet, ip-10-0-153-107.us-east-2.compute.internal Node ip-10-0-153-107.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 2m43s (x9 over 27h) kubelet, ip-10-0-153-107.us-east-2.compute.internal Node ip-10-0-153-107.us-east-2.compute.internal status is now: NodeHasSufficientPID
Normal NodeNotReady 2m43s kubelet, ip-10-0-153-107.us-east-2.compute.internal Node ip-10-0-153-107.us-east-2.compute.internal status is now: NodeNotReady
Normal NodeReady 2m33s (x2 over 27h) kubelet, ip-10-0-153-107.us-east-2.compute.internal Node ip-10-0-153-107.us-east-2.compute.internal status is now: NodeReady
*** Bug 1801771 has been marked as a duplicate of this bug. *** QE: This patch is in 4.5. I'm not sure of a great way of testing because the kubelet gets injected into RHCOS. *** Bug 1802944 has been marked as a duplicate of this bug. *** *** Bug 1766237 has been marked as a duplicate of this bug. *** *** Bug 1767284 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |
Created attachment 1661546 [details] causes an OOMkill / eviction for memory Creating a memory hogger pod (that should be evicted / OOM killed) instead of being safely handled by the node causes the node to become unreachable for >10m. On the node, the kubelet appears to be running but can't heartbeat the apiserver. Also, the node appears to think that the apiserver deleted all the pods (DELETE("api") in logs) which is not correct - no pods except the oomkilled one should be evicted / deleted. Recreate 1. Create the attached kill-node.yaml on the cluster (oc create -f kill-node.yaml) 2. Wait 2-3 minutes while memory fills up on the worker Expected: 1. memory-hog pod is oomkilled and/or evicted (either would be acceptable) 2. the node remains ready Actual: 1. Node is tainted as unreachable, heartbeats stop, and it takes >10m for it to recover 2. After recovery, events are delivered As part of fixing this, we need to add an e2e tests to the origin disruptive suite that triggers this (and add eviction tests, because this doesn't seem to evict anything).