Bug 1800319

Summary: [4.5] A pod that gradually leaks memory causes node to become unreachable for 10 minutes
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: NodeAssignee: Ryan Phillips <rphillips>
Status: CLOSED ERRATA QA Contact: Weinan Liu <weinliu>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4CC: aos-bugs, cfillekes, dfeddema, hannsj_uhl, Holger.Wolf, jerzhang, jokerman, jshepherd, minmli, rphillips, tdale, weinliu, xxia
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: kubepods.slice memory cgroup was not being set to the max limit minus the reservations. This caused the nodes to become overloaded with OOMs and not evict workloads. Consequence: Fix: Sets the kubepods.slice memory reservation correctly. Result:
Story Points: ---
Clone Of:
: 1801824 1802687 1806786 (view as bug list) Environment:
Last Closed: 2020-05-13 21:56:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1765215, 1766237, 1801826, 1801829, 1802687, 1805398, 1806786, 1808444    
Attachments:
Description Flags
causes an OOMkill / eviction for memory none

Description Clayton Coleman 2020-02-06 21:01:31 UTC
Created attachment 1661546 [details]
causes an OOMkill / eviction for memory

Creating a memory hogger pod (that should be evicted / OOM killed) instead of being safely handled by the node causes the node to become unreachable for >10m.  On the node, the kubelet appears to be running but can't heartbeat the apiserver.  Also, the node appears to think that the apiserver deleted all the pods (DELETE("api") in logs) which is not correct - no pods except the oomkilled one should be evicted / deleted.

Recreate

1. Create the attached kill-node.yaml on the cluster (oc create -f kill-node.yaml)
2. Wait 2-3 minutes while memory fills up on the worker

Expected:

1. memory-hog pod is oomkilled and/or evicted (either would be acceptable)
2. the node remains ready

Actual:

1. Node is tainted as unreachable, heartbeats stop, and it takes >10m for it to recover
2. After recovery, events are delivered

As part of fixing this, we need to add an e2e tests to the origin disruptive suite that triggers this (and add eviction tests, because this doesn't seem to evict anything).

Comment 1 Clayton Coleman 2020-02-06 21:14:33 UTC
Once this is fixed we need to test against 4.3 and 4.2 and backport if it happens - this can DoS a node.

Comment 2 Cheryl A Fillekes 2020-02-11 18:54:44 UTC
The OOMKiller on Z is so active in the cases we tested on z/VM clusters that it fills the 4GB z/VM console message spool with OOMkiller messages, which requires manual intervention at the z/VM x3270 console to conduct a sometimes required complete re-installation.  It might be worth testing https://bugzilla.redhat.com/show_bug.cgi?id=1800319 with long-running forked multiprocess memory hogs such as `stress-ng --mmap 64 &`.  On Z, this memory hog not only completely disables the node and the node's monitoring cluster operator and dns/routing operator, but does so over the better part of a day, only falling into complete catatonia late at night.  It appears that the eviction starts, but the OOMKiller is killing so many things at such a rate, that it appears to even kill the service conducting the eviction process.  We're trying to convince the owners of OCP 4.3 test clusters on Power architecture at IBM to try these workloads, too.

Comment 5 Clayton Coleman 2020-02-14 18:53:17 UTC
Going to reopen while we engage the kernel team, we believe this is a fundamental issue with OOM kill and would manifest regardless of our default reservation if the kubelet had enough usage.

Comment 6 Clayton Coleman 2020-02-14 18:53:33 UTC
Kernel issue is https://bugzilla.redhat.com/show_bug.cgi?id=1803217

Comment 7 Ryan Phillips 2020-02-14 18:58:27 UTC
*** Bug 1803239 has been marked as a duplicate of this bug. ***

Comment 9 MinLi 2020-02-21 08:25:20 UTC
on version :4.4.0-0.nightly-2020-02-19-213909 , this bug is reproduced

Time starts from creating memory-hog pod, 3 minutes later, the node becomes Unknown status and is tainted unreachable;
8 minutes later, the pod becomes Terminating status;
15 minutes later, the pod disappeared (oomkilled) and the node becomes Ready status.
So there are at least 12 minutes in which the node keeps Unknown, and heartbeat stoped.
after node recover, the events are delivered

[lyman@localhost env]$ oc describe node ip-10-0-153-107.us-east-2.compute.internal
Name:               ip-10-0-153-107.us-east-2.compute.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m4.large
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2b
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-153-107
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=m4.large
                    node.openshift.io/os_id=rhcos
                    topology.kubernetes.io/region=us-east-2
                    topology.kubernetes.io/zone=us-east-2b
Annotations:        machine.openshift.io/machine: openshift-machine-api/minmli-0220-zgsmn-worker-us-east-2b-zwwcf
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-7438f0d51b46b0f81add0bf8ec2fbe1a
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-7438f0d51b46b0f81add0bf8ec2fbe1a
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 20 Feb 2020 11:19:31 +0800
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Fri, 21 Feb 2020 14:40:41 +0800   Fri, 21 Feb 2020 14:43:31 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Fri, 21 Feb 2020 14:40:41 +0800   Fri, 21 Feb 2020 14:43:31 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Fri, 21 Feb 2020 14:40:41 +0800   Fri, 21 Feb 2020 14:43:31 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Fri, 21 Feb 2020 14:40:41 +0800   Fri, 21 Feb 2020 14:43:31 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
Addresses:
  InternalIP:   10.0.153.107
  Hostname:     ip-10-0-153-107.us-east-2.compute.internal
  InternalDNS:  ip-10-0-153-107.us-east-2.compute.internal
Capacity:
 attachable-volumes-aws-ebs:  39
 cpu:                         2
 ephemeral-storage:           125277164Ki
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      8161840Ki
 pods:                        250
Allocatable:
 attachable-volumes-aws-ebs:  39
 cpu:                         1500m
 ephemeral-storage:           114381692328
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      7010864Ki
 pods:                        250
System Info:
 Machine ID:                              8efca36c2ad941a7a0e84909f882c4ed
 System UUID:                             ec2d4765-d316-147c-37a8-a4fc04bd9239
 Boot ID:                                 029d4d02-8598-4bcd-aaa0-f23bb12e88c5
 Kernel Version:                          4.18.0-147.5.1.el8_1.x86_64
 OS Image:                                Red Hat Enterprise Linux CoreOS 44.81.202002191330-0 (Ootpa)
 Operating System:                        linux
 Architecture:                            amd64
 Container Runtime Version:               cri-o://1.17.0-4.dev.rhaos4.4.gitc3436cc.el8
 Kubelet Version:                         v1.17.1
 Kube-Proxy Version:                      v1.17.1
ProviderID:                               aws:///us-east-2b/i-042d34751c5d914e5
Non-terminated Pods:                      (14 in total)
  Namespace                               Name                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                                  ------------  ----------  ---------------  -------------  ---
  minmli                                  memory-hog-pod                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         13m
  openshift-cluster-node-tuning-operator  tuned-b7jdj                           10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         27h
  openshift-dns                           dns-default-b7pxh                     110m (7%)     0 (0%)      70Mi (1%)        512Mi (7%)     27h
  openshift-image-registry                node-ca-bvcfz                         10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         27h
  openshift-ingress                       router-default-5cd6c75986-vchdp       100m (6%)     0 (0%)      256Mi (3%)       0 (0%)         27h
  openshift-machine-config-operator       machine-config-daemon-mlp2c           40m (2%)      0 (0%)      100Mi (1%)       0 (0%)         27h
  openshift-monitoring                    alertmanager-main-0                   110m (7%)     100m (6%)   245Mi (3%)       25Mi (0%)      27h
  openshift-monitoring                    grafana-755b7df4f9-tph2m              110m (7%)     0 (0%)      120Mi (1%)       0 (0%)         27h
  openshift-monitoring                    node-exporter-khpq5                   112m (7%)     0 (0%)      200Mi (2%)       0 (0%)         27h
  openshift-monitoring                    prometheus-adapter-d64c8db56-c2ww8    10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         3h49m
  openshift-monitoring                    prometheus-k8s-1                      480m (32%)    200m (13%)  1234Mi (18%)     50Mi (0%)      27h
  openshift-multus                        multus-bfhxb                          10m (0%)      0 (0%)      150Mi (2%)       0 (0%)         27h
  openshift-sdn                           ovs-m8zhh                             200m (13%)    0 (0%)      400Mi (5%)       0 (0%)         27h
  openshift-sdn                           sdn-pxlfz                             100m (6%)     0 (0%)      200Mi (2%)       0 (0%)         27h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1402m (93%)   300m (20%)
  memory                      3055Mi (44%)  587Mi (8%)
  ephemeral-storage           0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
Events:                       <none>


After node recover: 
[lyman@localhost env]$ oc describe node ip-10-0-153-107.us-east-2.compute.internal
Name:               ip-10-0-153-107.us-east-2.compute.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m4.large
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2b
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-153-107
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=m4.large
                    node.openshift.io/os_id=rhcos
                    topology.kubernetes.io/region=us-east-2
                    topology.kubernetes.io/zone=us-east-2b
Annotations:        machine.openshift.io/machine: openshift-machine-api/minmli-0220-zgsmn-worker-us-east-2b-zwwcf
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-7438f0d51b46b0f81add0bf8ec2fbe1a
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-7438f0d51b46b0f81add0bf8ec2fbe1a
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 20 Feb 2020 11:19:31 +0800
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Fri, 21 Feb 2020 14:56:47 +0800   Fri, 21 Feb 2020 14:56:37 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Fri, 21 Feb 2020 14:56:47 +0800   Fri, 21 Feb 2020 14:56:37 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Fri, 21 Feb 2020 14:56:47 +0800   Fri, 21 Feb 2020 14:56:37 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Fri, 21 Feb 2020 14:56:47 +0800   Fri, 21 Feb 2020 14:56:47 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.0.153.107
  Hostname:     ip-10-0-153-107.us-east-2.compute.internal
  InternalDNS:  ip-10-0-153-107.us-east-2.compute.internal
Capacity:
 attachable-volumes-aws-ebs:  39
 cpu:                         2
 ephemeral-storage:           125277164Ki
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      8161840Ki
 pods:                        250
Allocatable:
 attachable-volumes-aws-ebs:  39
 cpu:                         1500m
 ephemeral-storage:           114381692328
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      7010864Ki
 pods:                        250
System Info:
 Machine ID:                              8efca36c2ad941a7a0e84909f882c4ed
 System UUID:                             ec2d4765-d316-147c-37a8-a4fc04bd9239
 Boot ID:                                 029d4d02-8598-4bcd-aaa0-f23bb12e88c5
 Kernel Version:                          4.18.0-147.5.1.el8_1.x86_64
 OS Image:                                Red Hat Enterprise Linux CoreOS 44.81.202002191330-0 (Ootpa)
 Operating System:                        linux
 Architecture:                            amd64
 Container Runtime Version:               cri-o://1.17.0-4.dev.rhaos4.4.gitc3436cc.el8
 Kubelet Version:                         v1.17.1
 Kube-Proxy Version:                      v1.17.1
ProviderID:                               aws:///us-east-2b/i-042d34751c5d914e5
Non-terminated Pods:                      (10 in total)
  Namespace                               Name                           CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                           ------------  ----------  ---------------  -------------  ---
  openshift-cluster-node-tuning-operator  tuned-b7jdj                    10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         27h
  openshift-dns                           dns-default-b7pxh              110m (7%)     0 (0%)      70Mi (1%)        512Mi (7%)     27h
  openshift-image-registry                node-ca-bvcfz                  10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         27h
  openshift-machine-config-operator       machine-config-daemon-mlp2c    40m (2%)      0 (0%)      100Mi (1%)       0 (0%)         27h
  openshift-monitoring                    alertmanager-main-0            110m (7%)     100m (6%)   245Mi (3%)       25Mi (0%)      2m21s
  openshift-monitoring                    node-exporter-khpq5            112m (7%)     0 (0%)      200Mi (2%)       0 (0%)         27h
  openshift-monitoring                    prometheus-k8s-1               480m (32%)    200m (13%)  1234Mi (18%)     50Mi (0%)      2m21s
  openshift-multus                        multus-bfhxb                   10m (0%)      0 (0%)      150Mi (2%)       0 (0%)         27h
  openshift-sdn                           ovs-m8zhh                      200m (13%)    0 (0%)      400Mi (5%)       0 (0%)         27h
  openshift-sdn                           sdn-pxlfz                      100m (6%)     0 (0%)      200Mi (2%)       0 (0%)         27h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1182m (78%)   300m (20%)
  memory                      2659Mi (38%)  587Mi (8%)
  ephemeral-storage           0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
Events:
  Type     Reason                   Age                  From                                                 Message
  ----     ------                   ----                 ----                                                 -------
  Warning  SystemOOM                2m44s                kubelet, ip-10-0-153-107.us-east-2.compute.internal  System OOM encountered, victim process: stress, pid: 3170956
  Normal   NodeHasSufficientMemory  2m43s (x9 over 27h)  kubelet, ip-10-0-153-107.us-east-2.compute.internal  Node ip-10-0-153-107.us-east-2.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    2m43s (x9 over 27h)  kubelet, ip-10-0-153-107.us-east-2.compute.internal  Node ip-10-0-153-107.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     2m43s (x9 over 27h)  kubelet, ip-10-0-153-107.us-east-2.compute.internal  Node ip-10-0-153-107.us-east-2.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeNotReady             2m43s                kubelet, ip-10-0-153-107.us-east-2.compute.internal  Node ip-10-0-153-107.us-east-2.compute.internal status is now: NodeNotReady
  Normal   NodeReady                2m33s (x2 over 27h)  kubelet, ip-10-0-153-107.us-east-2.compute.internal  Node ip-10-0-153-107.us-east-2.compute.internal status is now: NodeReady

Comment 10 Ryan Phillips 2020-02-24 19:54:58 UTC
*** Bug 1801771 has been marked as a duplicate of this bug. ***

Comment 11 Ryan Phillips 2020-02-25 02:58:16 UTC
QE: This patch is in 4.5. I'm not sure of a great way of testing because the kubelet gets injected into RHCOS.

Comment 12 Ryan Phillips 2020-02-25 17:35:25 UTC
*** Bug 1802944 has been marked as a duplicate of this bug. ***

Comment 16 Ryan Phillips 2020-03-06 17:54:48 UTC
*** Bug 1766237 has been marked as a duplicate of this bug. ***

Comment 17 Ryan Phillips 2020-03-26 13:42:59 UTC
*** Bug 1767284 has been marked as a duplicate of this bug. ***

Comment 20 errata-xmlrpc 2020-05-13 21:56:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581