Bug 1763046

Summary: cluster-monitoring-operator is killed by oom-killer
Product: OpenShift Container Platform Reporter: Daein Park <dapark>
Component: MonitoringAssignee: Pawel Krupa <pkrupa>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania, tkimura
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-18 14:52:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Daein Park 2019-10-18 06:40:00 UTC
Description of problem:

"cluster-monitoring-operator" is killed by oom-killer when reaching memory limit size as follows. 
It seems like the default "resources.limits.memory: 50Mi" is not enough on some systems, we need to consider expanding the size to more reasonable one on various systems by default.
This oom-killer issue has reported by customer, and they are used to happen occasionally. But keeping the issue until now.

~~~
Oct 18 12:12:52 infra.ocp.example.com kernel: operator invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=-998
Oct 18 12:12:52 infra.ocp.example.com kernel: operator cpuset=docker-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.scope mems_allowed=0
Oct 18 12:12:52 infra.ocp.example.com kernel: CPU: 4 PID: 24268 Comm: operator Tainted: G               ------------ T 3.10.0-957.10.1.el7.x86_64 #1
Oct 18 12:12:52 infra.ocp.example.com kernel: Hardware name: Red Hat OpenStack Compute, BIOS 1.11.0-2.el7 04/01/2014
Oct 18 12:12:52 infra.ocp.example.com kernel: Call Trace:
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff84762e41>] dump_stack+0x19/0x1b
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff8475d86a>] dump_header+0x90/0x229
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff841ba156>] ? find_lock_task_mm+0x56/0xc0
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff841ba604>] oom_kill_process+0x254/0x3d0
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff84235406>] mem_cgroup_oom_synchronize+0x546/0x570
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff84234880>] ? mem_cgroup_charge_common+0xc0/0xc0
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff841bae94>] pagefault_out_of_memory+0x14/0x90
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff8475bd72>] mm_fault_error+0x6a/0x157
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff847707a8>] __do_page_fault+0x3c8/0x500
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff847709c6>] trace_do_page_fault+0x56/0x150
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff8476ff42>] do_async_page_fault+0x22/0xf0
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff8476c788>] async_page_fault+0x28/0x30
Oct 18 12:12:52 infra.ocp.example.com kernel: Task in /kubepods.slice/kubepods-podxxxxxxxx_xxxx_xxxx_xxxx_xxxxxxxxxxxx.slice/docker-zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz.scope killed as a result of limit of /kubepods.slice/kubepods-podxxxxxxxx_xxxx_xxxx_xxxx_xxxxxxxxxxxx.slice
Oct 18 12:12:52 infra.ocp.example.com kernel: memory: usage 51200kB, limit 51200kB, failcnt 15555886
Oct 18 12:12:52 infra.ocp.example.com kernel: memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
Oct 18 12:12:52 infra.ocp.example.com kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Oct 18 12:12:54 infra.ocp.example.com kernel: Memory cgroup stats for /kubepods.slice/kubepods-podxxxxxxxx_xxxx_xxxx_xxxx_xxxxxxxxxxxx.slice: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Oct 18 12:12:56 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:23:56.558344   10245 kubelet_node_status.go:453] Using node IP: "172.30.141.16"
Oct 18 12:12:58 infra.ocp.example.com kernel: Memory cgroup stats for /kubepods.slice/kubepods-podxxxxxxxx_xxxx_xxxx_xxxx_xxxxxxxxxxxx.slice/docker-zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz.scope: cache:0KB rss:516KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:516KB inactive_file:0KB active_file:0KB unevictable:0KB
Oct 10 16:24:13 infra.ocp.example.com kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Oct 10 16:24:13 infra.ocp.example.com kernel: [16868]  1001 16868      685      187       7        0          -998 pod
Oct 10 16:24:13 infra.ocp.example.com kernel: [12012] 1000080000 12012   305055    10301      89        0          -998 operator
Oct 10 16:24:13 infra.ocp.example.com kernel: Memory cgroup out of memory: Kill process 11155 (pod) score 0 or sacrifice child
Oct 10 16:24:13 infra.ocp.example.com kernel: Killed process 16868 (pod) total-vm:2740kB, anon-rss:404kB, file-rss:344kB, shmem-rss:0kB
Oct 10 16:24:20 infra.ocp.example.com oci-systemd-hook[62670]: systemdhook <debug>: e6f4beb7625b: Skipping as container command is /usr/bin/pod, not init or systemd
Oct 10 16:24:20 infra.ocp.example.com oci-umount[62672]: umounthook <debug>: e6f4beb7625b: only runs in prestart stage, ignoring
Oct 10 16:24:20 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:20.261220731+09:00" level=warning msg="zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz cleanup: failed to unmount secrets: invalid argument"
Oct 10 16:24:20 infra.ocp.example.com atomic-openshift-node[10245]: W1010 16:24:20.370627   10245 docker_sandbox.go:372] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "cluster-monitoring-operator-6c574f8bfd-dk7g9_openshift-monitoring": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
Oct 10 16:24:20 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:20.383931   10245 kubelet.go:1865] SyncLoop (PLEG): "cluster-monitoring-operator-6c574f8bfd-dk7g9_openshift-monitoring(0b77e7c0-e894-11e9-bc9a-fa163e36387b)", event: &pleg.PodLifecycleEvent{ID:"0b77e7c0-e894-11e9-bc9a-fa163e36387b", Type:"ContainerDied", Data:"zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"}
Oct 10 16:24:20 infra.ocp.example.com atomic-openshift-node[10245]: W1010 16:24:20.384058   10245 pod_container_deletor.go:75] Container "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" not found in pod's containers
Oct 10 16:24:20 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:20.684754   10245 kuberuntime_manager.go:403] No ready sandbox for pod "cluster-monitoring-operator-6c574f8bfd-dk7g9_openshift-monitoring(0b77e7c0-e894-11e9-bc9a-fa163e36387b)" can be found. Need to start a new one
Oct 10 16:24:20 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:20.684884   10245 kuberuntime_container.go:553] Killing container "docker://aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" with 30 second grace period
Oct 10 16:24:21 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:21.853939   10245 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "infra.ocp.example.com"
Oct 10 16:24:23 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:23.181066   10245 cloud_request_manager.go:108] Node addresses from cloud provider for node "infra.ocp.example.com" collected
Oct 10 16:24:25 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:25.192897274+09:00" level=error msg="collecting stats for aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: sandbox a725104675ac6e959f67d661a70e1ffd39be240ed64a7424ad9e8708d320b2ff not found"
Oct 10 16:24:26 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:26.203429815+09:00" level=error msg="collecting stats for aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: sandbox a725104675ac6e959f67d661a70e1ffd39be240ed64a7424ad9e8708d320b2ff not found"
Oct 10 16:24:26 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:26.628142   10245 kubelet_node_status.go:453] Using node IP: "172.30.141.16"
Oct 10 16:24:27 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:27.239180977+09:00" level=error msg="collecting stats for aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: sandbox a725104675ac6e959f67d661a70e1ffd39be240ed64a7424ad9e8708d320b2ff not found"
Oct 10 16:24:28 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:28.127616183+09:00" level=error msg="collecting stats for aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: sandbox a725104675ac6e959f67d661a70e1ffd39be240ed64a7424ad9e8708d320b2ff not found"
Oct 10 16:24:29 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:29.163885628+09:00" level=error msg="collecting stats for aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: sandbox a725104675ac6e959f67d661a70e1ffd39be240ed64a7424ad9e8708d320b2ff not found"
~~~


Version-Release number of selected component (if applicable):

OpenShift v3.11.135

How reproducible:

This issue is related with resource usage, so I cannot verify how can the issue reproduce, but the issue has been reported multiple times by same customers.

Steps to Reproduce:
1.
2.
3.

Actual results:

cluster-monitoring-operator is killed by oom-killer when reaching the memory limit usage.

Expected results:

cluster-monitoring-operator keeps running stably without out of memory issues.

Additional info:

Comment 4 errata-xmlrpc 2019-11-18 14:52:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3817