Description of problem: "cluster-monitoring-operator" is killed by oom-killer when reaching memory limit size as follows. It seems like the default "resources.limits.memory: 50Mi" is not enough on some systems, we need to consider expanding the size to more reasonable one on various systems by default. This oom-killer issue has reported by customer, and they are used to happen occasionally. But keeping the issue until now. ~~~ Oct 18 12:12:52 infra.ocp.example.com kernel: operator invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=-998 Oct 18 12:12:52 infra.ocp.example.com kernel: operator cpuset=docker-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.scope mems_allowed=0 Oct 18 12:12:52 infra.ocp.example.com kernel: CPU: 4 PID: 24268 Comm: operator Tainted: G ------------ T 3.10.0-957.10.1.el7.x86_64 #1 Oct 18 12:12:52 infra.ocp.example.com kernel: Hardware name: Red Hat OpenStack Compute, BIOS 1.11.0-2.el7 04/01/2014 Oct 18 12:12:52 infra.ocp.example.com kernel: Call Trace: Oct 18 12:12:52 infra.ocp.example.com kernel: [<ffffffff84762e41>] dump_stack+0x19/0x1b Oct 18 12:12:52 infra.ocp.example.com kernel: [<ffffffff8475d86a>] dump_header+0x90/0x229 Oct 18 12:12:52 infra.ocp.example.com kernel: [<ffffffff841ba156>] ? find_lock_task_mm+0x56/0xc0 Oct 18 12:12:52 infra.ocp.example.com kernel: [<ffffffff841ba604>] oom_kill_process+0x254/0x3d0 Oct 18 12:12:52 infra.ocp.example.com kernel: [<ffffffff84235406>] mem_cgroup_oom_synchronize+0x546/0x570 Oct 18 12:12:52 infra.ocp.example.com kernel: [<ffffffff84234880>] ? mem_cgroup_charge_common+0xc0/0xc0 Oct 18 12:12:52 infra.ocp.example.com kernel: [<ffffffff841bae94>] pagefault_out_of_memory+0x14/0x90 Oct 18 12:12:52 infra.ocp.example.com kernel: [<ffffffff8475bd72>] mm_fault_error+0x6a/0x157 Oct 18 12:12:52 infra.ocp.example.com kernel: [<ffffffff847707a8>] __do_page_fault+0x3c8/0x500 Oct 18 12:12:52 infra.ocp.example.com kernel: [<ffffffff847709c6>] trace_do_page_fault+0x56/0x150 Oct 18 12:12:52 infra.ocp.example.com kernel: [<ffffffff8476ff42>] do_async_page_fault+0x22/0xf0 Oct 18 12:12:52 infra.ocp.example.com kernel: [<ffffffff8476c788>] async_page_fault+0x28/0x30 Oct 18 12:12:52 infra.ocp.example.com kernel: Task in /kubepods.slice/kubepods-podxxxxxxxx_xxxx_xxxx_xxxx_xxxxxxxxxxxx.slice/docker-zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz.scope killed as a result of limit of /kubepods.slice/kubepods-podxxxxxxxx_xxxx_xxxx_xxxx_xxxxxxxxxxxx.slice Oct 18 12:12:52 infra.ocp.example.com kernel: memory: usage 51200kB, limit 51200kB, failcnt 15555886 Oct 18 12:12:52 infra.ocp.example.com kernel: memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0 Oct 18 12:12:52 infra.ocp.example.com kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Oct 18 12:12:54 infra.ocp.example.com kernel: Memory cgroup stats for /kubepods.slice/kubepods-podxxxxxxxx_xxxx_xxxx_xxxx_xxxxxxxxxxxx.slice: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Oct 18 12:12:56 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:23:56.558344 10245 kubelet_node_status.go:453] Using node IP: "172.30.141.16" Oct 18 12:12:58 infra.ocp.example.com kernel: Memory cgroup stats for /kubepods.slice/kubepods-podxxxxxxxx_xxxx_xxxx_xxxx_xxxxxxxxxxxx.slice/docker-zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz.scope: cache:0KB rss:516KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:516KB inactive_file:0KB active_file:0KB unevictable:0KB Oct 10 16:24:13 infra.ocp.example.com kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Oct 10 16:24:13 infra.ocp.example.com kernel: [16868] 1001 16868 685 187 7 0 -998 pod Oct 10 16:24:13 infra.ocp.example.com kernel: [12012] 1000080000 12012 305055 10301 89 0 -998 operator Oct 10 16:24:13 infra.ocp.example.com kernel: Memory cgroup out of memory: Kill process 11155 (pod) score 0 or sacrifice child Oct 10 16:24:13 infra.ocp.example.com kernel: Killed process 16868 (pod) total-vm:2740kB, anon-rss:404kB, file-rss:344kB, shmem-rss:0kB Oct 10 16:24:20 infra.ocp.example.com oci-systemd-hook[62670]: systemdhook <debug>: e6f4beb7625b: Skipping as container command is /usr/bin/pod, not init or systemd Oct 10 16:24:20 infra.ocp.example.com oci-umount[62672]: umounthook <debug>: e6f4beb7625b: only runs in prestart stage, ignoring Oct 10 16:24:20 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:20.261220731+09:00" level=warning msg="zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz cleanup: failed to unmount secrets: invalid argument" Oct 10 16:24:20 infra.ocp.example.com atomic-openshift-node[10245]: W1010 16:24:20.370627 10245 docker_sandbox.go:372] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "cluster-monitoring-operator-6c574f8bfd-dk7g9_openshift-monitoring": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" Oct 10 16:24:20 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:20.383931 10245 kubelet.go:1865] SyncLoop (PLEG): "cluster-monitoring-operator-6c574f8bfd-dk7g9_openshift-monitoring(0b77e7c0-e894-11e9-bc9a-fa163e36387b)", event: &pleg.PodLifecycleEvent{ID:"0b77e7c0-e894-11e9-bc9a-fa163e36387b", Type:"ContainerDied", Data:"zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"} Oct 10 16:24:20 infra.ocp.example.com atomic-openshift-node[10245]: W1010 16:24:20.384058 10245 pod_container_deletor.go:75] Container "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" not found in pod's containers Oct 10 16:24:20 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:20.684754 10245 kuberuntime_manager.go:403] No ready sandbox for pod "cluster-monitoring-operator-6c574f8bfd-dk7g9_openshift-monitoring(0b77e7c0-e894-11e9-bc9a-fa163e36387b)" can be found. Need to start a new one Oct 10 16:24:20 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:20.684884 10245 kuberuntime_container.go:553] Killing container "docker://aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" with 30 second grace period Oct 10 16:24:21 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:21.853939 10245 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "infra.ocp.example.com" Oct 10 16:24:23 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:23.181066 10245 cloud_request_manager.go:108] Node addresses from cloud provider for node "infra.ocp.example.com" collected Oct 10 16:24:25 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:25.192897274+09:00" level=error msg="collecting stats for aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: sandbox a725104675ac6e959f67d661a70e1ffd39be240ed64a7424ad9e8708d320b2ff not found" Oct 10 16:24:26 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:26.203429815+09:00" level=error msg="collecting stats for aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: sandbox a725104675ac6e959f67d661a70e1ffd39be240ed64a7424ad9e8708d320b2ff not found" Oct 10 16:24:26 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:26.628142 10245 kubelet_node_status.go:453] Using node IP: "172.30.141.16" Oct 10 16:24:27 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:27.239180977+09:00" level=error msg="collecting stats for aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: sandbox a725104675ac6e959f67d661a70e1ffd39be240ed64a7424ad9e8708d320b2ff not found" Oct 10 16:24:28 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:28.127616183+09:00" level=error msg="collecting stats for aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: sandbox a725104675ac6e959f67d661a70e1ffd39be240ed64a7424ad9e8708d320b2ff not found" Oct 10 16:24:29 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:29.163885628+09:00" level=error msg="collecting stats for aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: sandbox a725104675ac6e959f67d661a70e1ffd39be240ed64a7424ad9e8708d320b2ff not found" ~~~ Version-Release number of selected component (if applicable): OpenShift v3.11.135 How reproducible: This issue is related with resource usage, so I cannot verify how can the issue reproduce, but the issue has been reported multiple times by same customers. Steps to Reproduce: 1. 2. 3. Actual results: cluster-monitoring-operator is killed by oom-killer when reaching the memory limit usage. Expected results: cluster-monitoring-operator keeps running stably without out of memory issues. Additional info:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3817