Bug 1763046 - cluster-monitoring-operator is killed by oom-killer
Summary: cluster-monitoring-operator is killed by oom-killer
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.11.z
Assignee: Pawel Krupa
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-18 06:40 UTC by Daein Park
Modified: 2019-12-03 05:44 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-18 14:52:10 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 515 0 'None' closed Bug 1763046: manifests: remove setting limits on CMO 2021-01-19 01:56:49 UTC
Github openshift openshift-ansible pull 11964 0 'None' closed Bug 1763046: roles/openshift_cluster_monitoring_operator: remove limits 2021-01-19 01:56:49 UTC
Red Hat Product Errata RHBA-2019:3817 0 None None None 2019-11-18 14:52:21 UTC

Description Daein Park 2019-10-18 06:40:00 UTC
Description of problem:

"cluster-monitoring-operator" is killed by oom-killer when reaching memory limit size as follows. 
It seems like the default "resources.limits.memory: 50Mi" is not enough on some systems, we need to consider expanding the size to more reasonable one on various systems by default.
This oom-killer issue has reported by customer, and they are used to happen occasionally. But keeping the issue until now.

~~~
Oct 18 12:12:52 infra.ocp.example.com kernel: operator invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=-998
Oct 18 12:12:52 infra.ocp.example.com kernel: operator cpuset=docker-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.scope mems_allowed=0
Oct 18 12:12:52 infra.ocp.example.com kernel: CPU: 4 PID: 24268 Comm: operator Tainted: G               ------------ T 3.10.0-957.10.1.el7.x86_64 #1
Oct 18 12:12:52 infra.ocp.example.com kernel: Hardware name: Red Hat OpenStack Compute, BIOS 1.11.0-2.el7 04/01/2014
Oct 18 12:12:52 infra.ocp.example.com kernel: Call Trace:
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff84762e41>] dump_stack+0x19/0x1b
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff8475d86a>] dump_header+0x90/0x229
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff841ba156>] ? find_lock_task_mm+0x56/0xc0
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff841ba604>] oom_kill_process+0x254/0x3d0
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff84235406>] mem_cgroup_oom_synchronize+0x546/0x570
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff84234880>] ? mem_cgroup_charge_common+0xc0/0xc0
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff841bae94>] pagefault_out_of_memory+0x14/0x90
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff8475bd72>] mm_fault_error+0x6a/0x157
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff847707a8>] __do_page_fault+0x3c8/0x500
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff847709c6>] trace_do_page_fault+0x56/0x150
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff8476ff42>] do_async_page_fault+0x22/0xf0
Oct 18 12:12:52 infra.ocp.example.com kernel:  [<ffffffff8476c788>] async_page_fault+0x28/0x30
Oct 18 12:12:52 infra.ocp.example.com kernel: Task in /kubepods.slice/kubepods-podxxxxxxxx_xxxx_xxxx_xxxx_xxxxxxxxxxxx.slice/docker-zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz.scope killed as a result of limit of /kubepods.slice/kubepods-podxxxxxxxx_xxxx_xxxx_xxxx_xxxxxxxxxxxx.slice
Oct 18 12:12:52 infra.ocp.example.com kernel: memory: usage 51200kB, limit 51200kB, failcnt 15555886
Oct 18 12:12:52 infra.ocp.example.com kernel: memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
Oct 18 12:12:52 infra.ocp.example.com kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Oct 18 12:12:54 infra.ocp.example.com kernel: Memory cgroup stats for /kubepods.slice/kubepods-podxxxxxxxx_xxxx_xxxx_xxxx_xxxxxxxxxxxx.slice: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Oct 18 12:12:56 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:23:56.558344   10245 kubelet_node_status.go:453] Using node IP: "172.30.141.16"
Oct 18 12:12:58 infra.ocp.example.com kernel: Memory cgroup stats for /kubepods.slice/kubepods-podxxxxxxxx_xxxx_xxxx_xxxx_xxxxxxxxxxxx.slice/docker-zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz.scope: cache:0KB rss:516KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:516KB inactive_file:0KB active_file:0KB unevictable:0KB
Oct 10 16:24:13 infra.ocp.example.com kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Oct 10 16:24:13 infra.ocp.example.com kernel: [16868]  1001 16868      685      187       7        0          -998 pod
Oct 10 16:24:13 infra.ocp.example.com kernel: [12012] 1000080000 12012   305055    10301      89        0          -998 operator
Oct 10 16:24:13 infra.ocp.example.com kernel: Memory cgroup out of memory: Kill process 11155 (pod) score 0 or sacrifice child
Oct 10 16:24:13 infra.ocp.example.com kernel: Killed process 16868 (pod) total-vm:2740kB, anon-rss:404kB, file-rss:344kB, shmem-rss:0kB
Oct 10 16:24:20 infra.ocp.example.com oci-systemd-hook[62670]: systemdhook <debug>: e6f4beb7625b: Skipping as container command is /usr/bin/pod, not init or systemd
Oct 10 16:24:20 infra.ocp.example.com oci-umount[62672]: umounthook <debug>: e6f4beb7625b: only runs in prestart stage, ignoring
Oct 10 16:24:20 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:20.261220731+09:00" level=warning msg="zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz cleanup: failed to unmount secrets: invalid argument"
Oct 10 16:24:20 infra.ocp.example.com atomic-openshift-node[10245]: W1010 16:24:20.370627   10245 docker_sandbox.go:372] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "cluster-monitoring-operator-6c574f8bfd-dk7g9_openshift-monitoring": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
Oct 10 16:24:20 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:20.383931   10245 kubelet.go:1865] SyncLoop (PLEG): "cluster-monitoring-operator-6c574f8bfd-dk7g9_openshift-monitoring(0b77e7c0-e894-11e9-bc9a-fa163e36387b)", event: &pleg.PodLifecycleEvent{ID:"0b77e7c0-e894-11e9-bc9a-fa163e36387b", Type:"ContainerDied", Data:"zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"}
Oct 10 16:24:20 infra.ocp.example.com atomic-openshift-node[10245]: W1010 16:24:20.384058   10245 pod_container_deletor.go:75] Container "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" not found in pod's containers
Oct 10 16:24:20 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:20.684754   10245 kuberuntime_manager.go:403] No ready sandbox for pod "cluster-monitoring-operator-6c574f8bfd-dk7g9_openshift-monitoring(0b77e7c0-e894-11e9-bc9a-fa163e36387b)" can be found. Need to start a new one
Oct 10 16:24:20 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:20.684884   10245 kuberuntime_container.go:553] Killing container "docker://aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" with 30 second grace period
Oct 10 16:24:21 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:21.853939   10245 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "infra.ocp.example.com"
Oct 10 16:24:23 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:23.181066   10245 cloud_request_manager.go:108] Node addresses from cloud provider for node "infra.ocp.example.com" collected
Oct 10 16:24:25 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:25.192897274+09:00" level=error msg="collecting stats for aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: sandbox a725104675ac6e959f67d661a70e1ffd39be240ed64a7424ad9e8708d320b2ff not found"
Oct 10 16:24:26 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:26.203429815+09:00" level=error msg="collecting stats for aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: sandbox a725104675ac6e959f67d661a70e1ffd39be240ed64a7424ad9e8708d320b2ff not found"
Oct 10 16:24:26 infra.ocp.example.com atomic-openshift-node[10245]: I1010 16:24:26.628142   10245 kubelet_node_status.go:453] Using node IP: "172.30.141.16"
Oct 10 16:24:27 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:27.239180977+09:00" level=error msg="collecting stats for aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: sandbox a725104675ac6e959f67d661a70e1ffd39be240ed64a7424ad9e8708d320b2ff not found"
Oct 10 16:24:28 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:28.127616183+09:00" level=error msg="collecting stats for aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: sandbox a725104675ac6e959f67d661a70e1ffd39be240ed64a7424ad9e8708d320b2ff not found"
Oct 10 16:24:29 infra.ocp.example.com dockerd-current[5752]: time="2019-10-10T16:24:29.163885628+09:00" level=error msg="collecting stats for aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: sandbox a725104675ac6e959f67d661a70e1ffd39be240ed64a7424ad9e8708d320b2ff not found"
~~~


Version-Release number of selected component (if applicable):

OpenShift v3.11.135

How reproducible:

This issue is related with resource usage, so I cannot verify how can the issue reproduce, but the issue has been reported multiple times by same customers.

Steps to Reproduce:
1.
2.
3.

Actual results:

cluster-monitoring-operator is killed by oom-killer when reaching the memory limit usage.

Expected results:

cluster-monitoring-operator keeps running stably without out of memory issues.

Additional info:

Comment 4 errata-xmlrpc 2019-11-18 14:52:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3817


Note You need to log in before you can comment on or make changes to this bug.