Bug 1481347
| Summary: | Mux is periodically OOM-killed under the load. | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Noriko Hosoi <nhosoi> |
| Component: | Logging | Assignee: | Noriko Hosoi <nhosoi> |
| Status: | CLOSED DUPLICATE | QA Contact: | Xia Zhao <xiazhao> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.6.0 | CC: | aos-bugs, jcantril, nhosoi, pportant, rmeggins |
| Target Milestone: | --- | ||
| Target Release: | 3.6.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-11-21 18:01:59 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I'm interested if https://github.com/openshift/openshift-ansible/pull/5119 might resolve this issue. Peter noted ES pods being killed by the kubelet because the requests and limit settings were different which made them eligible to be OOMKilled (In reply to Jeff Cantrill from comment #1) > I'm interested if https://github.com/openshift/openshift-ansible/pull/5119 > might resolve this issue. Peter noted ES pods being killed by the kubelet > because the requests and limit settings were different which made them > eligible to be OOMKilled Thank you, @Jeff! I've applied the fix and restarted mux in the RHV test env. I'll update this bug with the results. @Peter - this is the bz about previous problems with mux and oom-killer @richm, @pportante, I googled the web and found some ruby users are tuning the performance with the GC parameters and malloc library, e.g., tcmalloc (or jemalloc). This blog is not so new (2013), but it is using Ruby 2.0, and reports high GC LIMIT + tcmalloc shows some performance gain on Ruby 2.0. https://meta.discourse.org/t/tuning-ruby-and-rails-for-discourse/4126/1 The "tcmalloc" gives me a warning. Red Hat Directory Server is currently linked with tcmalloc to mitigate the constant memory growth caused by the memory fragmentation the glibc malloc generates. The fluentd retag plugin is a regular expression machine which allocates various size of small memories for string operations. It should be constantly generating memory fragmentation... Recently, this bug was filed against to the Directory Server to stop using tcmalloc, which target version is RHEL8. Bug 1496872 - 389-ds-base should stop using tcmalloc And this site announces the per-thread malloc is available from glibc2.26. https://www.phoronix.com/scan.php?page=news_item&px=glibc-malloc-thread-cache This should be fun for testing and benchmarking. This per-thread malloc cache will be present in the upcoming Glibc 2.26 release. That is, until the time glibc2.26 is available, we may be impacted by the glibc memory fragmentation... *** This bug has been marked as a duplicate of bug 1502764 *** |
Description of problem: In the RHV test environment, MUX pods are oom-killed and restarted periodically. # oc logs logging-mux-10-64h1c | egrep "died" 2017-08-11 02:02:06 +0300 [error]: fluentd main process died unexpectedly. restarting. 2017-08-11 02:35:03 +0300 [error]: fluentd main process died unexpectedly. restarting. 2017-08-11 03:09:48 +0300 [error]: fluentd main process died unexpectedly. restarting. .... 2017-08-13 19:38:04 +0300 [error]: fluentd main process died unexpectedly. restarting. 2017-08-13 20:05:19 +0300 [error]: fluentd main process died unexpectedly. restarting. 2017-08-13 20:41:41 +0300 [error]: fluentd main process died unexpectedly. restarting. # egrep fluentd /var/log/messages | egrep --color -i oom Aug 11 02:02:06 TEST_HOSTNAME kernel: fluentd invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=872 Aug 11 02:18:07 TEST_HOSTNAME kernel: fluentd invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=872 Aug 11 02:35:03 TEST_HOSTNAME kernel: fluentd invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=872 Aug 11 03:09:47 TEST_HOSTNAME kernel: fluentd invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=872 ..... Aug 13 19:38:03 TEST_HOSTNAME kernel: fluentd invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=872 Aug 13 19:47:44 TEST_HOSTNAME kernel: fluentd invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=872 Aug 13 20:05:18 TEST_HOSTNAME kernel: fluentd invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=872 Aug 13 20:16:25 TEST_HOSTNAME kernel: fluentd invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=872 Aug 13 20:41:41 TEST_HOSTNAME kernel: fluentd invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=872 Other than fluentd, ruby-timer-thr is also targetted by oom-killer. Aug 13 06:02:07 TEST_HOSTNAME kernel: ruby-timer-thr invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=872 Sample load: green open project.ovirt-metrics-engine.2e79fd1e-7b77-11e7-817b-001a4a23128a.2017.08.13 1 0 3314305 0 573.6mb 573.6mb ==> 3314305 logs/day == 38.36 logs/sec We may need to increase the memory size of MUX. # rpm -qa | egrep "openshift|origin" openshift-ansible-callback-plugins-3.6.173.0.3-1.el7.noarch origin-sdn-ovs-3.6.0-1.0.c4dd4cf.x86_64 tuned-profiles-origin-node-3.6.0-1.0.c4dd4cf.x86_64 openshift-ansible-docs-3.6.173.0.3-1.el7.noarch openshift-ansible-lookup-plugins-3.6.173.0.3-1.el7.noarch docker-forward-journald-1.9.1-25.1.origin.el7.x86_64 origin-clients-3.6.0-1.0.c4dd4cf.x86_64 origin-node-3.6.0-1.0.c4dd4cf.x86_64 openshift-ansible-3.6.173.0.3-1.el7.noarch openshift-ansible-roles-3.6.173.0.3-1.el7.noarch origin-docker-excluder-3.6.0-1.0.c4dd4cf.noarch origin-3.6.0-1.0.c4dd4cf.x86_64 openshift-ansible-playbooks-3.6.173.0.3-1.el7.noarch origin-excluder-3.6.0-1.0.c4dd4cf.noarch origin-master-3.6.0-1.0.c4dd4cf.x86_64 openshift-ansible-filter-plugins-3.6.173.0.3-1.el7.noarch