Bug 1502764
Summary: | slow memory leak in logging-mux (fluentd secure_forward listener) during steady-state logging. | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||||||||||||||||
Component: | Logging | Assignee: | Rich Megginson <rmeggins> | ||||||||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Fiedler <mifiedle> | ||||||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||||||
Priority: | unspecified | ||||||||||||||||||||
Version: | 3.6.0 | CC: | aos-bugs, jcantril, mifiedle, nhosoi, pportant, rmeggins, rromerom, smunilla, wsun | ||||||||||||||||||
Target Milestone: | --- | ||||||||||||||||||||
Target Release: | 3.9.0 | ||||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||
Whiteboard: | aos-scalability-37 | ||||||||||||||||||||
Fixed In Version: | openshift3/logging-fluentd:v3.7.16-3 | Doc Type: | Bug Fix | ||||||||||||||||||
Doc Text: |
Cause: Fluentd's memory usage pattern, especially the secure_forward listener, does not fit well with the default glibc memory allocator, causing excessive memory fragmentation.
Consequence: The fluentd process uses more and more memory until it is OOM killed.
Fix: Fluentd is now built with the jemalloc memory allocator library. This memory allocator works much better for the fluentd memory usage pattern.
Result: Fluentd does not excessively fragment memory, and does not experience unbounded memory growth.
Note: Fluentd will still show memory growth as it fills its buffers, caches, and queues, but the memory growth should eventually level off, and fluentd should not be OOM killed due to this issue.
|
Story Points: | --- | ||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||
Last Closed: | 2018-06-18 17:39:26 UTC | Type: | Bug | ||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||
Embargoed: | |||||||||||||||||||||
Bug Depends On: | 1531157, 1536651 | ||||||||||||||||||||
Bug Blocks: | |||||||||||||||||||||
Attachments: |
|
Description
Mike Fiedler
2017-10-16 15:53:09 UTC
Created attachment 1339318 [details]
logging-mux rss
Created attachment 1339319 [details]
logging-mux vsz
Created attachment 1339320 [details]
logging-fluentd rss
Created attachment 1339321 [details]
logging-fluentd vsz
You are correct - it will eventually be oom-killed - I think this may be a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1481347 Updated my logging deployment to use the tuning values from https://github.com/openshift/openshift-ansible/pull/5748 logging-mux now has: memory limit = 512Mi FILE_BUFFER_LIMIT=512Mi BUFFER_QUEUE_LIMIT=32 BUFFER_SIZE_LIMIT=8m With these values set, after 12 hours of running traffic at the rate in the description (1000 messages per second per fluentd) one logging-mux has: 2017-10-17 11:24:44 +0000 [warn]: emit transaction failed: error_class=Errno::ENOMEM error="Cannot allocate memory - /var/lib/fluentd/buffer-output-es-config.output_tag.b55bbc5eb509f7340.log" tag="output_tag" 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/plugin/buf_file.rb:39:in `write' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/plugin/buf_file.rb:39:in `<<' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:200:in `block in emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/ruby/monitor.rb:211:in `mon_synchronize' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:193:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:467:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:42:in `next' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/plugin/out_copy.rb:78:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:90:in `emit_stream' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:81:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-rewrite-tag-filter-1.5.6/lib/fluent/plugin/out_rewrite_tag_filter.rb:75:in `block in emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:130:in `call' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:130:in `block in each' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:129:in `each' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:129:in `each' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-rewrite-tag-filter-1.5.6/lib/fluent/plugin/out_rewrite_tag_filter.rb:72:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:154:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:90:in `emit_stream' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:81:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-rewrite-tag-filter-1.5.6/lib/fluent/plugin/out_rewrite_tag_filter.rb:75:in `block in emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:130:in `call' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:130:in `block in each' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:129:in `each' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:129:in `each' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-rewrite-tag-filter-1.5.6/lib/fluent/plugin/out_rewrite_tag_filter.rb:72:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:154:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:90:in `emit_stream' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/in_secure_forward.rb:293:in `on_message' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:124:in `on_read' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:185:in `feed_each' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:185:in `block in start' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:178:in `loop' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:178:in `start' 2017-10-17 11:24:44 +0000 [warn]: emit transaction failed: error_class=Errno::ENOMEM error="Cannot allocate memory - /var/lib/fluentd/buffer-output-es-config.output_tag.b55bbc5eb509f7340.log" tag="kubernetes.var.log.containers.centos-logtest-dkr5g_svt18_centos-logtest-95016b9211bff50c95238d15704b1084302d26d7cb712261c5440f1ecd706cab.log.mux" 2017-10-17 11:24:44 +0000 [warn]: suppressed same stacktrace 2017-10-17 11:24:44 +0000 [warn]: emit transaction failed: error_class=Errno::ENOMEM error="Cannot allocate memory - /var/lib/fluentd/buffer-output-es-config.output_tag.b55bbc5eb509f7340.log" tag="kubernetes.var.log.containers.centos-logtest-dkr5g_svt18_centos-logtest-95016b9211bff50c95238d15704b1084302d26d7cb712261c5440f1ecd706cab.log" 2017-10-17 11:24:44 +0000 [warn]: suppressed same stacktrace 2017-10-17 11:24:44 +0000 [warn]: unexpected error in in_secure_forward from 172.21.0.230:48422 error_class=Errno::ENOMEM error=#<Errno::ENOMEM: Cannot allocate memory - /var/lib/fluentd/buffer-output-es-config.output_tag.b55bbc5eb509f7340.log> 2017-10-17 11:24:47 +0000 [warn]: Elasticsearch rejected 1 documents due to invalid field arguments 2017-10-17 11:24:48 +0000 [error]: fluentd main process died unexpectedly. restarting. 2017-10-17 11:24:49 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not 2017-10-17 11:24:49 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not The second logging-mux has: 2017-10-17 07:48:04 +0000 [error]: fluentd main process died unexpectedly. restarting. 2017-10-17 07:48:05 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not 2017-10-17 07:48:05 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not 2017-10-17 09:52:26 +0000 [error]: fluentd main process died unexpectedly. restarting. 2017-10-17 09:52:27 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not 2017-10-17 09:52:27 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not 2017-10-17 11:11:15 +0000 [error]: fluentd main process died unexpectedly. restarting. 2017-10-17 11:11:16 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not 2017-10-17 11:11:16 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not Comment 7 traffic rate should be 1000 messages per second per logging-mux. That is the aggregate message rate for each of 2 logging-mux pods *** Bug 1481347 has been marked as a duplicate of this bug. *** Fluentd now includes jemalloc koji_builds: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=636331 repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.9-rhel-7-docker-candidate-89109-20180103231254 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.9.0 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.9.0-0.16.0.1 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.9.0 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.9.0.20180103.170133 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.9 koji_builds: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=636358 repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.8-rhel-7-docker-candidate-92791-20180103235159 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.8.25 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.8.25-3 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.8.25.20180102.220346 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.8 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.8.25 koji_builds: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=636394 repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.7-rhel-7-docker-candidate-18263-20180104013419 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7.16 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7.16-3 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7.16 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7.16.20171221.103556 Tested on the 3.7.23 logging images and the situation is much better than originally reported. In the original issue logging-mux vsz showed very noticeable linear growth. In the new run (graphs will be attached), logging-mux rss and vsz basically look stable. There is some growth early in the run and then things stabilize. Unfortunately there is a potential anomaly at the end of the run that is going to force a re-run for a longer period of time. On the logging-fluentd side, both rss and vsz show slight but very noticeable growth throughout the course of the run (graphs will be attached). Functionally, logging is stable on these images. Next step is a longer run, hopefully on 3.9 when bug 1531157 is fixed. Created attachment 1386086 [details]
3.7.23 logging-mux rss
Created attachment 1386087 [details]
3.7.23 logging-mux vsz
Created attachment 1386088 [details]
3.7.23 logging-fluentd rss
Created attachment 1386089 [details]
3.7.23 logging-fluentd vsz
Hi Mike, Could you help verify this bug? Thanks! As seen in comment 14, both vsz and rss are stable for fluentd in logging 3.9.2. Ran 15 million messages and captured pbench data (linked in a later comment). No restarts were seen and no memory usage was stable. |