Description of problem: Configuration (see below for logging install inventory) 1 elastic search node 2 logging-mux 4 logging-fluentd nodes 200 pods logging on those 4 nodes at an aggregate rate of 500 1K messages/sec/node which is an aggregate throughput of 1000 messages/second/fluentd Running 100 million messages through this configuration at this rate. All messages are indexed successfully and the logging-fluentd and logging-mux logs show 0 errors. During the run logging-fluentd resident set size and virtual size show no growth - rss shows a gc pattern and vsz is stable (see attached graphs) Logging-mux shows stable resident set size but slow linear growth in the virtual size. Eventually I believe the pod will be OOM killed (will test that with a lower memory limimt). So, logging-fluentd seems stable and logging-mux appears to have a leak. Next action is to do an even longer run to verify the trend continues. Version-Release number of selected component (if applicable): logging-fluentd version is build 3.7.0-0.146.0.1. How reproducible: Always (2 runs so far) Steps to Reproduce: 1. 1 master/2 infra/4 computes. Single ES, 2 logging-mux, 4 logging-fluentd 2. Logging deployed with the inventory below. 3. 50 pods running on each compute each logging 10 1K messages/second. See above for aggregate logging rate. 4. Monitor memory usage with pbench (https://github.com/distributed-system-analysis/pbench) Actual results: Slow/steady growth in logging-mux pod virtual size. Stable logging-mux rss Expected results: No growth for a constant workload Additional info: [OSEv3:children] masters etcd [masters] ec2-54-200-230-39.us-west-2.compute.amazonaws.com [etcd] ec2-54-200-230-39.us-west-2.compute.amazonaws.com [OSEv3:vars] deployment_type=openshift-enterprise openshift_deployment_type=openshift-enterprise openshift_release=v3.7 openshift_image_tag=v3.7.0 openshift_logging_install_logging=true openshift_logging_master_url=https://ec2-54-200-230-39.us-west-2.compute.amazonaws.com:8443 openshift_logging_master_public_url=https://ec2-54-200-230-39.us-west-2.compute.amazonaws.com:8443 openshift_logging_kibana_hostname=kibana.0918-s73.qe.rhcloud.com openshift_logging_namespace=logging openshift_logging_image_prefix=registry.ops.openshift.com/openshift3/ openshift_logging_image_version=v3.7 openshift_logging_es_cluster_size=1 openshift_logging_es_pvc_dynamic=true openshift_logging_es_pvc_size=50Gi openshift_logging_fluentd_read_from_head=false openshift_logging_use_mux=true openshift_logging_mux_client_mode=maximal openshift_logging_use_ops=false openshift_logging_fluentd_cpu_limit=1000m openshift_logging_mux_cpu_limit=1000m openshift_logging_kibana_cpu_limit=200m openshift_logging_kibana_proxy_cpu_limit=100m openshift_logging_es_memory_limit=9Gi openshift_logging_es_cpu_limit-2500m openshift_logging_fluentd_memory_limit=1Gi openshift_logging_mux_memory_limit=2Gi openshift_logging_kibana_memory_limit=1Gi openshift_logging_kibana_proxy_memory_limit=256Mi openshift_logging_mux_file_buffer_storage_type=pvc openshift_logging_mux_file_buffer_pvc_name=logging-mux-pvc openshift_logging_mux_file_buffer_pvc_size=30Gi
Created attachment 1339318 [details] logging-mux rss
Created attachment 1339319 [details] logging-mux vsz
Created attachment 1339320 [details] logging-fluentd rss
Created attachment 1339321 [details] logging-fluentd vsz
You are correct - it will eventually be oom-killed - I think this may be a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1481347
Updated my logging deployment to use the tuning values from https://github.com/openshift/openshift-ansible/pull/5748 logging-mux now has: memory limit = 512Mi FILE_BUFFER_LIMIT=512Mi BUFFER_QUEUE_LIMIT=32 BUFFER_SIZE_LIMIT=8m With these values set, after 12 hours of running traffic at the rate in the description (1000 messages per second per fluentd) one logging-mux has: 2017-10-17 11:24:44 +0000 [warn]: emit transaction failed: error_class=Errno::ENOMEM error="Cannot allocate memory - /var/lib/fluentd/buffer-output-es-config.output_tag.b55bbc5eb509f7340.log" tag="output_tag" 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/plugin/buf_file.rb:39:in `write' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/plugin/buf_file.rb:39:in `<<' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:200:in `block in emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/ruby/monitor.rb:211:in `mon_synchronize' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:193:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:467:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:42:in `next' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/plugin/out_copy.rb:78:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:90:in `emit_stream' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:81:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-rewrite-tag-filter-1.5.6/lib/fluent/plugin/out_rewrite_tag_filter.rb:75:in `block in emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:130:in `call' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:130:in `block in each' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:129:in `each' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:129:in `each' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-rewrite-tag-filter-1.5.6/lib/fluent/plugin/out_rewrite_tag_filter.rb:72:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:154:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:90:in `emit_stream' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:81:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-rewrite-tag-filter-1.5.6/lib/fluent/plugin/out_rewrite_tag_filter.rb:75:in `block in emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:130:in `call' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:130:in `block in each' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:129:in `each' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:129:in `each' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-rewrite-tag-filter-1.5.6/lib/fluent/plugin/out_rewrite_tag_filter.rb:72:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:154:in `emit' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:90:in `emit_stream' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/in_secure_forward.rb:293:in `on_message' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:124:in `on_read' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:185:in `feed_each' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:185:in `block in start' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:178:in `loop' 2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:178:in `start' 2017-10-17 11:24:44 +0000 [warn]: emit transaction failed: error_class=Errno::ENOMEM error="Cannot allocate memory - /var/lib/fluentd/buffer-output-es-config.output_tag.b55bbc5eb509f7340.log" tag="kubernetes.var.log.containers.centos-logtest-dkr5g_svt18_centos-logtest-95016b9211bff50c95238d15704b1084302d26d7cb712261c5440f1ecd706cab.log.mux" 2017-10-17 11:24:44 +0000 [warn]: suppressed same stacktrace 2017-10-17 11:24:44 +0000 [warn]: emit transaction failed: error_class=Errno::ENOMEM error="Cannot allocate memory - /var/lib/fluentd/buffer-output-es-config.output_tag.b55bbc5eb509f7340.log" tag="kubernetes.var.log.containers.centos-logtest-dkr5g_svt18_centos-logtest-95016b9211bff50c95238d15704b1084302d26d7cb712261c5440f1ecd706cab.log" 2017-10-17 11:24:44 +0000 [warn]: suppressed same stacktrace 2017-10-17 11:24:44 +0000 [warn]: unexpected error in in_secure_forward from 172.21.0.230:48422 error_class=Errno::ENOMEM error=#<Errno::ENOMEM: Cannot allocate memory - /var/lib/fluentd/buffer-output-es-config.output_tag.b55bbc5eb509f7340.log> 2017-10-17 11:24:47 +0000 [warn]: Elasticsearch rejected 1 documents due to invalid field arguments 2017-10-17 11:24:48 +0000 [error]: fluentd main process died unexpectedly. restarting. 2017-10-17 11:24:49 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not 2017-10-17 11:24:49 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not The second logging-mux has: 2017-10-17 07:48:04 +0000 [error]: fluentd main process died unexpectedly. restarting. 2017-10-17 07:48:05 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not 2017-10-17 07:48:05 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not 2017-10-17 09:52:26 +0000 [error]: fluentd main process died unexpectedly. restarting. 2017-10-17 09:52:27 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not 2017-10-17 09:52:27 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not 2017-10-17 11:11:15 +0000 [error]: fluentd main process died unexpectedly. restarting. 2017-10-17 11:11:16 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not 2017-10-17 11:11:16 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
Comment 7 traffic rate should be 1000 messages per second per logging-mux. That is the aggregate message rate for each of 2 logging-mux pods
*** Bug 1481347 has been marked as a duplicate of this bug. ***
Fluentd now includes jemalloc koji_builds: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=636331 repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.9-rhel-7-docker-candidate-89109-20180103231254 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.9.0 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.9.0-0.16.0.1 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.9.0 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.9.0.20180103.170133 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.9 koji_builds: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=636358 repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.8-rhel-7-docker-candidate-92791-20180103235159 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.8.25 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.8.25-3 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.8.25.20180102.220346 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.8 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.8.25 koji_builds: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=636394 repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.7-rhel-7-docker-candidate-18263-20180104013419 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7.16 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7.16-3 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7.16 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7.16.20171221.103556
Tested on the 3.7.23 logging images and the situation is much better than originally reported. In the original issue logging-mux vsz showed very noticeable linear growth. In the new run (graphs will be attached), logging-mux rss and vsz basically look stable. There is some growth early in the run and then things stabilize. Unfortunately there is a potential anomaly at the end of the run that is going to force a re-run for a longer period of time. On the logging-fluentd side, both rss and vsz show slight but very noticeable growth throughout the course of the run (graphs will be attached). Functionally, logging is stable on these images. Next step is a longer run, hopefully on 3.9 when bug 1531157 is fixed.
Created attachment 1386086 [details] 3.7.23 logging-mux rss
Created attachment 1386087 [details] 3.7.23 logging-mux vsz
Created attachment 1386088 [details] 3.7.23 logging-fluentd rss
Created attachment 1386089 [details] 3.7.23 logging-fluentd vsz
Hi Mike, Could you help verify this bug? Thanks!
As seen in comment 14, both vsz and rss are stable for fluentd in logging 3.9.2. Ran 15 million messages and captured pbench data (linked in a later comment). No restarts were seen and no memory usage was stable.