1502764 – slow memory leak in logging-mux (fluentd secure_forward listener) during steady-state logging.

Bug 1502764 - slow memory leak in logging-mux (fluentd secure_forward listener) during steady-state logging.

Summary: slow memory leak in logging-mux (fluentd secure_forward listener) during stea...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.6.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.9.0
Assignee:	Rich Megginson
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:	aos-scalability-37
Duplicates (1):	1481347 (view as bug list)
Depends On:	1531157 1536651
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-16 15:53 UTC by Mike Fiedler
Modified:	2021-09-09 12:43 UTC (History)
CC List:	9 users (show)
Fixed In Version:	openshift3/logging-fluentd:v3.7.16-3
Doc Type:	Bug Fix
Doc Text:	Cause: Fluentd's memory usage pattern, especially the secure_forward listener, does not fit well with the default glibc memory allocator, causing excessive memory fragmentation. Consequence: The fluentd process uses more and more memory until it is OOM killed. Fix: Fluentd is now built with the jemalloc memory allocator library. This memory allocator works much better for the fluentd memory usage pattern. Result: Fluentd does not excessively fragment memory, and does not experience unbounded memory growth. Note: Fluentd will still show memory growth as it fills its buffers, caches, and queues, but the memory growth should eventually level off, and fluentd should not be OOM killed due to this issue.
Clone Of:
Environment:
Last Closed:	2018-06-18 17:39:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logging-mux rss (33.44 KB, image/png) 2017-10-16 15:53 UTC, Mike Fiedler	no flags	Details
logging-mux vsz (46.54 KB, image/png) 2017-10-16 15:54 UTC, Mike Fiedler	no flags	Details
logging-fluentd rss (58.36 KB, image/png) 2017-10-16 15:55 UTC, Mike Fiedler	no flags	Details
logging-fluentd vsz (44.12 KB, image/png) 2017-10-16 15:55 UTC, Mike Fiedler	no flags	Details
3.7.23 logging-mux rss (53.06 KB, image/png) 2018-01-25 13:48 UTC, Mike Fiedler	no flags	Details
3.7.23 logging-mux vsz (49.65 KB, image/png) 2018-01-25 13:48 UTC, Mike Fiedler	no flags	Details
3.7.23 logging-fluentd rss (59.27 KB, image/png) 2018-01-25 13:50 UTC, Mike Fiedler	no flags	Details
3.7.23 logging-fluentd vsz (49.65 KB, image/png) 2018-01-25 13:50 UTC, Mike Fiedler	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin-aggregated-logging pull 788	0	None	None	None	2017-11-30 20:48:06 UTC

Description Mike Fiedler 2017-10-16 15:53:09 UTC

Description of problem:

Configuration (see below for logging install inventory)

1 elastic search node
2 logging-mux
4 logging-fluentd nodes
200 pods logging on those 4 nodes at an aggregate rate of 500 1K messages/sec/node which is an aggregate throughput of 1000 messages/second/fluentd

Running 100 million messages through this configuration at this rate.   All messages are indexed successfully and the logging-fluentd and logging-mux logs show 0 errors.

During the run logging-fluentd resident set size and virtual size show no growth - rss shows a gc pattern and vsz is stable (see attached graphs)

Logging-mux shows stable resident set size but slow linear growth in the virtual size.   Eventually I believe the pod will be OOM killed (will test that with a lower memory limimt).

So, logging-fluentd seems stable and logging-mux appears to have a leak.

Next action is to do an even longer run to verify the trend continues.

Version-Release number of selected component (if applicable):  logging-fluentd version is build 3.7.0-0.146.0.1.


How reproducible:   Always (2 runs so far)


Steps to Reproduce:
1.  1 master/2 infra/4 computes.  Single ES, 2 logging-mux, 4 logging-fluentd
2.  Logging deployed with the inventory below.
3.  50 pods running on each compute each logging 10 1K messages/second.  See above for aggregate logging rate. 
4.  Monitor memory usage with pbench (https://github.com/distributed-system-analysis/pbench)

Actual results:

Slow/steady growth in logging-mux pod virtual size.  Stable logging-mux rss

Expected results:

No growth for a constant workload

Additional info:

[OSEv3:children]                  
masters                           
etcd                              

[masters]                         
ec2-54-200-230-39.us-west-2.compute.amazonaws.com                   

[etcd]                            
ec2-54-200-230-39.us-west-2.compute.amazonaws.com                   
[OSEv3:vars]                      
deployment_type=openshift-enterprise                                

openshift_deployment_type=openshift-enterprise                      
openshift_release=v3.7            
openshift_image_tag=v3.7.0        



openshift_logging_install_logging=true                              
openshift_logging_master_url=https://ec2-54-200-230-39.us-west-2.compute.amazonaws.com:8443                                             
openshift_logging_master_public_url=https://ec2-54-200-230-39.us-west-2.compute.amazonaws.com:8443                                      
openshift_logging_kibana_hostname=kibana.0918-s73.qe.rhcloud.com    
openshift_logging_namespace=logging                                 
openshift_logging_image_prefix=registry.ops.openshift.com/openshift3/                                                                   
openshift_logging_image_version=v3.7                                
openshift_logging_es_cluster_size=1                                 
openshift_logging_es_pvc_dynamic=true                               
openshift_logging_es_pvc_size=50Gi                                  
openshift_logging_fluentd_read_from_head=false                      
openshift_logging_use_mux=true    
openshift_logging_mux_client_mode=maximal                           
openshift_logging_use_ops=false   

openshift_logging_fluentd_cpu_limit=1000m                           
openshift_logging_mux_cpu_limit=1000m                               
openshift_logging_kibana_cpu_limit=200m                             
openshift_logging_kibana_proxy_cpu_limit=100m                       
openshift_logging_es_memory_limit=9Gi 
openshift_logging_es_cpu_limit-2500m                              
openshift_logging_fluentd_memory_limit=1Gi                          
openshift_logging_mux_memory_limit=2Gi                              
openshift_logging_kibana_memory_limit=1Gi                           
openshift_logging_kibana_proxy_memory_limit=256Mi                   

openshift_logging_mux_file_buffer_storage_type=pvc                  
openshift_logging_mux_file_buffer_pvc_name=logging-mux-pvc          
openshift_logging_mux_file_buffer_pvc_size=30Gi

Comment 1 Mike Fiedler 2017-10-16 15:53:44 UTC

Created attachment 1339318 [details]
logging-mux rss

Comment 2 Mike Fiedler 2017-10-16 15:54:51 UTC

Created attachment 1339319 [details]
logging-mux vsz

Comment 3 Mike Fiedler 2017-10-16 15:55:19 UTC

Created attachment 1339320 [details]
logging-fluentd rss

Comment 4 Mike Fiedler 2017-10-16 15:55:44 UTC

Created attachment 1339321 [details]
logging-fluentd vsz

Comment 6 Rich Megginson 2017-10-16 16:03:43 UTC

You are correct - it will eventually be oom-killed - I think this may be a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1481347

Comment 7 Mike Fiedler 2017-10-17 12:57:25 UTC

Updated my logging deployment to use the tuning values from https://github.com/openshift/openshift-ansible/pull/5748

logging-mux now has:

memory limit = 512Mi
FILE_BUFFER_LIMIT=512Mi
BUFFER_QUEUE_LIMIT=32
BUFFER_SIZE_LIMIT=8m

With these values set, after 12 hours of running traffic at the rate in the description (1000 messages per second per fluentd) one logging-mux has:

2017-10-17 11:24:44 +0000 [warn]: emit transaction failed: error_class=Errno::ENOMEM error="Cannot allocate memory - /var/lib/fluentd/buffer-output-es-config.output_tag.b55bbc5eb509f7340.log" tag="output_tag"
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/plugin/buf_file.rb:39:in `write'            
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/plugin/buf_file.rb:39:in `<<'               
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:200:in `block in emit'            
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/ruby/monitor.rb:211:in `mon_synchronize'                                         
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:193:in `emit'                     
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:467:in `emit'                     
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:42:in `next'                      
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/plugin/out_copy.rb:78:in `emit'             
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:90:in `emit_stream'         
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:81:in `emit'                
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-rewrite-tag-filter-1.5.6/lib/fluent/plugin/out_rewrite_tag_filter.rb:75:in `block in emit'
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:130:in `call'                      
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:130:in `block in each'             
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:129:in `each'                      
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:129:in `each'                      
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-rewrite-tag-filter-1.5.6/lib/fluent/plugin/out_rewrite_tag_filter.rb:72:in `emit'
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:154:in `emit'               
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:90:in `emit_stream'         
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:81:in `emit'                
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-rewrite-tag-filter-1.5.6/lib/fluent/plugin/out_rewrite_tag_filter.rb:75:in `block in emit'
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:130:in `call'                      
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:130:in `block in each'             
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:129:in `each'                      
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:129:in `each'                      
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-rewrite-tag-filter-1.5.6/lib/fluent/plugin/out_rewrite_tag_filter.rb:72:in `emit'
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:154:in `emit'               
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:90:in `emit_stream'         
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/in_secure_forward.rb:293:in `on_message'
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:124:in `on_read'
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:185:in `feed_each'
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:185:in `block in start'
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:178:in `loop'
  2017-10-17 11:24:44 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:178:in `start'
2017-10-17 11:24:44 +0000 [warn]: emit transaction failed: error_class=Errno::ENOMEM error="Cannot allocate memory - /var/lib/fluentd/buffer-output-es-config.output_tag.b55bbc5eb509f7340.log" tag="kubernetes.var.log.containers.centos-logtest-dkr5g_svt18_centos-logtest-95016b9211bff50c95238d15704b1084302d26d7cb712261c5440f1ecd706cab.log.mux"
  2017-10-17 11:24:44 +0000 [warn]: suppressed same stacktrace  
2017-10-17 11:24:44 +0000 [warn]: emit transaction failed: error_class=Errno::ENOMEM error="Cannot allocate memory - /var/lib/fluentd/buffer-output-es-config.output_tag.b55bbc5eb509f7340.log" tag="kubernetes.var.log.containers.centos-logtest-dkr5g_svt18_centos-logtest-95016b9211bff50c95238d15704b1084302d26d7cb712261c5440f1ecd706cab.log"
  2017-10-17 11:24:44 +0000 [warn]: suppressed same stacktrace  
2017-10-17 11:24:44 +0000 [warn]: unexpected error in in_secure_forward from 172.21.0.230:48422 error_class=Errno::ENOMEM error=#<Errno::ENOMEM: Cannot allocate memory - /var/lib/fluentd/buffer-output-es-config.output_tag.b55bbc5eb509f7340.log>
2017-10-17 11:24:47 +0000 [warn]: Elasticsearch rejected 1 documents due to invalid field arguments
2017-10-17 11:24:48 +0000 [error]: fluentd main process died unexpectedly. restarting.
2017-10-17 11:24:49 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
2017-10-17 11:24:49 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not



The second logging-mux has:

2017-10-17 07:48:04 +0000 [error]: fluentd main process died unexpectedly. restarting.                                          
2017-10-17 07:48:05 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
2017-10-17 07:48:05 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
2017-10-17 09:52:26 +0000 [error]: fluentd main process died unexpectedly. restarting.                                          
2017-10-17 09:52:27 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
2017-10-17 09:52:27 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
2017-10-17 11:11:15 +0000 [error]: fluentd main process died unexpectedly. restarting.                                          
2017-10-17 11:11:16 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
2017-10-17 11:11:16 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not

Comment 8 Mike Fiedler 2017-10-17 12:58:43 UTC

Comment 7 traffic rate should be 1000 messages per second per logging-mux.   That is the aggregate message rate for each of 2 logging-mux pods

Comment 9 Rich Megginson 2017-11-21 18:01:59 UTC

*** Bug 1481347 has been marked as a duplicate of this bug. ***

Comment 11 Rich Megginson 2018-01-04 03:03:33 UTC

Fluentd now includes jemalloc

koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=636331
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.9-rhel-7-docker-candidate-89109-20180103231254
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.9.0
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.9.0-0.16.0.1
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.9.0
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.9.0.20180103.170133
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.9

koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=636358
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.8-rhel-7-docker-candidate-92791-20180103235159
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.8.25
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.8.25-3
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.8.25.20180102.220346
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.8
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.8.25

koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=636394
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.7-rhel-7-docker-candidate-18263-20180104013419
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7.16
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7.16-3
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7.16
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7.16.20171221.103556

Comment 14 Mike Fiedler 2018-01-25 13:45:35 UTC

Tested on the 3.7.23 logging images and the situation is much better than originally reported.  In the original issue logging-mux vsz showed very noticeable linear growth.

In the new run (graphs will be attached), logging-mux rss and vsz basically look stable.   There is some growth early in the run and then things stabilize.  Unfortunately there is a potential anomaly at the end of the run that is going to force a re-run for a longer period of time.

On the logging-fluentd side, both rss and vsz show slight but very noticeable growth throughout the course of the run (graphs will be attached).


Functionally, logging is stable on these images.  Next step is a longer run, hopefully on 3.9 when bug 1531157 is fixed.

Comment 15 Mike Fiedler 2018-01-25 13:48:16 UTC

Created attachment 1386086 [details]
3.7.23 logging-mux rss

Comment 16 Mike Fiedler 2018-01-25 13:48:51 UTC

Created attachment 1386087 [details]
3.7.23 logging-mux vsz

Comment 17 Mike Fiedler 2018-01-25 13:50:22 UTC

Created attachment 1386088 [details]
3.7.23 logging-fluentd rss

Comment 18 Mike Fiedler 2018-01-25 13:50:50 UTC

Created attachment 1386089 [details]
3.7.23 logging-fluentd vsz

Comment 19 Wei Sun 2018-03-05 07:03:14 UTC

Hi Mike,

Could you help verify this bug?

Thanks!

Comment 20 Mike Fiedler 2018-03-06 14:56:51 UTC

As seen in comment 14, both vsz and rss are stable for fluentd in logging 3.9.2.   Ran 15 million messages and captured pbench data (linked in a later comment).  No restarts were seen and no memory usage was stable.

Note You need to log in before you can comment on or make changes to this bug.