Created attachment 1197219 [details] Graphs showing thread count and rss memory Description of problem: Ceilometer-polling process is growing in the number of threads both when under load and not under load. Comparing two clouds: Cloud under load(16 hours): Booting 20 instances every 20 minutes: Controllers (3 controllers) grew from 7 threads to 1k threads Computes (7 computes) grew from 7 threads to 1k threads This results in a total of 10k sleeping threads across the 10 machines Cloud not under load (13 hours): In about 13 hours thread counts grow: Controllers (3 controllers) - 15 to 63 threads Compute (1 compute) - 16 to 95 threads Version-Release number of selected component (if applicable): OSP10 deployed from OSPd: builds - 2016-08-29.1 and 2016-08-30 openstack-ceilometer-polling-7.0.0-0.20160818153837.4ce3339.el7ost.noarch How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Attached graphs showing thread counts over time of the ceilometer-polling daemon. Also attached is RSS Memory graphs of the processes as well. One concern here is that leaking threads consume a little bit of memory that is never released, and as demonstrated we can see RSS memory grow in both the under load and no load situations. This does not directly confirm a memory leak though. Further investigation of the threads show they are in sleeping state when viewed with top and ps. [root@overcloud-controller-0 ~]# ps afx | grep ceilometer-polling 24013 pts/0 S+ 0:00 \_ grep --color=auto ceilometer-polling 9697 ? Ss 4:38 /usr/bin/python2 /usr/bin/ceilometer-polling --polling-namespaces central --logfile /var/log/ceilometer/central.log 9917 ? Sl 21:29 \_ ceilometer-polling - AgentManager(0) [root@overcloud-controller-0 ~]# cat /proc/9917/status | grep -i threads Threads: 1005 [root@overcloud-controller-0 ~]# ps -T -p 9917 -o pid,lwp,state,rss,pcpu,cmd PID LWP S RSS %CPU CMD 9917 9917 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 9918 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 9945 S 201200 0.2 ceilometer-polling - AgentManager(0) 9917 9946 S 201200 0.8 ceilometer-polling - AgentManager(0) 9917 9948 S 201200 0.1 ceilometer-polling - AgentManager(0) 9917 17550 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 21764 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 25958 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 29950 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 33829 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 37656 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 41507 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 45586 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 844 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 5158 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 9237 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 13068 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 16867 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 21349 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 25498 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 29287 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 33148 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 36955 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 40808 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 44784 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 48765 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 4192 S 201200 0.0 ceilometer-polling - AgentManager(0) 9917 8277 S 201200 0.0 ceilometer-polling - AgentManager(0) ....
The bug have been fixed upstream, it's not critical, the thread pool used by ceilometer was 1000. So the value was safely capped (but a bit too big :p). The upstream fix limit the number of needed threads to the exact number of pollsters. When eventlet was used, the pool was 1000 too, but greenthread wasn't show up in ps. And the event pool recycle does not work like the concurrent.futures one. So memory usage was not as high.
This is part of Ceilometer 7.0.0.0rc1 upstream release
After two hours, I still have 7 threads, everything looks OK [heat-admin@overcloud-controller-1 ~]$ uptime 13:02:02 up 2:37, 1 user, load average: 6.74, 6.35, 7.16 [heat-admin@overcloud-controller-1 ~]$ ps -T -p 13028 -o pid,lwp,state,rss,pcpu,cmd PID LWP S RSS %CPU CMD 13028 13028 S 52892 0.0 ceilometer-polling - AgentManager(0) 13028 13032 S 52892 0.0 ceilometer-polling - AgentManager(0) 13028 13090 R 52892 0.1 ceilometer-polling - AgentManager(0) 13028 13092 S 52892 0.0 ceilometer-polling - AgentManager(0) 13028 13093 S 52892 0.0 ceilometer-polling - AgentManager(0) 13028 13213 S 52892 0.0 ceilometer-polling - AgentManager(0) 13028 9295 S 52892 0.0 ceilometer-polling - AgentManager(0)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html