Description of problem: I set a 2 hour long workload of continuous provisioning from Cloud Forms by ordering an item off the service catalog. During this workload, I measure system resources and Generic and Priority Workers are being recycled (Exceeding memory) most probably due to the fact they are spawning threads that do not appear to be cleaned up. (See attach graph on number of threads). This occurs when attached to both VMware and RHEVM providers. Version-Release number of selected component (if applicable): 5.5.3.2 How reproducible: 100% Reproducible during the provisioning workload. Steps to Reproduce: 1. Attach CFME to VMware or RHEVM Provider 2. Create Service Catalog Item and Order it 3. Monitor number of threads used by each worker as well as amount of memory Actual results: Generic and Priority Workers continuously grow in number of threads Expected results: For the workers to "level" off in amount of memory used and clean up threads that are no longer used. Additional info: When the workload completes and I disable automate, those workers no longer grow in memory however the number of threads they have spawned remains: [root@dhcp23-64 log]# rake evm:status (in /var/www/miq/vmdb) Checking EVM status... ... MiqGenericWorker | started | 61 | 20222 | 20485 | generic | 2016-03-29T08:40:27Z | 2016-03-29T12:07:39Z MiqGenericWorker | started | 60 | 12616 | 12841 | generic | 2016-03-29T08:37:53Z | 2016-03-29T12:07:39Z MiqPriorityWorker | started | 38 | 48809 | 49051 | generic | 2016-03-29T07:46:00Z | 2016-03-29T12:07:40Z MiqPriorityWorker | started | 41 | 541 | 904 | generic | 2016-03-29T07:51:39Z | 2016-03-29T12:07:40Z ... [root@dhcp23-64 log]# cat /proc/541/status | grep "threads" -i Threads: 145 [root@dhcp23-64 log]# cat /proc/48809/status | grep "threads" -i Threads: 182 [root@dhcp23-64 log]# cat /proc/12616/status | grep "threads" -i Threads: 269 [root@dhcp23-64 log]# cat /proc/20222/status | grep "threads" -i Threads: 2
Created attachment 1141221 [details] CPU Graph during 2 hour workload
Created attachment 1141222 [details] Memory Usage during 2 hour workload
Created attachment 1141226 [details] Generic/Priority Workers Threads during 2 hour workload
Created attachment 1141228 [details] Generic/Priority RSS Memory Usage during 2 hour workload
*** Bug 1322394 has been marked as a duplicate of this bug. ***
Question in automate gitter room: Is this DRb.start_service[1] and this one[2] run in the same process? one is in the RUBY_METHOD_PREAMBLE and one is in the setup_drb_for_ruby_method two start_service calls on the same URI with a one or more stop_service calls, leaks threads, or so it appears irb(main):004:0> require 'drb'; 10.times { puts Thread.list.length; DRb.start_service("druby://127.0.0.1:0"); DRb.start_service("druby://127.0.0.1:0"); DRb.stop_service; } 1 2 3 4 5 6 7 8 9 10 [1] https://github.com/ManageIQ/manageiq/blob/3f5a48410bf8023774584ba4a2616264a0511936/lib/miq_automation_engine/engine/miq_ae_method.rb#L230 [2] https://github.com/ManageIQ/manageiq/blob/3f5a48410bf8023774584ba4a2616264a0511936/lib/miq_automation_engine/engine/miq_ae_method.rb#L129
This relates to Automate methods and how we use Drb provided caching. For caching objects sent over by the DrbServer we use a Ruby implementation called TimerIdConv. This caches all the objects based on a passed in timeout, we pass in an hour. For that time period there is a thread that holds on to the cached objects. When the method ends, this thread has not been terminated. When the next work is picked up by the Automate worker and it dispatches a method it creates another TimerIdConv thread. These threads would all terminate after 1 hour. https://github.com/ruby/ruby/blob/32674b167bddc0d737c38f84722986b0f228b44b/lib/drb/timeridconv.rb#L78 One of the solutions we worked on was to kill this thread when the method ends.
https://github.com/ManageIQ/manageiq/pull/8399
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/1428c1e9a855e6ca22cd1187dff6bbf7fa4c93cc commit 1428c1e9a855e6ca22cd1187dff6bbf7fa4c93cc Author: Madhu Kanoor <mkanoor> AuthorDate: Mon May 2 17:59:22 2016 -0400 Commit: Madhu Kanoor <mkanoor> CommitDate: Tue May 3 12:10:20 2016 -0400 DRb TimerIdConv cleanup https://bugzilla.redhat.com/show_bug.cgi?id=1321934 The TimerIdConv thread was not being terminated after a method ends. The TimerIdConv thread sleeps for a hour before it GC's the objects lib/miq_automation_engine/engine/miq_ae_method.rb | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
https://github.com/ManageIQ/manageiq/pull/8473
New commit detected on cfme/5.5.z: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=e3112414d3feecd65d823970049a292565f88a60 commit e3112414d3feecd65d823970049a292565f88a60 Author: Madhu Kanoor <mkanoor> AuthorDate: Mon May 2 17:59:22 2016 -0400 Commit: Madhu Kanoor <mkanoor> CommitDate: Fri May 13 15:59:30 2016 -0400 DRb TimerIdConv cleanup https://bugzilla.redhat.com/show_bug.cgi?id=1321934 The TimerIdConv thread was not being terminated after a method ends. The TimerIdConv thread sleeps for a hour before it GC's the objects lib/miq_automation_engine/engine/miq_ae_method.rb | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
New commit detected on cfme/5.5.z: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=418e1a6628339842597a045911269da2f860308b commit 418e1a6628339842597a045911269da2f860308b Merge: 1b6e399 aac6e54 Author: Greg McCullough <gmccullo> AuthorDate: Wed May 25 16:16:31 2016 -0400 Commit: Greg McCullough <gmccullo> CommitDate: Wed May 25 16:16:31 2016 -0400 Merge branch 'bugzilla_1321934_5_5_z' into '5.5.z' Thread leakage due to Drb's TimerIdConv thread not being terminated To prevent Drb recycled error we delegated the caching of referenced objects to DRb's builtin TimerIdConv, which uses a thread and that thread never gets terminated, causing a thread leakage. https://bugzilla.redhat.com/show_bug.cgi?id=1321934 Original PRs: https://github.com/ManageIQ/manageiq/pull/8399 (Clean Cherry Pick) https://github.com/ManageIQ/manageiq/pull/8419 (2 commits clean cherry pick, 1 commit had rspec 3 changes) See merge request !939 lib/miq_automation_engine/engine/miq_ae_method.rb | 13 +- .../miq_automation_engine/miq_ae_method_spec.rb | 240 +++++++++++---------- 2 files changed, 138 insertions(+), 115 deletions(-)
Resolved by https://github.com/ManageIQ/manageiq/pull/8399
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1348