1321934 – GenericWorkers and PriorityWorkers are spawning threads that are not cleaned up during provisioning in 5.5.3.2

Bug 1321934 - GenericWorkers and PriorityWorkers are spawning threads that are not cleaned up during provisioning in 5.5.3.2

Summary: GenericWorkers and PriorityWorkers are spawning threads that are not cleaned ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Performance
Sub Component:
Version:	5.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.6.0
Assignee:	dmetzger
QA Contact:	Alex Krzos
Docs Contact:
URL:
Whiteboard:	perf:worker
Duplicates (1):	1322394 (view as bug list)
Depends On:
Blocks:	1342278 1342280
TreeView+	depends on / blocked

Reported:	2016-03-29 12:09 UTC by Alex Krzos
Modified:	2019-08-06 20:05 UTC (History)
CC List:	9 users (show)
Fixed In Version:	5.6.0.5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1342278 1342280 (view as bug list)
Environment:
Last Closed:	2016-06-29 15:45:32 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
CPU Graph during 2 hour workload (146.45 KB, image/png) 2016-03-29 12:10 UTC, Alex Krzos	no flags	Details
Memory Usage during 2 hour workload (71.98 KB, image/png) 2016-03-29 12:10 UTC, Alex Krzos	no flags	Details
Generic/Priority Workers Threads during 2 hour workload (83.65 KB, image/png) 2016-03-29 12:11 UTC, Alex Krzos	no flags	Details
Generic/Priority RSS Memory Usage during 2 hour workload (86.93 KB, image/png) 2016-03-29 12:11 UTC, Alex Krzos	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1348	0	normal	SHIPPED_LIVE	CFME 5.6.0 bug fixes and enhancement update	2016-06-29 18:50:04 UTC

Description Alex Krzos 2016-03-29 12:09:47 UTC

Description of problem:
I set a 2 hour long workload of continuous provisioning from Cloud Forms by ordering an item off the service catalog.  During this workload, I measure system resources and Generic and Priority Workers are being recycled (Exceeding memory) most probably due to the fact they are spawning threads that do not appear to be cleaned up.  (See attach graph on number of threads).  This occurs when attached to both VMware and RHEVM providers.

Version-Release number of selected component (if applicable):
5.5.3.2

How reproducible:
100% Reproducible during the provisioning workload.  

Steps to Reproduce:
1. Attach CFME to VMware or RHEVM Provider
2. Create Service Catalog Item and Order it
3. Monitor number of threads used by each worker as well as amount of memory

Actual results:
Generic and Priority Workers continuously grow in number of threads

Expected results:
For the workers to "level" off in amount of memory used and clean up threads that are no longer used.

Additional info:
When the workload completes and I disable automate, those workers no longer grow in memory however the number of threads they have spawned remains:
[root@dhcp23-64 log]# rake evm:status
(in /var/www/miq/vmdb)
Checking EVM status...
...
 MiqGenericWorker                                         | started | 61 | 20222 | 20485 | generic                 | 2016-03-29T08:40:27Z | 2016-03-29T12:07:39Z
 MiqGenericWorker                                         | started | 60 | 12616 | 12841 | generic                 | 2016-03-29T08:37:53Z | 2016-03-29T12:07:39Z
 MiqPriorityWorker                                        | started | 38 | 48809 | 49051 | generic                 | 2016-03-29T07:46:00Z | 2016-03-29T12:07:40Z
 MiqPriorityWorker                                        | started | 41 |   541 | 904   | generic                 | 2016-03-29T07:51:39Z | 2016-03-29T12:07:40Z
...

[root@dhcp23-64 log]# cat /proc/541/status | grep "threads" -i
Threads:	145
[root@dhcp23-64 log]# cat /proc/48809/status | grep "threads" -i
Threads:	182
[root@dhcp23-64 log]# cat /proc/12616/status | grep "threads" -i
Threads:	269
[root@dhcp23-64 log]# cat /proc/20222/status | grep "threads" -i
Threads:	2

Comment 2 Alex Krzos 2016-03-29 12:10:23 UTC

Created attachment 1141221 [details]
CPU Graph during 2 hour workload

Comment 3 Alex Krzos 2016-03-29 12:10:44 UTC

Created attachment 1141222 [details]
Memory Usage during 2 hour workload

Comment 4 Alex Krzos 2016-03-29 12:11:13 UTC

Created attachment 1141226 [details]
Generic/Priority Workers Threads during 2 hour workload

Comment 5 Alex Krzos 2016-03-29 12:11:46 UTC

Created attachment 1141228 [details]
Generic/Priority RSS Memory Usage during 2 hour workload

Comment 7 Dave Johnson 2016-03-31 15:13:57 UTC

*** Bug 1322394 has been marked as a duplicate of this bug. ***

Comment 9 Joe Rafaniello 2016-04-29 18:34:13 UTC

Question in automate gitter room:

Is this DRb.start_service[1] and this one[2] run in the same process?
one is in the RUBY_METHOD_PREAMBLE and one is in the setup_drb_for_ruby_method
two start_service calls on the same URI with a one or more stop_service calls, leaks threads, or so it appears


 irb(main):004:0> require 'drb'; 10.times { puts Thread.list.length; DRb.start_service("druby://127.0.0.1:0"); DRb.start_service("druby://127.0.0.1:0"); DRb.stop_service; }
1
2
3
4
5
6
7
8
9
10

[1]
https://github.com/ManageIQ/manageiq/blob/3f5a48410bf8023774584ba4a2616264a0511936/lib/miq_automation_engine/engine/miq_ae_method.rb#L230

[2] https://github.com/ManageIQ/manageiq/blob/3f5a48410bf8023774584ba4a2616264a0511936/lib/miq_automation_engine/engine/miq_ae_method.rb#L129

Comment 10 mkanoor 2016-05-02 21:38:43 UTC

This relates to Automate methods and how we use Drb provided caching.

For caching objects sent over by the DrbServer we use a Ruby implementation called TimerIdConv. This caches all the objects based on a passed in timeout, we pass in an hour. For that time period there is a thread that holds on to the cached objects. When the method ends, this thread has not been terminated. When the next work is picked up by the Automate worker and it dispatches a method it creates another TimerIdConv thread. These threads would all terminate after 1 hour.

https://github.com/ruby/ruby/blob/32674b167bddc0d737c38f84722986b0f228b44b/lib/drb/timeridconv.rb#L78

One of the solutions we worked on was to kill this thread when the method ends.

Comment 11 CFME Bot 2016-05-02 22:40:52 UTC

https://github.com/ManageIQ/manageiq/pull/8399

Comment 12 CFME Bot 2016-05-03 17:56:00 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/1428c1e9a855e6ca22cd1187dff6bbf7fa4c93cc

commit 1428c1e9a855e6ca22cd1187dff6bbf7fa4c93cc
Author:     Madhu Kanoor <mkanoor>
AuthorDate: Mon May 2 17:59:22 2016 -0400
Commit:     Madhu Kanoor <mkanoor>
CommitDate: Tue May 3 12:10:20 2016 -0400

    DRb TimerIdConv cleanup
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1321934
    
    The TimerIdConv thread was not being terminated after a method ends.
    The TimerIdConv thread sleeps for a hour before it GC's the objects

 lib/miq_automation_engine/engine/miq_ae_method.rb | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comment 16 CFME Bot 2016-05-17 17:23:38 UTC

https://github.com/ManageIQ/manageiq/pull/8473

Comment 17 CFME Bot 2016-05-25 20:21:20 UTC

New commit detected on cfme/5.5.z:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=e3112414d3feecd65d823970049a292565f88a60

commit e3112414d3feecd65d823970049a292565f88a60
Author:     Madhu Kanoor <mkanoor>
AuthorDate: Mon May 2 17:59:22 2016 -0400
Commit:     Madhu Kanoor <mkanoor>
CommitDate: Fri May 13 15:59:30 2016 -0400

    DRb TimerIdConv cleanup
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1321934
    
    The TimerIdConv thread was not being terminated after a method ends.
    The TimerIdConv thread sleeps for a hour before it GC's the objects

 lib/miq_automation_engine/engine/miq_ae_method.rb | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comment 18 CFME Bot 2016-05-25 20:21:25 UTC

New commit detected on cfme/5.5.z:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=418e1a6628339842597a045911269da2f860308b

commit 418e1a6628339842597a045911269da2f860308b
Merge: 1b6e399 aac6e54
Author:     Greg McCullough <gmccullo>
AuthorDate: Wed May 25 16:16:31 2016 -0400
Commit:     Greg McCullough <gmccullo>
CommitDate: Wed May 25 16:16:31 2016 -0400

    Merge branch 'bugzilla_1321934_5_5_z' into '5.5.z'
    
    Thread leakage due to Drb's TimerIdConv thread not being terminated
    
    To prevent Drb recycled error we delegated the caching of referenced objects to DRb's builtin TimerIdConv, which uses a thread and that thread never gets terminated, causing a thread leakage.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1321934
    
    Original PRs:
    https://github.com/ManageIQ/manageiq/pull/8399 (Clean Cherry Pick)
    https://github.com/ManageIQ/manageiq/pull/8419 (2 commits clean cherry pick, 1 commit had rspec 3 changes)
    
    
    
    See merge request !939

 lib/miq_automation_engine/engine/miq_ae_method.rb  |  13 +-
 .../miq_automation_engine/miq_ae_method_spec.rb    | 240 +++++++++++----------
 2 files changed, 138 insertions(+), 115 deletions(-)

Comment 19 dmetzger 2016-06-02 13:05:10 UTC

Resolved by https://github.com/ManageIQ/manageiq/pull/8399

Comment 23 errata-xmlrpc 2016-06-29 15:45:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1348

Note You need to log in before you can comment on or make changes to this bug.