Bug 1648412

Summary:	Clicking on templates or Hosts causes puma memory to grow until it crashes
Product:	Red Hat CloudForms Management Engine	Reporter:	Ryan Spagnola <rspagnol>
Component:	Performance	Assignee:	Nick LaMuro <nlamuro>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Jaroslav Henner <jhenner>
Severity:	high	Docs Contact:	Red Hat CloudForms Documentation <cloudforms-docs>
Priority:	high
Version:	5.8.3	CC:	abellott, dmetzger, gekis, jdeubel, jprause, jrafanie, kbrock, mshriver, nlamuro, obarenbo, rspagnol, simaishi, tuado
Target Milestone:	GA	Keywords:	Reopened, TestOnly, ZStream
Target Release:	5.11.0
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	5.11.0.1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1686043 1690610 (view as bug list)		Environment:
Last Closed:	2019-12-13 15:00:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	Bug
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	CFME Core	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1679295, 1686043, 1690610

Description Ryan Spagnola 2018-11-09 16:34:45 UTC

Description of problem:
Clicking on templates or Hosts causes puma memory to grow until it crashes

Version-Release number of selected component (if applicable):
5.8.3.4

How reproducible:
Repeatedly

Steps to Reproduce:
1. Navigate to Compute -> Infrastructure -> Providers
2. click on clusters
3. click on hosts

Actual results:
puma crashes

Expected results:
puma doesn't crash

Additional info:
Found CPU spikes as well for both appliances happening consistently

examples =

appliance 01
PID  PPID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
40248  1648 root      21   1  771116 355312   3744 S 100.0  1.4 109:25.09 puma 3.3.0 (tcp://127.0.0.1:5000) [MIQ: Web Server Worker]

appliance 02
 PID  PPID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
57374  1581 root      21   1  745544 363932   5012 S  99.6  1.5 109:14.80 puma 3.3.0 (tcp://127.0.0.1:5000) [MIQ: Web Server Worker]

Comment 2 Ryan Spagnola 2018-11-09 16:39:28 UTC

The customer has pointed out the following:

I have noticed that now templates have a relationship to every VM that has been provisioned from them.  I think this is our issue when navigating to Compute -> Infrastructure -> Virtual Machines then clicking on templates and searching for a template name.  This action also causes puma to spike memory past 5GB until the process dies.  This hasn't been repeatable every time like the issue when clicking on hosts.

Comment 9 Joe Rafaniello 2018-11-12 19:58:25 UTC

The offending controller and action is EmsClusterController#show:

appliance 02 logs/production.log:[----] I, [2018-11-07T11:10:57.997723 #18268:f4b0f0]  INFO -- : Started GET "/ems_cluster/show/12300000000115?display=hosts" for [REDACTED] at 2018-11-07 11:10:57 -0500
appliance 02 logs/production.log:[----] I, [2018-11-07T11:10:58.001356 #18268:f4b0f0]  INFO -- : Processing by EmsClusterController#show as HTML
appliance 02 logs/production.log:[----] I, [2018-11-07T11:10:58.001492 #18268:f4b0f0]  INFO -- :   Parameters: {"display"=>"hosts", "id"=>"12300000000115"}


appliance 02 logs/production.log:[----] I, [2018-11-07T12:19:34.912987 #34255:f1e154]  INFO -- : Started GET "/ems_cluster/show/12300000000117?display=hosts" for [REDACTED] at 2018-11-07 12:19:34 -0500
appliance 02 logs/production.log:[----] I, [2018-11-07T12:19:34.916055 #34255:f1e154]  INFO -- : Processing by EmsClusterController#show as HTML
appliance 02 logs/production.log:[----] I, [2018-11-07T12:19:34.916188 #34255:f1e154]  INFO -- :   Parameters: {"display"=>"hosts", "id"=>"1230000000117"}

These don't complete before the worker exceeds 3.2 GB and gets killed.

Comment 12 Keenan Brock 2018-11-15 21:16:03 UTC

Hello Ryan,

Has the customer gotten back on whether upgrading from 5.8.3 to 5.8.5 fixes this issue?

We have been finding minor issues, but they only will give modest gains. If the problem is not solved, I want to start looking for larger issues with this report.

Thanks,
Keenan

Comment 33 Nick LaMuro 2019-02-25 06:13:42 UTC

Just an FYI, I started taking a look at this on Friday, and will be continuing that again today.

Comment 38 Nick LaMuro 2019-02-27 00:01:48 UTC

Proposed fix:  https://github.com/ManageIQ/manageiq-ui-classic/pull/5283

Comment 41 Nick LaMuro 2019-03-04 22:47:38 UTC

https://github.com/ManageIQ/manageiq-ui-classic/pull/5283 has been merged and should address this issue.

Comment 42 Nick LaMuro 2019-03-05 20:28:05 UTC

*** Bug 1679276 has been marked as a duplicate of this bug. ***

Comment 45 Jaroslav Henner 2019-07-03 15:40:21 UTC

I don't know how to reproduce this. The seed script fails 

[root@dhcp-8-198-68 vmdb]# bin/rails r bz_1648412_replication_database_seed.rb
Creating Storage records...
EMS records created!
Creating Cluster records...
Creating Tag Records...
Creating Host records...
Traceback (most recent call last):
	12: from bin/rails:4:in `<main>'
	11: from bin/rails:4:in `require'
	10: from /opt/rh/cfme-gemset/gems/railties-5.1.7/lib/rails/commands.rb:16:in `<top (required)>'
	 9: from /opt/rh/cfme-gemset/gems/railties-5.1.7/lib/rails/command.rb:44:in `invoke'
	 8: from /opt/rh/cfme-gemset/gems/railties-5.1.7/lib/rails/command/base.rb:63:in `perform'
	 7: from /opt/rh/cfme-gemset/gems/thor-0.20.3/lib/thor.rb:387:in `dispatch'
	 6: from /opt/rh/cfme-gemset/gems/thor-0.20.3/lib/thor/invocation.rb:126:in `invoke_command'
	 5: from /opt/rh/cfme-gemset/gems/thor-0.20.3/lib/thor/command.rb:27:in `run'
	 4: from /opt/rh/cfme-gemset/gems/railties-5.1.7/lib/rails/commands/runner/runner_command.rb:34:in `perform'
	 3: from /opt/rh/cfme-gemset/gems/railties-5.1.7/lib/rails/commands/runner/runner_command.rb:34:in `load'
	 2: from bz_1648412_replication_database_seed.rb:119:in `<top (required)>'
	 1: from bz_1648412_replication_database_seed.rb:119:in `times'
bz_1648412_replication_database_seed.rb:127:in `block in <top (required)>': undefined method `sub' for nil:NilClass (NoMethodError)

Comment 46 Nick LaMuro 2019-07-11 18:14:16 UTC

Jaroslav,

Thanks for the update (and sorry for the delay).



I was able to reproduce your issue and just updated the replication script:

https://gist.github.com/NickLaMuro/f36576f9fa73c2bcc0d1ef7b7d0adafb

As there were a few oversights, but that should now be working.



-Nick

Comment 47 Jaroslav Henner 2019-07-25 12:43:20 UTC

with the Nikcs benchmark tool that I had to first convince to talk to localhost with https:

cfme-5.11.0.16-1.el8cf.x86_64:
...
Started POST "/vm_infra/report_data" for 127.0.0.1 at 2019-07-25 08:38:24 -0400
Processing by VmInfraController#report_data as HTML
  Parameters: {"page"=>"1", "ppsetting"=>"20"}
PostgreSQLAdapter#log_after_checkout, connection_pool: size: 5, connections: 1, in use: 1, waiting_in_queue: 0
Completed 200 OK in 3603ms (Views: 8.3ms | ActiveRecord: 324.7ms | Rows: 2028)
PostgreSQLAdapter#log_after_checkin, connection_pool: size: 5, connections: 1, in use: 0, waiting_in_queue: 0
Started POST "/vm_infra/report_data" for 127.0.0.1 at 2019-07-25 08:38:28 -0400
Processing by VmInfraController#report_data as HTML
  Parameters: {"page"=>"2", "ppsetting"=>"20"}
PostgreSQLAdapter#log_after_checkout, connection_pool: size: 5, connections: 1, in use: 1, waiting_in_queue: 0
Completed 200 OK in 1098ms (Views: 6.7ms | ActiveRecord: 253.6ms | Rows: 2027)
PostgreSQLAdapter#log_after_checkin, connection_pool: size: 5, connections: 1, in use: 0, waiting_in_queue: 0


cfme-5.10.1.2-2.el7cf.x86_64:
Started POST "/vm_infra/report_data" for 127.0.0.1 at 2019-07-25 08:31:30 -0400
Processing by VmInfraController#report_data as HTML
  Parameters: {"page"=>"1", "ppsetting"=>"20"}
PostgreSQLAdapter#log_after_checkout, connection_pool: size: 5, connections: 1, in use: 1, waiting_in_queue: 0
Completed 200 OK in 11385ms (Views: 8.8ms | ActiveRecord: 1945.0ms | Rows: 35328)
PostgreSQLAdapter#log_after_checkin, connection_pool: size: 5, connections: 1, in use: 0, waiting_in_queue: 0
Started POST "/vm_infra/report_data" for 127.0.0.1 at 2019-07-25 08:31:42 -0400
Processing by VmInfraController#report_data as HTML
  Parameters: {"page"=>"2", "ppsetting"=>"20"}
PostgreSQLAdapter#log_after_checkout, connection_pool: size: 5, connections: 1, in use: 1, waiting_in_queue: 0
Completed 200 OK in 45464ms (Views: 38.4ms | ActiveRecord: 5898.7ms | Rows: 125327)
PostgreSQLAdapter#log_after_checkin, connection_pool: size: 5, connections: 1, in use: 0, waiting_in_queue: 0
[root@dhcp-8-197-130 vmdb]# rpm -q cfme

... I see something like a 45x better time than for the POST request. I think I can call this VERIFIED.