1479356 – Large MiqServer process leads to large generic workers that get killed

Bug 1479356 - Large MiqServer process leads to large generic workers that get killed

Summary: Large MiqServer process leads to large generic workers that get killed

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Performance
Sub Component:
Version:	5.8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.10.0
Assignee:	Joe Rafaniello
QA Contact:	Tasos Papaioannou
Docs Contact:
URL:
Whiteboard:	c&u:worker:perf
Duplicates (1):	1506737 (view as bug list)
Depends On:	1456775 1479339
Blocks:	1526473 1526474 1527093
TreeView+	depends on / blocked

Reported:	2017-08-08 12:20 UTC by Archit Sharma
Modified:	2021-06-10 12:46 UTC (History)
CC List:	21 users (show)
Fixed In Version:	5.10.0.0
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1479339
Clones:	1526473 1526474 1527093 (view as bug list)
Environment:
Last Closed:	2018-06-21 20:39:18 UTC
Category:	---
Cloudforms Team:	CFME Core
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Generic worker leakage (blue legend) w.r.t stable processor queue and powered on/off vms (185.01 KB, image/png) 2017-08-08 12:26 UTC, Archit Sharma	no flags	Details
PSS & RSS utilization - 4+ day test run (769.14 KB, image/png) 2017-08-09 16:21 UTC, dmetzger	no flags	Details
PSS Utilzationn(Short Test) (646.81 KB, image/png) 2017-08-10 15:59 UTC, dmetzger	no flags	Details
CPU Utilization (Short Test) (718.33 KB, image/png) 2017-08-10 15:59 UTC, dmetzger	no flags	Details
austx01-generic_workers_pss.pdf: Generic workers PSS usage (61.58 KB, application/pdf) 2017-11-07 15:04 UTC, dmetzger	no flags	Details
austx01-max_worker_pss.pdf: Shows maximum PSS used per worker type (82.37 KB, application/pdf) 2017-11-07 15:05 UTC, dmetzger	no flags	Details
austx01-mem_threshold_exceptions_counts_by_worker.pdf: # of of time memory exceeded (158.79 KB, application/pdf) 2017-11-07 15:06 UTC, dmetzger	no flags	Details
austx01-free.pdf: Appliance memory usage - OS perspective (185.51 KB, application/pdf) 2017-11-07 15:07 UTC, dmetzger	no flags	Details
View All

Description Archit Sharma 2017-08-08 12:20:05 UTC

Created attachment 1310601 [details]
Generic worker growth (yellow legend) w.r.t. metrics collector queue

+++ This bug was initially created as a clone of Bug #1479339 +++

Description of problem:

Observed memory leaks in GenericWorker for a 10k VMware provider infra connected 

This ran for over 3 days but leaks occurred in the first day itself. 

Version-Release number of selected component (if applicable):
CFME 5.8.0.17

How reproducible:

I had increased memory thresholds / counts for specific worker processes on all appliances, just enough to accommodate those many VMs for a 6 appliance setup.

Steps to Reproduce:
1. [ 1 DB, 5 worker] 6 appliance setup.
2. Turn on C&U on all worker appliances (and cluster wide C&U collection settings in config) and keep server roles to minimum on DB appliance
3. Connect to 10k vms VMware infra provider and let it run for 2-3 days while keeping an eye on C&U data collector worker memory usage.


For reference:- 

----
# DB

- Generic - 2, 500 MB
- Priority - 2, 600 MB

----
# Worker appliances

- Generic - 4, 500 MB
- Priority - 2, 800 MB
- C&U Data Collectors - 6, 600 MB
- C&U Data Processors - 4, 800 MB
- Refresh - 2 GB

Actual results:
Generic Worker memory grew from about 1.5G to 2.8G. 

Expected results:
almost little or no memory growth after initial C&U/refresh period.

Additional info:
attaching screenshot for reference

Original comment (From BZ about MetricsCollector worker leak for multiple providers): https://bugzilla.redhat.com/show_bug.cgi?id=1456775#c29

Comment 2 Archit Sharma 2017-08-08 12:21:48 UTC

Also, this occurs at the same time as metrics processor and collector leak: https://bugzilla.redhat.com/show_bug.cgi?id=1479339

Comment 3 Archit Sharma 2017-08-08 12:26:42 UTC

Created attachment 1310615 [details]
Generic worker leakage (blue legend) w.r.t stable processor queue and powered on/off vms

Comment 4 dmetzger 2017-08-09 16:21:06 UTC

Created attachment 1311306 [details]
PSS & RSS utilization - 4+ day test run

Chart showing PSS & RSS memory utilization on a CFME 5.8.0.17-1 appliance.

Worker Config:
    Single Generic Worker
    1.5Gb Memory Threshold

Provider:
    Clusters:      10
    Hosts:         50
    Datastores:    61
    VMs:        1,000
    Type:       VMware VC 5.5.0

Comment 5 dmetzger 2017-08-09 16:27:40 UTC

Data from a 4+ day run on my test CFME 5.8.0.17-1 appliance (see comment #4 above) with C&U enabled does not show the leak. Can we get a full set of logs to review?

Comment 6 dmetzger 2017-08-10 15:59:10 UTC

Created attachment 1311826 [details]
PSS Utilzationn(Short Test)

Adding two charts to show the following observation: CPU and PSS memory usage spike on the hour. CPU utilization decreases over the hour, while the PSS size is not reduced.

Data collected on a CFME 5.8.0.17-1 appliance which is configured with single instance of each worker type and no provider added.

Appliance was booted at 09:24.

NOTE: The long duration test (4+ days) show PSS utilization leveled off after about 12 hours.

Comment 7 dmetzger 2017-08-10 15:59:52 UTC

Created attachment 1311827 [details]
CPU Utilization (Short Test)

Adding two charts to show the following observation: CPU and PSS memory usage spike on the hour. CPU utilization decreases over the hour, while the PSS size is not reduced.

Data collected on a CFME 5.8.0.17-1 appliance which is configured with single instance of each worker type and no provider added.

Appliance was booted at 09:24.

NOTE: The long duration test (4+ days) show PSS utilization leveled off after about 12 hours.

Comment 8 dmetzger 2017-08-10 16:01:48 UTC

Comment on attachment 1311826 [details]
PSS Utilzationn(Short Test)

Note, previous comment associated with the short test PSS utilization chart incorrectly specified no provider added.

Comment 9 dmetzger 2017-08-10 16:02:18 UTC

Comment on attachment 1311827 [details]
CPU Utilization (Short Test)

Note, previous comment associated with the short test COU utilization chart incorrectly specified no provider added.

Comment 14 Joe Rafaniello 2017-10-12 13:16:28 UTC

Does anyone have both the historical and current set of evm.log and top_output.log files?  We need to track the RSS of the server process, the one that forks workers, what it was doing, and the same for the forked worker exhibiting the problem.

I've looked at the logs provided and it's only the evm.log.

Comment 15 Joe Rafaniello 2017-10-12 13:36:48 UTC

If top_output.log shows the server process is > 1 GB and the forked child worker starts > 1 GB RSS, it's likely related to bug 1425217

Comment 16 dmetzger 2017-10-12 17:00:57 UTC

Relative to comment https://bugzilla.redhat.com/show_bug.cgi?id=1479356#c11

1. Did the failure mode begin recently or has this been an ongoing issue that has been addressed thus far by increasing the threshold value over time?
2. Is the failure mode happening on all appliances in the environment?
3. Do we have an ETA on full log set(s)?

Comment 17 dmetzger 2017-10-12 17:02:45 UTC

Accidentally cleared the need info flag, resetting it.

Comment 26 dmetzger 2017-11-07 15:04:16 UTC

Created attachment 1349013 [details]
austx01-generic_workers_pss.pdf: Generic workers PSS usage

Comment 27 dmetzger 2017-11-07 15:05:05 UTC

Created attachment 1349014 [details]
austx01-max_worker_pss.pdf: Shows maximum PSS used per worker type

Comment 28 dmetzger 2017-11-07 15:06:10 UTC

Created attachment 1349015 [details]
austx01-mem_threshold_exceptions_counts_by_worker.pdf: # of of time memory exceeded

Comment 29 dmetzger 2017-11-07 15:07:09 UTC

Created attachment 1349016 [details]
austx01-free.pdf: Appliance memory usage - OS perspective

Comment 31 CFME Bot 2017-11-15 19:12:56 UTC

https://github.com/ManageIQ/manageiq-gems-pending/pull/313

Comment 32 CFME Bot 2017-11-15 19:13:16 UTC

https://github.com/ManageIQ/manageiq-gems-pending/pull/314

Comment 33 CFME Bot 2017-11-15 19:13:27 UTC

https://github.com/ManageIQ/manageiq-gems-pending/pull/312

Comment 34 CFME Bot 2017-11-15 19:21:49 UTC

https://github.com/ManageIQ/manageiq/pull/16480

Comment 35 Keenan Brock 2017-12-12 14:42:22 UTC

Archit:

Do you have an environment that I can access that has this leak?
Or can you reproduce again?

Thanks,
Keenan

Comment 36 Joe Rafaniello 2017-12-15 14:53:10 UTC

There could possibly

Comment 37 Joe Rafaniello 2017-12-15 14:56:11 UTC

Premature save changes click.  Basically, there might have been a leak in the generic worker but based on comment 25 by Dennis, the fundamental problem is that a leaking miq server process growing in size is causing new generic workers (and all workers) to be created with a very large amount of shared memory inherited from the server process.  This leads to a very high PSS and causes the worker to be killed.  The use of PSS is incorrect for determining if a worker is leaking because we can wrongfully think a worker is leaking even though it's inheriting a large shared memory and therefore PSS.

We're changing this to use USS.

Comment 40 Joe Rafaniello 2017-12-15 15:22:18 UTC

Master and gaprindashvili branches change the worker validation to use unique set size also, but this is done through a schema change to add this column to the miq_workers and miq_servers tables:
https://github.com/ManageIQ/manageiq-schema/pull/139
https://github.com/ManageIQ/manageiq-gems-pending/pull/317
https://github.com/ManageIQ/manageiq/pull/16569
https://github.com/ManageIQ/manageiq/pull/16570

Comment 41 Joe Rafaniello 2017-12-15 15:54:23 UTC

Note, the changes to use unique set size (USS) instead of proportional set size (PSS) don't fix the server leaking.  We now will not penalize workers for inheriting a large amount of memory from a large miq server process when they're forked.  top will still show these workers as high memory usage (RSS) if they inherited memory from a large server.  You need to use tools such as smem, smem -P MIQ, to see the USS.  bin/rake evm:status will show the USS value now.

We will continue to track down and fix the server memory growth but at least now, we won't be prematurely killing workers.

Comment 42 CFME Bot 2017-12-15 17:06:24 UTC

New commit detected on ManageIQ/manageiq/euwe:
https://github.com/ManageIQ/manageiq/commit/06030f3826af407ef18ee54efcd0e0c5b48b8044

commit 06030f3826af407ef18ee54efcd0e0c5b48b8044
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Mon Nov 13 16:26:26 2017 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Dec 15 11:18:36 2017 -0500

    Store unique set size (USS) in the PSS column
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1479356
    Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1526474
    
    Unique set size is a better way to detect workers that are growing
    unbounded since any memory/reference leaks would be shown in their
    uss.  If the server process is large when forking, new workers would
    inherit a big pss immediately.
    
    We should really rename the column/hash key to uss.

 gems/pending/util/miq-process.rb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 43 Joe Rafaniello 2017-12-15 19:43:32 UTC

*** Bug 1506737 has been marked as a duplicate of this bug. ***

Comment 44 CFME Bot 2017-12-15 22:22:01 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/2d758778b55484a96e423317bbf832c5293eb0ed

commit 2d758778b55484a96e423317bbf832c5293eb0ed
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Thu Nov 30 15:27:11 2017 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Dec 15 16:55:02 2017 -0500

    Also log the server/worker unique set size (USS)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1479356

 app/models/miq_server/status_management.rb | 2 +-
 app/models/miq_worker.rb                   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comment 45 CFME Bot 2017-12-15 22:22:12 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/eb028e0c47df57b19dc830175ade75b140146646

commit eb028e0c47df57b19dc830175ade75b140146646
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Thu Nov 30 15:29:17 2017 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Dec 15 16:55:02 2017 -0500

    Show USS instead of PSS in rake evm:status
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1479356

 lib/tasks/evm_application.rb | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comment 46 CFME Bot 2017-12-15 22:26:31 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/eb028e0c47df57b19dc830175ade75b140146646

commit eb028e0c47df57b19dc830175ade75b140146646
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Thu Nov 30 15:29:17 2017 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Dec 15 16:55:02 2017 -0500

    Show USS instead of PSS in rake evm:status
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1479356

 lib/tasks/evm_application.rb | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comment 47 CFME Bot 2017-12-15 22:26:41 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/ef84a880817385e5f0edc6cbb4410819140999a4

commit ef84a880817385e5f0edc6cbb4410819140999a4
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Thu Nov 30 15:29:26 2017 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Dec 15 16:55:32 2017 -0500

    Change worker validation to check USS not PSS
    
    Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1479356
    
    Why? USS is a more reliable mechanism for tracking workers with runaway
    memory growth. PSS is great, until the server process that forks new
    processes grows large. As each new worker is forked, it inherits a share
    of the large amount of the parent process' memory and therefore starts
    with a large PSS, possibly exceeding our limits before doing any work.
    USS only measures a process' private memory and is a better indicator
    when a process is responsible for allocating too much memory without
    freeing it.

 app/models/miq_server/worker_management/monitor/validation.rb | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comment 49 CFME Bot 2018-01-03 16:34:27 UTC

New commit detected on ManageIQ/manageiq-gems-pending/fine:
https://github.com/ManageIQ/manageiq-gems-pending/commit/d51285ca19c96304c7e3b521ae16713d75cfcee1

commit d51285ca19c96304c7e3b521ae16713d75cfcee1
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Mon Nov 13 16:26:26 2017 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Dec 15 11:20:13 2017 -0500

    Store unique set size (USS) in the PSS column
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1479356
    Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1526473
    
    Unique set size is a better way to detect workers that are growing
    unbounded since any memory/reference leaks would be shown in their
    uss.  If the server process is large when forking, new workers would
    inherit a big pss immediately.
    
    We should really rename the column/hash key to uss.

 lib/gems/pending/util/miq-process.rb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Note You need to log in before you can comment on or make changes to this bug.

abellott
bsorota
cchen
dajohnso
dmetzger
epacific
fsimonce
hroy
jhardy
jrafanie
kbrock
mburman
myoder
niroy
obarenbo
pmcgowan
psuriset
rspagnol
simaishi
tpapaioa
yzamir