1479339 – Memory leak in MetricsProcessor Worker

Bug 1479339 - Memory leak in MetricsProcessor Worker

Summary: Memory leak in MetricsProcessor Worker

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Performance
Sub Component:
Version:	5.8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	cfme-future
Assignee:	Nick LaMuro
QA Contact:	Tasos Papaioannou
Docs Contact:
URL:
Whiteboard:	c&u:worker:perf
Duplicates (1):	1511897 (view as bug list)
Depends On:	1456775
Blocks:	1479356
TreeView+	depends on / blocked

Reported:	2017-08-08 11:50 UTC by Archit Sharma
Modified:	2018-07-12 17:44 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1456775
Clones:	1479356 (view as bug list)
Environment:
Last Closed:	2018-07-12 17:44:26 UTC
Category:	---
Cloudforms Team:	CFME Core
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Priority worker leakage on all appliances (161.22 KB, image/png) 2017-08-08 11:50 UTC, Archit Sharma	no flags	Details
Generic worker leakage w.r.t stable processor queue and powered on/off vms (185.01 KB, image/png) 2017-08-08 11:51 UTC, Archit Sharma	no flags	Details
all worker types memory usage comparison (132.18 KB, image/png) 2017-08-08 11:52 UTC, Archit Sharma	no flags	Details
PSS & RSS utilization - 4+ day test run (475.80 KB, image/png) 2017-08-09 18:42 UTC, dmetzger	no flags	Details
View All

Description Archit Sharma 2017-08-08 11:50:24 UTC

Created attachment 1310596 [details]
Priority worker leakage on all appliances

Description of problem:

Observed memory leaks in MetricsProcessor Worker and GenericWorker for a 10k VMware provider infra connected 

This ran for over 3 days but leaks occurred in the first day itself. 

Version-Release number of selected component (if applicable):
CFME 5.8.0.17

How reproducible:


Steps to Reproduce:
1. [ 1 DB, 5 worker] 6 appliance setup.
2. Turn on C&U on all worker appliances (and cluster wide C&U collection settings in config) and keep server roles to minimum on DB appliance
3. Connect to 10k vms VMware infra provider and let it run for 2-3 days while keeping an eye on C&U data collector worker memory usage.

Actual results:
MetricsProcessorWorker memory grew from about 1.5G to 2.8G. 

Expected results:
almost little or no memory growth after initial C&U/refresh period.

Additional info:
attaching screenshot for reference

Original comment (From BZ about MetricsCollector worker leak for multiple providers): https://bugzilla.redhat.com/show_bug.cgi?id=1456775#c29

Comment 2 Archit Sharma 2017-08-08 11:51:59 UTC

Created attachment 1310597 [details]
Generic worker leakage w.r.t stable processor queue and powered on/off vms

Comment 3 Archit Sharma 2017-08-08 11:52:51 UTC

Created attachment 1310598 [details]
all worker types memory usage comparison

Comment 4 Archit Sharma 2017-08-08 12:01:20 UTC

To further add to 'steps to reproduce' in description, I had increased memory thresholds / counts for specific worker processes on all appliances, just enough to accommodate those many VMs for a 6 appliance setup.

For reference:- 

----
# DB

- Generic - 2, 500 MB
- Priority - 2, 600 MB

----
# Worker appliances

- Generic - 4, 500 MB
- Priority - 2, 800 MB
- C&U Data Collectors - 6, 600 MB
- C&U Data Processors - 4, 800 MB
- Refresh - 2 GB
----

The refresh worker's (leaked?) memory grew by few MBs. Its RSS memory growth is included in the attachment https://bugzilla.redhat.com/attachment.cgi?id=1310598

Comment 5 dmetzger 2017-08-09 18:42:27 UTC

Created attachment 1311340 [details]
PSS & RSS utilization - 4+ day test run

Worker Config:
    Single Metrics Processor Worker
    1.5Gb Memory Threshold

Provider:
    Clusters:      10
    Hosts:         50
    Datastores:    61
    VMs:        1,000
    Type:       VMware VC 5.5.0

Comment 6 Nick LaMuro 2017-09-13 13:59:31 UTC

I think based on some talks with Dennis regarding similar tickets, I think enabling the metrics collection is the root cause to some of the "leaks" that we are seeing.

Most of my commenting will probably be done on:

https://bugzilla.redhat.com/show_bug.cgi?id=1458392



Will update here when I have more to share.

Comment 10 Nick LaMuro 2018-01-18 16:07:32 UTC

A possible fix has been proposed in this related BZ:

https://bugzilla.redhat.com/show_bug.cgi?id=1535720


That is targeted for the MiqServer, and high confidence that it will fix the leak there.  Updates will probably happen there more regularly until we determine if there is a different leak in the MetricsProcessor Worker, and there is a high probability this was a leak across all workers.

Comment 11 Nick LaMuro 2018-01-19 23:53:51 UTC

The fix above has been backported to 5.8:

https://bugzilla.redhat.com/show_bug.cgi?id=1536672

As well as for future releases here:

https://bugzilla.redhat.com/show_bug.cgi?id=1535720

We are going to do some testing ourselves to see if this is fixing the issue with the MetricsProcessor as well, and will update with those results.

Comment 12 Nick LaMuro 2018-02-01 22:46:24 UTC

Update:

We are relatively sure that this leak will be resolved with the patch provided in https://bugzilla.redhat.com/show_bug.cgi?id=1535720 (or the respective backported version), so this might already be fixed.

That said, we are doing some final long term comparisons with our test environments to confirm that the systems that had the patch applied and displayed no leak, will start leaking once the patch is removed.  We are confident this patch fixes the leak with MiqServer, but want to be confident in saying this is the same with the other workers as well, and that there isn't possibly another leak at play here.

Next update will be roughly in a week's time.

Comment 13 Nick LaMuro 2018-02-08 17:39:48 UTC

After testing on a pair of appliances for about a week, we are fairly confident that this has a substantial impact to the memory footprint of all the workers, including the MetricsProcessorWorker, as mentioned here.

Please retest with the changes in place, and if the issue persists, feel free to kick the ticket back so we can look into it further.

Comment 14 Satoe Imaishi 2018-02-09 14:12:10 UTC

*** Bug 1511897 has been marked as a duplicate of this bug. ***

Comment 17 Tasos Papaioannou 2018-06-12 17:44:34 UTC

Verified.

Note You need to log in before you can comment on or make changes to this bug.