1456775 – Possible memory leak in MetricsCollectorWorker

Bug 1456775 - Possible memory leak in MetricsCollectorWorker

Summary: Possible memory leak in MetricsCollectorWorker

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Performance
Sub Component:
Version:	5.8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	cfme-future
Assignee:	Nick LaMuro
QA Contact:	Einat Pacifici
Docs Contact:
URL:
Whiteboard:	c&u:worker:perf
Depends On:
Blocks:	1479339 1479356
TreeView+	depends on / blocked

Reported:	2017-05-30 10:43 UTC by Archit Sharma
Modified:	2019-01-24 14:32 UTC (History)
CC List:	13 users (show)
Fixed In Version:	5.9.0.18
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1479339 (view as bug list)
Environment:
Last Closed:	2019-01-24 14:32:13 UTC
Category:	---
Cloudforms Team:	CFME Core
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Metrics Collector Memory Usage over 2 days (RHEVM) - corrected units for memory (56.82 KB, image/png) 2017-08-03 11:53 UTC, Archit Sharma	no flags	Details
cfme_individual_memleak_correlation_graphs over 3 days (RSS usage) (685.87 KB, application/x-gzip) 2017-08-07 09:23 UTC, Archit Sharma	no flags	Details
overall look at memory leaks in workers (248.01 KB, image/png) 2017-08-07 09:25 UTC, Archit Sharma	no flags	Details
PSS & RSS utilization - 4+ day test run (818.70 KB, image/png) 2017-08-09 18:32 UTC, dmetzger	no flags	Details
View All

Description Archit Sharma 2017-05-30 10:43:15 UTC

Created attachment 1283390 [details]
collector memory graph for a week

Description of problem:
Over last 4-5 days, the OCP environment was kept constant, but we had to increase MetricCollectorWorker memory threshold first to 500M then 600M and eventually 700M right now. (The data collectors are just keeping up with the rate of messages coming in)

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Single DB appliance with 5 worker appliances
2. Turn on C&U on all worker appliances and keep server roles to minimum on DB appliance
3. Connect to ~1900 pods in OCP env and let it run for 4-5 days while keeping an eye on C&U data collector worker memory usage 

Actual results:
MetricsCollectorWorker memory grows over time even after raising memory thresholds a couple of times.

Expected results:
After increasing memory threshold a couple of times, queued msg pipeline should've been stabilized

Additional info:
attaching screenshot for reference

Comment 3 Greg Blomquist 2017-05-30 13:27:16 UTC

GreggT, this looks like good feedback from the Perf QE team about a possible memory leak in the C&U worker.  Wondering if it's a general problem, or if there's a specific problem with the way that the OpenShift Metrics Collector works.

Sending over to platform to see if you have any ideas off hand or have seen this in a more general setting.

Comment 5 Federico Simoncelli 2017-06-01 16:16:29 UTC

(In reply to Greg Blomquist from comment #3)
> Sending over to platform to see if you have any ideas off hand or have seen
> this in a more general setting.

Bronagh any additional information on what was discovered to assign it to CM?

Comment 6 Bronagh Sorota 2017-06-01 16:39:32 UTC

On today's bug triage call the Performance team shared that they have long running 5.7.2.1 and 5.8,0.11 appliances pointing to VC and there are no memory leaks. 

Pradeep - Can the QE performance team do some performance tests for other providers to determine if this is a CM problem or a general problem.

Comment 23 Archit Sharma 2017-08-03 11:53:34 UTC

Created attachment 1308698 [details]
Metrics Collector Memory Usage over 2 days (RHEVM) - corrected units for memory

Comment 24 Federico Simoncelli 2017-08-03 12:21:14 UTC

(In reply to Archit Sharma from comment #23)
> Created attachment 1308698 [details]
> Metrics Collector Memory Usage over 2 days (RHEVM) - corrected units for
> memory

Thanks Himanshu, the ask was to have multiple provider to compare.

The graph you provided is great but we should see the trend of other type of providers as well (OpenShift, RHV, OpenStack).

Comment 28 Dave Johnson 2017-08-03 23:50:26 UTC

Please assess the impact of this issue and update the severity accordingly.  Please refer to https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity for a reminder on each severity's definition.

If it's something like a tracker bug where it doesn't matter, please set it to Low/Low.

Comment 29 Archit Sharma 2017-08-07 09:22:47 UTC

Observed memory leaks not just in MetricsCollector, but also in MetricsProcessor, GenericWorker and PriorityWorker for a 10k VMware provider infra connected [ 1 DB, 5 worker] CFME 5.8.0.17 appliance setup.

This ran for over 3 days but leaks occurred in the first day itself. 

So now we have verifiable data atleast about Metrics collector leak, from following providers:
- VMWare
- Openshift
- RHVM

Attaching graphs showing increase in RSS memory and their correlations to stable ems_metrics_processor/collector queues vs total powered on/off VMs vs DB size.

Comment 30 Archit Sharma 2017-08-07 09:23:52 UTC

Created attachment 1309976 [details]
cfme_individual_memleak_correlation_graphs over 3 days (RSS usage)

Comment 31 Archit Sharma 2017-08-07 09:25:23 UTC

Created attachment 1309977 [details]
overall look at memory leaks in workers

individual graphs in this screenshot have been attached separately in a previous attachment. [1] 

[1] - https://bugzilla.redhat.com/attachment.cgi?id=1309976

Comment 32 dmetzger 2017-08-07 17:45:17 UTC

Work is currently in process on the metrics collector worker with respect to memory leaking, using https://bugzilla.redhat.com/show_bug.cgi?id=1458392. 

Closing this BZ as a dupe of that ticket given work is currently in progress under that BZ.

If there is data showing memory leaks in other workers as mentioned in comment #29, please open tickets for these workers with the associated data.

*** This bug has been marked as a duplicate of bug 1458392 ***

Comment 35 dmetzger 2017-08-09 18:32:00 UTC

Created attachment 1311339 [details]
PSS & RSS utilization - 4+ day test run

Worker Config:
    Single Generic Worker
    1.5Gb Memory Threshold

Provider:
    Clusters:      10
    Hosts:         50
    Datastores:    61
    VMs:        1,000
    Type:       VMware VC 5.5.0

Comment 36 dmetzger 2017-08-09 18:44:24 UTC

Correction to comment #35, worker configuration was a single Metrics Collector Worker

Comment 41 Nick LaMuro 2018-01-18 16:06:56 UTC

A possible fix has been proposed in this related BZ:

https://bugzilla.redhat.com/show_bug.cgi?id=1535720


That is targeted for the MiqServer, and high confidence that it will fix the leak there.  Updates will probably happen there more regularly until we determine if there is a different leak in the MetricsCollectorWorker, and there is a high probability this was a leak across all workers.

Comment 42 Nick LaMuro 2018-01-19 23:53:04 UTC

The fix above has been backported to 5.8:

https://bugzilla.redhat.com/show_bug.cgi?id=1536672

As well as for future releases here:

https://bugzilla.redhat.com/show_bug.cgi?id=1535720

We are going to do some testing ourselves to see if this is fixing the issue with the MetricsCollectorWorker as well, and will update with those results.

Comment 43 Nick LaMuro 2018-02-01 22:45:43 UTC

Update:

We are relatively sure that this leak will be resolved with the patch provided in https://bugzilla.redhat.com/show_bug.cgi?id=1535720 (or the respective backported version), so this might already be fixed.

That said, we are doing some final long term comparisons with our test environments to confirm that the systems that had the patch applied and displayed no leak, will start leaking once the patch is removed.  We are confident this patch fixes the leak with MiqServer, but want to be confident in saying this is the same with the other workers as well, and that there isn't possibly another leak at play here.

Next update will be roughly in a week's time.

Comment 44 Nick LaMuro 2018-02-08 17:38:43 UTC

After testing on a pair of appliances for about a week, we are fairly confident that this has a substantial impact to the memory footprint of all the workers, including the MetricsCollectorWorker, as mentioned here.

Please retest with the changes in place, and if the issue persists, feel free to kick the ticket back so we can look into it further.

Comment 45 Einat Pacifici 2018-02-19 07:56:45 UTC

Retested with long long running metrics. No leaks seen. Verifying. However, if still persists. We can recheck/retest as required.

Comment 46 Pradeep Kumar Surisetty 2018-04-16 08:28:40 UTC

Memory consumption with 5.9 is lesser than older releases. 
Havet noticed leaks much.  can be closed

Note You need to log in before you can comment on or make changes to this bug.