Bug 1456775

Summary: Possible memory leak in MetricsCollectorWorker
Product: Red Hat CloudForms Management Engine Reporter: Archit Sharma <arcsharm>
Component: PerformanceAssignee: Nick LaMuro <nlamuro>
Status: CLOSED CURRENTRELEASE QA Contact: Einat Pacifici <epacific>
Severity: high Docs Contact:
Priority: high    
Version: 5.8.0CC: abellott, bsorota, dajohnso, dmetzger, fsimonce, hroy, jhardy, mburman, obarenbo, pmcgowan, psuriset, simaishi, yzamir
Target Milestone: GAKeywords: Reopened, TestOnly
Target Release: cfme-future   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: c&u:worker:perf
Fixed In Version: 5.9.0.18 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1479339 (view as bug list) Environment:
Last Closed: 2019-01-24 14:32:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: CFME Core Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1479339, 1479356    
Attachments:
Description Flags
Metrics Collector Memory Usage over 2 days (RHEVM) - corrected units for memory
none
cfme_individual_memleak_correlation_graphs over 3 days (RSS usage)
none
overall look at memory leaks in workers
none
PSS & RSS utilization - 4+ day test run none

Description Archit Sharma 2017-05-30 10:43:15 UTC
Created attachment 1283390 [details]
collector memory graph for a week

Description of problem:
Over last 4-5 days, the OCP environment was kept constant, but we had to increase MetricCollectorWorker memory threshold first to 500M then 600M and eventually 700M right now. (The data collectors are just keeping up with the rate of messages coming in)

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Single DB appliance with 5 worker appliances
2. Turn on C&U on all worker appliances and keep server roles to minimum on DB appliance
3. Connect to ~1900 pods in OCP env and let it run for 4-5 days while keeping an eye on C&U data collector worker memory usage 

Actual results:
MetricsCollectorWorker memory grows over time even after raising memory thresholds a couple of times.

Expected results:
After increasing memory threshold a couple of times, queued msg pipeline should've been stabilized

Additional info:
attaching screenshot for reference

Comment 3 Greg Blomquist 2017-05-30 13:27:16 UTC
GreggT, this looks like good feedback from the Perf QE team about a possible memory leak in the C&U worker.  Wondering if it's a general problem, or if there's a specific problem with the way that the OpenShift Metrics Collector works.

Sending over to platform to see if you have any ideas off hand or have seen this in a more general setting.

Comment 5 Federico Simoncelli 2017-06-01 16:16:29 UTC
(In reply to Greg Blomquist from comment #3)
> Sending over to platform to see if you have any ideas off hand or have seen
> this in a more general setting.

Bronagh any additional information on what was discovered to assign it to CM?

Comment 6 Bronagh Sorota 2017-06-01 16:39:32 UTC
On today's bug triage call the Performance team shared that they have long running 5.7.2.1 and 5.8,0.11 appliances pointing to VC and there are no memory leaks. 

Pradeep - Can the QE performance team do some performance tests for other providers to determine if this is a CM problem or a general problem.

Comment 23 Archit Sharma 2017-08-03 11:53:34 UTC
Created attachment 1308698 [details]
Metrics Collector Memory Usage over 2 days (RHEVM) - corrected units for memory

Comment 24 Federico Simoncelli 2017-08-03 12:21:14 UTC
(In reply to Archit Sharma from comment #23)
> Created attachment 1308698 [details]
> Metrics Collector Memory Usage over 2 days (RHEVM) - corrected units for
> memory

Thanks Himanshu, the ask was to have multiple provider to compare.

The graph you provided is great but we should see the trend of other type of providers as well (OpenShift, RHV, OpenStack).

Comment 28 Dave Johnson 2017-08-03 23:50:26 UTC
Please assess the impact of this issue and update the severity accordingly.  Please refer to https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity for a reminder on each severity's definition.

If it's something like a tracker bug where it doesn't matter, please set it to Low/Low.

Comment 29 Archit Sharma 2017-08-07 09:22:47 UTC
Observed memory leaks not just in MetricsCollector, but also in MetricsProcessor, GenericWorker and PriorityWorker for a 10k VMware provider infra connected [ 1 DB, 5 worker] CFME 5.8.0.17 appliance setup.

This ran for over 3 days but leaks occurred in the first day itself. 

So now we have verifiable data atleast about Metrics collector leak, from following providers:
- VMWare
- Openshift
- RHVM

Attaching graphs showing increase in RSS memory and their correlations to stable ems_metrics_processor/collector queues vs total powered on/off VMs vs DB size.

Comment 30 Archit Sharma 2017-08-07 09:23:52 UTC
Created attachment 1309976 [details]
cfme_individual_memleak_correlation_graphs over 3 days (RSS usage)

Comment 31 Archit Sharma 2017-08-07 09:25:23 UTC
Created attachment 1309977 [details]
overall look at memory leaks in workers

individual graphs in this screenshot have been attached separately in a previous attachment. [1] 

[1] - https://bugzilla.redhat.com/attachment.cgi?id=1309976

Comment 32 dmetzger 2017-08-07 17:45:17 UTC
Work is currently in process on the metrics collector worker with respect to memory leaking, using https://bugzilla.redhat.com/show_bug.cgi?id=1458392. 

Closing this BZ as a dupe of that ticket given work is currently in progress under that BZ.

If there is data showing memory leaks in other workers as mentioned in comment #29, please open tickets for these workers with the associated data.

*** This bug has been marked as a duplicate of bug 1458392 ***

Comment 35 dmetzger 2017-08-09 18:32:00 UTC
Created attachment 1311339 [details]
PSS & RSS utilization - 4+ day test run

Worker Config:
    Single Generic Worker
    1.5Gb Memory Threshold

Provider:
    Clusters:      10
    Hosts:         50
    Datastores:    61
    VMs:        1,000
    Type:       VMware VC 5.5.0

Comment 36 dmetzger 2017-08-09 18:44:24 UTC
Correction to comment #35, worker configuration was a single Metrics Collector Worker

Comment 41 Nick LaMuro 2018-01-18 16:06:56 UTC
A possible fix has been proposed in this related BZ:

https://bugzilla.redhat.com/show_bug.cgi?id=1535720


That is targeted for the MiqServer, and high confidence that it will fix the leak there.  Updates will probably happen there more regularly until we determine if there is a different leak in the MetricsCollectorWorker, and there is a high probability this was a leak across all workers.

Comment 42 Nick LaMuro 2018-01-19 23:53:04 UTC
The fix above has been backported to 5.8:

https://bugzilla.redhat.com/show_bug.cgi?id=1536672

As well as for future releases here:

https://bugzilla.redhat.com/show_bug.cgi?id=1535720

We are going to do some testing ourselves to see if this is fixing the issue with the MetricsCollectorWorker as well, and will update with those results.

Comment 43 Nick LaMuro 2018-02-01 22:45:43 UTC
Update:

We are relatively sure that this leak will be resolved with the patch provided in https://bugzilla.redhat.com/show_bug.cgi?id=1535720 (or the respective backported version), so this might already be fixed.

That said, we are doing some final long term comparisons with our test environments to confirm that the systems that had the patch applied and displayed no leak, will start leaking once the patch is removed.  We are confident this patch fixes the leak with MiqServer, but want to be confident in saying this is the same with the other workers as well, and that there isn't possibly another leak at play here.

Next update will be roughly in a week's time.

Comment 44 Nick LaMuro 2018-02-08 17:38:43 UTC
After testing on a pair of appliances for about a week, we are fairly confident that this has a substantial impact to the memory footprint of all the workers, including the MetricsCollectorWorker, as mentioned here.

Please retest with the changes in place, and if the issue persists, feel free to kick the ticket back so we can look into it further.

Comment 45 Einat Pacifici 2018-02-19 07:56:45 UTC
Retested with long long running metrics. No leaks seen. Verifying. However, if still persists. We can recheck/retest as required.

Comment 46 Pradeep Kumar Surisetty 2018-04-16 08:28:40 UTC
Memory consumption with 5.9 is lesser than older releases. 
Havet noticed leaks much.  can be closed