Created attachment 1283390 [details] collector memory graph for a week Description of problem: Over last 4-5 days, the OCP environment was kept constant, but we had to increase MetricCollectorWorker memory threshold first to 500M then 600M and eventually 700M right now. (The data collectors are just keeping up with the rate of messages coming in) Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Single DB appliance with 5 worker appliances 2. Turn on C&U on all worker appliances and keep server roles to minimum on DB appliance 3. Connect to ~1900 pods in OCP env and let it run for 4-5 days while keeping an eye on C&U data collector worker memory usage Actual results: MetricsCollectorWorker memory grows over time even after raising memory thresholds a couple of times. Expected results: After increasing memory threshold a couple of times, queued msg pipeline should've been stabilized Additional info: attaching screenshot for reference
GreggT, this looks like good feedback from the Perf QE team about a possible memory leak in the C&U worker. Wondering if it's a general problem, or if there's a specific problem with the way that the OpenShift Metrics Collector works. Sending over to platform to see if you have any ideas off hand or have seen this in a more general setting.
(In reply to Greg Blomquist from comment #3) > Sending over to platform to see if you have any ideas off hand or have seen > this in a more general setting. Bronagh any additional information on what was discovered to assign it to CM?
On today's bug triage call the Performance team shared that they have long running 5.7.2.1 and 5.8,0.11 appliances pointing to VC and there are no memory leaks. Pradeep - Can the QE performance team do some performance tests for other providers to determine if this is a CM problem or a general problem.
Created attachment 1308698 [details] Metrics Collector Memory Usage over 2 days (RHEVM) - corrected units for memory
(In reply to Archit Sharma from comment #23) > Created attachment 1308698 [details] > Metrics Collector Memory Usage over 2 days (RHEVM) - corrected units for > memory Thanks Himanshu, the ask was to have multiple provider to compare. The graph you provided is great but we should see the trend of other type of providers as well (OpenShift, RHV, OpenStack).
Please assess the impact of this issue and update the severity accordingly. Please refer to https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity for a reminder on each severity's definition. If it's something like a tracker bug where it doesn't matter, please set it to Low/Low.
Observed memory leaks not just in MetricsCollector, but also in MetricsProcessor, GenericWorker and PriorityWorker for a 10k VMware provider infra connected [ 1 DB, 5 worker] CFME 5.8.0.17 appliance setup. This ran for over 3 days but leaks occurred in the first day itself. So now we have verifiable data atleast about Metrics collector leak, from following providers: - VMWare - Openshift - RHVM Attaching graphs showing increase in RSS memory and their correlations to stable ems_metrics_processor/collector queues vs total powered on/off VMs vs DB size.
Created attachment 1309976 [details] cfme_individual_memleak_correlation_graphs over 3 days (RSS usage)
Created attachment 1309977 [details] overall look at memory leaks in workers individual graphs in this screenshot have been attached separately in a previous attachment. [1] [1] - https://bugzilla.redhat.com/attachment.cgi?id=1309976
Work is currently in process on the metrics collector worker with respect to memory leaking, using https://bugzilla.redhat.com/show_bug.cgi?id=1458392. Closing this BZ as a dupe of that ticket given work is currently in progress under that BZ. If there is data showing memory leaks in other workers as mentioned in comment #29, please open tickets for these workers with the associated data. *** This bug has been marked as a duplicate of bug 1458392 ***
Created attachment 1311339 [details] PSS & RSS utilization - 4+ day test run Worker Config: Single Generic Worker 1.5Gb Memory Threshold Provider: Clusters: 10 Hosts: 50 Datastores: 61 VMs: 1,000 Type: VMware VC 5.5.0
Correction to comment #35, worker configuration was a single Metrics Collector Worker
A possible fix has been proposed in this related BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1535720 That is targeted for the MiqServer, and high confidence that it will fix the leak there. Updates will probably happen there more regularly until we determine if there is a different leak in the MetricsCollectorWorker, and there is a high probability this was a leak across all workers.
The fix above has been backported to 5.8: https://bugzilla.redhat.com/show_bug.cgi?id=1536672 As well as for future releases here: https://bugzilla.redhat.com/show_bug.cgi?id=1535720 We are going to do some testing ourselves to see if this is fixing the issue with the MetricsCollectorWorker as well, and will update with those results.
Update: We are relatively sure that this leak will be resolved with the patch provided in https://bugzilla.redhat.com/show_bug.cgi?id=1535720 (or the respective backported version), so this might already be fixed. That said, we are doing some final long term comparisons with our test environments to confirm that the systems that had the patch applied and displayed no leak, will start leaking once the patch is removed. We are confident this patch fixes the leak with MiqServer, but want to be confident in saying this is the same with the other workers as well, and that there isn't possibly another leak at play here. Next update will be roughly in a week's time.
After testing on a pair of appliances for about a week, we are fairly confident that this has a substantial impact to the memory footprint of all the workers, including the MetricsCollectorWorker, as mentioned here. Please retest with the changes in place, and if the issue persists, feel free to kick the ticket back so we can look into it further.
Retested with long long running metrics. No leaks seen. Verifying. However, if still persists. We can recheck/retest as required.
Memory consumption with 5.9 is lesser than older releases. Havet noticed leaks much. can be closed