Bug 1267697

Summary: Much higher memory usage in 5.5
Product: Red Hat CloudForms Management Engine Reporter: Alex Krzos <akrzos>
Component: PerformanceAssignee: Keenan Brock <kbrock>
Status: CLOSED ERRATA QA Contact: Alex Krzos <akrzos>
Severity: high Docs Contact:
Priority: high    
Version: 5.5.0CC: apatters, cpelland, dajohnso, dmetzger, jhardy, jocarter, kbrock, mfeifer, nachandr, obarenbo, perfbz, simaishi
Target Milestone: Beta 2   
Target Release: 5.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 5.5.0.8 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-08 13:33:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Krzos 2015-09-30 17:24:59 UTC
Description of problem:
Running Benchmarks that report timings and RSS Memory usage for various Refresh Scenarios with CFME and seeing 4-2x memory usage for Refresh benchmarks currently.

Version-Release number of selected component (if applicable):
5.5.0.1

How reproducible:
Currently against RHEVM providers, more data will arrive today as the benchmarks continue to run

Steps to Reproduce:
1.Measure RSS Memory between 5.4 and 5.5 appliances

Using Rails console this can be demonstrated:
mrss_start = MiqProcess.processInfo()[:memory_usage]
gc_start = GC.count
e = ExtManagementSystem.find_by_name('rhemv-small')
timing = Benchmark.realtime {EmsRefresh.refresh e}
mrss_end = MiqProcess.processInfo()[:memory_usage]
gc_end = GC.count
mrss_change = mrss_end - mrss_start
gc_change = gc_end - gc_start
puts "#{mrss_start}, #{mrss_end}, #{mrss_change}"
puts "#{gc_start}, #{gc_end}, #{gc_change}"
puts Process.pid
timing

Actual results:

Current available data on RHEVM providers shows:

5.5 Initial Refresh:
RHEVM small provider - Between 196MiB to 201MiB
RHEVM medium provider - Between 531MiB to 577MiB

5.4 Initial Refresh:
RHEVM small provider - Between 48MiB to 54MiB
RHEVM medium provider - Between 184MiB to 195MiB

Expected results:
I would not expect memory usage to go up by more than 4-2x

Additional info:
VMware providers should finish being benchmarked later tonight/tomorrow morning
QE suspects this affects all workers and features

Comment 2 Dave Johnson 2015-09-30 17:28:57 UTC
This is impacting QE test automation runs, Dennis, can we get someone to look into this soon please?  TIA

Comment 3 Alex Krzos 2015-09-30 20:12:20 UTC
* The measurement for memory in this comment and comment #0 is the difference of the RSS memory consumed from immediately after rails console starts and until the benchmark completes.  In the benchmark's code that is mrss_change.  This is less than total memory used by the benchmark since that would include what overhead is used by spawning the rails console.

Additional tests show that larger scaled RHEVM providers and VMware providers are also affected

5.5 Initial Refresh:
RHEVM large provider - Between 1095MiB to 1216MiB

VMware small provider - Between 93MiB to 94MiB
VMware medium provider - Between 376MiB to 395MiB
VMware large provider - ~1248MiB (Only one sample at this time)

5.4 Initial Refresh:
RHEVM large provider - Between 606MiB to 610MiB

VMware small provider - Between 58MiB to 63MiB
VMware medium provider - Between 227MiB to 233MiB
VMware large provider - Between 551MiB to 607MiB


* In addition the the EmsRefresh Memory bloat, VMware providers have a VIMBroker Worker which might also bloat in memory usage.

Comment 4 Alex Krzos 2015-10-01 12:06:03 UTC
Capacity and Utilization Benchmarks show that RSS Memory Utilization is also higher for VMware VM/Host perf_captures and RHEVM/VMware perf_capture_timer.  That means this memory growth will affect more than the refresh worker/feature.  (Adjusting the BZ title to match)

VM.perf_capture

5.5 99%ile of 4 samples:
RHEVM small provider - 9MiB
RHEVM medium provider - 11MiB
RHEVM large provider - 19MiB

VMware small provider - 72MiB
VMware medium provider - 78MiB
VMware large provider - 72MiB

5.4 99%ile of 4 samples:
RHEVM small provider - 10MiB
RHEVM medium provider - 14MiB
RHEVM large provider - 19MiB

VMware small provider - 35MiB
VMware medium provider - 36MiB
VMware large provider - 44MiB


Host.perf_capture

5.5 99%ile of 4 samples:
RHEVM small provider - 9.2MiB
RHEVM medium provider - 9.8MiB
RHEVM large provider - 9.8MiB

VMware small provider - 81.2MiB
VMware medium provider - 87.3MiB
VMware large provider - 83.1MiB

5.4 99%ile of 4 samples:
RHEVM small provider - 10.0MiB
RHEVM medium provider - 10.4MiB
RHEVM large provider - 10.5MiB

VMware small provider - 40.4MiB
VMware medium provider - 35.0MiB
VMware large provider - 39.7MiB


* While some of the RHEVM perf_capture tests show memory growth, it is much harder to measure timing/memory values with RHEVM perf_captures due to: https://bugzilla.redhat.com/show_bug.cgi?id=1085988 During these measurements the simulators had the dwhd stopped thus there shouldn't have been any data to collect from rhevm, however in some cases there is still apparent memory growth


perf_capture_timer

5.5 99%ile of 4 samples:
RHEVM small provider - 21.5MiB
RHEVM medium provider - 43.5MiB
RHEVM large provider - 95.1MiB

VMware small provider - 34.8MiB
VMware medium provider - 110.3MiB
VMware large provider - 143.3MiB

5.4 99%ile of 4 samples:
RHEVM small provider - 21.8MiB
RHEVM medium provider - 120.3MiB
RHEVM large provider - 237.2MiB

VMware small provider - 23.7MiB
VMware medium provider - 41.3MiB
VMware large provider - 84.2MiB

Comment 5 Alex Krzos 2015-10-01 18:42:39 UTC
Correction to Comment 4:

RHEVM provider memory utilization was reversed for 5.4 vs 5.5.  Below is correct data for benchmarks of RHEVM perf_capture_timer.  RHEVM providers have higher memory utilization during this benchmark in 5.5 alpha.


perf_capture_timer

5.5 99%ile of 4 samples:
RHEVM small provider - 21.8MiB
RHEVM medium provider - 120.3MiB
RHEVM large provider - 237.2MiB


5.4 99%ile of 4 samples:
RHEVM small provider - 21.5MiB
RHEVM medium provider - 43.5MiB
RHEVM large provider - 95.1MiB

Comment 6 Dave Johnson 2015-10-05 14:41:54 UTC
*** Bug 1267695 has been marked as a duplicate of this bug. ***

Comment 7 Alex Krzos 2015-10-05 14:58:39 UTC
In order to characterize this issue against CFME 5.4, I deployed a 5.4.3.0 appliance and 5.5.0.3 appliance and managed the same provider while capturing worker rss/virt memory usage over 20 minutes:

The environment managed was a medium sized VMware environment consisting of 1000 VMs, 50 Hosts and 61 datastores. 

The only applied workload was to add the provider and allow cfme to inventory.  evmserverd was then restarted and the memory utilization was tracked for 20 mintues.

The amount more for 5.5 appliances by worker amounts to:

115MiB More for Refresh Worker
52MiB More for MiqVimBrokerWorker
102MiB More for MiqEmsRefreshCoreWorker
~80MiB More for MiqGenericWorker (2x)
~50MiB More for MiqPriorityWorker (2x)
43MiB More for MiqScheduleWorker
48MiB More for MiqUiWorker
46MiB More for MiqWebServiceWorker
~48MiB More for MiqReportingWorker (2x)
44MiB More for MiqEventHandler
73MiB More for Event Catcher Worker

+ 169MiB for MiqAutomateWorker (2x)

This totals up to an additional 1217MiB to manage the same sized provider in 5.5.

Comment 9 Alex Krzos 2015-10-05 19:42:33 UTC
Performing the same sequence only this time turning on C&U collections for the entire region results in even greater memory usage over 5.4:

Significantly changed workers:

95MiB More for MiqVimBrokerWorker
~46MiB More for MiqGenericWorker (2x)
~93MiB More for MiqPriorityWorker (2x)
79MiB More for MiqUiWorker
~148MiB More for Collector Worker (2x)
~107MiB More for MiqEmsMetricsProcessorWorker (2x)


112MiB More for Refresh Worker
93MiB More for MiqEmsRefreshCoreWorker
50.9MiB More for MiqScheduleWorker
48MiB More for MiqWebServiceWorker
~47MiB More for MiqReportingWorker (2x)
43MiB More for MiqEventHandler
72MiB More for Event Catcher Worker
+ 169MiB for MiqAutomateWorker (2x)


This totals up to an additional 1813MiB to manage and collect metrics on the same sized provider in 5.5.

Comment 10 dmetzger 2015-11-02 18:17:00 UTC
There were numerous commits over the past couple weeks all relating reduction in appliance memory. Current memory utilization support Small/Medium environments with the desired 6Gb memory configuration. Therefore, this ticket is being closed, however development will continue to monitor / evaluate the application memory footprint closely.

Comment 11 Keenan Brock 2015-11-05 15:35:34 UTC
This has been addressed and merged.

Please open BZs with specific memory errors for the rest of this release

Comment 12 Alex Krzos 2015-12-04 19:28:03 UTC
Fixed in 5.5.0.12.  

As stated by Keenan, individual BZs addressing memory usage will be opened as seen fit under further analysis.  Many fixes were applied to reduce/accommodate for the memory footprint of 5.5.  

These include:
Removal of Automate Workers
GC Tuning to reduce and cap rate of memory growth
Reduced Vim Broker Worker memory usage
Default memory of appliance was raised to 8GiB

With the above fixes, I can now manage Small (100 total vms, 50 online) and Medium (1000 total vms, 500 online) VMware environments with a default appliance configuration with Capacity and Utilization turned on.

Comment 14 errata-xmlrpc 2015-12-08 13:33:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2551