Bug 1281855

Summary: MiqEmsRefreshCoreWorker exits on xlarge vmware provider on 5.5
Product: Red Hat CloudForms Management Engine Reporter: Alex Krzos <akrzos>
Component: PerformanceAssignee: Adam Grare <agrare>
Status: CLOSED NOTABUG QA Contact: Pradeep Kumar Surisetty <psuriset>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.5.0CC: arcsharm, cpelland, dmetzger, hroy, jhardy, obarenbo
Target Milestone: GA   
Target Release: 5.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: perf
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-16 20:50:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
5.4.3.1 - MiqEmsRefreshCoreWorker Memory usage
none
5.5.0.9 where MiqEmsRefreshCoreWorker only exceeds threshold during refresh
none
console output showing MiqEmsRefreshCoreWorker exit storm on 5.5.0.10
none
5.8 VimBroker and RefreshCore Worker Memory Usage
none
Raw data of VimBroker and RefreshCore Worker Memory Usage none

Description Alex Krzos 2015-11-13 15:40:47 UTC
Description of problem:
Connected CFME to xlarge vmware provider and the MiqEmsRefreshCoreWorker exits due to exceeding its default memory threshold of 400MiB RSS.  A 5.4.3.1 appliance does not exhibit this same behavior when connected to the same sized provider.  A 5.4.3.1 appliance appears to use ~20MiB less memory with the MiqEmsRefreshCoreWorker rather than the 5.5.0.9/10 appliances.  Additionally the constant recycling of of this worker consumes an excessive amount of CPU.

xlarge vmware provider is:
10,000 VMs, 5,000 online VMs
200 hosts
221 data stores
20 clusters
200 resource pools

Version-Release number of selected component (if applicable):
5.5.0.9
5.5.0.10

How reproducible:
Inconsistent between runs.  

I have witnessed three different behaviors:
1. MiqEmsRefreshCoreWorker exceeds its memory threshold during refresh and consistently there afterwards resulting in wasted cpu resources.
2. MiqEmsRefreshCoreWorker exceeds its memory threshold during a refresh, thus exiting consistently, but once refresh is complete the last started worker stabilizes at a hair below the memory threshold ~398MiB RSS.
3. MiqEmsRefreshCoreWorker does not exceed its memory threshold even during a refresh operation.

Scenario 1 is the worst case, witnessed on 5.5.0.10.  Scenario 3 is the ideal situation and is what occurs on 5.4.3.1, and has been witnessed on 5.5.0.10 once.


Steps to Reproduce:
1.
2.
3.

Actual results:
With large scale vmware providers MiqEmsRefreshCoreWorker will sometimes exceed its 400MiB RSS Memory threshold causing a "worker exit storm" consuming a tremendous amount of CPU from an appliance.  There is no notification to the end user or administrator that this is occurring, without viewing the logs or seeing saw tooth memory usage or high cpu usage of the appliance itself.

Expected results:
MiqEmsRefreshCoreWorker not to strain the appliance's resources.  Solutions to this issue are:
1. Release note or instructions to raise MiqEmsRefreshCoreWorker default memory threshold when connecting to xlarge vmware sized providers
2. Reduce memory foot print of MiqEmsRefreshCoreWorker.
3. Raise default Memory threshold of MiqEmsRefreshCoreWorker in the configuration.  (There are other workers who exceed their memory defaults when connected to this sized environment so this is the least desirable solution IMO.)

Additional info:
In the instance where a MiqEmsRefreshCoreWorker was consistently exiting and restarting even after an initial refresh, I had tuned the worker memory threshold to 500MiB.  The worker ended up stabilizing at 416MiB RSS Memory.  This is odd as in other Memory Baseline tests I have seen the worker stabilize at ~395-398MiB RSS with it only breaching during the initial refresh operation.  The memory utilization appears to be inconsistent and right at the boarder of it's memory threshold with this sized environment.

Comment 2 Alex Krzos 2015-11-13 15:43:41 UTC
Created attachment 1093711 [details]
5.4.3.1 - MiqEmsRefreshCoreWorker Memory usage

Ideal situation where memory usage of this worker never exceeds its defined threshold.  RSS Memory usage peaks at 378MiB which is below the appliance's default of 400MiB.

Comment 3 Alex Krzos 2015-11-13 15:46:54 UTC
Created attachment 1093713 [details]
5.5.0.9 where MiqEmsRefreshCoreWorker only exceeds threshold during refresh

This memory graph shows MiqEmsRefreshCoreWorker exiting during refresh, and then stabilizing at 395MiB RSS Memory usage after refresh has completed.

Comment 4 Alex Krzos 2015-11-13 15:54:07 UTC
Created attachment 1093716 [details]
console output showing MiqEmsRefreshCoreWorker exit storm on 5.5.0.10

This is the worst case scenario, where even after refresh the worker does not settle below 400MiB RSS Memory usage.  Note the high user cpu usage at 50-70% due to 2-3 MiqEmsRefreshCoreWorkers starting/exiting concurrently.

Comment 5 Adam Grare 2017-02-16 20:45:14 UTC
With the worker steady state memory around 380-390MiB on 5.4 and increasing to 416MiB on 5.5 I'd argue that an increase in baseline worker memory usage between 5.4 and 5.5 pushed this worker just over the top.

In 5.6 we started using PSS for the memory threshold (https://github.com/ManageIQ/manageiq/commit/6583411f3d4634f54db0e404318e0ea594726ce5) and from what I'm seeing this worker is consistently under 400MiB of PSS.  With 10240VMs I averaged 300.39MiB PSS over 3 runs for the MiqEmsRefreshCoreWorker.

I'll attach the csv and graph of memory usage for the broker and the refresh core workers with 512, 1024, 2048, 4096, 8192, and 10240 VMs.

Comment 6 Adam Grare 2017-02-16 20:46:28 UTC
Created attachment 1250943 [details]
5.8 VimBroker and RefreshCore Worker Memory Usage

Graph of VimBroker and RefreshCore Workers RSS&PSS memory usage from 512VMs to 10240VMs

Comment 7 Adam Grare 2017-02-16 20:47:40 UTC
Created attachment 1250945 [details]
Raw data of VimBroker and RefreshCore Worker Memory Usage

Here is the raw data (CSV) from the refresh results used to create the graph.

Comment 8 Adam Grare 2017-02-16 20:50:59 UTC
Seeing as with an "XL VC" aka more than 10,000VMs the worker memory usage is safely under the threshold as of 5.6 I'm going to mark this not a bug as of the 5.6 release (due to the checking of PSS not RSS memory for the threshold)