Red Hat Bugzilla – Bug 1281855
MiqEmsRefreshCoreWorker exits on xlarge vmware provider on 5.5
Last modified: 2017-12-05 10:57:20 EST
Description of problem:
Connected CFME to xlarge vmware provider and the MiqEmsRefreshCoreWorker exits due to exceeding its default memory threshold of 400MiB RSS. A 188.8.131.52 appliance does not exhibit this same behavior when connected to the same sized provider. A 184.108.40.206 appliance appears to use ~20MiB less memory with the MiqEmsRefreshCoreWorker rather than the 220.127.116.11/10 appliances. Additionally the constant recycling of of this worker consumes an excessive amount of CPU.
xlarge vmware provider is:
10,000 VMs, 5,000 online VMs
221 data stores
200 resource pools
Version-Release number of selected component (if applicable):
Inconsistent between runs.
I have witnessed three different behaviors:
1. MiqEmsRefreshCoreWorker exceeds its memory threshold during refresh and consistently there afterwards resulting in wasted cpu resources.
2. MiqEmsRefreshCoreWorker exceeds its memory threshold during a refresh, thus exiting consistently, but once refresh is complete the last started worker stabilizes at a hair below the memory threshold ~398MiB RSS.
3. MiqEmsRefreshCoreWorker does not exceed its memory threshold even during a refresh operation.
Scenario 1 is the worst case, witnessed on 18.104.22.168. Scenario 3 is the ideal situation and is what occurs on 22.214.171.124, and has been witnessed on 126.96.36.199 once.
Steps to Reproduce:
With large scale vmware providers MiqEmsRefreshCoreWorker will sometimes exceed its 400MiB RSS Memory threshold causing a "worker exit storm" consuming a tremendous amount of CPU from an appliance. There is no notification to the end user or administrator that this is occurring, without viewing the logs or seeing saw tooth memory usage or high cpu usage of the appliance itself.
MiqEmsRefreshCoreWorker not to strain the appliance's resources. Solutions to this issue are:
1. Release note or instructions to raise MiqEmsRefreshCoreWorker default memory threshold when connecting to xlarge vmware sized providers
2. Reduce memory foot print of MiqEmsRefreshCoreWorker.
3. Raise default Memory threshold of MiqEmsRefreshCoreWorker in the configuration. (There are other workers who exceed their memory defaults when connected to this sized environment so this is the least desirable solution IMO.)
In the instance where a MiqEmsRefreshCoreWorker was consistently exiting and restarting even after an initial refresh, I had tuned the worker memory threshold to 500MiB. The worker ended up stabilizing at 416MiB RSS Memory. This is odd as in other Memory Baseline tests I have seen the worker stabilize at ~395-398MiB RSS with it only breaching during the initial refresh operation. The memory utilization appears to be inconsistent and right at the boarder of it's memory threshold with this sized environment.
Created attachment 1093711 [details]
188.8.131.52 - MiqEmsRefreshCoreWorker Memory usage
Ideal situation where memory usage of this worker never exceeds its defined threshold. RSS Memory usage peaks at 378MiB which is below the appliance's default of 400MiB.
Created attachment 1093713 [details]
184.108.40.206 where MiqEmsRefreshCoreWorker only exceeds threshold during refresh
This memory graph shows MiqEmsRefreshCoreWorker exiting during refresh, and then stabilizing at 395MiB RSS Memory usage after refresh has completed.
Created attachment 1093716 [details]
console output showing MiqEmsRefreshCoreWorker exit storm on 220.127.116.11
This is the worst case scenario, where even after refresh the worker does not settle below 400MiB RSS Memory usage. Note the high user cpu usage at 50-70% due to 2-3 MiqEmsRefreshCoreWorkers starting/exiting concurrently.
With the worker steady state memory around 380-390MiB on 5.4 and increasing to 416MiB on 5.5 I'd argue that an increase in baseline worker memory usage between 5.4 and 5.5 pushed this worker just over the top.
In 5.6 we started using PSS for the memory threshold (https://github.com/ManageIQ/manageiq/commit/6583411f3d4634f54db0e404318e0ea594726ce5) and from what I'm seeing this worker is consistently under 400MiB of PSS. With 10240VMs I averaged 300.39MiB PSS over 3 runs for the MiqEmsRefreshCoreWorker.
I'll attach the csv and graph of memory usage for the broker and the refresh core workers with 512, 1024, 2048, 4096, 8192, and 10240 VMs.
Created attachment 1250943 [details]
5.8 VimBroker and RefreshCore Worker Memory Usage
Graph of VimBroker and RefreshCore Workers RSS&PSS memory usage from 512VMs to 10240VMs
Created attachment 1250945 [details]
Raw data of VimBroker and RefreshCore Worker Memory Usage
Here is the raw data (CSV) from the refresh results used to create the graph.
Seeing as with an "XL VC" aka more than 10,000VMs the worker memory usage is safely under the threshold as of 5.6 I'm going to mark this not a bug as of the 5.6 release (due to the checking of PSS not RSS memory for the threshold)