1281855 – MiqEmsRefreshCoreWorker exits on xlarge vmware provider on 5.5

Bug 1281855 - MiqEmsRefreshCoreWorker exits on xlarge vmware provider on 5.5

Summary: MiqEmsRefreshCoreWorker exits on xlarge vmware provider on 5.5

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Performance
Sub Component:
Version:	5.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	GA
Target Release:	5.8.0
Assignee:	Adam Grare
QA Contact:	Pradeep Kumar Surisetty
Docs Contact:
URL:
Whiteboard:	perf
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-11-13 15:40 UTC by Alex Krzos
Modified:	2017-12-05 15:57 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-02-16 20:50:59 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
5.4.3.1 - MiqEmsRefreshCoreWorker Memory usage (61.27 KB, image/png) 2015-11-13 15:43 UTC, Alex Krzos	no flags	Details
5.5.0.9 where MiqEmsRefreshCoreWorker only exceeds threshold during refresh (318.00 KB, image/png) 2015-11-13 15:46 UTC, Alex Krzos	no flags	Details
console output showing MiqEmsRefreshCoreWorker exit storm on 5.5.0.10 (8.00 KB, text/plain) 2015-11-13 15:54 UTC, Alex Krzos	no flags	Details
5.8 VimBroker and RefreshCore Worker Memory Usage (39.24 KB, image/jpeg) 2017-02-16 20:46 UTC, Adam Grare	no flags	Details
Raw data of VimBroker and RefreshCore Worker Memory Usage (869 bytes, text/plain) 2017-02-16 20:47 UTC, Adam Grare	no flags	Details
View All

Description Alex Krzos 2015-11-13 15:40:47 UTC

Description of problem:
Connected CFME to xlarge vmware provider and the MiqEmsRefreshCoreWorker exits due to exceeding its default memory threshold of 400MiB RSS. A 5.4.3.1 appliance does not exhibit this same behavior when connected to the same sized provider. A 5.4.3.1 appliance appears to use ~20MiB less memory with the MiqEmsRefreshCoreWorker rather than the 5.5.0.9/10 appliances. Additionally the constant recycling of of this worker consumes an excessive amount of CPU.

xlarge vmware provider is:
10,000 VMs, 5,000 online VMs
200 hosts
221 data stores
20 clusters
200 resource pools

Version-Release number of selected component (if applicable):
5.5.0.9
5.5.0.10

How reproducible:
Inconsistent between runs.

I have witnessed three different behaviors:
1. MiqEmsRefreshCoreWorker exceeds its memory threshold during refresh and consistently there afterwards resulting in wasted cpu resources.
2. MiqEmsRefreshCoreWorker exceeds its memory threshold during a refresh, thus exiting consistently, but once refresh is complete the last started worker stabilizes at a hair below the memory threshold ~398MiB RSS.
3. MiqEmsRefreshCoreWorker does not exceed its memory threshold even during a refresh operation.

Scenario 1 is the worst case, witnessed on 5.5.0.10. Scenario 3 is the ideal situation and is what occurs on 5.4.3.1, and has been witnessed on 5.5.0.10 once.

Steps to Reproduce:
1.
2.
3.

Actual results:
With large scale vmware providers MiqEmsRefreshCoreWorker will sometimes exceed its 400MiB RSS Memory threshold causing a "worker exit storm" consuming a tremendous amount of CPU from an appliance. There is no notification to the end user or administrator that this is occurring, without viewing the logs or seeing saw tooth memory usage or high cpu usage of the appliance itself.

Expected results:
MiqEmsRefreshCoreWorker not to strain the appliance's resources. Solutions to this issue are:
1. Release note or instructions to raise MiqEmsRefreshCoreWorker default memory threshold when connecting to xlarge vmware sized providers
2. Reduce memory foot print of MiqEmsRefreshCoreWorker.
3. Raise default Memory threshold of MiqEmsRefreshCoreWorker in the configuration. (There are other workers who exceed their memory defaults when connected to this sized environment so this is the least desirable solution IMO.)

Additional info:
In the instance where a MiqEmsRefreshCoreWorker was consistently exiting and restarting even after an initial refresh, I had tuned the worker memory threshold to 500MiB. The worker ended up stabilizing at 416MiB RSS Memory. This is odd as in other Memory Baseline tests I have seen the worker stabilize at ~395-398MiB RSS with it only breaching during the initial refresh operation. The memory utilization appears to be inconsistent and right at the boarder of it's memory threshold with this sized environment.

Comment 2 Alex Krzos 2015-11-13 15:43:41 UTC

Created attachment 1093711 [details]
5.4.3.1 - MiqEmsRefreshCoreWorker Memory usage

Ideal situation where memory usage of this worker never exceeds its defined threshold.  RSS Memory usage peaks at 378MiB which is below the appliance's default of 400MiB.

Comment 3 Alex Krzos 2015-11-13 15:46:54 UTC

Created attachment 1093713 [details]
5.5.0.9 where MiqEmsRefreshCoreWorker only exceeds threshold during refresh

This memory graph shows MiqEmsRefreshCoreWorker exiting during refresh, and then stabilizing at 395MiB RSS Memory usage after refresh has completed.

Comment 4 Alex Krzos 2015-11-13 15:54:07 UTC

Created attachment 1093716 [details]
console output showing MiqEmsRefreshCoreWorker exit storm on 5.5.0.10

This is the worst case scenario, where even after refresh the worker does not settle below 400MiB RSS Memory usage.  Note the high user cpu usage at 50-70% due to 2-3 MiqEmsRefreshCoreWorkers starting/exiting concurrently.

Comment 5 Adam Grare 2017-02-16 20:45:14 UTC

With the worker steady state memory around 380-390MiB on 5.4 and increasing to 416MiB on 5.5 I'd argue that an increase in baseline worker memory usage between 5.4 and 5.5 pushed this worker just over the top.

In 5.6 we started using PSS for the memory threshold (https://github.com/ManageIQ/manageiq/commit/6583411f3d4634f54db0e404318e0ea594726ce5) and from what I'm seeing this worker is consistently under 400MiB of PSS.  With 10240VMs I averaged 300.39MiB PSS over 3 runs for the MiqEmsRefreshCoreWorker.

I'll attach the csv and graph of memory usage for the broker and the refresh core workers with 512, 1024, 2048, 4096, 8192, and 10240 VMs.

Comment 6 Adam Grare 2017-02-16 20:46:28 UTC

Created attachment 1250943 [details]
5.8 VimBroker and RefreshCore Worker Memory Usage

Graph of VimBroker and RefreshCore Workers RSS&PSS memory usage from 512VMs to 10240VMs

Comment 7 Adam Grare 2017-02-16 20:47:40 UTC

Created attachment 1250945 [details]
Raw data of VimBroker and RefreshCore Worker Memory Usage

Here is the raw data (CSV) from the refresh results used to create the graph.

Comment 8 Adam Grare 2017-02-16 20:50:59 UTC

Seeing as with an "XL VC" aka more than 10,000VMs the worker memory usage is safely under the threshold as of 5.6 I'm going to mark this not a bug as of the 5.6 release (due to the checking of PSS not RSS memory for the threshold)

Note You need to log in before you can comment on or make changes to this bug.