1281921 – Worker exceeding memory causes more workers to spawn before worker exceeding threshold is exited

Bug 1281921 - Worker exceeding memory causes more workers to spawn before worker exceeding threshold is exited

Summary: Worker exceeding memory causes more workers to spawn before worker exceeding ...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Performance
Sub Component:
Version:	5.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	GA
Target Release:	cfme-future
Assignee:	Gregg Tanzillo
QA Contact:	Pradeep Kumar Surisetty
Docs Contact:
URL:
Whiteboard:	perf
Depends On:
Blocks:	1348203
TreeView+	depends on / blocked

Reported:	2015-11-13 19:59 UTC by Alex Krzos
Modified:	2018-01-05 23:50 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1348203 (view as bug list)
Environment:
Last Closed:	2017-08-21 13:06:53 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
EVM Log file showing 4 reporting workers existing at the same time. (1.45 MB, text/plain) 2015-11-13 19:59 UTC, Alex Krzos	no flags	Details
Appliance CPU graph displaying 100% cpu usage during which 4 reporting workers are running (90.52 KB, image/png) 2015-11-13 20:02 UTC, Alex Krzos	no flags	Details
Appliance memory graph displaying additional memory usage when 4 reporting workers are running (38.35 KB, image/png) 2015-11-13 20:03 UTC, Alex Krzos	no flags	Details
5.5.0.13-2 Appliance CPU+Memory Metrics Graph (389.35 KB, image/png) 2016-01-12 01:37 UTC, Alex Krzos	no flags	Details
5.5.0.13-2 MiqReportingWorkers CPU,Memory,Process/Thread Count (207.75 KB, image/png) 2016-01-12 01:41 UTC, Alex Krzos	no flags	Details
View All

Description Alex Krzos 2015-11-13 19:59:24 UTC

Created attachment 1093827 [details]
EVM Log file showing 4 reporting workers existing at the same time.

Description of problem:
When a worker exceeds its memory threshold, a new worker is spawned in its place and it is assumed that the worker who exceeded the threshold will exit quickly.  In most cases this is what occurs.  In specific instances, the worker who has exceeded the memory threshold may not exit quickly as it is handling a large piece of work.  When this occurs, we run the risk of consuming all available hardware resources on an appliance.  Both Memory and CPU run the risk of being consumed completely if this scenario occurs.

I have found several instances when this can occur.  The easiest to reproduce is with a large scale database and running reports.
The default reports scheduled for creation once a day will trigger 2 additional reporting workers than normally are allowed to run at a given time.  This will consume all available CPU on an appliance with the default of 4 vCPUs and will consume additional memory.  (Note that the action of exiting a worker due to exceeding memory threshold is used to prevent more memory from being allocated, however in this case the opposite is occurring)

Version-Release number of selected component (if applicable):
5.4
also demonstrated in 5.3 and 5.5

How reproducible:
Specific scenarios this will occur.

Steps to Reproduce:
1.
2.
3.

Actual results:
Even more memory and cpu is being used.

Expected results:
For the worker who is exceeding memory to not potentially cause hardware resource starvation.

Additional info:
This is most easily represented in large scale environments.

Attached is a 30minute chunk of logs for a 5.4.3.1 appliance that shows the following:

Two existing reporting worker pids (51173 and 27681) each pick up a report for processing at 19:00:12,  They both promptly exceed the default memory threshold at 19:00:16  (4 seconds later...)  These pids do not exit until 19:18:06 and 19:19:34.  Two new reporting workers (pids 2010 and 2013) are spawned and pick up the remaining reports at 19:00:35 and 19:00:36.  For 5 minutes there are 4 reporting workers burning 100% cpu as represented in the cpu graph.  Since the first two workers are the ones that have been asked to exit and have also picked up the most memory intense reports that require the most time to process, we have 4 reporting workers for almost 20 minutes using additional memory. (View memory graph used memory)  In this appliance's case, it ejects file system cache from memory to make room for the extra reporting workers.  Had this appliance had less memory it could have swapped processes into swap and slowed everything down.

Comment 2 Alex Krzos 2015-11-13 20:02:45 UTC

Created attachment 1093833 [details]
Appliance CPU graph displaying 100% cpu usage during which 4 reporting workers are running

Comment 3 Alex Krzos 2015-11-13 20:03:52 UTC

Created attachment 1093834 [details]
Appliance memory graph displaying additional memory usage when 4 reporting workers are running

Comment 5 Alex Krzos 2016-01-12 01:37:43 UTC

Created attachment 1113758 [details]
5.5.0.13-2 Appliance CPU+Memory Metrics Graph

Attached appliance level system performance metrics with the same issue on 5.5.0.13-2.

Comment 6 Alex Krzos 2016-01-12 01:41:17 UTC

Created attachment 1113771 [details]
5.5.0.13-2 MiqReportingWorkers CPU,Memory,Process/Thread Count

Attached is a per process graph displaying how MiqReportingWorkers will jump in memory usage and more than the intended count  and can coexist during long running work. In this case 5 workers exist concurrently even though the appliance is configured for 2 workers

Comment 8 Chris Pelland 2017-08-21 13:06:53 UTC

This bug has been open for more than a year and is assigned to an older release of CloudForms. 

If you would like to keep this Bugzilla open and if the issue is still present in the latest version of the product, please file a new Bugzilla which will be added and assigned to the latest release of CloudForms.

Comment 9 Chris Pelland 2017-08-21 13:09:23 UTC

This bug has been open for more than a year and is assigned to an older release of CloudForms. 

If you would like to keep this Bugzilla open and if the issue is still present in the latest version of the product, please file a new Bugzilla which will be added and assigned to the latest release of CloudForms.

Note You need to log in before you can comment on or make changes to this bug.