Bug 1281921

Summary: Worker exceeding memory causes more workers to spawn before worker exceeding threshold is exited
Product: Red Hat CloudForms Management Engine Reporter: Alex Krzos <akrzos>
Component: PerformanceAssignee: Gregg Tanzillo <gtanzill>
Status: CLOSED WONTFIX QA Contact: Pradeep Kumar Surisetty <psuriset>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.5.0CC: arcsharm, cpelland, dmetzger, hroy, jhardy, jrafanie, obarenbo
Target Milestone: GAKeywords: ZStream
Target Release: cfme-future   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: perf
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1348203 (view as bug list) Environment:
Last Closed: 2017-08-21 13:06:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1348203    
Attachments:
Description Flags
EVM Log file showing 4 reporting workers existing at the same time.
none
Appliance CPU graph displaying 100% cpu usage during which 4 reporting workers are running
none
Appliance memory graph displaying additional memory usage when 4 reporting workers are running
none
5.5.0.13-2 Appliance CPU+Memory Metrics Graph
none
5.5.0.13-2 MiqReportingWorkers CPU,Memory,Process/Thread Count none

Description Alex Krzos 2015-11-13 19:59:24 UTC
Created attachment 1093827 [details]
EVM Log file showing 4 reporting workers existing at the same time.

Description of problem:
When a worker exceeds its memory threshold, a new worker is spawned in its place and it is assumed that the worker who exceeded the threshold will exit quickly.  In most cases this is what occurs.  In specific instances, the worker who has exceeded the memory threshold may not exit quickly as it is handling a large piece of work.  When this occurs, we run the risk of consuming all available hardware resources on an appliance.  Both Memory and CPU run the risk of being consumed completely if this scenario occurs.

I have found several instances when this can occur.  The easiest to reproduce is with a large scale database and running reports.
The default reports scheduled for creation once a day will trigger 2 additional reporting workers than normally are allowed to run at a given time.  This will consume all available CPU on an appliance with the default of 4 vCPUs and will consume additional memory.  (Note that the action of exiting a worker due to exceeding memory threshold is used to prevent more memory from being allocated, however in this case the opposite is occurring)

Version-Release number of selected component (if applicable):
5.4
also demonstrated in 5.3 and 5.5

How reproducible:
Specific scenarios this will occur.

Steps to Reproduce:
1.
2.
3.

Actual results:
Even more memory and cpu is being used.

Expected results:
For the worker who is exceeding memory to not potentially cause hardware resource starvation.

Additional info:
This is most easily represented in large scale environments.

Attached is a 30minute chunk of logs for a 5.4.3.1 appliance that shows the following:

Two existing reporting worker pids (51173 and 27681) each pick up a report for processing at 19:00:12,  They both promptly exceed the default memory threshold at 19:00:16  (4 seconds later...)  These pids do not exit until 19:18:06 and 19:19:34.  Two new reporting workers (pids 2010 and 2013) are spawned and pick up the remaining reports at 19:00:35 and 19:00:36.  For 5 minutes there are 4 reporting workers burning 100% cpu as represented in the cpu graph.  Since the first two workers are the ones that have been asked to exit and have also picked up the most memory intense reports that require the most time to process, we have 4 reporting workers for almost 20 minutes using additional memory. (View memory graph used memory)  In this appliance's case, it ejects file system cache from memory to make room for the extra reporting workers.  Had this appliance had less memory it could have swapped processes into swap and slowed everything down.

Comment 2 Alex Krzos 2015-11-13 20:02:45 UTC
Created attachment 1093833 [details]
Appliance CPU graph displaying 100% cpu usage during which 4 reporting workers are running

Comment 3 Alex Krzos 2015-11-13 20:03:52 UTC
Created attachment 1093834 [details]
Appliance memory graph displaying additional memory usage when 4 reporting workers are running

Comment 5 Alex Krzos 2016-01-12 01:37:43 UTC
Created attachment 1113758 [details]
5.5.0.13-2 Appliance CPU+Memory Metrics Graph

Attached appliance level system performance metrics with the same issue on 5.5.0.13-2.

Comment 6 Alex Krzos 2016-01-12 01:41:17 UTC
Created attachment 1113771 [details]
5.5.0.13-2 MiqReportingWorkers CPU,Memory,Process/Thread Count

Attached is a per process graph displaying how MiqReportingWorkers will jump in memory usage and more than the intended count  and can coexist during long running work. In this case 5 workers exist concurrently even though the appliance is configured for 2 workers

Comment 8 Chris Pelland 2017-08-21 13:06:53 UTC
This bug has been open for more than a year and is assigned to an older release of CloudForms. 

If you would like to keep this Bugzilla open and if the issue is still present in the latest version of the product, please file a new Bugzilla which will be added and assigned to the latest release of CloudForms.

Comment 9 Chris Pelland 2017-08-21 13:09:23 UTC
This bug has been open for more than a year and is assigned to an older release of CloudForms. 

If you would like to keep this Bugzilla open and if the issue is still present in the latest version of the product, please file a new Bugzilla which will be added and assigned to the latest release of CloudForms.