Bug 1281921 - Worker exceeding memory causes more workers to spawn before worker exceeding threshold is exited
Worker exceeding memory causes more workers to spawn before worker exceeding ...
Status: ASSIGNED
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Performance (Show other bugs)
5.5.0
Unspecified Unspecified
medium Severity medium
: GA
: cfme-future
Assigned To: Gregg Tanzillo
Pradeep Kumar Surisetty
perf
: ZStream
Depends On:
Blocks: 1348203
  Show dependency treegraph
 
Reported: 2015-11-13 14:59 EST by Alex Krzos
Modified: 2017-08-07 01:31 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1348203 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
EVM Log file showing 4 reporting workers existing at the same time. (1.45 MB, text/plain)
2015-11-13 14:59 EST, Alex Krzos
no flags Details
Appliance CPU graph displaying 100% cpu usage during which 4 reporting workers are running (90.52 KB, image/png)
2015-11-13 15:02 EST, Alex Krzos
no flags Details
Appliance memory graph displaying additional memory usage when 4 reporting workers are running (38.35 KB, image/png)
2015-11-13 15:03 EST, Alex Krzos
no flags Details
5.5.0.13-2 Appliance CPU+Memory Metrics Graph (389.35 KB, image/png)
2016-01-11 20:37 EST, Alex Krzos
no flags Details
5.5.0.13-2 MiqReportingWorkers CPU,Memory,Process/Thread Count (207.75 KB, image/png)
2016-01-11 20:41 EST, Alex Krzos
no flags Details

  None (edit)
Description Alex Krzos 2015-11-13 14:59:24 EST
Created attachment 1093827 [details]
EVM Log file showing 4 reporting workers existing at the same time.

Description of problem:
When a worker exceeds its memory threshold, a new worker is spawned in its place and it is assumed that the worker who exceeded the threshold will exit quickly.  In most cases this is what occurs.  In specific instances, the worker who has exceeded the memory threshold may not exit quickly as it is handling a large piece of work.  When this occurs, we run the risk of consuming all available hardware resources on an appliance.  Both Memory and CPU run the risk of being consumed completely if this scenario occurs.

I have found several instances when this can occur.  The easiest to reproduce is with a large scale database and running reports.
The default reports scheduled for creation once a day will trigger 2 additional reporting workers than normally are allowed to run at a given time.  This will consume all available CPU on an appliance with the default of 4 vCPUs and will consume additional memory.  (Note that the action of exiting a worker due to exceeding memory threshold is used to prevent more memory from being allocated, however in this case the opposite is occurring)

Version-Release number of selected component (if applicable):
5.4
also demonstrated in 5.3 and 5.5

How reproducible:
Specific scenarios this will occur.

Steps to Reproduce:
1.
2.
3.

Actual results:
Even more memory and cpu is being used.

Expected results:
For the worker who is exceeding memory to not potentially cause hardware resource starvation.

Additional info:
This is most easily represented in large scale environments.

Attached is a 30minute chunk of logs for a 5.4.3.1 appliance that shows the following:

Two existing reporting worker pids (51173 and 27681) each pick up a report for processing at 19:00:12,  They both promptly exceed the default memory threshold at 19:00:16  (4 seconds later...)  These pids do not exit until 19:18:06 and 19:19:34.  Two new reporting workers (pids 2010 and 2013) are spawned and pick up the remaining reports at 19:00:35 and 19:00:36.  For 5 minutes there are 4 reporting workers burning 100% cpu as represented in the cpu graph.  Since the first two workers are the ones that have been asked to exit and have also picked up the most memory intense reports that require the most time to process, we have 4 reporting workers for almost 20 minutes using additional memory. (View memory graph used memory)  In this appliance's case, it ejects file system cache from memory to make room for the extra reporting workers.  Had this appliance had less memory it could have swapped processes into swap and slowed everything down.
Comment 2 Alex Krzos 2015-11-13 15:02 EST
Created attachment 1093833 [details]
Appliance CPU graph displaying 100% cpu usage during which 4 reporting workers are running
Comment 3 Alex Krzos 2015-11-13 15:03 EST
Created attachment 1093834 [details]
Appliance memory graph displaying additional memory usage when 4 reporting workers are running
Comment 5 Alex Krzos 2016-01-11 20:37 EST
Created attachment 1113758 [details]
5.5.0.13-2 Appliance CPU+Memory Metrics Graph

Attached appliance level system performance metrics with the same issue on 5.5.0.13-2.
Comment 6 Alex Krzos 2016-01-11 20:41 EST
Created attachment 1113771 [details]
5.5.0.13-2 MiqReportingWorkers CPU,Memory,Process/Thread Count

Attached is a per process graph displaying how MiqReportingWorkers will jump in memory usage and more than the intended count  and can coexist during long running work. In this case 5 workers exist concurrently even though the appliance is configured for 2 workers

Note You need to log in before you can comment on or make changes to this bug.