Red Hat Bugzilla – Bug 1281921
Worker exceeding memory causes more workers to spawn before worker exceeding threshold is exited
Last modified: 2018-01-05 18:50:38 EST
Created attachment 1093827 [details]
EVM Log file showing 4 reporting workers existing at the same time.
Description of problem:
When a worker exceeds its memory threshold, a new worker is spawned in its place and it is assumed that the worker who exceeded the threshold will exit quickly. In most cases this is what occurs. In specific instances, the worker who has exceeded the memory threshold may not exit quickly as it is handling a large piece of work. When this occurs, we run the risk of consuming all available hardware resources on an appliance. Both Memory and CPU run the risk of being consumed completely if this scenario occurs.
I have found several instances when this can occur. The easiest to reproduce is with a large scale database and running reports.
The default reports scheduled for creation once a day will trigger 2 additional reporting workers than normally are allowed to run at a given time. This will consume all available CPU on an appliance with the default of 4 vCPUs and will consume additional memory. (Note that the action of exiting a worker due to exceeding memory threshold is used to prevent more memory from being allocated, however in this case the opposite is occurring)
Version-Release number of selected component (if applicable):
also demonstrated in 5.3 and 5.5
Specific scenarios this will occur.
Steps to Reproduce:
Even more memory and cpu is being used.
For the worker who is exceeding memory to not potentially cause hardware resource starvation.
This is most easily represented in large scale environments.
Attached is a 30minute chunk of logs for a 18.104.22.168 appliance that shows the following:
Two existing reporting worker pids (51173 and 27681) each pick up a report for processing at 19:00:12, They both promptly exceed the default memory threshold at 19:00:16 (4 seconds later...) These pids do not exit until 19:18:06 and 19:19:34. Two new reporting workers (pids 2010 and 2013) are spawned and pick up the remaining reports at 19:00:35 and 19:00:36. For 5 minutes there are 4 reporting workers burning 100% cpu as represented in the cpu graph. Since the first two workers are the ones that have been asked to exit and have also picked up the most memory intense reports that require the most time to process, we have 4 reporting workers for almost 20 minutes using additional memory. (View memory graph used memory) In this appliance's case, it ejects file system cache from memory to make room for the extra reporting workers. Had this appliance had less memory it could have swapped processes into swap and slowed everything down.
Created attachment 1093833 [details]
Appliance CPU graph displaying 100% cpu usage during which 4 reporting workers are running
Created attachment 1093834 [details]
Appliance memory graph displaying additional memory usage when 4 reporting workers are running
Created attachment 1113758 [details]
22.214.171.124-2 Appliance CPU+Memory Metrics Graph
Attached appliance level system performance metrics with the same issue on 126.96.36.199-2.
Created attachment 1113771 [details]
188.8.131.52-2 MiqReportingWorkers CPU,Memory,Process/Thread Count
Attached is a per process graph displaying how MiqReportingWorkers will jump in memory usage and more than the intended count and can coexist during long running work. In this case 5 workers exist concurrently even though the appliance is configured for 2 workers
This bug has been open for more than a year and is assigned to an older release of CloudForms.
If you would like to keep this Bugzilla open and if the issue is still present in the latest version of the product, please file a new Bugzilla which will be added and assigned to the latest release of CloudForms.