1506737 – MIQ Server excessive memory growth causing worker memory threshold exceptions

Bug 1506737 - MIQ Server excessive memory growth causing worker memory threshold exceptions

Summary: MIQ Server excessive memory growth causing worker memory threshold exceptions

Keywords:
Status:	CLOSED DUPLICATE of bug 1479356
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	cfme-future
Assignee:	Joe Rafaniello
QA Contact:	Dave Johnson
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-26 17:01 UTC by dmetzger
Modified:	2021-06-10 13:22 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-12-15 19:43:32 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
CSV file documenting Generic Worker memory exceptions (64.91 KB, text/plain) 2017-10-26 17:01 UTC, dmetzger	no flags	Details
Associated EVM log (2.81 MB, application/x-gzip) 2017-10-26 17:02 UTC, dmetzger	no flags	Details
Associated TOP_OUTPUT.LOG (243.26 KB, application/x-gzip) 2017-10-26 17:03 UTC, dmetzger	no flags	Details
View All

Description dmetzger 2017-10-26 17:01:54 UTC

Created attachment 1343861 [details]
CSV file documenting Generic Worker memory exceptions

Description of problem:
Customer experiencing workers exceeding their memory threshold. After a system has run for a period of time, the MIQ Server process becomes extremely large, one 2.2Gb seen in logs, and when it spawns a new worker, this worker inherits the large RSS footprint.

Once this situation is encountered, the worker termination / restart repeats continuously, see attached mem_threshold_exceptions.csv for an example timeline of the Generic worker going through this cycle on one appliance. During this time period the MIQ Server went from 1.72Gb RSS to 1.73Gb. evm.log and top_output.log from this appliance attached. 

Have asked the customer to restart one appliance and collect all logs covering the time from the reboot point to the MIQ Server growing to 1Gb (and till the worker thrashing starts if they can / will).

Version-Release number of selected component (if applicable):


How reproducible:

Happens consistently in the customer environment.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 dmetzger 2017-10-26 17:02:39 UTC

Created attachment 1343862 [details]
Associated EVM log

Comment 3 Dave Johnson 2017-10-26 17:03:21 UTC

Please assess the impact of this issue and update the severity accordingly.  Please refer to https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity for a reminder on each severity's definition.

If it's something like a tracker bug where it doesn't matter, please set the severity to Low.

Comment 4 dmetzger 2017-10-26 17:03:23 UTC

Created attachment 1343863 [details]
Associated TOP_OUTPUT.LOG

Comment 6 Joe Rafaniello 2017-11-02 19:31:53 UTC

Dennis, the attached logs show the server starting at 1.7 GB, so the damage has already been done.  We need to figure out what made the server grow that large.  Do you have top_output and evm logs before the server process grew?  Preferably less than 800 MB.

Comment 7 dmetzger 2017-11-03 12:56:28 UTC

I requested that the appliance be rebooted and size of EVM Server monitored and logs captured when the server reached the 700Mn-1Gb size range. Have not received logs back yet.

Comment 13 Joe Rafaniello 2017-12-07 22:57:38 UTC

Thanks Ryan.  We're still researching the slow memory leak on the MiqServer process.  These sets of logs show similar behavior to what we've seen in house and are continuing to try to track down what is the root cause.  I'll update the BZ when we have more information.

Comment 14 Joe Rafaniello 2017-12-14 18:25:23 UTC

From the last top_output.log, the system has 4 GB of memory free, 0% swap used so clearly this is a case of incorrectly proactively claiming a worker is consuming too much memory as the system is fine.

The server is too big and growing but that's a separate issue we're still trying to fix.  The "workaround" below has been tested only on my test systems but requires no schema changes and changes our worker validation to look at USS instead of PSS for the "exceeding memory threshold" check.  It will better target workers that are truly growing without bound.

https://github.com/ManageIQ/manageiq/pull/16480



top - 11:26:51 up 13 days,  1:29,  0 users,  load average: 4.54, 2.90, 2.85
Tasks: 233 total,  10 running, 223 sleeping,   0 stopped,   0 zombie
%Cpu(s): 18.2 us,  1.9 sy, 42.5 ni, 37.1 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem : 24516804 total,  4178152 free, 17580208 used,  2758444 buff/cache
KiB Swap:  9957372 total,  9957372 free,        0 used.  6529544 avail Mem

  PID  PPID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
31956 31955 root      20   0  801492 690004    880 R  99.8  2.8  13:37.08 xz
 4487  2516 root      23   3 1389288 999.8m   4884 S  48.7  4.2   3473:45 MIQ: MiqVimBrokerWorker id: 10000001778609
49717  2516 root      23   3 1805544 1.378g   3688 S  32.4  5.9   1224:19 MIQ: Vmware::InfraManager::MetricsCollectorWorker id: 10000001881854, queue: vmwar
49966  2516 root      23   3 1813072 1.379g   3688 R  30.6  5.9   1228:32 MIQ: Vmware::InfraManager::MetricsCollectorWorker id: 10000001881859, queue: vmwar
49709  2516 root      23   3 1805608 1.376g   3688 S  29.4  5.9   1232:42 MIQ: Vmware::InfraManager::MetricsCollectorWorker id: 10000001881853, queue: vmwar
49701  2516 root      23   3 1813680 1.382g   3688 R  27.7  5.9   1224:42 MIQ: Vmware::InfraManager::MetricsCollectorWorker id: 10000001881852, queue: vmwar
49979  2516 root      23   3 1806948 1.377g   3688 R  26.3  5.9   1231:56 MIQ: Vmware::InfraManager::MetricsCollectorWorker id: 10000001881861, queue: vmwar
49974  2516 root      23   3 1808980 1.377g   3688 R  26.1  5.9   1219:36 MIQ: Vmware::InfraManager::MetricsCollectorWorker id: 10000001881860, queue: vmwar
12570  2516 root      27   7 1727176 1.299g   3184 S  15.9  5.6   3:45.62 MIQ: MiqEmsMetricsProcessorWorker id: 10000001914941, queue: ems_metrics_processor
 2516     1 root      20   0 1672216 1.198g   7156 R  11.7  5.1   2328:01 MIQ Server

Comment 15 Joe Rafaniello 2017-12-15 15:54:46 UTC

Note, the changes to use unique set size (USS) instead of proportional set size (PSS) don't fix the server leaking.  We now will not penalize workers for inheriting a large amount of memory from a large miq server process when they're forked.  top will still show these workers as high memory usage (RSS) if they inherited memory from a large server.  You need to use tools such as smem, smem -P MIQ, to see the USS.  bin/rake evm:status will show the USS value now.

We will continue to track down and fix the server memory growth but at least now, we won't be prematurely killing workers.

Comment 16 Joe Rafaniello 2017-12-15 16:02:09 UTC

This is basically a duplicate of bug 1479356.

Comment 17 Joe Rafaniello 2017-12-15 19:43:32 UTC

Closing, the side effect of workers being killed because they're inheriting a large shared memory from a leaking miq server process is being fixed in bug 1479356

*** This bug has been marked as a duplicate of bug 1479356 ***

Comment 18 Satoe Imaishi 2018-01-04 15:21:24 UTC

Since this was closed as duplicate, moving hotfix request flag to bug #1526474 (5.7.z clone of bug #1479356)

Note You need to log in before you can comment on or make changes to this bug.