Created attachment 1343861 [details] CSV file documenting Generic Worker memory exceptions Description of problem: Customer experiencing workers exceeding their memory threshold. After a system has run for a period of time, the MIQ Server process becomes extremely large, one 2.2Gb seen in logs, and when it spawns a new worker, this worker inherits the large RSS footprint. Once this situation is encountered, the worker termination / restart repeats continuously, see attached mem_threshold_exceptions.csv for an example timeline of the Generic worker going through this cycle on one appliance. During this time period the MIQ Server went from 1.72Gb RSS to 1.73Gb. evm.log and top_output.log from this appliance attached. Have asked the customer to restart one appliance and collect all logs covering the time from the reboot point to the MIQ Server growing to 1Gb (and till the worker thrashing starts if they can / will). Version-Release number of selected component (if applicable): How reproducible: Happens consistently in the customer environment. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1343862 [details] Associated EVM log
Please assess the impact of this issue and update the severity accordingly. Please refer to https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity for a reminder on each severity's definition. If it's something like a tracker bug where it doesn't matter, please set the severity to Low.
Created attachment 1343863 [details] Associated TOP_OUTPUT.LOG
Dennis, the attached logs show the server starting at 1.7 GB, so the damage has already been done. We need to figure out what made the server grow that large. Do you have top_output and evm logs before the server process grew? Preferably less than 800 MB.
I requested that the appliance be rebooted and size of EVM Server monitored and logs captured when the server reached the 700Mn-1Gb size range. Have not received logs back yet.
Thanks Ryan. We're still researching the slow memory leak on the MiqServer process. These sets of logs show similar behavior to what we've seen in house and are continuing to try to track down what is the root cause. I'll update the BZ when we have more information.
From the last top_output.log, the system has 4 GB of memory free, 0% swap used so clearly this is a case of incorrectly proactively claiming a worker is consuming too much memory as the system is fine. The server is too big and growing but that's a separate issue we're still trying to fix. The "workaround" below has been tested only on my test systems but requires no schema changes and changes our worker validation to look at USS instead of PSS for the "exceeding memory threshold" check. It will better target workers that are truly growing without bound. https://github.com/ManageIQ/manageiq/pull/16480 top - 11:26:51 up 13 days, 1:29, 0 users, load average: 4.54, 2.90, 2.85 Tasks: 233 total, 10 running, 223 sleeping, 0 stopped, 0 zombie %Cpu(s): 18.2 us, 1.9 sy, 42.5 ni, 37.1 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st KiB Mem : 24516804 total, 4178152 free, 17580208 used, 2758444 buff/cache KiB Swap: 9957372 total, 9957372 free, 0 used. 6529544 avail Mem PID PPID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 31956 31955 root 20 0 801492 690004 880 R 99.8 2.8 13:37.08 xz 4487 2516 root 23 3 1389288 999.8m 4884 S 48.7 4.2 3473:45 MIQ: MiqVimBrokerWorker id: 10000001778609 49717 2516 root 23 3 1805544 1.378g 3688 S 32.4 5.9 1224:19 MIQ: Vmware::InfraManager::MetricsCollectorWorker id: 10000001881854, queue: vmwar 49966 2516 root 23 3 1813072 1.379g 3688 R 30.6 5.9 1228:32 MIQ: Vmware::InfraManager::MetricsCollectorWorker id: 10000001881859, queue: vmwar 49709 2516 root 23 3 1805608 1.376g 3688 S 29.4 5.9 1232:42 MIQ: Vmware::InfraManager::MetricsCollectorWorker id: 10000001881853, queue: vmwar 49701 2516 root 23 3 1813680 1.382g 3688 R 27.7 5.9 1224:42 MIQ: Vmware::InfraManager::MetricsCollectorWorker id: 10000001881852, queue: vmwar 49979 2516 root 23 3 1806948 1.377g 3688 R 26.3 5.9 1231:56 MIQ: Vmware::InfraManager::MetricsCollectorWorker id: 10000001881861, queue: vmwar 49974 2516 root 23 3 1808980 1.377g 3688 R 26.1 5.9 1219:36 MIQ: Vmware::InfraManager::MetricsCollectorWorker id: 10000001881860, queue: vmwar 12570 2516 root 27 7 1727176 1.299g 3184 S 15.9 5.6 3:45.62 MIQ: MiqEmsMetricsProcessorWorker id: 10000001914941, queue: ems_metrics_processor 2516 1 root 20 0 1672216 1.198g 7156 R 11.7 5.1 2328:01 MIQ Server
Note, the changes to use unique set size (USS) instead of proportional set size (PSS) don't fix the server leaking. We now will not penalize workers for inheriting a large amount of memory from a large miq server process when they're forked. top will still show these workers as high memory usage (RSS) if they inherited memory from a large server. You need to use tools such as smem, smem -P MIQ, to see the USS. bin/rake evm:status will show the USS value now. We will continue to track down and fix the server memory growth but at least now, we won't be prematurely killing workers.
This is basically a duplicate of bug 1479356.
Closing, the side effect of workers being killed because they're inheriting a large shared memory from a leaking miq server process is being fixed in bug 1479356 *** This bug has been marked as a duplicate of bug 1479356 ***
Since this was closed as duplicate, moving hotfix request flag to bug #1526474 (5.7.z clone of bug #1479356)