Bug 1425217 - smartproxy workers are being terminate for memory limit exceeded before they get their first message to process
Summary: smartproxy workers are being terminate for memory limit exceeded before they ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Performance
Version: 5.7.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: GA
: 5.9.0
Assignee: Joe Rafaniello
QA Contact: Dave Johnson
URL:
Whiteboard: worker:smartstate
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-20 21:35 UTC by Thomas Hennessy
Modified: 2020-03-11 15:49 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-31 17:04:20 UTC
Category: ---
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
full log of smartproxy worker that is terminated after about 25 seconds for memory limit exceeded reason before doing any work (1.22 KB, application/x-gzip)
2017-02-20 21:35 UTC, Thomas Hennessy
no flags Details

Description Thomas Hennessy 2017-02-20 21:35:49 UTC
Created attachment 1255875 [details]
full log of smartproxy worker that is terminated after about 25 seconds for memory limit exceeded reason before doing any work

Description of problem:


Version-Release number of selected component (if applicable):
Version: 5.7.0.17
 Build:   20161219135818_725f92f


How reproducible: in the customer environment, just start up a smartproxy worker.... and bingo!


Steps to Reproduce:
1.
2.
3.

Actual results:
worker shuts down after about 25 seconds

Expected results: worker process to survive 7200 seconds if there is no work to do


Additional info:

Comment 6 Thomas Hennessy 2017-02-21 22:09:51 UTC
Initial case pointed to smartproxy workers being created with memory exceeding memory threshold and being terminated within 25 seconds, causing restart flood.

most recent upload for pid 22460 shows similar issue with very high used memory, but the process is not termineated until the restart interval expires, with each active worker holding 15% or more of appliance memory while doing no work at all.

Comment 9 Joe Rafaniello 2017-02-21 22:34:41 UTC
For now, you can increase the memory threshold for the smart proxy workers so they can do their work.  It's not a fix but will allow us to diagnose what's happening.

It looks like 600 MB of PSS is not enough for the smart proxy workers.  

From numbers in the logs:
The limit is printed as 629145600 (600 MB)
Some of the workers are at 617332000 (588 MB).

Is this a new problem?  We can increase the limit permanently if needed.  

In the past, we had several workers with very little room for growth in memory so we should figure out if this is the case here or if we added something recently to the smart proxy worker that bloats it's memory.

Comment 11 Joe Rafaniello 2017-02-21 22:41:02 UTC
Note, we bumped other worker memory limits in bug 1391687 via https://github.com/ManageIQ/manageiq/pull/12484.  It's not clear if smart proxy worker was excluded because it wasn't a problem back then or was missed.

Comment 12 Joe Rafaniello 2017-02-21 22:47:18 UTC
Note, the top output logs are not entirely helpful for forked processes since RSS memory doesn't take into account shared memory.  This is why we use Proportional Set Size viewable via the smem utility.  You'll see many of the smart proxy workers in the top output logs having nearly the same RSS memory as the parent server process that forked them and is unfortunate but expected.

With that said, the logs in comment 5 show the smart proxies consuming 1.5 GB of RSS but nearly 600 MB of PSS so they exceed the 600 MB threshold shortly after.

Comment 13 Joe Rafaniello 2017-02-23 15:58:37 UTC
Hi Tom, sorry I thought I did a needinfo but it looks like a bugzilla midair collision ate it.  Can you confirm if this is a new problem or we missed it in bug 1391687?  I haven't benchmarked idle worker memory in a while and any information you have from this bug vs. the other would be helpful.  Thanks!

Comment 14 Thomas Hennessy 2017-03-06 15:12:30 UTC
Joe,
the bz I opened was from a specific customer experience which was unusual to say the least.  the top output together with the evm.log log lines provided is the best and only diagnostic information available. this customer has moved beyond this issue by increasing the memory limit for these workers.  The case was opened in an attempt to report and possibly prevent this issue from becoming a common issue angonst the entire CFME community.  If there is a lesson here perhaps it is that CFME engineering needs to begin to exploit the use of the new settings_changes table to identify thoase settings changes which need to be reviewed (especially as a part of a migration) and exposed in the evmserverd startup settings where other one-shot information such as  provider, database latency and schema validity are exposed.

Tom Hennessy

Comment 15 Joe Rafaniello 2017-03-06 18:33:07 UTC
Reassigning to the performance area to get test results of idle workers on various versions.

Comment 18 Nick LaMuro 2017-06-27 17:38:46 UTC
Joe, Tom,

Based on the data gathered from the QE Performance team, the idle memory for the MiqSmartProxyWorker is almost half of the default, even in the most recent version where memory is at it's highest:



--- Idle Memory tests ---

                 Start of Test           End of Test
 Version  |     PSS     |     RSS     |     PSS     |     RSS     
------------------------------------------------------------------
 5.8.0.17 |   162.81 MB |   296.20 MB |   167.01 MB |   297.22 MB
 5.7.3.01 |   168.77 MB |   289.18 MB |   173.32 MB |   289.23 MB
 5.7.2.01 |   152.96 MB |   279.10 MB |   154.69 MB |   279.14 MB
 5.6.4.02 |   153.92 MB |   249.71 MB |   155.83 MB |   249.72 MB



--- Smart State Analysis (SSA) tests ---

                 Start of Test           End of Test
 Version  |     PSS     |     RSS     |     PSS     |     RSS     
------------------------------------------------------------------
 5.8.0.17 |   185.25 MB |   303.53 MB |   261.01 MB |   355.40 MB
 5.7.3.01 |   184.84 MB |   304.39 MB |   240.76 MB |   342.84 MB
 5.7.2.01 |   188.61 MB |   304.03 MB |   338.46 MB |   442.13 MB
 5.6.4.02 |   158.23 MB |   251.07 MB |   162.62 MB |   251.34 MB




Even when actively using the SSA from the QE perf SSA workload, it seems like the proxy worker is relatively below the threshold, so I think what you are seeing is specific to the customer.

Comment 20 Joe Rafaniello 2017-07-25 16:05:59 UTC
Dave, can you have someone see if the reported problem is still the case?

From the reported problem, the 600 MB PSS memory_threshold for the smart proxy worker is exceeded just booting the worker, where the logs named "full log of smartproxy worker that is terminated after about 25 seconds..." show the worker's PSS is over 700 MB nearly immediately.

This is contradicted by comment 18, so I'm not sure if there's some other variable in the mix.  

In the provided top output logs, the server process is ~1.2 GB and each forked worker starts at roughly 1.2 GB RSS. Note, we use PSS private + (shared memory / processes sharing this memory) for threshold checking, but if ruby's poor GC is in play, perhaps this is causing workers start at near or above the normal worker threshold.  

Perhaps, if the server process is already at ~1.2 GB RSS and each forked child is starting at ~1.2 GB of RSS, maybe it's more likely that a forked child will allocate/free objects in the OS heap "next" to other objects in the heap, causing more of the process memory to be copied vs. a process that starts at the typical 200-250 MB RSS.  It still doesn't seem likely that this would happen nearly immediately.  

This is also speculation without a way to recreate it.

Comment 21 Dave Johnson 2017-07-26 03:24:37 UTC
Satyajit, can you please try to retest this and see how far you get?  Thanks!

Comment 25 Joe Rafaniello 2017-08-07 16:11:14 UTC
From the QE appliance, the 3 workers that exited due to memory were:


pid 16927 (up for 50+ minutes):
[----] I, [2017-08-07T06:46:55.548146 #10092:97f130]  INFO -- : MIQ(MiqSmartProxyWorker#start) Worker started: ID [130], PID [16927], GUID [bad191ca-7b5d-11e7-a2ea-fa163e66cb7d]
[----] W, [2017-08-07T07:37:27.890495 #10092:97f130]  WARN -- : MIQ(MiqServer#validate_worker) Worker [MiqSmartProxyWorker] with ID: [130], PID: [16927], GUID: [bad191ca-7b5d-11e7-a2ea-fa163e66cb7d] process memory usage [641531000] exceeded limit [629145600], requesting worker to exit


pid 16919, up for 52+ minutes:
[----] I, [2017-08-07T06:46:55.454327 #10092:97f130]  INFO -- : MIQ(MiqSmartProxyWorker#start) Worker started: ID [129], PID [16919], GUID [ba90b920-7b5d-11e7-a2ea-fa163e66cb7d]
[----] W, [2017-08-07T07:38:53.138111 #10092:97f130]  WARN -- : MIQ(MiqServer#validate_worker) Worker [MiqSmartProxyWorker] with ID: [129], PID: [16919], GUID: [ba90b920-7b5d-11e7-a2ea-fa163e66cb7d] process memory usage [678153000] exceeded limit [629145600], requesting worker to exit


pid 24773, up for nearly 9 minutes:
[----] I, [2017-08-07T07:39:09.427025 #10092:97f130]  INFO -- : MIQ(MiqSmartProxyWorker#start) Worker started: ID [132], PID [24773], GUID [06793d60-7b65-11e7-a2ea-fa163e66cb7d]
[----] W, [2017-08-07T07:47:51.289314 #10092:97f130]  WARN -- : MIQ(MiqServer#validate_worker) Worker [MiqSmartProxyWorker] with ID: [132], PID: [24773], GUID: [06793d60-7b65-11e7-a2ea-fa163e66cb7d] process memory usage [678898000] exceeded limit [629145600], requesting worker to exit

So, while it looks like we're hitting that limit after scanning vms, it's not just after boot but after actual scanning work.  We're also "getting the message" which means we're completing in time and gracefully existing.  

It's important to note that the MIQ server process that is forking these smart proxy workers is only at 546 MB RSS so, if the "hitting limit at worker boot" occurs only when the server is already larger than 1 GB, we're not actually testing that in this scenario.



Here is a grep of all of the smart proxy worker starts and any "exceeded limit" warning messages in the log:

# grep -E "WARN.+SmartProxy.+exceeded limit|SmartProxyWorker#start" log/evm.log
[----] I, [2017-08-07T04:46:37.407680 #10092:97f130]  INFO -- : MIQ(MiqSmartProxyWorker#start) Worker started: ID [126], PID [7270], GUID [ec2cb60c-7b4c-11e7-a2ea-fa163e66cb7d]
[----] I, [2017-08-07T04:46:37.562291 #10092:97f130]  INFO -- : MIQ(MiqSmartProxyWorker#start) Worker started: ID [127], PID [7279], GUID [ec874bb2-7b4c-11e7-a2ea-fa163e66cb7d]
[----] I, [2017-08-07T06:46:55.454327 #10092:97f130]  INFO -- : MIQ(MiqSmartProxyWorker#start) Worker started: ID [129], PID [16919], GUID [ba90b920-7b5d-11e7-a2ea-fa163e66cb7d]
[----] I, [2017-08-07T06:46:55.548146 #10092:97f130]  INFO -- : MIQ(MiqSmartProxyWorker#start) Worker started: ID [130], PID [16927], GUID [bad191ca-7b5d-11e7-a2ea-fa163e66cb7d]
[----] W, [2017-08-07T07:37:27.890495 #10092:97f130]  WARN -- : MIQ(MiqServer#validate_worker) Worker [MiqSmartProxyWorker] with ID: [130], PID: [16927], GUID: [bad191ca-7b5d-11e7-a2ea-fa163e66cb7d] process memory usage [641531000] exceeded limit [629145600], requesting worker to exit
[----] I, [2017-08-07T07:37:44.085023 #10092:97f130]  INFO -- : MIQ(MiqSmartProxyWorker#start) Worker started: ID [131], PID [24103], GUID [d39d4aee-7b64-11e7-a2ea-fa163e66cb7d]
[----] W, [2017-08-07T07:38:53.138111 #10092:97f130]  WARN -- : MIQ(MiqServer#validate_worker) Worker [MiqSmartProxyWorker] with ID: [129], PID: [16919], GUID: [ba90b920-7b5d-11e7-a2ea-fa163e66cb7d] process memory usage [678153000] exceeded limit [629145600], requesting worker to exit
[----] I, [2017-08-07T07:39:09.427025 #10092:97f130]  INFO -- : MIQ(MiqSmartProxyWorker#start) Worker started: ID [132], PID [24773], GUID [06793d60-7b65-11e7-a2ea-fa163e66cb7d]
[----] W, [2017-08-07T07:47:51.289314 #10092:97f130]  WARN -- : MIQ(MiqServer#validate_worker) Worker [MiqSmartProxyWorker] with ID: [132], PID: [24773], GUID: [06793d60-7b65-11e7-a2ea-fa163e66cb7d] process memory usage [678898000] exceeded limit [629145600], requesting worker to exit
[----] I, [2017-08-07T07:48:07.582496 #10092:97f130]  INFO -- : MIQ(MiqSmartProxyWorker#start) Worker started: ID [133], PID [29120], GUID [4732f778-7b66-11e7-a2ea-fa163e66cb7d]
[----] I, [2017-08-07T09:38:14.118467 #10092:97f130]  INFO -- : MIQ(MiqSmartProxyWorker#start) Worker started: ID [135], PID [10650], GUID [a914b5bc-7b75-11e7-a2ea-fa163e66cb7d]
[----] I, [2017-08-07T09:48:27.139265 #10092:97f130]  INFO -- : MIQ(MiqSmartProxyWorker#start) Worker started: ID [136], PID [11673], GUID [167cb158-7b77-11e7-a2ea-fa163e66cb7d]
[----] I, [2017-08-07T11:38:41.714228 #10092:97f130]  INFO -- : MIQ(MiqSmartProxyWorker#start) Worker started: ID [138], PID [26811], GUID [7d102ca6-7b86-11e7-a2ea-fa163e66cb7d]
[----] I, [2017-08-07T11:48:56.147272 #10092:97f130]  INFO -- : MIQ(MiqSmartProxyWorker#start) Worker started: ID [139], PID [28070], GUID [eb4a7a04-7b87-11e7-a2ea-fa163e66cb7d]

Comment 26 Nick LaMuro 2017-08-07 16:24:13 UTC
Joe,

Is it possible that because we are forking here, the MIQ Server instance in the customer's case was already above the threshold for the individual worker, so the process was started with an invalid amount of memory to begin with?

I think you pretty much suggested this might be the case already, but wanted to make sure.

-Nick

Comment 27 Joe Rafaniello 2017-08-07 16:47:37 UTC
Great point Nick.  So, our appliance memory threshold code is checking PSS (taking into account shared memory), not RSS, so just because the server PSS is greater than the worker threshold, the forked worker will start at the same RSS, but the PSS should be significantly smaller since it's sharing the memory with the server. 
It should not should not immediately start greater than the threshold.

Next is only my best guess.  I have no evidence.
What I'm suggesting is that either a combination of OS swapping or ruby's not copy-on-write friendly GC is causing the OS memory pages shared with the server process to be written to, either from freeing or allocating memory for objects on the same OS heap page as the shared memory.  I'm not exactly sure how shared memory pages work when your OS memory is low and you need to swap out and swap back.  My guess is that shared memory that's swapped out will become private memory for each process when swapped back in.  Since the process of scanning vms seems to balloon the process to about 600 MB, perhaps it's more likely that these allocations occur on OS heap pages that also contain memory shared with the server.  Again, the test report from comment 22 and 25 don't confirm or reject this since the server process was still relatively small and under the 600 MB threshold.

Comment 28 Joe Rafaniello 2017-08-29 19:16:11 UTC
Note, we still haven't been able to recreate this "exceeding memory threshold upon smart proxy worker boot".  My hunch is still that somehow the issue only occurs when the server process that forks workers is already ballooned to over 1 GB and the system is under significant load.  I don't believe there is anything we can do here without first recreating it.  The next version of cfme will most likely no longer use fork so if that is a part of the problem, it will no longer be there.


Note You need to log in before you can comment on or make changes to this bug.