Created attachment 1310601 [details] Generic worker growth (yellow legend) w.r.t. metrics collector queue +++ This bug was initially created as a clone of Bug #1479339 +++ Description of problem: Observed memory leaks in GenericWorker for a 10k VMware provider infra connected This ran for over 3 days but leaks occurred in the first day itself. Version-Release number of selected component (if applicable): CFME 5.8.0.17 How reproducible: I had increased memory thresholds / counts for specific worker processes on all appliances, just enough to accommodate those many VMs for a 6 appliance setup. Steps to Reproduce: 1. [ 1 DB, 5 worker] 6 appliance setup. 2. Turn on C&U on all worker appliances (and cluster wide C&U collection settings in config) and keep server roles to minimum on DB appliance 3. Connect to 10k vms VMware infra provider and let it run for 2-3 days while keeping an eye on C&U data collector worker memory usage. For reference:- ---- # DB - Generic - 2, 500 MB - Priority - 2, 600 MB ---- # Worker appliances - Generic - 4, 500 MB - Priority - 2, 800 MB - C&U Data Collectors - 6, 600 MB - C&U Data Processors - 4, 800 MB - Refresh - 2 GB Actual results: Generic Worker memory grew from about 1.5G to 2.8G. Expected results: almost little or no memory growth after initial C&U/refresh period. Additional info: attaching screenshot for reference Original comment (From BZ about MetricsCollector worker leak for multiple providers): https://bugzilla.redhat.com/show_bug.cgi?id=1456775#c29
Also, this occurs at the same time as metrics processor and collector leak: https://bugzilla.redhat.com/show_bug.cgi?id=1479339
Created attachment 1310615 [details] Generic worker leakage (blue legend) w.r.t stable processor queue and powered on/off vms
Created attachment 1311306 [details] PSS & RSS utilization - 4+ day test run Chart showing PSS & RSS memory utilization on a CFME 5.8.0.17-1 appliance. Worker Config: Single Generic Worker 1.5Gb Memory Threshold Provider: Clusters: 10 Hosts: 50 Datastores: 61 VMs: 1,000 Type: VMware VC 5.5.0
Data from a 4+ day run on my test CFME 5.8.0.17-1 appliance (see comment #4 above) with C&U enabled does not show the leak. Can we get a full set of logs to review?
Created attachment 1311826 [details] PSS Utilzationn(Short Test) Adding two charts to show the following observation: CPU and PSS memory usage spike on the hour. CPU utilization decreases over the hour, while the PSS size is not reduced. Data collected on a CFME 5.8.0.17-1 appliance which is configured with single instance of each worker type and no provider added. Appliance was booted at 09:24. NOTE: The long duration test (4+ days) show PSS utilization leveled off after about 12 hours.
Created attachment 1311827 [details] CPU Utilization (Short Test) Adding two charts to show the following observation: CPU and PSS memory usage spike on the hour. CPU utilization decreases over the hour, while the PSS size is not reduced. Data collected on a CFME 5.8.0.17-1 appliance which is configured with single instance of each worker type and no provider added. Appliance was booted at 09:24. NOTE: The long duration test (4+ days) show PSS utilization leveled off after about 12 hours.
Comment on attachment 1311826 [details] PSS Utilzationn(Short Test) Note, previous comment associated with the short test PSS utilization chart incorrectly specified no provider added.
Comment on attachment 1311827 [details] CPU Utilization (Short Test) Note, previous comment associated with the short test COU utilization chart incorrectly specified no provider added.
Does anyone have both the historical and current set of evm.log and top_output.log files? We need to track the RSS of the server process, the one that forks workers, what it was doing, and the same for the forked worker exhibiting the problem. I've looked at the logs provided and it's only the evm.log.
If top_output.log shows the server process is > 1 GB and the forked child worker starts > 1 GB RSS, it's likely related to bug 1425217
Relative to comment https://bugzilla.redhat.com/show_bug.cgi?id=1479356#c11 1. Did the failure mode begin recently or has this been an ongoing issue that has been addressed thus far by increasing the threshold value over time? 2. Is the failure mode happening on all appliances in the environment? 3. Do we have an ETA on full log set(s)?
Accidentally cleared the need info flag, resetting it.
Created attachment 1349013 [details] austx01-generic_workers_pss.pdf: Generic workers PSS usage
Created attachment 1349014 [details] austx01-max_worker_pss.pdf: Shows maximum PSS used per worker type
Created attachment 1349015 [details] austx01-mem_threshold_exceptions_counts_by_worker.pdf: # of of time memory exceeded
Created attachment 1349016 [details] austx01-free.pdf: Appliance memory usage - OS perspective
https://github.com/ManageIQ/manageiq-gems-pending/pull/313
https://github.com/ManageIQ/manageiq-gems-pending/pull/314
https://github.com/ManageIQ/manageiq-gems-pending/pull/312
https://github.com/ManageIQ/manageiq/pull/16480
Archit: Do you have an environment that I can access that has this leak? Or can you reproduce again? Thanks, Keenan
There could possibly
Premature save changes click. Basically, there might have been a leak in the generic worker but based on comment 25 by Dennis, the fundamental problem is that a leaking miq server process growing in size is causing new generic workers (and all workers) to be created with a very large amount of shared memory inherited from the server process. This leads to a very high PSS and causes the worker to be killed. The use of PSS is incorrect for determining if a worker is leaking because we can wrongfully think a worker is leaking even though it's inheriting a large shared memory and therefore PSS. We're changing this to use USS.
Master and gaprindashvili branches change the worker validation to use unique set size also, but this is done through a schema change to add this column to the miq_workers and miq_servers tables: https://github.com/ManageIQ/manageiq-schema/pull/139 https://github.com/ManageIQ/manageiq-gems-pending/pull/317 https://github.com/ManageIQ/manageiq/pull/16569 https://github.com/ManageIQ/manageiq/pull/16570
Note, the changes to use unique set size (USS) instead of proportional set size (PSS) don't fix the server leaking. We now will not penalize workers for inheriting a large amount of memory from a large miq server process when they're forked. top will still show these workers as high memory usage (RSS) if they inherited memory from a large server. You need to use tools such as smem, smem -P MIQ, to see the USS. bin/rake evm:status will show the USS value now. We will continue to track down and fix the server memory growth but at least now, we won't be prematurely killing workers.
New commit detected on ManageIQ/manageiq/euwe: https://github.com/ManageIQ/manageiq/commit/06030f3826af407ef18ee54efcd0e0c5b48b8044 commit 06030f3826af407ef18ee54efcd0e0c5b48b8044 Author: Joe Rafaniello <jrafanie> AuthorDate: Mon Nov 13 16:26:26 2017 -0500 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Dec 15 11:18:36 2017 -0500 Store unique set size (USS) in the PSS column https://bugzilla.redhat.com/show_bug.cgi?id=1479356 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1526474 Unique set size is a better way to detect workers that are growing unbounded since any memory/reference leaks would be shown in their uss. If the server process is large when forking, new workers would inherit a big pss immediately. We should really rename the column/hash key to uss. gems/pending/util/miq-process.rb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
*** Bug 1506737 has been marked as a duplicate of this bug. ***
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/2d758778b55484a96e423317bbf832c5293eb0ed commit 2d758778b55484a96e423317bbf832c5293eb0ed Author: Joe Rafaniello <jrafanie> AuthorDate: Thu Nov 30 15:27:11 2017 -0500 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Dec 15 16:55:02 2017 -0500 Also log the server/worker unique set size (USS) https://bugzilla.redhat.com/show_bug.cgi?id=1479356 app/models/miq_server/status_management.rb | 2 +- app/models/miq_worker.rb | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/eb028e0c47df57b19dc830175ade75b140146646 commit eb028e0c47df57b19dc830175ade75b140146646 Author: Joe Rafaniello <jrafanie> AuthorDate: Thu Nov 30 15:29:17 2017 -0500 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Dec 15 16:55:02 2017 -0500 Show USS instead of PSS in rake evm:status https://bugzilla.redhat.com/show_bug.cgi?id=1479356 lib/tasks/evm_application.rb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/ef84a880817385e5f0edc6cbb4410819140999a4 commit ef84a880817385e5f0edc6cbb4410819140999a4 Author: Joe Rafaniello <jrafanie> AuthorDate: Thu Nov 30 15:29:26 2017 -0500 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Dec 15 16:55:32 2017 -0500 Change worker validation to check USS not PSS Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1479356 Why? USS is a more reliable mechanism for tracking workers with runaway memory growth. PSS is great, until the server process that forks new processes grows large. As each new worker is forked, it inherits a share of the large amount of the parent process' memory and therefore starts with a large PSS, possibly exceeding our limits before doing any work. USS only measures a process' private memory and is a better indicator when a process is responsible for allocating too much memory without freeing it. app/models/miq_server/worker_management/monitor/validation.rb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
New commit detected on ManageIQ/manageiq-gems-pending/fine: https://github.com/ManageIQ/manageiq-gems-pending/commit/d51285ca19c96304c7e3b521ae16713d75cfcee1 commit d51285ca19c96304c7e3b521ae16713d75cfcee1 Author: Joe Rafaniello <jrafanie> AuthorDate: Mon Nov 13 16:26:26 2017 -0500 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Dec 15 11:20:13 2017 -0500 Store unique set size (USS) in the PSS column https://bugzilla.redhat.com/show_bug.cgi?id=1479356 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1526473 Unique set size is a better way to detect workers that are growing unbounded since any memory/reference leaks would be shown in their uss. If the server process is large when forking, new workers would inherit a big pss immediately. We should really rename the column/hash key to uss. lib/gems/pending/util/miq-process.rb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)