Description of problem:Worker processes that have been requested to exit are not being killed, allowing them to run indefinitely consuming postgresql sessions, appliance CPU and appliance memory resources. Version-Release number of selected component (if applicable):5.6.1.2 How reproducible:it should be sufficient to create an automation task that simply waits hours exceeding the default message timeout value of 600 seconds. the worker process will be signaled to terminate but it will not until the automate task wait time expires and then the worker will honor the terminate request, which can be hours later. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: I am providing an extract from a customer log wherein a reporting worker has exceeded the memory threshold and is requested to terminate but does not until the default message time for reports (3600 seconds) has expired and it then recognizes the request to terminate (apparently ) an extract from top shows that the process (pid 21426) continues to run until the 3600 seconds message time is exhausted.
Partially related PR: https://github.com/ManageIQ/manageiq/pull/12932 While reviewing the code changed in that PR, it became clear that "stopping" workers, ones that exceed the memory thresholds and are asked to exit, will continue to heartbeat to the server via DRb, but the server will not persist the heartbeat to the worker row in the last_hearbeat column. This is why we see "stopping" workers with a very old last heartbeat value. Neither the code change nor the information above fix the reported problem. The options we have are: * Create both "gracefully exit" (the existing one) and "kill" thresholds... more configuration and still the chance that workers could run too long or get killed too quickly. Note, in the old MIQ bug in comment 8 above, the original idea was to ask to gracefully exit at the "threshold" and kill it at 2 times the threshold. * continue to allow workers doing work to possibly run way too long * possibly kill workers too early... ones where they would complete just beyond the memory thresholds but would be killed if we took the "kill when exceeding the threshold option"
The side effect of not killing the worker for possibly another hour is that there are often 2 processes running at the same time. This often forces us into swap and hits hard. Solving this side effect may be simpler and tide us over until we solve the core issue.
sorry. not enough detail. When a worker is running too long or taking up too much memory, we send a signal to the process to shutdown. At about the same time, we startup a new process to take over the work. This works if the first process does shutdown quickly. But if the first process takes a while to shutdown, then we end up with 2 processes running. So we have a process that is taking up too much memory in addition to a second one. When the appliance hits swap, the presence of the 2 processes is often what customers report. That and the fact that the original process is still taking up more memory than the threshold.
Kenan, You are on the right track, but the analysis is incomplete. If one worker is signaled to terminate for exceeded some limit (memory or time threshold) another worker starts while the signaled worker does whatever it is going to do.... if the replacement worker encounters a similar problem which the original worker is still active consuming resources in the appliance, it too will be signaled to terminate and a replacement worker started.... if the replacement replacement worker encounters a similar problem as to the first two....etc, etc, etc.... IMHO, there is really only one solution as distasteful as it might seem: honor the limits strictly. signal the worker to terminate and if it does not terminate within the default worker heartbeat (currently between 10 and 30 seconds) after being signaled to terminate, kill it.
Just some notes of some discussions we've had RE: this issue * The existing mechanism for stopping a worker gracefully and having a replacement started while the original is shutting down is done to avoid losing data. Changing this requires more logic for coordination and more chances for bugs, when the start/kill algorithm should be there to protect us from runaway workers. * The start/kill algorithm [1] should be used to avoid starting workers or killing workers when a specific amount of swap usage is exceeded. This should be used as the safety measure to prevent the prior bullet point from going out of control. This needs to be confirmed by me. * The start/kill alogrithms values above need to refined, 60 and 80 is way too high. My personal thought is no longer start new workers (start) at 5% swap used, and start killing workers at 10%. Once we're in swap, performance is severely impacted so we should probably error on the side of caution. Besides, we raise events when these occur so we should be able to react to them. Note, 5/10 is probably wrong but the existing numbers are too high. * when a worker exceeds the memory threshold, they are asked to exit gracefully and marked as stopping. They are not killed. * stopping workers are never killed, they should be. We should check stopping workers like we handle not_responding or memory_exceeded workers but give the stopping ones a grace period of 10 minutes or so. Either way, we're going to be changing behavior from how it's always been, so we'll be killing workers doing useful work, even though we're not invading swap... At the same time, a queue worker probably shouldn't be doing a report for 30+ minutes. If it takes that long, we should kill the worker and the error should tell us the report is not optimized and we should the custom report either in the report itself or in our backend code if it's inefficient. [1] :worker_monitor: :kill_algorithm: :name: :used_swap_percent_gt_value :value: 80 :start_algorithm: :name: :used_swap_percent_lt_value :value: 60
Opened bug https://bugzilla.redhat.com/show_bug.cgi?id=1415290 to track the fix of the critical section access of the worker heartbeat without a mutex... while it's not causing this issue, nor is it probably a big problem, it was a bug, and something that needed to be fixed.
Please disregard the comment 21, it was for the wrong bug.
This is the direction we're thinking of trying: * Allow workers to gracefully exit * Give them 10 minutes to exit gracefully before killing them * Expose this as a configurable option * raise an event when this happens System limits code should keep appliances from getting destroyed: Currently, we stop starting new workers at 60% swap usage (maybe drop this to 10%) Currently, we start killing workers at 80% swap usage (maybe drop to 40%)
I would suggest not changing the swap limits at this time as that action will itself have unexpected consequences (it is not uncommon for swap to be invaded by 2-3 GB without appliance operations being noticeably impacted) and besides since these values (60% and 80%) are already exposed and assumedly changeable, it is already possible for any single customer to make such a change if they deem it necessary. Lets just enforce the memory_threshold and restart_interval limits by ensuring that worker processes (and their children) that are requested to terminate are removed if they do not comply within a reasonable timeframe.
https://github.com/ManageIQ/manageiq/pull/13805
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/e5f4bd3fe1299070e40d235be963428b4f9a2d14 commit e5f4bd3fe1299070e40d235be963428b4f9a2d14 Author: Joe Rafaniello <jrafanie> AuthorDate: Tue Jan 3 11:18:07 2017 -0500 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Feb 10 16:35:55 2017 -0500 Kill workers that don't stop after a configurable time Previously, we'd gracefully ask them to exit and if the queue work they're doing, takes 1 hour to do, they'd exceed memory thresholds, keep running until the work is done and finally respond to the exit request. Now, we mark them as 'stopping' when they exceed a threshold and they have up to 10 minutes to finish before we'd kill them. This value is configurable in the 'stopping_timeout' field in each worker's advanced settings. https://bugzilla.redhat.com/show_bug.cgi?id=1395736 app/models/miq_server/worker_management/monitor.rb | 8 ++++- app/models/miq_worker.rb | 8 +++++ config/settings.yml | 1 + .../miq_server/worker_management/monitor_spec.rb | 34 ++++++++++++++++++++++ spec/models/miq_worker_spec.rb | 21 +++++++++++++ 5 files changed, 71 insertions(+), 1 deletion(-) create mode 100644 spec/models/miq_server/worker_management/monitor_spec.rb
https://github.com/ManageIQ/manageiq/pull/13919
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/66368530acd2e31a13e225d360f9da853cd6cfa3 commit 66368530acd2e31a13e225d360f9da853cd6cfa3 Author: Joe Rafaniello <jrafanie> AuthorDate: Tue Feb 14 12:02:51 2017 -0500 Commit: Joe Rafaniello <jrafanie> CommitDate: Tue Feb 14 12:29:00 2017 -0500 Spec cleanup * remove any_instance_of since we have an instance of a server! :tada: * use lets, remove some ivars https://bugzilla.redhat.com/show_bug.cgi?id=1395736 spec/models/miq_server/worker_monitor_spec.rb | 74 +++++++++++++++------------ 1 file changed, 40 insertions(+), 34 deletions(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/3ea55faefa466bab32ef7debdc77b303de52b16d commit 3ea55faefa466bab32ef7debdc77b303de52b16d Author: Joe Rafaniello <jrafanie> AuthorDate: Tue Feb 14 12:17:31 2017 -0500 Commit: Joe Rafaniello <jrafanie> CommitDate: Tue Feb 14 12:29:00 2017 -0500 Fix Reason NameError using constants included in MiqServer Due to module inclusion spaghetti, it's easier and less confusing to reference the Reason constants consistently in the MiqServer class, which is the ultimate destination for all of these modules. Fixes #13901 introduced in #13805 https://bugzilla.redhat.com/show_bug.cgi?id=1395736 app/models/miq_server/worker_management/monitor.rb | 4 ++-- app/models/miq_server/worker_management/monitor/validation.rb | 2 +- spec/models/miq_server/worker_monitor_spec.rb | 6 ++++++ 3 files changed, 9 insertions(+), 3 deletions(-)
New commit detected on ManageIQ/manageiq/euwe: https://github.com/ManageIQ/manageiq/commit/97aa6571357239d9db9bc50012a77c21773687a7 commit 97aa6571357239d9db9bc50012a77c21773687a7 Author: Gregg Tanzillo <gtanzill> AuthorDate: Mon Feb 13 22:12:59 2017 -0500 Commit: Joe Rafaniello <jrafanie> CommitDate: Thu Feb 16 13:32:16 2017 -0500 Merge pull request #13805 from jrafanie/stopping_worker Kill workers that don't stop after a configurable time (cherry picked from commit 97648709a719ce59679ef0d4833ff7eb16c6d6d3) https://bugzilla.redhat.com/show_bug.cgi?id=1395736 app/models/miq_server/worker_management/monitor.rb | 8 +++-- .../miq_server/worker_management/monitor/reason.rb | 3 ++ .../worker_management/monitor/validation.rb | 2 +- app/models/miq_worker.rb | 8 +++++ config/settings.yml | 1 + .../miq_server/worker_management/monitor_spec.rb | 34 ++++++++++++++++++++++ spec/models/miq_worker_spec.rb | 21 +++++++++++++ 7 files changed, 74 insertions(+), 3 deletions(-) create mode 100644 spec/models/miq_server/worker_management/monitor_spec.rb
New commit detected on ManageIQ/manageiq/euwe: https://github.com/ManageIQ/manageiq/commit/fbb362598910958de9cd48a74392e2fdf1c9ace6 commit fbb362598910958de9cd48a74392e2fdf1c9ace6 Author: Chris Arcand <chrisarcand.github.com> AuthorDate: Tue Feb 14 12:13:38 2017 -0600 Commit: Joe Rafaniello <jrafanie> CommitDate: Thu Feb 16 13:32:56 2017 -0500 Merge pull request #13919 from jrafanie/fix_missing_reason_contants Fix missing reason contants (cherry picked from commit 89dfba8b95059d7349ebca79f84f7f277928a080) https://bugzilla.redhat.com/show_bug.cgi?id=1395736 app/models/miq_server/worker_management/monitor.rb | 4 +- .../worker_management/monitor/validation.rb | 2 +- spec/models/miq_server/worker_monitor_spec.rb | 78 +++++++++++++--------- 3 files changed, 48 insertions(+), 36 deletions(-)
*** Bug 1430106 has been marked as a duplicate of this bug. ***
New commit detected on ManageIQ/manageiq/darga: https://github.com/ManageIQ/manageiq/commit/1d603c9c0a8ca39cc9d27f63db39c1f196698606 commit 1d603c9c0a8ca39cc9d27f63db39c1f196698606 Author: Gregg Tanzillo <gtanzill> AuthorDate: Mon Feb 13 22:12:59 2017 -0500 Commit: Joe Rafaniello <jrafanie> CommitDate: Thu Feb 16 13:35:09 2017 -0500 Merge pull request #13805 from jrafanie/stopping_worker Kill workers that don't stop after a configurable time (cherry picked from commit 97648709a719ce59679ef0d4833ff7eb16c6d6d3) https://bugzilla.redhat.com/show_bug.cgi?id=1395736 app/models/miq_server/worker_management/monitor.rb | 8 +++-- .../miq_server/worker_management/monitor/reason.rb | 3 ++ .../worker_management/monitor/validation.rb | 2 +- app/models/miq_worker.rb | 8 +++++ config/settings.yml | 1 + .../miq_server/worker_management/monitor_spec.rb | 34 ++++++++++++++++++++++ spec/models/miq_worker_spec.rb | 21 +++++++++++++ 7 files changed, 74 insertions(+), 3 deletions(-) create mode 100644 spec/models/miq_server/worker_management/monitor_spec.rb
New commit detected on ManageIQ/manageiq/darga: https://github.com/ManageIQ/manageiq/commit/2b5d162a01afffb5e8a812913fffa72783e7507e commit 2b5d162a01afffb5e8a812913fffa72783e7507e Author: Chris Arcand <chrisarcand.github.com> AuthorDate: Tue Feb 14 12:13:38 2017 -0600 Commit: Joe Rafaniello <jrafanie> CommitDate: Thu Feb 16 13:35:10 2017 -0500 Merge pull request #13919 from jrafanie/fix_missing_reason_contants Fix missing reason contants (cherry picked from commit 89dfba8b95059d7349ebca79f84f7f277928a080) https://bugzilla.redhat.com/show_bug.cgi?id=1395736 app/models/miq_server/worker_management/monitor.rb | 4 +- .../worker_management/monitor/validation.rb | 2 +- spec/models/miq_server/worker_monitor_spec.rb | 78 +++++++++++++--------- 3 files changed, 48 insertions(+), 36 deletions(-)