1395736 – Appliance fails to terminate (ie, kill) worker processes that fail to respond to requested termination.

Bug 1395736 - Appliance fails to terminate (ie, kill) worker processes that fail to respond to requested termination.

Summary: Appliance fails to terminate (ie, kill) worker processes that fail to respond...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.6.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	GA
Target Release:	5.8.0
Assignee:	Joe Rafaniello
QA Contact:	Tasos Papaioannou
Docs Contact:
URL:
Whiteboard:	worker
Duplicates (1):	1430106 (view as bug list)
Depends On:
Blocks:	1422649 1423035
TreeView+	depends on / blocked

Reported:	2016-11-16 14:45 UTC by Thomas Hennessy
Modified:	2020-12-14 07:53 UTC (History)
CC List:	13 users (show)
Fixed In Version:	5.8.0.2
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1422649 1423035 (view as bug list)
Environment:
Last Closed:	2017-06-12 16:20:25 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2172821	0	None	None	None	2017-02-02 02:53:45 UTC

Description Thomas Hennessy 2016-11-16 14:45:40 UTC

Description of problem:Worker processes that have been requested to exit are not being killed, allowing them to run indefinitely consuming postgresql sessions, appliance CPU and appliance memory resources.


Version-Release number of selected component (if applicable):5.6.1.2


How reproducible:it should be sufficient to create an automation task that simply waits hours exceeding the default message timeout value of 600 seconds.  the worker process will be signaled to terminate but it will not until the automate task wait time expires and then the worker will honor the terminate request, which can be hours later.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
I am providing an extract from a customer log wherein a reporting worker has exceeded the memory threshold and is requested to terminate but does not until the default message time for reports (3600 seconds) has expired and it then recognizes the request to terminate (apparently )

an extract from top shows that the process (pid 21426)  continues to run until the 3600 seconds message time is exhausted.

Comment 12 Joe Rafaniello 2016-11-30 21:15:23 UTC

Partially related PR:  https://github.com/ManageIQ/manageiq/pull/12932

While reviewing the code changed in that PR, it became clear that "stopping" workers, ones that exceed the memory thresholds and are asked to exit, will continue to heartbeat to the server via DRb, but the server will not persist the heartbeat to the worker row in the last_hearbeat column.  This is why we see "stopping" workers with a very old last heartbeat value.

Neither the code change nor the information above fix the reported problem.  

The options we have are:

* Create both "gracefully exit" (the existing one) and "kill" thresholds... more configuration and still the chance that workers could run too long or get killed too quickly.  Note, in the old MIQ bug in comment 8 above, the original idea was to ask to gracefully exit at the "threshold" and kill it at 2 times the threshold.

* continue to allow workers doing work to possibly run way too long

* possibly kill workers too early... ones where they would complete just beyond the memory thresholds but would be killed if we took the "kill when exceeding the threshold option"

Comment 13 Keenan Brock 2016-12-05 18:13:28 UTC

The side effect of not killing the worker for possibly another hour is that there are often 2 processes running at the same time. This often forces us into swap and hits hard.

Solving this side effect may be simpler and tide us over until we solve the core issue.

Comment 14 Keenan Brock 2016-12-05 18:20:21 UTC

sorry. not enough detail.

When a worker is running too long or taking up too much memory, we send a signal to the process to shutdown.
At about the same time, we startup a new process to take over the work. This works if the first process does shutdown quickly.

But if the first process takes a while to shutdown, then we end up with 2 processes running. So we have a process that is taking up too much memory in addition to a second one.

When the appliance hits swap, the presence of the 2 processes is often what customers report. That and the fact that the original process is still taking up more memory than the threshold.

Comment 15 Thomas Hennessy 2016-12-05 19:54:46 UTC

Kenan,
You are on the right track, but the analysis is incomplete.  If one worker is signaled to terminate for exceeded some limit (memory or time threshold) another worker starts while the signaled worker does whatever it is going to do....

if the replacement worker encounters a similar problem which the original worker is still active consuming resources in the appliance, it too will be signaled to terminate and a replacement worker started....

if the replacement replacement worker encounters a similar problem as to the first two....etc, etc, etc....

IMHO, there is really only one solution as distasteful as it might seem: honor the limits strictly.  signal the worker to terminate and if it does not terminate within the default worker heartbeat (currently between 10 and 30 seconds) after being signaled to terminate, kill it.

Comment 19 Joe Rafaniello 2016-12-20 22:46:30 UTC

Just some notes of some discussions we've had RE: this issue

* The existing mechanism for stopping a worker gracefully and having a replacement started while the original is shutting down is done to avoid losing data. Changing this requires more logic for coordination and more chances for bugs, when the start/kill algorithm should be there to protect us from runaway workers.

* The start/kill algorithm [1] should be used to avoid starting workers or killing workers when a specific amount of swap usage is exceeded. This should be used as the safety measure to prevent the prior bullet point from going out of control. This needs to be confirmed by me.

* The start/kill alogrithms values above need to refined, 60 and 80 is way too high. My personal thought is no longer start new workers (start) at 5% swap used, and start killing workers at 10%. Once we're in swap, performance is severely impacted so we should probably error on the side of caution. Besides, we raise events when these occur so we should be able to react to them. Note, 5/10 is probably wrong but the existing numbers are too high.

* when a worker exceeds the memory threshold, they are asked to exit gracefully and marked as stopping. They are not killed.

* stopping workers are never killed, they should be. We should check stopping workers like we handle not_responding or memory_exceeded workers but give the stopping ones a grace period of 10 minutes or so. Either way, we're going to be changing behavior from how it's always been, so we'll be killing workers doing useful work, even though we're not invading swap... At the same time, a queue worker probably shouldn't be doing a report for 30+ minutes. If it takes that long, we should kill the worker and the error should tell us the report is not optimized and we should the custom report either in the report itself or in our backend code if it's inefficient.

[1]
:worker_monitor:
:kill_algorithm:
:name: :used_swap_percent_gt_value
:value: 80
:start_algorithm:
:name: :used_swap_percent_lt_value
:value: 60

Comment 20 Joe Rafaniello 2017-01-20 18:58:03 UTC

Opened bug https://bugzilla.redhat.com/show_bug.cgi?id=1415290 to track the fix of the critical section access of the worker heartbeat without a mutex... while it's not causing this issue, nor is it probably a big problem, it was a bug, and something that needed to be fixed.

Comment 22 Joe Rafaniello 2017-01-20 19:09:52 UTC

Please disregard the comment 21, it was for the wrong bug.

Comment 23 Joe Rafaniello 2017-01-30 20:54:00 UTC

This is the direction we're thinking of trying:

* Allow workers to gracefully exit
* Give them 10 minutes to exit gracefully before killing them
* Expose this as a configurable option
* raise an event when this happens

System limits code should keep appliances from getting destroyed:

Currently, we stop starting new workers at 60% swap usage (maybe drop this to 10%)
Currently, we start killing workers at 80% swap usage (maybe drop to 40%)

Comment 24 Thomas Hennessy 2017-01-30 21:04:23 UTC

I would suggest not changing the swap limits at this time as that action will itself have unexpected consequences (it is not uncommon for swap to be invaded by 2-3 GB without appliance operations being noticeably impacted) and besides since these values (60% and 80%) are already exposed and assumedly changeable, it is already possible for any single customer to make such a change if they deem it necessary.

Lets just enforce the memory_threshold and  restart_interval limits by ensuring that worker processes (and their children) that are requested to terminate are removed if they do not comply within a reasonable timeframe.

Comment 25 CFME Bot 2017-02-07 22:36:23 UTC

https://github.com/ManageIQ/manageiq/pull/13805

Comment 26 CFME Bot 2017-02-14 03:16:19 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/e5f4bd3fe1299070e40d235be963428b4f9a2d14

commit e5f4bd3fe1299070e40d235be963428b4f9a2d14
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Tue Jan 3 11:18:07 2017 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Feb 10 16:35:55 2017 -0500

    Kill workers that don't stop after a configurable time
    
    Previously, we'd gracefully ask them to exit and if the queue work
    they're doing, takes 1 hour to do, they'd exceed
    memory thresholds, keep running until the work is done and finally
    respond to the exit request.
    
    Now, we mark them as 'stopping' when they
    exceed a threshold and they have up to 10 minutes to finish before we'd
    kill them.  This value is configurable in the 'stopping_timeout' field in
    each worker's advanced settings.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1395736

 app/models/miq_server/worker_management/monitor.rb |  8 ++++-
 app/models/miq_worker.rb                           |  8 +++++
 config/settings.yml                                |  1 +
 .../miq_server/worker_management/monitor_spec.rb   | 34 ++++++++++++++++++++++
 spec/models/miq_worker_spec.rb                     | 21 +++++++++++++
 5 files changed, 71 insertions(+), 1 deletion(-)
 create mode 100644 spec/models/miq_server/worker_management/monitor_spec.rb

Comment 27 CFME Bot 2017-02-14 17:41:22 UTC

https://github.com/ManageIQ/manageiq/pull/13919

Comment 28 CFME Bot 2017-02-14 18:16:15 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/66368530acd2e31a13e225d360f9da853cd6cfa3

commit 66368530acd2e31a13e225d360f9da853cd6cfa3
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Tue Feb 14 12:02:51 2017 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Tue Feb 14 12:29:00 2017 -0500

    Spec cleanup
    
    * remove any_instance_of since we have an instance of a server!  :tada:
    * use lets, remove some ivars
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1395736

 spec/models/miq_server/worker_monitor_spec.rb | 74 +++++++++++++++------------
 1 file changed, 40 insertions(+), 34 deletions(-)

Comment 29 CFME Bot 2017-02-14 18:16:22 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/3ea55faefa466bab32ef7debdc77b303de52b16d

commit 3ea55faefa466bab32ef7debdc77b303de52b16d
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Tue Feb 14 12:17:31 2017 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Tue Feb 14 12:29:00 2017 -0500

    Fix Reason NameError using constants included in MiqServer
    
    Due to module inclusion spaghetti, it's easier and less confusing
    to reference the Reason constants consistently in the MiqServer class,
    which is the ultimate destination for all of these modules.
    
    Fixes #13901 introduced in #13805
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1395736

 app/models/miq_server/worker_management/monitor.rb            | 4 ++--
 app/models/miq_server/worker_management/monitor/validation.rb | 2 +-
 spec/models/miq_server/worker_monitor_spec.rb                 | 6 ++++++
 3 files changed, 9 insertions(+), 3 deletions(-)

Comment 34 CFME Bot 2017-03-03 20:40:52 UTC

New commit detected on ManageIQ/manageiq/euwe:
https://github.com/ManageIQ/manageiq/commit/97aa6571357239d9db9bc50012a77c21773687a7

commit 97aa6571357239d9db9bc50012a77c21773687a7
Author:     Gregg Tanzillo <gtanzill>
AuthorDate: Mon Feb 13 22:12:59 2017 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Thu Feb 16 13:32:16 2017 -0500

    Merge pull request #13805 from jrafanie/stopping_worker
    
    Kill workers that don't stop after a configurable time
    (cherry picked from commit 97648709a719ce59679ef0d4833ff7eb16c6d6d3)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1395736

 app/models/miq_server/worker_management/monitor.rb |  8 +++--
 .../miq_server/worker_management/monitor/reason.rb |  3 ++
 .../worker_management/monitor/validation.rb        |  2 +-
 app/models/miq_worker.rb                           |  8 +++++
 config/settings.yml                                |  1 +
 .../miq_server/worker_management/monitor_spec.rb   | 34 ++++++++++++++++++++++
 spec/models/miq_worker_spec.rb                     | 21 +++++++++++++
 7 files changed, 74 insertions(+), 3 deletions(-)
 create mode 100644 spec/models/miq_server/worker_management/monitor_spec.rb

Comment 35 CFME Bot 2017-03-03 20:40:59 UTC

New commit detected on ManageIQ/manageiq/euwe:
https://github.com/ManageIQ/manageiq/commit/fbb362598910958de9cd48a74392e2fdf1c9ace6

commit fbb362598910958de9cd48a74392e2fdf1c9ace6
Author:     Chris Arcand <chrisarcand.github.com>
AuthorDate: Tue Feb 14 12:13:38 2017 -0600
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Thu Feb 16 13:32:56 2017 -0500

    Merge pull request #13919 from jrafanie/fix_missing_reason_contants
    
    Fix missing reason contants
    (cherry picked from commit 89dfba8b95059d7349ebca79f84f7f277928a080)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1395736

 app/models/miq_server/worker_management/monitor.rb |  4 +-
 .../worker_management/monitor/validation.rb        |  2 +-
 spec/models/miq_server/worker_monitor_spec.rb      | 78 +++++++++++++---------
 3 files changed, 48 insertions(+), 36 deletions(-)

Comment 36 Joe Rafaniello 2017-03-07 21:47:00 UTC

*** Bug 1430106 has been marked as a duplicate of this bug. ***

Comment 37 CFME Bot 2017-03-10 17:56:09 UTC

New commit detected on ManageIQ/manageiq/darga:
https://github.com/ManageIQ/manageiq/commit/1d603c9c0a8ca39cc9d27f63db39c1f196698606

commit 1d603c9c0a8ca39cc9d27f63db39c1f196698606
Author:     Gregg Tanzillo <gtanzill>
AuthorDate: Mon Feb 13 22:12:59 2017 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Thu Feb 16 13:35:09 2017 -0500

    Merge pull request #13805 from jrafanie/stopping_worker
    
    Kill workers that don't stop after a configurable time
    (cherry picked from commit 97648709a719ce59679ef0d4833ff7eb16c6d6d3)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1395736

 app/models/miq_server/worker_management/monitor.rb |  8 +++--
 .../miq_server/worker_management/monitor/reason.rb |  3 ++
 .../worker_management/monitor/validation.rb        |  2 +-
 app/models/miq_worker.rb                           |  8 +++++
 config/settings.yml                                |  1 +
 .../miq_server/worker_management/monitor_spec.rb   | 34 ++++++++++++++++++++++
 spec/models/miq_worker_spec.rb                     | 21 +++++++++++++
 7 files changed, 74 insertions(+), 3 deletions(-)
 create mode 100644 spec/models/miq_server/worker_management/monitor_spec.rb

Comment 38 CFME Bot 2017-03-10 17:56:15 UTC

New commit detected on ManageIQ/manageiq/darga:
https://github.com/ManageIQ/manageiq/commit/2b5d162a01afffb5e8a812913fffa72783e7507e

commit 2b5d162a01afffb5e8a812913fffa72783e7507e
Author:     Chris Arcand <chrisarcand.github.com>
AuthorDate: Tue Feb 14 12:13:38 2017 -0600
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Thu Feb 16 13:35:10 2017 -0500

    Merge pull request #13919 from jrafanie/fix_missing_reason_contants
    
    Fix missing reason contants
    (cherry picked from commit 89dfba8b95059d7349ebca79f84f7f277928a080)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1395736

 app/models/miq_server/worker_management/monitor.rb |  4 +-
 .../worker_management/monitor/validation.rb        |  2 +-
 spec/models/miq_server/worker_monitor_spec.rb      | 78 +++++++++++++---------
 3 files changed, 48 insertions(+), 36 deletions(-)

Note You need to log in before you can comment on or make changes to this bug.