1652056 – Enhance resiliency mechanism to avoid memory recycler leading to tasks paused with 'Abnormal termination (previous state: running)' error

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1652056 - Enhance resiliency mechanism to avoid memory recycler leading to tasks paused with 'Abnormal termination (previous state: running)' error

Summary: Enhance resiliency mechanism to avoid memory recycler leading to tasks paused...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Tasks Plugin
Sub Component:
Version:	6.3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	6.8.0
Assignee:	satellite6-bugs
QA Contact:	Peter Ondrejka
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-11-21 13:42 UTC by Ivan Necas
Modified:	2024-12-20 18:47 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-08 15:00:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Foreman Issue Tracker	25528	0	Normal	New	Enhance resiliency mechanism to avoid memory recycler leading to tasks paused with 'Abnormal termination (previous state...	2021-01-07 16:14:50 UTC
Red Hat Bugzilla	1652060	0	unspecified	CLOSED	Singleton actions may not start after unclean shutdown	2021-12-10 18:17:23 UTC

Internal Links: 1652060

Description Ivan Necas 2018-11-21 13:42:35 UTC

Description of problem:
With memory recycler, it happens more often that the tasks can get interrupted
during the execution. In sake of transparency of the recycling process, we should
try to handle this situation better so that the user doesn't have to deal with
the error explicitly

Version-Release number of selected component (if applicable):
6.3.0

How reproducible:
Occasionally

Steps to Reproduce:
1. setup memory limit in /etc/sysconfig/foreman-tasks (EXECUTOR_MEMORY_LIMIT=2gb, for easier reproducing, one might decrease
the EXECUTOR_MEMORY_MONITOR_DELAY to get the restarting more often)
2. restart foreman-tasks
3. start using Satellite in larger environment (continuous registration of hosts + content view publishes in combination with multiple capsules)

Actual results:

After some time, some tasks can end up in paused/error state `Abnormal termination (previous state: running)`


Expected results:
We should analyse this cases and find a way how to resume those before requiring
the user to manually interact with those

Additional info:

We will try to find more reliable reproducer, as we will develop the fix for this issue.

Comment 1 Adam Ruzicka 2018-11-21 13:57:53 UTC

Created redmine issue http://projects.theforeman.org/issues/25528 from this bug

Comment 17 Adam Ruzicka 2020-07-30 09:59:34 UTC

The original memory recycler was introduced in foreman-tasks-0.9.2 which first landed in Satellite 6.3. It was described in the tuning guide until 6.6 inclusive, but it was removed from the tuning guide in 6.7 so whether it was supported in 6.7 or even 6.8 is rather questionable. Now with 6.8, the memory recycler is gone and therefore cannot cause tasks to get paused with abnormal termination errors.

If the workers grow too much and someone wants to reclaim the memory, they workers can be relatively safely restarted manually with systemctl restart dynflow-sidekiq@$worker where $worker is the worker instance id. The workers should be able to deal with being restarted this way without having any impact on the jobs.

If we want to be super safe, it would be better to do
systemctl kill --signal TSTP dynflow-sidekiq@$worker
while ! systemctl status dynflow-sidekiq@$worker | grep -Po '\[0 of \d+ busy\]'; do sleep 5; done
systemctl restart dynflow-sidekiq@$worker

The first systemctl command will tell the worker not to accept new jobs, then it will wait until the worker is not doing anything and the last one will restart the service.

If the workers are killed hard (kill -9), the workers should be able to recover from that, but it will be on a best-effort basis. The harder you kill the workers, the harder the recovery becomes. It may just work, it may take time or it may take time and then fail the job. Here be dragons.

I put together a writeup[1] how a memory recycler *could* be implemented using systemd from 6.8 onwards. Please read it carefully, as there are a few catches, mostly regarding how systemd kills services when the memory limit is reached and how the recovery is handled, but the tl;dr is:
- it can be done
- the services will get killed hard
- with dynflow <= 1.4.6, a patch[2] needs to be applied otherwise the recovery will not be successful
- even with the patch from the previous line, the recovery is best-effort only

[1] - https://gist.github.com/adamruzicka/8abb3c65aa8ff1c84c1b81599a6d42b0
[2] - https://github.com/Dynflow/dynflow/pull/360

@Ashish from your point of view, was this considered supported given it wasn't mentioned anywhere in the docs starting with 6.7?

Not sure what the right state for this should be, removing triaged keyword to let the triage time decide what to do with this bz

Comment 21 Marek Hulan 2020-09-08 15:00:58 UTC

Thanks for confirmation, the memory recycler is gone, there will be a KCS how to achieve similar behavior even though it's not generally recommended. Closing this now. Please reopen if I missed something.

Note You need to log in before you can comment on or make changes to this bug.