Bug 1652056
Summary: | Enhance resiliency mechanism to avoid memory recycler leading to tasks paused with 'Abnormal termination (previous state: running)' error | ||
---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Ivan Necas <inecas> |
Component: | Tasks Plugin | Assignee: | satellite6-bugs <satellite6-bugs> |
Status: | CLOSED WONTFIX | QA Contact: | Peter Ondrejka <pondrejk> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 6.3.4 | CC: | ahumbe, aruzicka, bkearney, cmarinea, hyu, inecas, mmccune, vsedmik |
Target Milestone: | 6.8.0 | ||
Target Release: | Unused | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-09-08 15:00:58 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Ivan Necas
2018-11-21 13:42:35 UTC
Created redmine issue http://projects.theforeman.org/issues/25528 from this bug The original memory recycler was introduced in foreman-tasks-0.9.2 which first landed in Satellite 6.3. It was described in the tuning guide until 6.6 inclusive, but it was removed from the tuning guide in 6.7 so whether it was supported in 6.7 or even 6.8 is rather questionable. Now with 6.8, the memory recycler is gone and therefore cannot cause tasks to get paused with abnormal termination errors. If the workers grow too much and someone wants to reclaim the memory, they workers can be relatively safely restarted manually with systemctl restart dynflow-sidekiq@$worker where $worker is the worker instance id. The workers should be able to deal with being restarted this way without having any impact on the jobs. If we want to be super safe, it would be better to do systemctl kill --signal TSTP dynflow-sidekiq@$worker while ! systemctl status dynflow-sidekiq@$worker | grep -Po '\[0 of \d+ busy\]'; do sleep 5; done systemctl restart dynflow-sidekiq@$worker The first systemctl command will tell the worker not to accept new jobs, then it will wait until the worker is not doing anything and the last one will restart the service. If the workers are killed hard (kill -9), the workers should be able to recover from that, but it will be on a best-effort basis. The harder you kill the workers, the harder the recovery becomes. It may just work, it may take time or it may take time and then fail the job. Here be dragons. I put together a writeup[1] how a memory recycler *could* be implemented using systemd from 6.8 onwards. Please read it carefully, as there are a few catches, mostly regarding how systemd kills services when the memory limit is reached and how the recovery is handled, but the tl;dr is: - it can be done - the services will get killed hard - with dynflow <= 1.4.6, a patch[2] needs to be applied otherwise the recovery will not be successful - even with the patch from the previous line, the recovery is best-effort only [1] - https://gist.github.com/adamruzicka/8abb3c65aa8ff1c84c1b81599a6d42b0 [2] - https://github.com/Dynflow/dynflow/pull/360 @Ashish from your point of view, was this considered supported given it wasn't mentioned anywhere in the docs starting with 6.7? Not sure what the right state for this should be, removing triaged keyword to let the triage time decide what to do with this bz Thanks for confirmation, the memory recycler is gone, there will be a KCS how to achieve similar behavior even though it's not generally recommended. Closing this now. Please reopen if I missed something. |