Bug 2152609

Summary: REX task running during logrotate to foreman-proxy goes to suspended state forever
Product: Red Hat Satellite Reporter: Pavel Moravec <pmoravec>
Component: Tasks PluginAssignee: Adam Ruzicka <aruzicka>
Status: CLOSED ERRATA QA Contact: Peter Ondrejka <pondrejk>
Severity: high Docs Contact:
Priority: medium    
Version: 6.11.2CC: aruzicka, jalviso, pcreech
Target Milestone: 6.13.0Keywords: Triaged
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: foreman-proxy-3.5.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2251037 (view as bug list) Environment:
Last Closed: 2023-05-03 13:23:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2251037    

Description Pavel Moravec 2022-12-12 12:54:57 UTC
Description of problem:
When logrotate is applied to foreman-proxy, any currently running (REX) tasks stuck forever. They are running/pending, the previously running dynflow step is suspended.


Version-Release number of selected component (if applicable):
Sat 6.11


How reproducible:
100%


Steps to Reproduce:
1. Invoke a REX job to e.g. execute "sleep 60" command
2. Check the foreman-task is really running
3. Run "logrotate -f /etc/logrotate.d/foreman-proxy" to forcefully execute logrotate on foreman-tasks
4. Check the foreman-task again now, or after 5 minutes


Actual results:
4. The task is in running/pending state forever, with the Actions::ProxyAction dynflow step being suspended forever


Expected results:
Resume the task in either way.


Additional info:

Comment 1 Pavel Moravec 2022-12-12 14:36:42 UTC
Some corrections made to the BZ description:
- logrotate must happen on foreman-proxy and not foreman-tasks
- Actions::ProxyAction dynflow step is always suspended (even before the logrotate)


/var/log/foreman-proxy/proxy.log.1.gz (prior the logrotate):

..
2022-12-12T15:25:02 80e95288 [I] Finished GET /dynflow/tasks/8221ba33-2108-47d3-9436-d585a0a5895e/status with 200 (5.65 ms)
2022-12-12T15:25:03 870b6f87 [I] Started GET /dynflow/tasks/8221ba33-2108-47d3-9436-d585a0a5895e/status 
2022-12-12T15:25:03 870b6f87 [I] Finished GET /dynflow/tasks/8221ba33-2108-47d3-9436-d585a0a5895e/status with 200 (5.01 ms)
2022-12-12T15:25:43 a648262a [I] Started GET /dynflow/tasks/count state=running
2022-12-12T15:25:43 a648262a [I] Finished GET /dynflow/tasks/count with 200 (4.31 ms)
2022-12-12T15:25:44 a648262a [I] Started POST /dynflow/tasks/launch 
2022-12-12T15:25:44 a648262a [I] Finished POST /dynflow/tasks/launch with 200 (41.71 ms)
2022-12-12T15:25:45 94e9ccfb [I] Started GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status 
2022-12-12T15:25:45 94e9ccfb [I] Finished GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status with 200 (8.05 ms)
2022-12-12T15:25:48 0c7f3c6f [I] Started GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status 
2022-12-12T15:25:48 0c7f3c6f [I] Finished GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status with 200 (18.59 ms)
2022-12-12T15:25:49 0c5e4a9d [I] Started GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status 
2022-12-12T15:25:49 0c5e4a9d [I] Finished GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status with 200 (14.16 ms)
2022-12-12T15:25:50 1fede890 [I] Started GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status 
2022-12-12T15:25:50 1fede890 [I] Finished GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status with 200 (29.1 ms)

/var/log/foreman-proxy/proxy.log (after the logrotate that was triggered at 15:25:57):
2022-12-12T15:25:57  [I] Logging file reopened via USR1 signal
2022-12-12T15:30:39 b993a41d [I] Started GET /dynflow/tasks/8221ba33-2108-47d3-9436-d585a0a5895e/status 
2022-12-12T15:30:39 b993a41d [I] Finished GET /dynflow/tasks/8221ba33-2108-47d3-9436-d585a0a5895e/status with 200 (7.05 ms)
2022-12-12T15:31:56 8d12af38 [I] Started GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status 
2022-12-12T15:31:56 8d12af38 [I] Finished GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status with 200 (6.62 ms)

The 1f7488a76717-.. task is *the* stuck task, 8221ba33- is another independent one (that I was checking concurrently)

Comment 2 Adam Ruzicka 2022-12-12 15:54:18 UTC
logrotate does systemctl kill --signal=SIGUSR1 foreman-proxy . This instructs systemd to send the SIGUSR1 signal to all processes within the foreman-proxy.service's cgroup. If a REX job is running, it includes the ssh processes, which upon receiving this signal just terminate. The additional issue in here is that the machinery inside smart_proxy_remote_execution_ssh fails to notice that it happened, leading to the job hanging.

If we remove the followup issue, the job would fail immediately.

As a fix we should probably make systemd notify only the top-level process by adding `--kill-who main` to systemctl line in /etc/logrotate.d/foreman-proxy

Comment 3 Adam Ruzicka 2022-12-12 15:59:51 UTC
Addendum to my previous comment:
Also, this is quite possibly a regression from when we still had net/ssh and is still (again, quite possibly) unfixed. On 6.13, the followup issue might be fixed so the job would fail, but the root cause should still be present there.

Comment 4 Adam Ruzicka 2022-12-13 08:39:52 UTC
To confirm my previous comments, on 6.13 the followup issue is indeed fixed so the job fails immediately, but the root cause is still present there.

Comment 5 Adam Ruzicka 2022-12-14 09:24:15 UTC
The upstream PR was merged. It was only a packaging change, we should be able to cherrypick it into any release we want.

Comment 7 Peter Ondrejka 2023-01-09 13:33:11 UTC
Verified on satellite 6.12 snap 5, tasks no longer get stuck in the aforementioned scenario, jobs terminate with 
"Actions::ProxyAction::ProxyActionMissing: Capsule task gone missing from the capsule"

Comment 8 Peter Ondrejka 2023-01-09 13:34:36 UTC
Correction of comment 7, the fix was verified on Satellite 6.13

Comment 11 errata-xmlrpc 2023-05-03 13:23:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.13 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2097