Bug 2152609
Summary: | REX task running during logrotate to foreman-proxy goes to suspended state forever | |||
---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> | |
Component: | Tasks Plugin | Assignee: | Adam Ruzicka <aruzicka> | |
Status: | CLOSED ERRATA | QA Contact: | Peter Ondrejka <pondrejk> | |
Severity: | high | Docs Contact: | ||
Priority: | medium | |||
Version: | 6.11.2 | CC: | aruzicka, jalviso, pcreech | |
Target Milestone: | 6.13.0 | Keywords: | Triaged | |
Target Release: | Unused | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | foreman-proxy-3.5.0 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2251037 (view as bug list) | Environment: | ||
Last Closed: | 2023-05-03 13:23:38 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2251037 |
Description
Pavel Moravec
2022-12-12 12:54:57 UTC
Some corrections made to the BZ description: - logrotate must happen on foreman-proxy and not foreman-tasks - Actions::ProxyAction dynflow step is always suspended (even before the logrotate) /var/log/foreman-proxy/proxy.log.1.gz (prior the logrotate): .. 2022-12-12T15:25:02 80e95288 [I] Finished GET /dynflow/tasks/8221ba33-2108-47d3-9436-d585a0a5895e/status with 200 (5.65 ms) 2022-12-12T15:25:03 870b6f87 [I] Started GET /dynflow/tasks/8221ba33-2108-47d3-9436-d585a0a5895e/status 2022-12-12T15:25:03 870b6f87 [I] Finished GET /dynflow/tasks/8221ba33-2108-47d3-9436-d585a0a5895e/status with 200 (5.01 ms) 2022-12-12T15:25:43 a648262a [I] Started GET /dynflow/tasks/count state=running 2022-12-12T15:25:43 a648262a [I] Finished GET /dynflow/tasks/count with 200 (4.31 ms) 2022-12-12T15:25:44 a648262a [I] Started POST /dynflow/tasks/launch 2022-12-12T15:25:44 a648262a [I] Finished POST /dynflow/tasks/launch with 200 (41.71 ms) 2022-12-12T15:25:45 94e9ccfb [I] Started GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status 2022-12-12T15:25:45 94e9ccfb [I] Finished GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status with 200 (8.05 ms) 2022-12-12T15:25:48 0c7f3c6f [I] Started GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status 2022-12-12T15:25:48 0c7f3c6f [I] Finished GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status with 200 (18.59 ms) 2022-12-12T15:25:49 0c5e4a9d [I] Started GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status 2022-12-12T15:25:49 0c5e4a9d [I] Finished GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status with 200 (14.16 ms) 2022-12-12T15:25:50 1fede890 [I] Started GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status 2022-12-12T15:25:50 1fede890 [I] Finished GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status with 200 (29.1 ms) /var/log/foreman-proxy/proxy.log (after the logrotate that was triggered at 15:25:57): 2022-12-12T15:25:57 [I] Logging file reopened via USR1 signal 2022-12-12T15:30:39 b993a41d [I] Started GET /dynflow/tasks/8221ba33-2108-47d3-9436-d585a0a5895e/status 2022-12-12T15:30:39 b993a41d [I] Finished GET /dynflow/tasks/8221ba33-2108-47d3-9436-d585a0a5895e/status with 200 (7.05 ms) 2022-12-12T15:31:56 8d12af38 [I] Started GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status 2022-12-12T15:31:56 8d12af38 [I] Finished GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status with 200 (6.62 ms) The 1f7488a76717-.. task is *the* stuck task, 8221ba33- is another independent one (that I was checking concurrently) logrotate does systemctl kill --signal=SIGUSR1 foreman-proxy . This instructs systemd to send the SIGUSR1 signal to all processes within the foreman-proxy.service's cgroup. If a REX job is running, it includes the ssh processes, which upon receiving this signal just terminate. The additional issue in here is that the machinery inside smart_proxy_remote_execution_ssh fails to notice that it happened, leading to the job hanging. If we remove the followup issue, the job would fail immediately. As a fix we should probably make systemd notify only the top-level process by adding `--kill-who main` to systemctl line in /etc/logrotate.d/foreman-proxy Addendum to my previous comment: Also, this is quite possibly a regression from when we still had net/ssh and is still (again, quite possibly) unfixed. On 6.13, the followup issue might be fixed so the job would fail, but the root cause should still be present there. To confirm my previous comments, on 6.13 the followup issue is indeed fixed so the job fails immediately, but the root cause is still present there. The upstream PR was merged. It was only a packaging change, we should be able to cherrypick it into any release we want. Verified on satellite 6.12 snap 5, tasks no longer get stuck in the aforementioned scenario, jobs terminate with "Actions::ProxyAction::ProxyActionMissing: Capsule task gone missing from the capsule" Correction of comment 7, the fix was verified on Satellite 6.13 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Satellite 6.13 Release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:2097 |