Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2152609 - REX task running during logrotate to foreman-proxy goes to suspended state forever
Summary: REX task running during logrotate to foreman-proxy goes to suspended state fo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Tasks Plugin
Version: 6.11.2
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: 6.13.0
Assignee: Adam Ruzicka
QA Contact: Peter Ondrejka
URL:
Whiteboard:
Depends On:
Blocks: 2251037
TreeView+ depends on / blocked
 
Reported: 2022-12-12 12:54 UTC by Pavel Moravec
Modified: 2023-11-22 14:02 UTC (History)
3 users (show)

Fixed In Version: foreman-proxy-3.5.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2251037 (view as bug list)
Environment:
Last Closed: 2023-05-03 13:23:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Foreman Issue Tracker 35859 0 Normal Closed foreman-proxy logrotate sends signal to all processes under foreman-proxy.service 2022-12-14 09:24:15 UTC
Github theforeman foreman-packaging pull 8835 0 None open Send signal to only main process foreman-proxy on logrotate 2022-12-13 09:00:39 UTC
Red Hat Issue Tracker SAT-14679 0 None None None 2023-01-05 00:49:48 UTC
Red Hat Product Errata RHSA-2023:2097 0 None None None 2023-05-03 13:24:22 UTC

Description Pavel Moravec 2022-12-12 12:54:57 UTC
Description of problem:
When logrotate is applied to foreman-proxy, any currently running (REX) tasks stuck forever. They are running/pending, the previously running dynflow step is suspended.


Version-Release number of selected component (if applicable):
Sat 6.11


How reproducible:
100%


Steps to Reproduce:
1. Invoke a REX job to e.g. execute "sleep 60" command
2. Check the foreman-task is really running
3. Run "logrotate -f /etc/logrotate.d/foreman-proxy" to forcefully execute logrotate on foreman-tasks
4. Check the foreman-task again now, or after 5 minutes


Actual results:
4. The task is in running/pending state forever, with the Actions::ProxyAction dynflow step being suspended forever


Expected results:
Resume the task in either way.


Additional info:

Comment 1 Pavel Moravec 2022-12-12 14:36:42 UTC
Some corrections made to the BZ description:
- logrotate must happen on foreman-proxy and not foreman-tasks
- Actions::ProxyAction dynflow step is always suspended (even before the logrotate)


/var/log/foreman-proxy/proxy.log.1.gz (prior the logrotate):

..
2022-12-12T15:25:02 80e95288 [I] Finished GET /dynflow/tasks/8221ba33-2108-47d3-9436-d585a0a5895e/status with 200 (5.65 ms)
2022-12-12T15:25:03 870b6f87 [I] Started GET /dynflow/tasks/8221ba33-2108-47d3-9436-d585a0a5895e/status 
2022-12-12T15:25:03 870b6f87 [I] Finished GET /dynflow/tasks/8221ba33-2108-47d3-9436-d585a0a5895e/status with 200 (5.01 ms)
2022-12-12T15:25:43 a648262a [I] Started GET /dynflow/tasks/count state=running
2022-12-12T15:25:43 a648262a [I] Finished GET /dynflow/tasks/count with 200 (4.31 ms)
2022-12-12T15:25:44 a648262a [I] Started POST /dynflow/tasks/launch 
2022-12-12T15:25:44 a648262a [I] Finished POST /dynflow/tasks/launch with 200 (41.71 ms)
2022-12-12T15:25:45 94e9ccfb [I] Started GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status 
2022-12-12T15:25:45 94e9ccfb [I] Finished GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status with 200 (8.05 ms)
2022-12-12T15:25:48 0c7f3c6f [I] Started GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status 
2022-12-12T15:25:48 0c7f3c6f [I] Finished GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status with 200 (18.59 ms)
2022-12-12T15:25:49 0c5e4a9d [I] Started GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status 
2022-12-12T15:25:49 0c5e4a9d [I] Finished GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status with 200 (14.16 ms)
2022-12-12T15:25:50 1fede890 [I] Started GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status 
2022-12-12T15:25:50 1fede890 [I] Finished GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status with 200 (29.1 ms)

/var/log/foreman-proxy/proxy.log (after the logrotate that was triggered at 15:25:57):
2022-12-12T15:25:57  [I] Logging file reopened via USR1 signal
2022-12-12T15:30:39 b993a41d [I] Started GET /dynflow/tasks/8221ba33-2108-47d3-9436-d585a0a5895e/status 
2022-12-12T15:30:39 b993a41d [I] Finished GET /dynflow/tasks/8221ba33-2108-47d3-9436-d585a0a5895e/status with 200 (7.05 ms)
2022-12-12T15:31:56 8d12af38 [I] Started GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status 
2022-12-12T15:31:56 8d12af38 [I] Finished GET /dynflow/tasks/cd04b990-7ae0-47d9-aa5d-1f7488a76717/status with 200 (6.62 ms)

The 1f7488a76717-.. task is *the* stuck task, 8221ba33- is another independent one (that I was checking concurrently)

Comment 2 Adam Ruzicka 2022-12-12 15:54:18 UTC
logrotate does systemctl kill --signal=SIGUSR1 foreman-proxy . This instructs systemd to send the SIGUSR1 signal to all processes within the foreman-proxy.service's cgroup. If a REX job is running, it includes the ssh processes, which upon receiving this signal just terminate. The additional issue in here is that the machinery inside smart_proxy_remote_execution_ssh fails to notice that it happened, leading to the job hanging.

If we remove the followup issue, the job would fail immediately.

As a fix we should probably make systemd notify only the top-level process by adding `--kill-who main` to systemctl line in /etc/logrotate.d/foreman-proxy

Comment 3 Adam Ruzicka 2022-12-12 15:59:51 UTC
Addendum to my previous comment:
Also, this is quite possibly a regression from when we still had net/ssh and is still (again, quite possibly) unfixed. On 6.13, the followup issue might be fixed so the job would fail, but the root cause should still be present there.

Comment 4 Adam Ruzicka 2022-12-13 08:39:52 UTC
To confirm my previous comments, on 6.13 the followup issue is indeed fixed so the job fails immediately, but the root cause is still present there.

Comment 5 Adam Ruzicka 2022-12-14 09:24:15 UTC
The upstream PR was merged. It was only a packaging change, we should be able to cherrypick it into any release we want.

Comment 7 Peter Ondrejka 2023-01-09 13:33:11 UTC
Verified on satellite 6.12 snap 5, tasks no longer get stuck in the aforementioned scenario, jobs terminate with 
"Actions::ProxyAction::ProxyActionMissing: Capsule task gone missing from the capsule"

Comment 8 Peter Ondrejka 2023-01-09 13:34:36 UTC
Correction of comment 7, the fix was verified on Satellite 6.13

Comment 11 errata-xmlrpc 2023-05-03 13:23:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.13 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2097


Note You need to log in before you can comment on or make changes to this bug.