RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2117087 - systemctl is-system-running sometimes returns exit code 15 in 1minutetip
Summary: systemctl is-system-running sometimes returns exit code 15 in 1minutetip
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: tuned
Version: 9.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Jaroslav Škarvada
QA Contact: Robin Hack
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-09 23:22 UTC by Jaroslav Škarvada
Modified: 2023-05-05 09:45 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-18 11:25:39 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-130777 0 None None None 2022-08-09 23:28:54 UTC

Description Jaroslav Škarvada 2022-08-09 23:22:33 UTC
Description of problem:
We are running 'systemctl is-system-running in TuneD on TuneD shutdown:
https://github.com/redhat-performance/tuned/blob/master/tuned/daemon/daemon.py#L181

If running in 1minutetip sometimes the 'systemctl is-system-running' command returns with exit code 15 and no output. It's not reproducible on RHEL-8.7.

Version-Release number of selected component (if applicable):
systemd-250-7.el9.x86_64

How reproducible:
Sometimes

Steps to Reproduce:
1. from systemd service which is shutting down, execute 'systemctl is-system-running'
2. check exitcode
3.

Actual results:
Exit code 15

Expected results:
Exit code 0

Additional info:
It looks like some race condition.

Comment 1 David Tardon 2022-08-16 08:57:39 UTC
(In reply to Jaroslav Škarvada from comment #0)
> Steps to Reproduce:
> 1. from systemd service which is shutting down, execute 'systemctl
> is-system-running'

Do you mean "service which is being stopped"? Or does it only happen if the service is being stopped _at shutdown_?

> Actual results:
> Exit code 15

That's weird. If I understand the code correctly `systemctl is-system-running` cannot return anything except 0 and 1...

Comment 2 David Tardon 2022-08-16 09:12:25 UTC
(In reply to David Tardon from comment #1)
> > Actual results:
> > Exit code 15
> 
> That's weird. If I understand the code correctly `systemctl
> is-system-running` cannot return anything except 0 and 1...

Maybe this is not an exit code but a signal number, i.e., the systemctl process has been terminated by SIGTERM (by systemd, as it cleans up tuned.service's cgroup). That would make sense and it would also explain why it only happens sometimes.

Comment 3 Jaroslav Škarvada 2022-08-17 07:23:19 UTC
(In reply to David Tardon from comment #1)
> (In reply to Jaroslav Škarvada from comment #0)
> > Steps to Reproduce:
> > 1. from systemd service which is shutting down, execute 'systemctl
> > is-system-running'
> 
> Do you mean "service which is being stopped"? Or does it only happen if the
> service is being stopped _at shutdown_?
> 
Yes, the service which is being stopped, I haven't tried it on machine shutdown.

Comment 4 Jaroslav Škarvada 2022-08-17 07:30:57 UTC
(In reply to David Tardon from comment #2)
> (In reply to David Tardon from comment #1)
> > > Actual results:
> > > Exit code 15
> > 
> > That's weird. If I understand the code correctly `systemctl
> > is-system-running` cannot return anything except 0 and 1...
> 
> Maybe this is not an exit code but a signal number, i.e., the systemctl
> process has been terminated by SIGTERM (by systemd, as it cleans up
> tuned.service's cgroup). That would make sense and it would also explain why
> it only happens sometimes.

Yes, you are right. Sorry for the noise. From the python doc for the popen returncode [1]:
> A negative value -N indicates that the child was terminated by signal N (POSIX only).

It receives -15 literally, which means SIGTERM. I will update the TuneD code to cope with it.

The remaining question is why is the TuneD service receiving the second SIGTERM? I.e. the first SIGTERM triggers the shutdown code on the line 181 of daemon.py. It's quite quick, no more than one or two seconds. But why it receives another SIGTERM while in the 'popen/exec' call trying to run 'systemctl is-system-running'? I wasn't able to reproduce this on RHEL-8.7, but on RHEL-9.1 it happens quite often.

[1] https://docs.python.org/3/library/subprocess.html#subprocess.Popen.returncode

Comment 5 David Tardon 2022-08-17 12:11:49 UTC
(In reply to Jaroslav Škarvada from comment #4)
> The remaining question is why is the TuneD service receiving the second
> SIGTERM?

That SIGTERM is received not by the TuneD service, but by the forked systemctl. systemd sends the signal to _all_ processes in the cgroup, even to newly forked ones.

> I wasn't able to reproduce this on RHEL-8.7,
> but on RHEL-9.1 it happens quite often.

I guess it's a difference between cgroups v1 and v2.

Comment 6 Jaroslav Škarvada 2022-08-18 11:25:39 UTC
(In reply to David Tardon from comment #5)
> (In reply to Jaroslav Škarvada from comment #4)
> > The remaining question is why is the TuneD service receiving the second
> > SIGTERM?
> 
> That SIGTERM is received not by the TuneD service, but by the forked
> systemctl. systemd sends the signal to _all_ processes in the cgroup, even
> to newly forked ones.
> 
> > I wasn't able to reproduce this on RHEL-8.7,
> > but on RHEL-9.1 it happens quite often.
> 
> I guess it's a difference between cgroups v1 and v2.

Thanks for info, it now makes sense.

Comment 7 Jaroslav Škarvada 2022-08-18 11:28:15 UTC
Maybe one more question, how to reliably detect whether the TuneD service is shut down by systemctl or by system shutdown/reboot? In the latter case we need to do full rollback in TuneD. I guess that in this case just calling the 'systemctl' command again from the TuneD wouldn't help.

I also don't understand why it's reproducible sometimes.

Comment 8 Jaroslav Škarvada 2022-08-18 11:29:48 UTC
We need to handle it somehow, thus reopening and moving to TuneD.

Comment 10 David Tardon 2022-08-23 07:50:35 UTC
(In reply to Jaroslav Škarvada from comment #7)
> Maybe one more question, how to reliably detect whether the TuneD service is
> shut down by systemctl or by system shutdown/reboot? 

The way you already do it. If the system is shutting down, `systemctl is-system-running` returns "stopping" (this is also available via D-Bus as org.freedesktop.systemd1.Manager#SystemState property on /org/freedesktop/systemd1).

> I also don't understand why it's reproducible sometimes.

Well, there's a race there between systemd reacting to a new process in the cgroup and the process finishing.

I can think of several ways to avoid the issue:
1. Use D-Bus API instead of calling systemctl.
2. Implement some stop signalling mechanism and use it in ExecStop=. Note that this must be synchronous, i.e., the ExecStop= command must wait till the daemon finishes stopping, as the killing of the cgroup processes will start just after the ExecStop= command finishes. (This method is discouraged in general, but it might be a suitable choice here.)
3. If the rollback can be done independently of the tuned daemon (i.e., the states/changes are somehow serialized on disk), do it from a separate script in ExecStopPost=. (This does have the additional advantage that the rollback will be done even if tuned crashes or is killed.)
4. Use KillMode=process. (This is not recommended, but possible, if you're absolutely sure that tuned doesn't leave any stray processes around. In any case, it can be used as a temp. workaround.)


Note You need to log in before you can comment on or make changes to this bug.