1272368 – Race condition on systemd-run --scope --slice=foo

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1272368 - Race condition on systemd-run --scope --slice=foo

Summary: Race condition on systemd-run --scope --slice=foo

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	systemd
Sub Component:
Version:	7.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	systemd-maint
QA Contact:	Branislav Blaškovič
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1154205 1259468 1283192 1283245 1289485
TreeView+	depends on / blocked

Reported:	2015-10-16 09:21 UTC by Petr Horáček
Modified:	2016-11-04 00:44 UTC (History)
CC List:	14 users (show)
Fixed In Version:	systemd-219-20.el7
Doc Type:	Bug Fix
Doc Text:	Cause: it might happen that by the time PID 1 adds our process to the scope unit the process might already have died, if the process is short-running (such as an invocation to /bin/true). Consequence: When systemd picked some recycled name for scope following error could appear: 'Failed to start transient scope unit: Unit run-XXXXX.scope already exists.' Fix: Synchronously wait until the scope unit we create is started. Result: It should work now.
Clone Of:
Clones:	1283192 (view as bug list)
Environment:
Last Closed:	2016-11-04 00:44:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:2216	0	normal	SHIPPED_LIVE	systemd bug fix and enhancement update	2016-11-03 13:24:51 UTC

Description Petr Horáček 2015-10-16 09:21:52 UTC

Description of problem:
Similar bug was discussed on [1] and [2]. Sometimes due to race condition, unit might show up as already existing, even though the process has already finished.

We face this problem on oVirt VDSM testing server, sometimes this happens:
$ /usr/sbin/ifdown enp1s0f1
$ /usr/bin/systemd-run --scope --slice=vdsm-dhclient /usr/sbin/ifup enp1s0f1
'Failed to start transient scope unit: Unit run-13034.scope already exists.'
When we wait a second and then retry, it's OK.

A fix for this problem was introduced in systemd v220 which is not available for EL7. Would it be possible to backport it?



Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux Server release 7.2 Beta (Maipo)
Linux 3.10.0-322.el7.x86_64
systemd-219-18.el7.x86_64



How reproducible:
On our test server sometimes, but in mentioned bugs [1] [2] they were able to reproduce it always.




Steps to Reproduce [1]:
[root@axis-00408cc563e5 /mnt/flash/root]27929# cat stress
#!/bin/sh

systemd-run --scope /bin/true &
systemd-run --scope /bin/true &
systemd-run --scope /bin/true &
systemd-run --scope /bin/true &
systemd-run --scope /bin/true &
systemd-run --scope /bin/true &
systemd-run --scope /bin/true &
systemd-run --scope /bin/true &
[root@axis-00408cc563e5 /mnt/flash/root]27929# ./stress
[root@axis-00408cc563e5 /mnt/flash/root]27929# Running as unit run-27947.scope.
Running as unit run-27946.scope.
Running as unit run-27945.scope.
Running as unit run-27948.scope.
Running as unit run-27952.scope.
Running as unit run-27950.scope.
Running as unit run-27951.scope.
Running as unit run-27949.scope.

[root@axis-00408cc563e5 /mnt/flash/root]27929# systemctl -t scope
UNIT            LOAD   ACTIVE SUB     DESCRIPTION
run-27945.scope loaded active running /bin/true
run-27946.scope loaded active running /bin/true
run-27947.scope loaded active running /bin/true
run-27948.scope loaded active running /bin/true
run-27949.scope loaded active running /bin/true
run-27950.scope loaded active running /bin/true
run-27951.scope loaded active running /bin/true
run-27952.scope loaded active running /bin/true

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

8 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.
[root@axis-00408cc563e5 /mnt/flash/root]27929# systemctl status run-27945.scope
● run-27945.scope - /bin/true
   Loaded: loaded (/run/systemd/system/run-27945.scope; static)
  Drop-In: /run/systemd/system/run-27945.scope.d
           └─50-Description.conf
   Active: active (running) since Fri 2014-10-24 13:13:26 GMT; 15s ago




Steps to reproduce [2]:
[robryk@sharya-rana bin]$ systemctl --user daemon-reload
[robryk@sharya-rana bin]$ systemctl --version
systemd 226
+PAM -AUDIT -SELINUX -IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD +IDN
[robryk@sharya-rana bin]$ systemd-run --user --scope --unit=foobar /bin/sleep 5
Running scope as unit foobar.scope.
[robryk@sharya-rana bin]$ systemctl status --user foobar.scope
● foobar.scope - /bin/sleep 5
   Loaded: loaded (/run/user/1000/systemd/user/foobar.scope; static; vendor preset: enabled)
  Drop-In: /run/user/1000/systemd/user/foobar.scope.d
           └─50-Description.conf
   Active: active (running) since Wed 2015-09-23 02:51:56 CEST; 30s ago

Sep 23 02:51:56 sharya-rana systemd[763]: Started /bin/sleep 5.
[robryk@sharya-rana bin]$ systemd-run --user --scope --unit=foobar /bin/sleep 5
Failed to start transient scope unit: Unit foobar.scope already exists.



[1] https://bugs.freedesktop.org/show_bug.cgi?id=86520
[2] https://github.com/systemd/systemd/issues/1351

Comment 4 GenadiC 2015-10-21 05:21:05 UTC

Run with systemd from https://brewweb.devel.redhat.com/taskinfo?taskID=9969138 and it looks like the problem of bringing interface up is solved

Comment 6 Dan Kenigsberg 2015-10-22 07:30:33 UTC

Lukáš, would you explain why we cannot rush this fix into 0day errata of rhel-7.2.0?

If not part of 7.2.0, may we have it in a near z-stream?

Comment 8 Petr Horáček 2015-10-22 09:22:57 UTC

Note this mail from Lennart Poettering. This patch may not be 100% solution, but it's better than nothing:

On Thu, 15.10.15 13:25, Petr Horacek (phoracek) wrote:

> Hello,
>
> recently we encountered strange systemd problem on automated tests of
> networking
> part of oVirt VDSM project.
>
> Sometimes this happens:
> $ /usr/sbin/ifdown enp1s0f1
> $ /usr/bin/systemd-run --scope --slice=vdsm-dhclient /usr/sbin/ifup enp1s0f1
> Failed to start transient scope unit: Unit run-13034.scope already exists.
>
> systemd-run should create a new scope every time it's called, should not
> it? Could it be
> a racefull bug in systemd?

The code for this is actually really naive... the number is just the
PID of the caller, and there's no check at all to ensure it is
unique. PIDs overrun easily, hence this is not nice at all...

What's even worse: when you use -H or -M to invoke things remotely we
still pick the client side PID for the name....

I figure we should rework this to pick some sufficiently large random
token instead, so that this is unlikely to conflict without actually
having to check for conflicts.

In the meantime, you should be able to fix this by explicitly picking
a randomized name for the scope using --unit=. For example, consider
just adding --unit=`uuidgen` to your command line, and the clashes
should not happen anymore.

> I found recently added issue [1] which describes similar problem,
> but with --unit instead of --slice. Note that our machine which
> reproduced it has systemd older than v220.
>
> Is it possible, that this is the same case as described in [1] and
> therefore it should be
> fixed in systemd 220?
>
> Is it possible to backport [1]'s fix to EL7?

Well, there are still cases where we unable to clean up scope units
properly, because we don't get any notifications for them when they
run empty. But yeah the current upstream versions should be better
than older versions.

Lennart

--
Lennart Poettering, Red Hat

Comment 22 errata-xmlrpc 2016-11-04 00:44:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2216.html

Note You need to log in before you can comment on or make changes to this bug.