Bug 1272368

Summary:	Race condition on systemd-run --scope --slice=foo
Product:	Red Hat Enterprise Linux 7	Reporter:	Petr Horáček <phoracek>
Component:	systemd	Assignee:	systemd-maint
Status:	CLOSED ERRATA	QA Contact:	Branislav Blaškovič <bblaskov>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	7.2	CC:	bblaskov, bmcclain, danken, gcheresh, gklein, jkurik, jscotka, lmiksik, lnykryn, snagar, systemd-maint-list, systemd-maint, tlavigne, ylavi
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	systemd-219-20.el7	Doc Type:	Bug Fix
Doc Text:	Cause: it might happen that by the time PID 1 adds our process to the scope unit the process might already have died, if the process is short-running (such as an invocation to /bin/true). Consequence: When systemd picked some recycled name for scope following error could appear: 'Failed to start transient scope unit: Unit run-XXXXX.scope already exists.' Fix: Synchronously wait until the scope unit we create is started. Result: It should work now.	Story Points:	---
Clone Of:
Clones:	1283192 (view as bug list)		Environment:
Last Closed:	2016-11-04 00:44:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1154205, 1259468, 1283192, 1283245, 1289485

Description Petr Horáček 2015-10-16 09:21:52 UTC

Description of problem:
Similar bug was discussed on [1] and [2]. Sometimes due to race condition, unit might show up as already existing, even though the process has already finished.

We face this problem on oVirt VDSM testing server, sometimes this happens:
$ /usr/sbin/ifdown enp1s0f1
$ /usr/bin/systemd-run --scope --slice=vdsm-dhclient /usr/sbin/ifup enp1s0f1
'Failed to start transient scope unit: Unit run-13034.scope already exists.'
When we wait a second and then retry, it's OK.

A fix for this problem was introduced in systemd v220 which is not available for EL7. Would it be possible to backport it?



Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux Server release 7.2 Beta (Maipo)
Linux 3.10.0-322.el7.x86_64
systemd-219-18.el7.x86_64



How reproducible:
On our test server sometimes, but in mentioned bugs [1] [2] they were able to reproduce it always.




Steps to Reproduce [1]:
[root@axis-00408cc563e5 /mnt/flash/root]27929# cat stress
#!/bin/sh

systemd-run --scope /bin/true &
systemd-run --scope /bin/true &
systemd-run --scope /bin/true &
systemd-run --scope /bin/true &
systemd-run --scope /bin/true &
systemd-run --scope /bin/true &
systemd-run --scope /bin/true &
systemd-run --scope /bin/true &
[root@axis-00408cc563e5 /mnt/flash/root]27929# ./stress
[root@axis-00408cc563e5 /mnt/flash/root]27929# Running as unit run-27947.scope.
Running as unit run-27946.scope.
Running as unit run-27945.scope.
Running as unit run-27948.scope.
Running as unit run-27952.scope.
Running as unit run-27950.scope.
Running as unit run-27951.scope.
Running as unit run-27949.scope.

[root@axis-00408cc563e5 /mnt/flash/root]27929# systemctl -t scope
UNIT            LOAD   ACTIVE SUB     DESCRIPTION
run-27945.scope loaded active running /bin/true
run-27946.scope loaded active running /bin/true
run-27947.scope loaded active running /bin/true
run-27948.scope loaded active running /bin/true
run-27949.scope loaded active running /bin/true
run-27950.scope loaded active running /bin/true
run-27951.scope loaded active running /bin/true
run-27952.scope loaded active running /bin/true

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

8 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.
[root@axis-00408cc563e5 /mnt/flash/root]27929# systemctl status run-27945.scope
● run-27945.scope - /bin/true
   Loaded: loaded (/run/systemd/system/run-27945.scope; static)
  Drop-In: /run/systemd/system/run-27945.scope.d
           └─50-Description.conf
   Active: active (running) since Fri 2014-10-24 13:13:26 GMT; 15s ago




Steps to reproduce [2]:
[robryk@sharya-rana bin]$ systemctl --user daemon-reload
[robryk@sharya-rana bin]$ systemctl --version
systemd 226
+PAM -AUDIT -SELINUX -IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD +IDN
[robryk@sharya-rana bin]$ systemd-run --user --scope --unit=foobar /bin/sleep 5
Running scope as unit foobar.scope.
[robryk@sharya-rana bin]$ systemctl status --user foobar.scope
● foobar.scope - /bin/sleep 5
   Loaded: loaded (/run/user/1000/systemd/user/foobar.scope; static; vendor preset: enabled)
  Drop-In: /run/user/1000/systemd/user/foobar.scope.d
           └─50-Description.conf
   Active: active (running) since Wed 2015-09-23 02:51:56 CEST; 30s ago

Sep 23 02:51:56 sharya-rana systemd[763]: Started /bin/sleep 5.
[robryk@sharya-rana bin]$ systemd-run --user --scope --unit=foobar /bin/sleep 5
Failed to start transient scope unit: Unit foobar.scope already exists.



[1] https://bugs.freedesktop.org/show_bug.cgi?id=86520
[2] https://github.com/systemd/systemd/issues/1351

Comment 4 GenadiC 2015-10-21 05:21:05 UTC

Run with systemd from https://brewweb.devel.redhat.com/taskinfo?taskID=9969138 and it looks like the problem of bringing interface up is solved

Comment 6 Dan Kenigsberg 2015-10-22 07:30:33 UTC

Lukáš, would you explain why we cannot rush this fix into 0day errata of rhel-7.2.0?

If not part of 7.2.0, may we have it in a near z-stream?

Comment 8 Petr Horáček 2015-10-22 09:22:57 UTC

Note this mail from Lennart Poettering. This patch may not be 100% solution, but it's better than nothing:

On Thu, 15.10.15 13:25, Petr Horacek (phoracek) wrote:

> Hello,
>
> recently we encountered strange systemd problem on automated tests of
> networking
> part of oVirt VDSM project.
>
> Sometimes this happens:
> $ /usr/sbin/ifdown enp1s0f1
> $ /usr/bin/systemd-run --scope --slice=vdsm-dhclient /usr/sbin/ifup enp1s0f1
> Failed to start transient scope unit: Unit run-13034.scope already exists.
>
> systemd-run should create a new scope every time it's called, should not
> it? Could it be
> a racefull bug in systemd?

The code for this is actually really naive... the number is just the
PID of the caller, and there's no check at all to ensure it is
unique. PIDs overrun easily, hence this is not nice at all...

What's even worse: when you use -H or -M to invoke things remotely we
still pick the client side PID for the name....

I figure we should rework this to pick some sufficiently large random
token instead, so that this is unlikely to conflict without actually
having to check for conflicts.

In the meantime, you should be able to fix this by explicitly picking
a randomized name for the scope using --unit=. For example, consider
just adding --unit=`uuidgen` to your command line, and the clashes
should not happen anymore.

> I found recently added issue [1] which describes similar problem,
> but with --unit instead of --slice. Note that our machine which
> reproduced it has systemd older than v220.
>
> Is it possible, that this is the same case as described in [1] and
> therefore it should be
> fixed in systemd 220?
>
> Is it possible to backport [1]'s fix to EL7?

Well, there are still cases where we unable to clean up scope units
properly, because we don't get any notifications for them when they
run empty. But yeah the current upstream versions should be better
than older versions.

Lennart

--
Lennart Poettering, Red Hat

Comment 22 errata-xmlrpc 2016-11-04 00:44:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2216.html