Bug 1378974 - Restart of systemd-udev-trigger.service in systemd-udev %postun causes 'replug' of graphics adapter, can cause X to crash
Summary: Restart of systemd-udev-trigger.service in systemd-udev %postun causes 'replu...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 24
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: AcceptedFreezeException
: 1382749 1383410 (view as bug list)
Depends On:
Blocks: F25BetaFreezeException
TreeView+ depends on / blocked
 
Reported: 2016-09-23 18:03 UTC by Tomas Sieger
Modified: 2016-11-04 23:01 UTC (History)
30 users (show)

Fixed In Version: systemd-229-16.fc24 systemd-231-10.fc25
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-10-14 05:00:04 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
journalctl -f logs from udevadm trigger --type=devices --action=add (16.90 KB, text/plain)
2016-10-04 22:33 UTC, Zbigniew Jędrzejewski-Szmek
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1341327 0 unspecified CLOSED [abrt] xorg-x11-server-Xorg: Segmentation fault at address 0x34 when restart of systemd-udev-trigger.service causes 'rep... 2021-02-22 00:41:40 UTC

Internal Links: 1341327 1381596

Description Tomas Sieger 2016-09-23 18:03:13 UTC
Description of problem:

Running "dnf upgrade" on a fresh FC 24 install results in upgrading also systemd-udev (updated version .x86_64, 229-13.fc24). Upgrading systemd-udev involves restarting the running X session (the user has to log in again) and breaking the dnf transaction: the state of the install is invalid, as many packages are marked as installed in both the old and new versions (as found by running "dnf repoquery --duplicated" , but in fact the new version is not installed (e.g. kernel).


Steps to Reproduce:
1. install FC 24 (incl. setting up wifi networking)
2. "dnf upgrade"
3. packages not updated, but marked as duplicated

Actual results:
Install broken, packages marked as installed (and duplicated), but in fact got not updated.

Expected results:
"dnf update" updates packages (ossibly without restarting the X session, or with restart, but with all packages updated)

Additional info:

Comment 1 Igor Gnatenko 2016-09-26 11:12:04 UTC
Unfortunately DNF can't do much about this. I guess something in packaging is wrong, though I'm not sure what exactly.

Comment 2 Zbigniew Jędrzejewski-Szmek 2016-09-26 17:57:48 UTC
It sounds a lot like #1367766, but that bug was only present in F25+. I'm not aware of anything which would cause this in F24. Can you attach the logs from around the upgrade?

Comment 3 Tomas Sieger 2016-09-26 18:22:26 UTC
Sorry, the logs are gone - I reinstalled the system. 
However, after reinstall, I updated systemd-udev alone (before updating all the rest), and issues similar to those described in #1367766 are now present in the newly installed system. So if some log related to this scenario would be helpful, let me know which one to attach.

Comment 4 Zbigniew Jędrzejewski-Szmek 2016-10-04 20:50:24 UTC
*** Bug 1341327 has been marked as a duplicate of this bug. ***

Comment 5 Adam Williamson 2016-10-04 20:58:06 UTC
There's some more info in 1341327 , which we're leaving open to be for the X end of this problem.

Basically, restarting systemd-udev-trigger.service causes (we think) systemd-logind to pull the graphics adapter out from under X and immediately give it back again. This service is (currently) restarted in %postun of systemd-udev . From reports we've received so far, it seems that on systems with hybrid graphics, this causes X to crash. On systems with dedicated graphics, it doesn't.

This bug is for the systemd end of the problem: the spurious graphics adapter 'replug' probably just shouldn't happen at all. The other bug is for making X not crash if it *does* happen, if X folks want to do that.

I'm proposing this bug as a Beta freeze exception. Since the restart is in %postun , if we ship F25 Beta with the current systemd package, then the first update to systemd-udev will trigger this bug - even if it's updating to a systemd-udev which takes the restart out of %postun. To ensure F25 Beta users don't encounter this bug, we have to include a systemd-udev build with the systemd-udev-trigger restart taken out of %postun in the frozen images.

http://koji.fedoraproject.org/koji/buildinfo?buildID=807101 is the build that should fix this, I'll submit an update once it's complete.

Comment 6 Lennart Poettering 2016-10-04 22:15:46 UTC
Nah, "udevadm trigger" is an operation that should always be safe. Software (be it apps or drivers) that cannot deal with such a replug is broken, and needs to be fixed. 

We have been retriggering udevadm either fully or only specific subsystems since about always. If X11 is broken now with that it needs to be fixed really.

I don't see anything to change in systemd here. Sorry.

Comment 7 Adam Williamson 2016-10-04 22:24:41 UTC
Lennart: the thing zbyszek thinks may be wrong is not the udevadm trigger operation itself, but the fact that it results in this 'hardware replug' happening. He says he's not sure that's actually intended or wanted.

Comment 8 Zbigniew Jędrzejewski-Szmek 2016-10-04 22:33:44 UTC
Created attachment 1207377 [details]
journalctl -f logs from udevadm trigger --type=devices --action=add

The issue is not caused by udevadm trigger --type=devices --action=add directly, but through systemd-logind. If systemd-logind is SIGSTOPed, nothing happens. If systemd-logind is running normally there is a bunch of remove/add events logged by Xorg (see) attachment. systemd-logind doesn't log anything, even at debug level unfortunately. I think it should at least log when it adds/removes devices.
I also don't think it should remove the devices from clients, even temporarily. I would expect this to cause glitches at least.

Comment 9 Kevin Fenzi 2016-10-04 22:58:05 UTC
+1 FE

Comment 10 Mohan Boddu 2016-10-04 23:16:03 UTC
+1 FE

Comment 11 Adam Williamson 2016-10-04 23:21:24 UTC
For the record, I checked release day F23 and F24 lives in a VM, and restarting systemd-udev-trigger.service appears to trigger the hardware 'replug' in both; I see:

Oct 04 19:13:40 localhost /usr/libexec/gdm-x-session[1542]: (II) config/udev: removing GPU device /sys/devices/pci0000:00/0000:00:02.0/drm/card0 /dev/dri/card0
Oct 04 19:13:40 localhost /usr/libexec/gdm-x-session[1542]: xf86: remove device 0 /sys/devices/pci0000:00/0000:00:02.0/drm/card0
Oct 04 19:13:40 localhost /usr/libexec/gdm-x-session[1542]: failed to find screen to remove

even on F23. However, it seems like there was a change between F23 and F24: the introduction of the systemd-udev subpackage, which did not exist in F23. This commit created it:

http://pkgs.fedoraproject.org/cgit/rpms/systemd.git/commit/?id=c16b573717a4fc657d8bac8e12f734f574b8ec42

and added the postun scriptlet:

+%postun udev
+%systemd_postun_with_restart systemd-udev-{settle,trigger}.service systemd-udevd-{control,kernel}.socket systemd-udevd.service

at least just looking at that commit diff, this wasn't simply moved from somewhere else - we actually weren't doing that before, though the systemd-udev-trigger service did exist. So I think that's why this showed up in F24.

Comment 12 Adam Williamson 2016-10-04 23:22:29 UTC
That's +3 FE (counting myself too), setting accepted.

Comment 13 Andrea Oliveri 2016-10-05 08:10:54 UTC
I think that this bug strikes also F25 because yesterday i have updated my Thinkpad T430 (Intel + Nvidia) with F25 during a Gnome Session and X has crashed..

Comment 14 Fedora Update System 2016-10-05 11:21:04 UTC
systemd-229-16.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-faf2598d0c

Comment 15 Fedora Update System 2016-10-05 20:28:16 UTC
systemd-231-8.fc25 has been pushed to the Fedora 25 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-d458ee281a

Comment 16 Fedora Update System 2016-10-06 00:50:19 UTC
systemd-229-16.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.

Comment 17 Adam Williamson 2016-10-06 04:06:27 UTC
I did have a thought about how we could further mitigate this to prevent people updating a fresh Fedora 24 install from encountering it.

Basically, have systemd-udev do something like this (psuedocode):

%pre
%if (current systemd package is older than systemd-229-16.fc24)
systemctl mask systemd-udev-trigger.service
%endif

%posttrans
systemctl unmask systemd-udev-trigger.service

the %pre will run before the old systemd-udev package's %postun and effectively negate its restart of the service, I think, then the %posttrans would restore it to normal.

We might need a few more hedges - perhaps only do this on update(?), and definitely check if systemd-udev-trigger.service was *already* masked and don't unmask it in %posttrans in that case (systemctl is-enabled can tell us if it's already masked) - but what do people think of the general idea? Too hacky?

Comment 18 Zbigniew Jędrzejewski-Szmek 2016-10-06 11:05:05 UTC
That's way to hacky and error prone. Instead, we could a drop in with '[Unit] RefuseManualStop=true' to the service and do 'systemctl daemon-reload' in %post. This is enough to prevent the subsequent 'systemctl try-restart systemd-udev-trigger.sevice' from doing anything.

Comment 19 Fedora Update System 2016-10-07 03:33:56 UTC
systemd-231-8.fc25 has been pushed to the Fedora 25 stable repository. If problems still persist, please make note of it in this bug report.

Comment 20 Zbigniew Jędrzejewski-Szmek 2016-10-07 16:49:47 UTC
*** Bug 1382749 has been marked as a duplicate of this bug. ***

Comment 21 Zbigniew Jędrzejewski-Szmek 2016-10-11 16:58:53 UTC
*** Bug 1383410 has been marked as a duplicate of this bug. ***

Comment 22 Zbigniew Jędrzejewski-Szmek 2016-10-11 20:13:13 UTC
systemd-231-10.fc25 now adds RefuseManualStop=true to the unit. This should fix the issues during upgrade.

Comment 23 Kevin Kofler 2016-10-12 13:48:05 UTC
Are you also going to push the RefuseManualStop workaround to F24?

Comment 24 Zbigniew Jędrzejewski-Szmek 2016-10-12 14:19:19 UTC
It's not necessary. The scriptlets which caused the issue were added in F25, and the F24→F25 upgrade should now be fixed. F24 itself is not affected.

Comment 25 Adam Williamson 2016-10-12 15:12:46 UTC
Huh? No, that's wrong. The scriptlets are already in F24. You can reproduce this bug simply by installing a clean F24 and updating systemd-udev from the update repositories.

Comment 26 Zbigniew Jędrzejewski-Szmek 2016-10-12 15:13:50 UTC
Hm, OK. I'll check F24 then too.

Comment 27 Fedora Update System 2016-10-13 05:53:05 UTC
systemd-231-10.fc25 has been pushed to the Fedora 25 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-005ad5dcb1

Comment 28 Fedora Update System 2016-10-14 05:00:04 UTC
systemd-231-10.fc25 has been pushed to the Fedora 25 stable repository. If problems still persist, please make note of it in this bug report.

Comment 29 Ray Holme 2016-10-14 14:13:11 UTC
So the Fedora 24 solution to this is to upgrade to 25. :=[[[

Comment 30 Adam Williamson 2016-10-14 15:18:49 UTC
No, it's already fixed in F24, just not the extra-fix that prevents it happening on the first update yet, but Zbigniew is still planning to do that, I think. Bugzilla / Bodhi integration has limitations in dealing with bugs that affect multiple releases.

Comment 31 Ray Holme 2016-10-14 20:58:23 UTC
OK, thanks, it did happen last time I did an update
when I can get a systemd update without failure, I will know.
I now separate systemd out from "all" and will continue till it does not fail.

:=]

Comment 32 Adam Williamson 2016-10-14 21:08:55 UTC
Ray: it happened when you did the update because of the details of how the bug is triggered.

The bug is triggered by a command in system-udev's `%postun` script. When you do a package update from, say, foo-1.0 to foo-2.0, the `%postun` script from foo-1.0 is run as part of that transaction.

So here's how it went down: the *existing* systemd-udev package on your system had a `%postun` script that would trigger the update. Up until systemd-229-16.fc24 , all F24 systemd packages had that script.

We released systemd-229-16.fc24 as an update which *removed* that script. However, because it's the %postun of the *old* package that is run on update - not the %postun from the *new* package - when you install systemd-udev-229-16.fc24 , the bug will happen one last time, because the old systemd-udev package still has the bad %postun.

What the update ensures is that any time you update the package *after* the update to 229-16, you won't hit the bug.

We've since come up with a trick which allows the new package to suppress the old package's %postun , so that the bug will no longer happen when you first update to the 'fixed' package. But that trick hasn't been built for F24 yet, I hope Zbigniew will build it, though. Still, now you've got 229-16 installed, you should be safe from this bug in future in any case.

Comment 33 Ray Holme 2016-10-15 12:35:59 UTC
Many thanks for your thorough explanation. I knew it was some command in the "clean" phase. I will expect the next one to succeed and remove the need for the separated"dnf update systemd" --- :=]]]

Comment 34 Boyan Anastasov 2016-11-04 08:38:00 UTC
Just for the record: I had the same problem upgrading from F23->F24 - X was terminated. Now it happened again with upgrade from F24->F25, and the systemd-udev package was updated long ago before upgrade from F24->F25 on Oct 04 2016:

/var/log/dnf.rpm.log:Nov 03 14:26:34 INFO Upgraded: systemd-udev-231-10.fc25.x86_64
/var/log/dnf.rpm.log-20161009:Oct 04 18:20:24 INFO Upgraded: systemd-udev-229-15.fc24.x86_64
/var/log/dnf.rpm.log-20161009:Oct 04 18:20:30 INFO Cleanup: systemd-udev-229-13.fc24.x86_64
/var/log/dnf.rpm.log-20161009:Oct 07 10:51:11 INFO Upgraded: systemd-udev-229-16.fc24.x86_64
/var/log/dnf.rpm.log-20161009:Oct 07 10:51:20 INFO Cleanup: systemd-udev-229-15.fc24.x86_64

It's not a big deal, but with dnf is a bit difficult to clean the mess after the crash. I use this:

dnf remove $(dnf repoquery --duplicated --latest-limit -1 -q)

which complains for removal of systemd* and dnf packages and I need to do that manually with rpm. This time the new problem was the missing /usr/lib/locale/locale-archive, which leads to:


-bash: warning: setlocale: LC_CTYPE: cannot change locale (en_US.UTF-8): No such file or directory
-bash: warning: setlocale: LC_COLLATE: cannot change locale (en_US.UTF-8): No such file or directory
-bash: warning: setlocale: LC_MESSAGES: cannot change locale (en_US.UTF-8): No such file or directory
-bash: warning: setlocale: LC_NUMERIC: cannot change locale (en_US.UTF-8): No such file or directory
-bash: warning: setlocale: LC_TIME: cannot change locale (en_US.UTF-8): No such file or directory

Executing build-locale-archive fixed it.

I have another one system to test the upgrade, if there is a fix for F24's systemd-udev package. I did not have a problem upgrading systemd* in F23 or F24, only when upgrading to the next FNN.

Comment 35 Ray Holme 2016-11-04 12:16:06 UTC
I think this is fixed now but FTR

I do updates in two steps till I am sure

  dnf update systemd
  dnf update all

If and when the first fails, I do this before the next one

  dnf clean all

Comment 36 Adam Williamson 2016-11-04 23:01:05 UTC
that's not a great defence. the best defences are the ones we documented: use offline updates, update from a VT, or update from a tmux/screen session.

this specific bug should now be basically fixed, yes.


Note You need to log in before you can comment on or make changes to this bug.