Bug 1931034 - systemd pid-1 crashed during update of systemd
Summary: systemd pid-1 crashed during update of systemd
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 34
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: AcceptedFreezeException
: 1930793 1930900 1934549 1936027 1936559 1937134 1937407 1937504 (view as bug list)
Depends On:
Blocks: F34BetaFreezeException
TreeView+ depends on / blocked
 
Reported: 2021-02-20 07:50 UTC by Villy Kruse
Modified: 2021-03-23 15:54 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-17 20:16:01 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Picuture of screen showing the frozen state (3.40 MB, image/jpeg)
2021-02-20 07:50 UTC, Villy Kruse
no flags Details
Running "journalctl -b -1 -p err" after system successfully rebooted (1.84 KB, text/plain)
2021-02-20 07:51 UTC, Villy Kruse
no flags Details
New version systemd (2.04 KB, text/plain)
2021-03-04 14:37 UTC, Villy Kruse
no flags Details
Old version of systemd (2.31 KB, text/plain)
2021-03-04 14:38 UTC, Villy Kruse
no flags Details
Raw crash dump (591.92 KB, application/octet-stream)
2021-03-04 14:38 UTC, Villy Kruse
no flags Details
Raw chrash dump (625.91 KB, application/octet-stream)
2021-03-04 14:39 UTC, Villy Kruse
no flags Details
coredumpctl gdb, thread apply all bt full (10.02 KB, text/plain)
2021-03-06 04:25 UTC, Chris Murphy
no flags Details

Description Villy Kruse 2021-02-20 07:50:00 UTC
Created attachment 1758380 [details]
Picuture of screen showing the frozen state

Description of problem:

During update of systemd on Fedora 34 the system completely froze.

Version-Release number of selected component (if applicable):

systemd 247.3-2-fc34 updating to 247.3-3.fc34

How reproducible:

Don't know

Steps to Reproduce:
1.  dnf update systemd
2.
3.

Actual results:

System freezing

Expected results:

Normal update

Additional info:

xfce environment running in graphical mode.

Comment 1 Villy Kruse 2021-02-20 07:51:50 UTC
Created attachment 1758381 [details]
Running "journalctl -b -1 -p err" after system successfully rebooted

Comment 2 Villy Kruse 2021-02-20 14:12:37 UTC
If I run

  systemctl daemon-reexec

it looks like pid-1 fails to establish a connection to the dbus.  In fact, running "lsof -p 1" shows no sockets at all opened by pid-1.

Comment 3 Zbigniew Jędrzejewski-Szmek 2021-02-20 19:33:27 UTC
Feb 20 07:38:39 fc33 systemd[1]: Assertion 'p->n_ref > 0' failed at src/shared/varlink.c:386, function varlink_unref(). Aborting.
Feb 20 07:38:40 fc33 systemd-coredump[4513]: Process 4512 (systemd) of user 0 dumped core.
                                             
                                             Stack trace of thread 4512:
                                             #0  0x00007f38e5ee058b kill (libc.so.6 + 0x3d58b)
                                             #1  0x000055c7b55d348e crash (/usr/lib/systemd/systemd (deleted) + 0x4848e)
                                             #2  0x00007f38e6085960 __restore_rt (libpthread.so.0 + 0x13960)
                                             #3  0x00007f38e5ee0292 raise (libc.so.6 + 0x3d292)
                                             #4  0x00007f38e5ec98a4 abort (libc.so.6 + 0x268a4)
                                             #5  0x00007f38e62218d2 n/a (/usr/lib/systemd/libsystemd-shared-247.so (deleted) + 0x728d2)
Feb 20 07:38:40 fc33 systemd[1]: Caught <ABRT>, dumped core as pid 4512.
Feb 20 07:38:40 fc33 systemd[1]: Freezing execution.
Feb 20 07:38:40 fc33 systemd-oomd[367]: Failed to connect to /run/systemd/io.system.ManagedOOM: Connection refused
Feb 20 07:38:40 fc33 systemd-oomd[367]: Failed to acquire varlink connection
Feb 20 07:38:40 fc33 systemd-oomd[367]: Event loop failed: Connection refused
Feb 20 07:39:39 fc33 abrt-notification[4650]: Process 4512 (systemd) crashed in ??()

Oops, that's https://github.com/systemd/systemd/issues/18025.

Comment 4 Fedora Blocker Bugs Application 2021-03-02 21:40:01 UTC
Proposed as a Freeze Exception for 34-beta by Fedora user pwalter using the blocker tracking app because:

 systemd crashing during update makes it very difficult to test Beta.

Comment 5 Zbigniew Jędrzejewski-Szmek 2021-03-02 22:13:57 UTC
Is this happening to other people too?

Comment 6 Villy Kruse 2021-03-03 05:18:47 UTC
(In reply to Zbigniew Jędrzejewski-Szmek from comment #5)
> Is this happening to other people too?

A crash during update I saw only once.  That said, when it crashed during update i was in graphical mode.  After that I only upgrade in multi-user-mode.

"systemctl daemon-reexec" consistently crashes when running in a virtual machine.  Both qemu/kvm and VirtualBox.

It does not crash on bare hardware on my system.

"systemctl daemon-reexec" crash is seen in version 247 and 248 on Fedora 34.

Comment 7 Kalev Lember 2021-03-03 10:26:33 UTC
I am seeing this as well. I have systemd-247.3-2.fc34.x86_64 installed in a VM and updating systemd ends up in the whole update hanging. I've been working this around with reverting to an older VM snapshot and updating with dnf update -x 'systemd*'

Comment 8 Adam Williamson 2021-03-03 17:47:53 UTC
Well, the correct "workaround" is to update offline. Like we recommend everyone does :D It works OK in an offline update.

Comment 9 Adam Williamson 2021-03-03 17:50:13 UTC
+3 in https://pagure.io/fedora-qa/blocker-review/issue/272 , marking accepted, but as noted there I'm only inclined to pull a very targeted and safe fix for this.

Comment 10 Villy Kruse 2021-03-04 05:55:58 UTC
(In reply to Adam Williamson from comment #8)
> Well, the correct "workaround" is to update offline. Like we recommend
> everyone does :D It works OK in an offline update.

That would still run "systemctl daemon-reexec" from the post-install scriptlet, wouldn't it.

Comment 11 Zbigniew Jędrzejewski-Szmek 2021-03-04 13:06:49 UTC
I cannot reproduce this. Maybe there's some additional ingredient missing.
Also, the backtrace in Comment 3 is not useful, since it was generated after files have already been replaced.
Villy, please attach the coredump if you have it.

Also, you say that you are able to reproduce this on 'daemon-reload'. Which versions?
Can you provide the backtrace or coredump?

I'd assume that this is fixed with https://github.com/systemd/systemd/commit/9807fdc1da8e037ddedfa4e2c6d2728b6e60051e,
except that you're saying it's also with 248.

Comment 12 Villy Kruse 2021-03-04 14:08:37 UTC
(In reply to Zbigniew Jędrzejewski-Szmek from comment #11)

> 
> Also, you say that you are able to reproduce this on 'daemon-reload'. Which
> versions?

Daemon-reexec not reload.


> Can you provide the backtrace or coredump?
> 

Later today.

Comment 13 Villy Kruse 2021-03-04 14:37:25 UTC
Created attachment 1760694 [details]
New version systemd

Comment 14 Villy Kruse 2021-03-04 14:38:12 UTC
Created attachment 1760695 [details]
Old version of systemd

Comment 15 Villy Kruse 2021-03-04 14:38:51 UTC
Created attachment 1760696 [details]
Raw crash dump

Comment 16 Villy Kruse 2021-03-04 14:39:38 UTC
Created attachment 1760697 [details]
Raw chrash dump

Comment 17 Zbigniew Jędrzejewski-Szmek 2021-03-05 17:53:05 UTC
*** Bug 1934549 has been marked as a duplicate of this bug. ***

Comment 18 Matt Fagnani 2021-03-06 01:37:21 UTC
(In reply to Zbigniew Jędrzejewski-Szmek from comment #5)
> Is this happening to other people too?

Zbigniew, I reported a crash with the same failed assertion 'Assertion 'p->n_ref > 0' failed at src/shared/varlink.c:386, function varlink_unref().' with the full trace which happened during the systemd-247.3-3.fc34 post-install scriptlet when I ran sudo dnf reinstall systemd*247.3* at
https://bugzilla.redhat.com/show_bug.cgi?id=1930900 Many systemd services timed-out during the transaction, and the update took several minutes when it would normally be only a few seconds. The system couldn't switch VTs, reboot or shutdown normally after that.

I reported a systemd abort during the systemd-247.3-3.fc34 post-install scriptlet when running sudo dnf upgrade. The trace of the systemd crash showed a "corrupted double-linked list" in malloc_printerr at malloc.c:5626 in glibc-2.32.9000-29.fc34.x86_64 in frame 6. https://bugzilla.redhat.com/show_bug.cgi?id=1930793 A systemd crash with the same trace and other characteristics also happened when I upgraded to systemd-248~rc2-1.fc34. I can provide more information if it would help. Thanks.

Comment 19 Chris Murphy 2021-03-06 02:14:52 UTC
*** Bug 1936027 has been marked as a duplicate of this bug. ***

Comment 20 Chris Murphy 2021-03-06 02:20:54 UTC
Quite a lot in bug 1936027 before I found this. But the gist is that the coredump produced is bad and can't be used by coredumpctl. And I can reproduce it 100% on Fedora Workstation. Do a graphical login, launch GNOME Terminal, and 'dnf update *rpm' in a dir that has the systemd rpms. It never happens if I do a 'dnf offline-upgrade' or GNOME Software initiated pkoffline update. Maybe it has something to do with a full user stack being used; or the update being initiated from inside that user stack *shrug*. But in all cases the coredump isn't usable.

Comment 21 Adam Williamson 2021-03-06 02:34:36 UTC
I actually tried setting the scratch build I did for the resolver fix as a 'workaround' in openQA so the test that fails intermittently would work reliably (I hoped)...and that ran into something this problem too. The openQA tests install the builds set as 'workarounds' before running the test proper, but when they tried to do that, things blew up, like so:

https://openqa.fedoraproject.org/tests/801288#step/_console_wait_login_2/3

There's a bunch of "Failed to" <do something> errors, the "reboot" command after the update completes doesn't work, and logging into another console doesn't work. I can't tell if anything outright *crashed* because the test wasn't able to log into another console to find out. I can fiddle with it some more next week if desired.

Comment 22 Chris Murphy 2021-03-06 04:11:21 UTC
$ sudo reboot
Failed to open initctl fifo: No such device or address
Failed to talk to init daemon.

Requires reboot -f or sysrq+b

Comment 23 Chris Murphy 2021-03-06 04:25:48 UTC
Created attachment 1761097 [details]
coredumpctl gdb, thread apply all bt full

I got most of thread 2, thread 1 looks unusable.

Comment 24 Zbigniew Jędrzejewski-Szmek 2021-03-07 15:53:22 UTC
You can do 'systemctl stop systemd-oomd' to work-around the issue ;)
https://github.com/systemd/systemd/pull/18915 should fix the crash.

Comment 25 Adam Williamson 2021-03-08 22:30:33 UTC
yup, that does indeed seem to work.

Comment 26 Zbigniew Jędrzejewski-Szmek 2021-03-09 13:42:30 UTC
*** Bug 1936559 has been marked as a duplicate of this bug. ***

Comment 27 Zbigniew Jędrzejewski-Szmek 2021-03-10 16:44:40 UTC
*** Bug 1937407 has been marked as a duplicate of this bug. ***

Comment 28 Zbigniew Jędrzejewski-Szmek 2021-03-11 11:45:20 UTC
*** Bug 1937134 has been marked as a duplicate of this bug. ***

Comment 29 Fedora Update System 2021-03-11 14:11:43 UTC
FEDORA-2021-7bd2ec6c13 has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-7bd2ec6c13

Comment 30 Fedora Update System 2021-03-11 19:51:56 UTC
FEDORA-2021-7bd2ec6c13 has been pushed to the Fedora 34 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-7bd2ec6c13`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-7bd2ec6c13

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 31 Zbigniew Jędrzejewski-Szmek 2021-03-12 14:05:06 UTC
*** Bug 1937504 has been marked as a duplicate of this bug. ***

Comment 32 Zbigniew Jędrzejewski-Szmek 2021-03-12 14:18:11 UTC
*** Bug 1930900 has been marked as a duplicate of this bug. ***

Comment 33 Zbigniew Jędrzejewski-Szmek 2021-03-12 14:19:10 UTC
*** Bug 1930793 has been marked as a duplicate of this bug. ***

Comment 34 George R. Goffe 2021-03-12 21:30:07 UTC
Howdy,

I'm looking forward to installing this fix... 

I have seen this problem twice on native hardware... in the past two weeks or so... same version of systemd (#1 was the upgrade to systemd... I thought it was a fluke... #2 was a dnf reinstall systemd...).

I would bet money that it could happen again if I do a dnf reinstall systemd.

George...

Comment 35 Xose Vazquez Perez 2021-03-12 21:39:44 UTC
(In reply to George R. Goffe from comment #34)

> I'm looking forward to installing this fix... 
> 
> I have seen this problem twice on native hardware... in the past two weeks
> or so... same version of systemd (#1 was the upgrade to systemd... I thought
> it was a fluke... #2 was a dnf reinstall systemd...).
> 
> I would bet money that it could happen again if I do a dnf reinstall systemd.

It was really fixed.

1. If you do not want wait, or use the testing repo: # systemctl stop systemd-oomd ;  systemctl disable systemd-oomd ; dracut -f
2. # dnf update --enablerepo=updates-testing systemd* ; dracut -f ; # and init 6

Comment 36 Zbigniew Jędrzejewski-Szmek 2021-03-13 09:49:12 UTC
I'll take that in cash, thanks.

Comment 37 Fedora Update System 2021-03-13 19:27:46 UTC
FEDORA-2021-c2bfa5e4f6 has been pushed to the Fedora 34 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-c2bfa5e4f6`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-c2bfa5e4f6

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 38 Fedora Update System 2021-03-16 00:29:09 UTC
FEDORA-2021-c2bfa5e4f6 has been pushed to the Fedora 34 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 39 Adam Williamson 2021-03-17 20:16:01 UTC
I think we can close this now, the update was tested to fix it and is in stable...

Comment 40 Villy Kruse 2021-03-18 12:49:50 UTC
(In reply to Adam Williamson from comment #39)
> I think we can close this now, the update was tested to fix it and is in
> stable...

I can no longer cause a failure when calling "systemctl daemon-reexec" even when systemd-oomd is running.
So the real problem seems to be fixed as well.


Note You need to log in before you can comment on or make changes to this bug.