Bug 1867930 - systemd 246 and logind hang talking to each other at 100% CPU causing the system to be stuck at boot
Summary: systemd 246 and logind hang talking to each other at 100% CPU causing the sys...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 33
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-11 08:31 UTC by Hans de Goede
Modified: 2020-09-08 17:04 UTC (History)
11 users (show)

Fixed In Version: systemd-246.4-1.fc33
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-08 17:04:22 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Strace of systemd hanging (seems to be talking to logind endlessly) (146.00 KB, application/gzip)
2020-08-22 15:10 UTC, Hans de Goede
no flags Details
strace -s 200 -p 1 (systemd hang strace) (5.77 KB, application/gzip)
2020-08-24 13:59 UTC, Hans de Goede
no flags Details

Description Hans de Goede 2020-08-11 08:31:27 UTC
After doing a dnf distro-sync from F32 to Rawhide/F33 (for dogfooding reasons) I noticed that my system would no longer boot.

The strange thing is, the very first boot when it also had to build some akmods for virtualbox, and the boot thus was slow, it did boot. Note I've disable the service loading the virtualbox hypervisor kernel modules, that did not help, so this is not caused by them.

It took my a couple of hours to find a workaround. I had to pass systemd.debug-shell=1 to get a shell and then I had to jump through some more hoops to get top to work.

Once I had top working I saw systemd pegged at 100% CPU and systemd-logind and dbus-broker at 40-80% CPU. The system did boot successfully once, and if I kill dbus-broker then sometimes after that the 100% CPU usage goes away, even agter dbus-broker has restarted, but other times as soon as dbus-broker starts again systemd causing 100% CPU load is back again. The same symptoms also happen when swapping dbus-daemon for dbus-broker.

There seems to be some race here where systemd keeps asking something from, presumably, logind over and over and over.

I've done a "strace -p 1" and hit CTRL+C after a couple of seconds, that resulted in a 22MB strace (which compresses really well) I'll attach it here.

Since I do have another machine with F33 which does not show this problem which is at systemd 246-rc2, rather then 246.1, I first tried downgrading to rc2, but that did not help.

For me the only way to fix this ultimately was to downgrade systemd to the latest 245.7-1 build from F32, with that everything works fine again.

Comment 1 Hans de Goede 2020-08-11 08:38:01 UTC
Bug 1866367 is a possible duplicate, I've asked the reporter to try adding a debug-shell and run top to see if he is really seeing the same thing.

Comment 2 Ben Cotton 2020-08-11 15:19:06 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 33 development cycle.
Changing version to 33.

Comment 3 Hans de Goede 2020-08-22 15:10:43 UTC
Created attachment 1712237 [details]
Strace of systemd hanging (seems to be talking to logind endlessly)

I just realized I forgot to attach the promised strace, here it is.

I will also install all the latest Fedora 33 updates and retry to make sure this is still happening.

Comment 4 Zbigniew Jędrzejewski-Szmek 2020-08-22 15:55:28 UTC
The repeating part seems to be:
stat("/etc/systemd/system", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/etc/systemd/system.attached", 0x7ffdba1a9eb0) = -1 ENOENT (No such file or directory)
stat("/run/systemd/system", {st_mode=S_IFDIR|0755, st_size=40, ...}) = 0
stat("/run/systemd/system.attached", 0x7ffdba1a9eb0) = -1 ENOENT (No such file or directory)
stat("/usr/local/lib/systemd/system", 0x7ffdba1a9eb0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/systemd/system", {st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/etc/systemd/system", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/etc/systemd/system.attached", 0x7ffdba1a9eb0) = -1 ENOENT (No such file or directory)
stat("/run/systemd/system", {st_mode=S_IFDIR|0755, st_size=40, ...}) = 0
stat("/run/systemd/system.attached", 0x7ffdba1a9eb0) = -1 ENOENT (No such file or directory)
stat("/usr/local/lib/systemd/system", 0x7ffdba1a9eb0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/systemd/system", {st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
sendmsg(70, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1|\3\0\0006\264\25\0\236\0\0\0\1\1o\0=\0\0\0/org/fre"..., iov_len=176}, {iov_base="\35\0\0\0org.freedesktop.systemd1.Uni"..., iov_len=892}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 1068

For such straces, -t and -s 200 help: right now it's hard to say even what call is being made.

Comment 5 Hans de Goede 2020-08-24 13:59:11 UTC
Ok good news, I'm no longer seeing this on my X1C8 after installing all the latest updates.

I was able to reproduce it on another machine with a slightly older F33 version, even with the latest systemd. I will leave that other systems as is for now in case you want to debug this further.

And I will attach a strave -s 200 from that other machine, it seems the sendmsg is zram generator related ? I wonder what makes systemd-logind consume around 25% CPU during the hang then?

Comment 6 Hans de Goede 2020-08-24 13:59:57 UTC
Created attachment 1712363 [details]
strace -s 200 -p 1 (systemd hang strace)

Comment 7 Hans de Goede 2020-08-24 14:06:33 UTC
Oh BTW I can also fully update the other system to all the latest f33 bits and see if the problem goes away there too. If it does I guess the root cause lay elsewhere and has been fixed; and then we can close this bug.

Either way let me know how you want to proceed with this bug.

Comment 8 Zbigniew Jędrzejewski-Szmek 2020-08-24 14:50:32 UTC
PID1 is sending org.freedesktop.DBus.Properties for the same proprties in a tight loop,
the two messages are for org.freedesktop.systemd1.Service and org.freedesktop.systemd1.Unit.

15:36:53 sendmsg(81, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\0013\3\0\0003\274\7\0\236\0\0\0\1\1o\0=\0\0\0/org/freedesktop/systemd1/unit/zram_2dsetup_40zram0_2eservice\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChanged\0\0\0\0\0\0\0\10\1g\0\10sa{sv}as\0\0\0", iov_len=176}, {iov_base=" \0\0\0org.freedesktop.systemd1.Service\0\0\0\0\374\1\0\0\0\0\0\0\7\0\0\0MainPID\0\1u\0\0\0\0\0\0\0\0\0\0\n\0\0\0ControlPID\0\1u\0\0\0\0\0\0\0\n\0\0\0StatusText\0\1s\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\v\0\0\0StatusErrno\0\1i\0\0\0\0\0\0\6\0\0\0Result\0\1s\0\0\0\7\0\0\0success\0\0\0\0\0\f\0\0\0ReloadResult"..., iov_len=819}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 995
15:36:53 sendmsg(81, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1|\3\0\0004\274\7\0\236\0\0\0\1\1o\0=\0\0\0/org/freedesktop/systemd1/unit/zram_2dsetup_40zram0_2eservice\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChanged\0\0\0\0\0\0\0\10\1g\0\10sa{sv}as\0\0\0", iov_len=176}, {iov_base="\35\0\0\0org.freedesktop.systemd1.Unit\0\0\0004\3\0\0\v\0\0\0ActiveState\0\1s\0\0\10\0\0\0inactive\0\0\0\0\0\0\0\0\f\0\0\0FreezerState\0\1s\0\7\0\0\0running\0\10\0\0\0SubState\0\1s\0\4\0\0\0dead\0\0\0\0\0\0\0\0\24\0\0\0StateChangeTimestamp\0\1t\0\0\0\0\0\0\0\0\0\0\0\0\0\35\0\0\0StateChangeT"..., iov_len=892}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 1068
15:36:53 sendmsg(81, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\0013\3\0\0005\274\7\0\236\0\0\0\1\1o\0=\0\0\0/org/freedesktop/systemd1/unit/zram_2dsetup_40zram0_2eservice\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChanged\0\0\0\0\0\0\0\10\1g\0\10sa{sv}as\0\0\0", iov_len=176}, {iov_base=" \0\0\0org.freedesktop.systemd1.Service\0\0\0\0\374\1\0\0\0\0\0\0\7\0\0\0MainPID\0\1u\0\0\0\0\0\0\0\0\0\0\n\0\0\0ControlPID\0\1u\0\0\0\0\0\0\0\n\0\0\0StatusText\0\1s\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\v\0\0\0StatusErrno\0\1i\0\0\0\0\0\0\6\0\0\0Result\0\1s\0\0\0\7\0\0\0success\0\0\0\0\0\f\0\0\0ReloadResult"..., iov_len=819}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 995
15:36:53 sendmsg(81, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1|\3\0\0006\274\7\0\236\0\0\0\1\1o\0=\0\0\0/org/freedesktop/systemd1/unit/zram_2dsetup_40zram0_2eservice\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChanged\0\0\0\0\0\0\0\10\1g\0\10sa{sv}as\0\0\0", iov_len=176}, {iov_base="\35\0\0\0org.freedesktop.systemd1.Unit\0\0\0004\3\0\0\v\0\0\0ActiveState\0\1s\0\0\10\0\0\0inactive\0\0\0\0\0\0\0\0\f\0\0\0FreezerState\0\1s\0\7\0\0\0running\0\10\0\0\0SubState\0\1s\0\4\0\0\0dead\0\0\0\0\0\0\0\0\24\0\0\0StateChangeTimestamp\0\1t\0\0\0\0\0\0\0\0\0\0\0\0\0\35\0\0\0StateChangeT"..., iov_len=892}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 1068
15:36:53 sendmsg(81, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\0013\3\0\0007\274\7\0\236\0\0\0\1\1o\0=\0\0\0/org/freedesktop/systemd1/unit/zram_2dsetup_40zram0_2eservice\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChanged\0\0\0\0\0\0\0\10\1g\0\10sa{sv}as\0\0\0", iov_len=176}, {iov_base=" \0\0\0org.freedesktop.systemd1.Service\0\0\0\0\374\1\0\0\0\0\0\0\7\0\0\0MainPID\0\1u\0\0\0\0\0\0\0\0\0\0\n\0\0\0ControlPID\0\1u\0\0\0\0\0\0\0\n\0\0\0StatusText\0\1s\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\v\0\0\0StatusErrno\0\1i\0\0\0\0\0\0\6\0\0\0Result\0\1s\0\0\0\7\0\0\0success\0\0\0\0\0\f\0\0\0ReloadResult"..., iov_len=819}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 995
15:36:54 sendmsg(81, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1|\3\0\08\274\7\0\236\0\0\0\1\1o\0=\0\0\0/org/freedesktop/systemd1/unit/zram_2dsetup_40zram0_2eservice\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChanged\0\0\0\0\0\0\0\10\1g\0\10sa{sv}as\0\0\0", iov_len=176}, {iov_base="\35\0\0\0org.freedesktop.systemd1.Unit\0\0\0004\3\0\0\v\0\0\0ActiveState\0\1s\0\0\10\0\0\0inactive\0\0\0\0\0\0\0\0\f\0\0\0FreezerState\0\1s\0\7\0\0\0running\0\10\0\0\0SubState\0\1s\0\4\0\0\0dead\0\0\0\0\0\0\0\0\24\0\0\0StateChangeTimestamp\0\1t\0\0\0\0\0\0\0\0\0\0\0\0\0\35\0\0\0StateChangeT"..., iov_len=892}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 1068

> Oh BTW I can also fully update the other system to all the latest f33 bits and see if the problem goes away there too.

No, please keep it for now if you may. Having a reliable reproducer is nice.

Comment 9 Hans de Goede 2020-08-24 15:28:52 UTC
(In reply to Zbigniew Jędrzejewski-Szmek from comment #8)
> No, please keep it for now if you may. Having a reliable reproducer is nice.

Ok, this system is one of my many x86 Cherry Trail tablets, so I can keep it in this state for a while without issues.

Comment 10 Zbigniew Jędrzejewski-Szmek 2020-08-26 13:45:15 UTC
I pasted this in a few bugs which seem to be duplicates:
After some upstream discussion we found one suspicious patch. In
https://copr.fedorainfracloud.org/coprs/zbyszek/systemd/build/1635482/
there's build systemd-246.3-2 that reverts the patch. It is otherwise the same as systemd-246.3-1 in rawhide/f33
right now. It would be great if someone who can reproduce the issue checked whether it still occurs
with systemd-246.3-2. (And if *not*, then whether it still occurs with systemd-246.3-1.)

If that doesn't work, we'll need to think of something different.

Comment 11 Hans de Goede 2020-08-26 15:09:13 UTC
(In reply to Zbigniew Jędrzejewski-Szmek from comment #10)
> I pasted this in a few bugs which seem to be duplicates:
> After some upstream discussion we found one suspicious patch. In
> https://copr.fedorainfracloud.org/coprs/zbyszek/systemd/build/1635482/

It looks like the build failed...

> there's build systemd-246.3-2 that reverts the patch. It is otherwise the
> same as systemd-246.3-1 in rawhide/f33
> right now. It would be great if someone who can reproduce the issue checked
> whether it still occurs
> with systemd-246.3-2. (And if *not*, then whether it still occurs with
> systemd-246.3-1.)

I can confirm that the problem still reproduces for me with systemd-246.3-1.fc33.

Comment 13 Hans de Goede 2020-08-26 15:54:07 UTC
Ok, testing the new build now. This takes a while because the rpm -Uvh takes a long time because all the %post scripts timeout,
I guess I could try using --noscripts, but so far I've done all up/downgrades by just waiting for the timeouts so for consistency's sake I'll keep doing that.

One interesting observation which I just made is that systemd is not using 100% CPU when booting into text-mode (adding " 3" on the kernel cmdline) and no-one has logged in yet. It jumps to 100% (and presumably showing the behavior from the strace) as soon as I try to login either on tty1 or over ssh.

Comment 14 Hans de Goede 2020-08-26 16:06:36 UTC
I can confirm that the copr build fixes the hang.

Comment 15 Fedora Update System 2020-09-02 11:07:15 UTC
FEDORA-2020-2255b438a2 has been submitted as an update to Fedora 33. https://bodhi.fedoraproject.org/updates/FEDORA-2020-2255b438a2

Comment 16 Hans de Goede 2020-09-02 12:23:54 UTC
I can confirm that the 246.4 build fixes this on the system were I could still reproduce this with 246.3-1 (and where it was also fixed with the 246.3-2 copr).

Comment 17 Fedora Update System 2020-09-02 16:20:20 UTC
FEDORA-2020-2255b438a2 has been pushed to the Fedora 33 testing repository.
In short time you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2020-2255b438a2`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2020-2255b438a2

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 18 Fedora Update System 2020-09-08 17:04:22 UTC
FEDORA-2020-2255b438a2 has been pushed to the Fedora 33 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.