Bug 1867930
Summary: | systemd 246 and logind hang talking to each other at 100% CPU causing the system to be stuck at boot | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Hans de Goede <hdegoede> | ||||||
Component: | systemd | Assignee: | systemd-maint | ||||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 33 | CC: | bugzilla, ego.cordatus, filbranden, flepied, lnykryn, msekleta, ssahani, s, systemd-maint, zbyszek, z | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | systemd-246.4-1.fc33 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2020-09-08 17:04:22 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Hans de Goede
2020-08-11 08:31:27 UTC
Bug 1866367 is a possible duplicate, I've asked the reporter to try adding a debug-shell and run top to see if he is really seeing the same thing. This bug appears to have been reported against 'rawhide' during the Fedora 33 development cycle. Changing version to 33. Created attachment 1712237 [details]
Strace of systemd hanging (seems to be talking to logind endlessly)
I just realized I forgot to attach the promised strace, here it is.
I will also install all the latest Fedora 33 updates and retry to make sure this is still happening.
The repeating part seems to be: stat("/etc/systemd/system", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/etc/systemd/system.attached", 0x7ffdba1a9eb0) = -1 ENOENT (No such file or directory) stat("/run/systemd/system", {st_mode=S_IFDIR|0755, st_size=40, ...}) = 0 stat("/run/systemd/system.attached", 0x7ffdba1a9eb0) = -1 ENOENT (No such file or directory) stat("/usr/local/lib/systemd/system", 0x7ffdba1a9eb0) = -1 ENOENT (No such file or directory) stat("/usr/lib/systemd/system", {st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0 stat("/etc/systemd/system", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/etc/systemd/system.attached", 0x7ffdba1a9eb0) = -1 ENOENT (No such file or directory) stat("/run/systemd/system", {st_mode=S_IFDIR|0755, st_size=40, ...}) = 0 stat("/run/systemd/system.attached", 0x7ffdba1a9eb0) = -1 ENOENT (No such file or directory) stat("/usr/local/lib/systemd/system", 0x7ffdba1a9eb0) = -1 ENOENT (No such file or directory) stat("/usr/lib/systemd/system", {st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0 sendmsg(70, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1|\3\0\0006\264\25\0\236\0\0\0\1\1o\0=\0\0\0/org/fre"..., iov_len=176}, {iov_base="\35\0\0\0org.freedesktop.systemd1.Uni"..., iov_len=892}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 1068 For such straces, -t and -s 200 help: right now it's hard to say even what call is being made. Ok good news, I'm no longer seeing this on my X1C8 after installing all the latest updates. I was able to reproduce it on another machine with a slightly older F33 version, even with the latest systemd. I will leave that other systems as is for now in case you want to debug this further. And I will attach a strave -s 200 from that other machine, it seems the sendmsg is zram generator related ? I wonder what makes systemd-logind consume around 25% CPU during the hang then? Created attachment 1712363 [details]
strace -s 200 -p 1 (systemd hang strace)
Oh BTW I can also fully update the other system to all the latest f33 bits and see if the problem goes away there too. If it does I guess the root cause lay elsewhere and has been fixed; and then we can close this bug. Either way let me know how you want to proceed with this bug. PID1 is sending org.freedesktop.DBus.Properties for the same proprties in a tight loop,
the two messages are for org.freedesktop.systemd1.Service and org.freedesktop.systemd1.Unit.
15:36:53 sendmsg(81, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\0013\3\0\0003\274\7\0\236\0\0\0\1\1o\0=\0\0\0/org/freedesktop/systemd1/unit/zram_2dsetup_40zram0_2eservice\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChanged\0\0\0\0\0\0\0\10\1g\0\10sa{sv}as\0\0\0", iov_len=176}, {iov_base=" \0\0\0org.freedesktop.systemd1.Service\0\0\0\0\374\1\0\0\0\0\0\0\7\0\0\0MainPID\0\1u\0\0\0\0\0\0\0\0\0\0\n\0\0\0ControlPID\0\1u\0\0\0\0\0\0\0\n\0\0\0StatusText\0\1s\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\v\0\0\0StatusErrno\0\1i\0\0\0\0\0\0\6\0\0\0Result\0\1s\0\0\0\7\0\0\0success\0\0\0\0\0\f\0\0\0ReloadResult"..., iov_len=819}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 995
15:36:53 sendmsg(81, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1|\3\0\0004\274\7\0\236\0\0\0\1\1o\0=\0\0\0/org/freedesktop/systemd1/unit/zram_2dsetup_40zram0_2eservice\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChanged\0\0\0\0\0\0\0\10\1g\0\10sa{sv}as\0\0\0", iov_len=176}, {iov_base="\35\0\0\0org.freedesktop.systemd1.Unit\0\0\0004\3\0\0\v\0\0\0ActiveState\0\1s\0\0\10\0\0\0inactive\0\0\0\0\0\0\0\0\f\0\0\0FreezerState\0\1s\0\7\0\0\0running\0\10\0\0\0SubState\0\1s\0\4\0\0\0dead\0\0\0\0\0\0\0\0\24\0\0\0StateChangeTimestamp\0\1t\0\0\0\0\0\0\0\0\0\0\0\0\0\35\0\0\0StateChangeT"..., iov_len=892}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 1068
15:36:53 sendmsg(81, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\0013\3\0\0005\274\7\0\236\0\0\0\1\1o\0=\0\0\0/org/freedesktop/systemd1/unit/zram_2dsetup_40zram0_2eservice\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChanged\0\0\0\0\0\0\0\10\1g\0\10sa{sv}as\0\0\0", iov_len=176}, {iov_base=" \0\0\0org.freedesktop.systemd1.Service\0\0\0\0\374\1\0\0\0\0\0\0\7\0\0\0MainPID\0\1u\0\0\0\0\0\0\0\0\0\0\n\0\0\0ControlPID\0\1u\0\0\0\0\0\0\0\n\0\0\0StatusText\0\1s\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\v\0\0\0StatusErrno\0\1i\0\0\0\0\0\0\6\0\0\0Result\0\1s\0\0\0\7\0\0\0success\0\0\0\0\0\f\0\0\0ReloadResult"..., iov_len=819}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 995
15:36:53 sendmsg(81, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1|\3\0\0006\274\7\0\236\0\0\0\1\1o\0=\0\0\0/org/freedesktop/systemd1/unit/zram_2dsetup_40zram0_2eservice\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChanged\0\0\0\0\0\0\0\10\1g\0\10sa{sv}as\0\0\0", iov_len=176}, {iov_base="\35\0\0\0org.freedesktop.systemd1.Unit\0\0\0004\3\0\0\v\0\0\0ActiveState\0\1s\0\0\10\0\0\0inactive\0\0\0\0\0\0\0\0\f\0\0\0FreezerState\0\1s\0\7\0\0\0running\0\10\0\0\0SubState\0\1s\0\4\0\0\0dead\0\0\0\0\0\0\0\0\24\0\0\0StateChangeTimestamp\0\1t\0\0\0\0\0\0\0\0\0\0\0\0\0\35\0\0\0StateChangeT"..., iov_len=892}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 1068
15:36:53 sendmsg(81, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\0013\3\0\0007\274\7\0\236\0\0\0\1\1o\0=\0\0\0/org/freedesktop/systemd1/unit/zram_2dsetup_40zram0_2eservice\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChanged\0\0\0\0\0\0\0\10\1g\0\10sa{sv}as\0\0\0", iov_len=176}, {iov_base=" \0\0\0org.freedesktop.systemd1.Service\0\0\0\0\374\1\0\0\0\0\0\0\7\0\0\0MainPID\0\1u\0\0\0\0\0\0\0\0\0\0\n\0\0\0ControlPID\0\1u\0\0\0\0\0\0\0\n\0\0\0StatusText\0\1s\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\v\0\0\0StatusErrno\0\1i\0\0\0\0\0\0\6\0\0\0Result\0\1s\0\0\0\7\0\0\0success\0\0\0\0\0\f\0\0\0ReloadResult"..., iov_len=819}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 995
15:36:54 sendmsg(81, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1|\3\0\08\274\7\0\236\0\0\0\1\1o\0=\0\0\0/org/freedesktop/systemd1/unit/zram_2dsetup_40zram0_2eservice\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChanged\0\0\0\0\0\0\0\10\1g\0\10sa{sv}as\0\0\0", iov_len=176}, {iov_base="\35\0\0\0org.freedesktop.systemd1.Unit\0\0\0004\3\0\0\v\0\0\0ActiveState\0\1s\0\0\10\0\0\0inactive\0\0\0\0\0\0\0\0\f\0\0\0FreezerState\0\1s\0\7\0\0\0running\0\10\0\0\0SubState\0\1s\0\4\0\0\0dead\0\0\0\0\0\0\0\0\24\0\0\0StateChangeTimestamp\0\1t\0\0\0\0\0\0\0\0\0\0\0\0\0\35\0\0\0StateChangeT"..., iov_len=892}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 1068
> Oh BTW I can also fully update the other system to all the latest f33 bits and see if the problem goes away there too.
No, please keep it for now if you may. Having a reliable reproducer is nice.
(In reply to Zbigniew Jędrzejewski-Szmek from comment #8) > No, please keep it for now if you may. Having a reliable reproducer is nice. Ok, this system is one of my many x86 Cherry Trail tablets, so I can keep it in this state for a while without issues. I pasted this in a few bugs which seem to be duplicates: After some upstream discussion we found one suspicious patch. In https://copr.fedorainfracloud.org/coprs/zbyszek/systemd/build/1635482/ there's build systemd-246.3-2 that reverts the patch. It is otherwise the same as systemd-246.3-1 in rawhide/f33 right now. It would be great if someone who can reproduce the issue checked whether it still occurs with systemd-246.3-2. (And if *not*, then whether it still occurs with systemd-246.3-1.) If that doesn't work, we'll need to think of something different. (In reply to Zbigniew Jędrzejewski-Szmek from comment #10) > I pasted this in a few bugs which seem to be duplicates: > After some upstream discussion we found one suspicious patch. In > https://copr.fedorainfracloud.org/coprs/zbyszek/systemd/build/1635482/ It looks like the build failed... > there's build systemd-246.3-2 that reverts the patch. It is otherwise the > same as systemd-246.3-1 in rawhide/f33 > right now. It would be great if someone who can reproduce the issue checked > whether it still occurs > with systemd-246.3-2. (And if *not*, then whether it still occurs with > systemd-246.3-1.) I can confirm that the problem still reproduces for me with systemd-246.3-1.fc33. Sorry for that. https://copr.fedorainfracloud.org/coprs/zbyszek/systemd/build/1635567/ is the replacement for https://copr.fedorainfracloud.org/coprs/zbyszek/systemd/build/1635482/. Ok, testing the new build now. This takes a while because the rpm -Uvh takes a long time because all the %post scripts timeout, I guess I could try using --noscripts, but so far I've done all up/downgrades by just waiting for the timeouts so for consistency's sake I'll keep doing that. One interesting observation which I just made is that systemd is not using 100% CPU when booting into text-mode (adding " 3" on the kernel cmdline) and no-one has logged in yet. It jumps to 100% (and presumably showing the behavior from the strace) as soon as I try to login either on tty1 or over ssh. I can confirm that the copr build fixes the hang. FEDORA-2020-2255b438a2 has been submitted as an update to Fedora 33. https://bodhi.fedoraproject.org/updates/FEDORA-2020-2255b438a2 I can confirm that the 246.4 build fixes this on the system were I could still reproduce this with 246.3-1 (and where it was also fixed with the 246.3-2 copr). FEDORA-2020-2255b438a2 has been pushed to the Fedora 33 testing repository. In short time you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2020-2255b438a2` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2020-2255b438a2 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates. FEDORA-2020-2255b438a2 has been pushed to the Fedora 33 stable repository. If problem still persists, please make note of it in this bug report. |