Bug 1650951

Summary: After upgrading Fedora Server from 28 to 29, systemd becomes unresponsive after activating multi-user.target
Product: [Fedora] Fedora Reporter: Mike Cronce <mike>
Component: systemdAssignee: systemd-maint
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 29CC: lnykryn, mike, msekleta, s, systemd-maint, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-27 22:38:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
virt01: rpm -qa
none
virt03: rpm -qa
none
rpm -qa diff (left is virt01, right is virt03)
none
journald output
none
virt02: rpm -qa none

Description Mike Cronce 2018-11-17 23:45:25 UTC
Description of problem:
I have, thus far, upgraded two of my three systems from Fedora Server 28 to Fedora Server 29 and found that after booting them up on 29, systemd is wholly unresponsive to systemctl commands - they all print "Failed to [insert action here]: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)" and exit.

The first system (virt03) just...stopped doing it after several reboots when I was trying to troubleshoot it.  I have no explanation whatsoever as to why it stopped occurring.

The second system (virt01) is still doing it.  I've found that if I boot to rescue.target I can activate every service that's wanted by multi-user.target and have no issue, but when I activate multi-user.target itself, the problem occurs.

Both systems have similar sets of software installed and are being used for the same purpose - members of Ceph and Kubernetes clusters.  I removed a couple packages from virt01 last night while trying to troubleshoot it.  I'll attach the output of rpm -qa | sort from both, along with a diff of the two.

A keen reader may notice that I skipped a box named virt02 :) this is correct.  virt02 is only a Kubernetes node and Ceph mon/mds; it has no OSDs, while the other two do.  I'm going to try to upgrade it tonight and see what it does, and will report back.

While trying to find similar bug reports, I came across https://bugzilla.redhat.com/show_bug.cgi?id=1548417 - it MIGHT be the same, but the logging I see from `journalctl -f` (attached) from the time when I activated multi-user.target on virt01 doesn't match the reporter's all that closely.

Any other details I can provide, please ask.  I'm at a total loss as to how to troubleshoot this one, so I'm sure what I have isn't as helpful as it could potentially be.

Version-Release number of selected component (if applicable):
systemd 239
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid

How reproducible:
I'm not sure yet.  It's happened to me on 100% of boxes that I've upgraded from Fedora 28 to Fedora 29, but my sample size is 2.  Not super scientific.


Steps to Reproduce:
1.  Have Fedora Server 28 system running kubelet (from Fedora repo's kubernetes-node package) and Ceph (from Ceph's repo, located at http://download.ceph.com/rpm-mimic/el7)
2.  Upgrade to Fedora Server 29 with `dnf upgrade --refresh && dnf system-upgrade download --releasever=29 && dnf system-upgrade reboot`
3.  On boot, observe this issue

Actual results:
systemd is unresponsive in multi-user.target


Expected results:
systemd is responsive in multi-user.target


Additional info:

Comment 1 Mike Cronce 2018-11-17 23:46:06 UTC
Created attachment 1506844 [details]
virt01: rpm -qa

Comment 2 Mike Cronce 2018-11-17 23:46:34 UTC
Created attachment 1506845 [details]
virt03: rpm -qa

Comment 3 Mike Cronce 2018-11-17 23:50:35 UTC
Created attachment 1506846 [details]
rpm -qa diff (left is virt01, right is virt03)

Comment 4 Mike Cronce 2018-11-17 23:52:19 UTC
Created attachment 1506847 [details]
journald output

Comment 5 Mike Cronce 2018-11-19 15:51:22 UTC
I was a bit delayed, but got the last box (virt02) upgraded to Fedora 29 last night.  It did not exhibit the issue.  I'll attach the `rpm -qa` output from that one as well.

Comment 6 Mike Cronce 2018-11-19 15:52:25 UTC
Created attachment 1507302 [details]
virt02: rpm -qa

Comment 7 Zbigniew Jędrzejewski-Szmek 2018-11-21 13:37:45 UTC
#1548417 was a memory corruption in some udev / device hashmap code called when parsing the mount table. In your logs, I see bus_process_object, so it's responding to a dbus message. Maybe https://github.com/systemd/systemd/issues/10716 is related?

Comment 8 Mike Cronce 2018-11-21 15:07:11 UTC
It could be related.  Taking a look at the logs he posted (https://pastebin.com/b9wZt0s6), though, his systemd is catching a SIGSEGV while mine is catching a SIGABRT.  We all know how these things can end up intertwined, though ;)

It is worth noting - Lennart mentions that the issue on https://github.com/systemd/systemd/issues/10716 is probably fallout from https://github.com/systemd/systemd/commit/a7a7163df7fc8a9f794f6803b2f6c9c9b0745a1f, which is intended to fix a race condition between daemon-reload and other commands.  In both that issue and my issue, the crash occurs during a period with frequent daemon-reloads.  My logs show six of them in the two seconds prior to the crash.  It's hard to ignore all the similarities.

Is there any way to roll back to systemd 238 on Fedora 29 while we await a fix?

Comment 9 Mike Cronce 2018-11-24 00:14:34 UTC
Update - it looks like `dnf install --downgrade systemd-239.3` makes it go away until a fix is released :)

Comment 10 Mike Cronce 2018-11-24 00:40:23 UTC
Sorry, lost the terminal and had the command wrong by memory.  `dnf install systemd-239-3.fc29`

Comment 11 Ben Cotton 2019-10-31 19:14:21 UTC
This message is a reminder that Fedora 29 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '29'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 29 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 12 Ben Cotton 2019-11-27 22:38:16 UTC
Fedora 29 changed to end-of-life (EOL) status on 2019-11-26. Fedora 29 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.