Bug 1825232

Summary:	System drops into emergency mode for no obvious reason after upgrading to latest systemd [rhel-7.9.z]
Product:	Red Hat Enterprise Linux 7	Reporter:	Renaud Métrich <rmetrich>
Component:	systemd	Assignee:	Michal Sekletar <msekleta>
Status:	CLOSED ERRATA	QA Contact:	Frantisek Sumsal <fsumsal>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.8	CC:	amarecek, asamir, bcao, fkrska, fsumsal, jreznik, kwalker, mhatanak, mschena, msekleta, myamazak, ovasik, pdwyer, qguo, rblakley, systemd-maint-list
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1889314 1889315 (view as bug list)		Environment:
Last Closed:	2020-11-10 12:58:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1889314, 1889315

Description Renaud Métrich 2020-04-17 12:49:30 UTC

Description of problem:

We have now several customers (at least 3) facing a boot issue after they updated their system to RHEL 7.8's systemd.
They all run on VMWare, but it's probably not related, since I can reproduce on KVM myself with some hacks (see below reproducer).

In a nutshell, the boot proceeds then it enters emergency.target because initrd-switch-root.service fails: 

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
● initrd-switch-root.service - Switch Root
   Loaded: loaded (/usr/lib/systemd/system/initrd-switch-root.service; static; vendor preset: disabled)
   Active: failed (Result: signal) since Fri 2020-04-17 14:36:17 CEST; 5min ago
  Process: 502 ExecStart=/usr/bin/systemctl --no-block --force switch-root /sysroot (code=killed, signal=TERM)
 Main PID: 502 (code=killed, signal=TERM)

Apr 17 14:36:17 vm-up76 systemd[1]: Starting Switch Root...
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

The condition to hit this seems to require to have:
- systemd in the initramfs is the "old" systemd prior to system update e.g. "systemd-219-62.el7_6.7.x86_64")
- no serial console configured


In real customer scenarios, there is indeed an old systemd because the customers updated in 2 phases:
- kernel + microcode
- rest of the system

Due to this, *no* initramfs is rebuilt after updating systemd.

My reproducer uses the following:
- update everything except systemd --> builds a new initramfs with old systemd inside
- reboot then update systemd --> initramfs not rebuilt


Version-Release number of selected component (if applicable):

systemd-219-62.el7_6.7.x86_64 -> systemd-219-73.el7_8.XX


How reproducible:

Always on customer sites
Using a hack in my lab


Steps to Reproduce:
1. Install a system with 2 CPUs with RHEL 7.6 DVD
2. Update the system to RHEL 7.6 Latest and reboot
3. Update the system to RHEL 7.8 latest *except* systemd and reboot
4. Update systemd to latest

Actual results:

Booting with the initramfs which contains old systemd enters Emergency mode, 100% reproducible

Expected results:


Additional info:

Rebuilding the initramfs with latest systemd fixes the issue for some reason.
We need to understand why ...

Indeed, if updating systemd requires a initramfs rebuild, then systemd post-install shall be updated to do so

Comment 2 Renaud Métrich 2020-04-17 12:57:37 UTC

In order to reproduce easily, I perform the following hack:

1. Update the system to RHEL 7.6 Latest and reboot

2. Edit /usr/lib/systemd/system/initrd-cleanup.service to delay its end

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
ExecStart=/bin/bash -c '/usr/bin/systemctl --no-block isolate initrd-switch-root.target && sleep 5'
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

3. Update the system to RHEL 7.8 latest *except* systemd and reboot

4. Update systemd to latest


Doing so triggers the issue.
I then get the following journal (with "debug"):

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
Trying to enqueue job initrd-switch-root.target/start/isolate
Installed new job systemd-udevd-control.socket/stop as 80
Installed new job timers.target/stop as 90
Installed new job initrd.target/stop as 85
Installed new job swap.target/stop as 81
Installed new job paths.target/stop as 100
Installed new job remote-fs.target/stop as 96
Installed new job systemd-udev-trigger.service/stop as 91
Installed new job local-fs.target/stop as 95
Installed new job sockets.target/stop as 99
Installed new job systemd-tmpfiles-setup-dev.service/stop as 102

HERE: job canceled

Job initrd-cleanup.service/start finished, result=canceled
Sent message type=signal sender=n/a destination=n/a object=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=JobRemoved cookie=1 reply_cookie=0 error=n/a

Installed new job initrd-cleanup.service/stop as 94
Installed new job dracut-cmdline.service/stop as 82
Installed new job systemd-udevd-kernel.socket/stop as 78 
Installed new job dracut-pre-udev.service/stop as 92 
Installed new job dracut-initqueue.service/stop as 88
Installed new job remote-fs-pre.target/stop as 101
Installed new job initrd-switch-root.service/start as 55 
Installed new job plymouth-switch-root.service/start as 58
Installed new job initrd-switch-root.target/start as 54
Installed new job slices.target/stop as 89
Installed new job basic.target/stop as 83
Installed new job initrd-udevadm-cleanup-db.service/start as 77
Installed new job sysinit.target/stop as 97
Installed new job dracut-pre-pivot.service/stop as 86
Installed new job systemd-sysctl.service/stop as 87
Installed new job systemd-udevd.service/stop as 79
Installed new job kmod-static-nodes.service/stop as 93
Enqueued job initrd-switch-root.target/start as 54

[...]

initrd-cleanup.service changed start -> stop-sigterm
Received SIGCHLD from PID 492 (bash).
Child 492 (bash) died (code=killed, status=15/TERM)
Child 492 belongs to initrd-cleanup.service
initrd-cleanup.service: main process exited, code=killed, status=15/TERM
initrd-cleanup.service changed stop-sigterm -> dead
Job initrd-cleanup.service/stop finished, result=done
Stopped Cleaning Up and Shutting Down Daemons.
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

The weird thing is that emergency.target enters not because of initrd-cleanup.service, but initrd-switch-root.service which doesn't print any suspicious log!

Comment 4 Renaud Métrich 2020-04-17 13:21:10 UTC

This may be due to BZ #1754053 but I would like to be really sure.

Comment 7 Renaud Métrich 2020-04-20 08:57:57 UTC

A customer reported he could see this while upgrading systemd from latest 7.7 to 7.8

Comment 19 Michal Sekletar 2020-09-21 08:33:48 UTC

Folks from Alibaba are also running into the same problem and they proposed solution upstream.

https://github.com/systemd-rhel/rhel-7/pull/117

Even though the proposed fix is a hack we have decided to go ahead and merge it (after the issues pointed out in code review get fixed) due to number of cases attached to the BZ.

Comment 21 Renaud Métrich 2020-09-21 08:42:46 UTC

Making the BZ public.

Comment 26 Lukáš Nykrýn 2020-09-30 08:28:47 UTC

fix merged to github master branch -> https://github.com/systemd-rhel/rhel-7/pull/117

Comment 40 errata-xmlrpc 2020-11-10 12:58:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (systemd bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5007