Bug 2172912

Summary:	Broken /dev/log socket created during boot in recovery, causing grub2-mkconfig to hang forever
Product:	Red Hat Enterprise Linux 9	Reporter:	Renaud Métrich <rmetrich>
Component:	rear	Assignee:	Pavel Cahyna <pcahyna>
Status:	CLOSED ERRATA	QA Contact:	Jakub Haruda <jharuda>
Severity:	high	Docs Contact:	Šárka Jana <sjanderk>
Priority:	high
Version:	9.1	CC:	jharuda, pcahyna
Target Milestone:	rc	Keywords:	Triaged
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	rear-2.6-18.el9	Doc Type:	Bug Fix
Doc Text:	.The `rsyslog` logging service now starts at boot of the rescue system Previously, the `rsyslog` service for message logging did not automatically start in the rescue system. The `/dev/log` socket kept receiving messages during the recovery process with no service listening at this socket. Consequently, the `/dev/log` socket was filled with messages and caused the recovery process to be stuck. For example, the `grub2-mkconfig` command to regenerate the GRUB configuration produces a high amount of log messages depending on the number of mounted file systems. If you used ReaR to recover systems with many mounted file systems, numerous log messages would fill the `/dev/log` socket, and the recovery process froze. With this fix, the `systemd` units in the rescue system now include the sockets target in the boot procedure to start the logging socket at boot. As a result, the `rsyslog` service starts in the rescue environment when required, and the processes that need to log messages during recovery are no longer stuck. The recovery process completes successfully and you can find the log messages in the `/var/log/messages` file in the rescue RAM disk.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-11-07 08:37:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Renaud Métrich 2023-02-23 13:48:43 UTC

Description of problem:

With RHEL9, the /dev/log inode is supposed to be a symlink to /run/systemd/journal/dev-log.
But when booting the ReaR ISO, it's not the case, it's a regular socket with nobody listening on.

This causes no harm unless programs log to /dev/log, which gets filled and once filled up, programs will hang.

Affected program can be anything, but usually it is likely grub2-mkconfig and children (including os-prober) executing in the chroot after recovery that will be affected:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
++ chroot /mnt/local /bin/bash --login -c 'grub2-mkconfig -o /boot/grub2/grub.cfg'
Generating grub configuration file ...

--> HANG
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

In this scenario, hang happens when having many mount points, which lead to having os-prober scan all the mount points and send many debug messages such as "debug: /dev/mapper/vg-lvname is not an HFS+ partition: exiting" through /dev/log.

The exact root cause behind having the /dev/log socket broken is the usage of templates in ReaR for some systemd services, e.g. /usr/share/rear/skel/default/usr/lib/systemd/system/syslog.socket

Such template is not in sync with systemd's units on RHEL9, causing the issue.

The workaround consists in 2 operations, to be performed before recovering:

1. Tell to copy standard systemd's units to the ReaR ISO:

   -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
   COPY_AS_IS+=( /usr/lib/systemd/system/systemd-journald-dev-log.socket /usr/lib/systemd/system/systemd-journald.socket /usr/lib/systemd/system/systemd-journald.service /usr/lib/systemd/system/sockets.target.wants/systemd-journald-dev-log.socket )
    -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

2. Delete /usr/share/rear/skel/default/usr/lib/systemd/system/syslog.socket

The proper solution is likely to remove all templates mapping systemd units and copy the systemd units to the ISO instead.

Version-Release number of selected component (if applicable):

rear-2.6-15

How reproducible:

Always

Steps to Reproduce:

1. Create a VM with many filesystems

   /dev/mapper/rhel-root   /                       xfs     defaults        0 0
   UUID=01d8a9ea-ee10-4ec2-b839-bac3c7e36db6 /boot                   xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint1 /datamntpoint1          xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint10 /datamntpoint10         xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint11 /datamntpoint11         xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint12 /datamntpoint12         xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint13 /datamntpoint13         xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint14 /datamntpoint14         xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint15 /datamntpoint15         xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint16 /datamntpoint16         xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint17 /datamntpoint17         xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint18 /datamntpoint18         xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint19 /datamntpoint19         xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint2 /datamntpoint2          xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint20 /datamntpoint20         xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint21 /datamntpoint21         xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint22 /datamntpoint22         xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint23 /datamntpoint23         xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint24 /datamntpoint24         xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint25 /datamntpoint25         xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint26 /datamntpoint26         xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint3 /datamntpoint3          xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint4 /datamntpoint4          xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint5 /datamntpoint5          xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint6 /datamntpoint6          xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint7 /datamntpoint7          xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint8 /datamntpoint8          xfs     defaults        0 0
   /dev/mapper/rhel-datamntpoint9 /datamntpoint9          xfs     defaults        0 0
   /dev/mapper/rhel-swap   none                    swap    defaults        0 0

2. Create a ReaR backup
3. Restore the backup

Actual results:

Hang while executing grub2-mkconfig

Expected results:

No hang, /dev/log socket being a symlink

Comment 1 Pavel Cahyna 2023-02-23 14:05:00 UTC

(In reply to Renaud Métrich from comment #0)
> Description of problem:
> 
> With RHEL9, the /dev/log inode is supposed to be a symlink to
> /run/systemd/journal/dev-log.

Thank you for the analysis. Is it a new problem in RHEL 9, or has it existed in RHEL 8 as well?
I see a similar situation in RHEL 8:

# ls -l /dev/log
lrwxrwxrwx. 1 root root 28 Feb 22 04:16 /dev/log -> /run/systemd/journal/dev-log

Comment 2 Renaud Métrich 2023-02-23 14:15:35 UTC

I don't know if this affects RHEL8.

For sure the good inode is:

# ls -l /dev/log
lrwxrwxrwx. 1 root root 28 Feb 22 04:16 /dev/log ->
/run/systemd/journal/dev-log

Comment 3 Pavel Cahyna 2023-02-23 14:20:46 UTC

I am curious though how does having correct systemd unit outside the chroot help the program running in the chroot? Is it because /run is shared so that connecting to /run/systemd/journal/dev-log in the chroot actually connects to the daemon that runs outside?

Comment 4 Renaud Métrich 2023-02-23 14:29:29 UTC

It's because /dev/log outside the chroot is broken, causing /dev/log inside the chroot to be broken as well since it's a bind mount

Comment 6 Pavel Cahyna 2023-06-16 11:40:00 UTC

Hi Renaud, thank you for the analysis again, I have looked into the details of systemd units startup in the rescue system. IMO, your proposed workaround (to copy all the systemd logging-related units) is not very well suitable for inclusion in upstream, as ReaR needs to support many distros and these details will vary among them. At least, it would require lots of difficult testing in all the supported distros. Therefore, I propose a less invasive solution. I found that there are multiple problems with the current systemd units: nothing wants basic.target and therefore the services/sockets that it contains get never started (this affect the /dev/log socket and the rsyslogd service that is listening on it). Moreover, if I fix this, the socket starts very early and for some reason this does not work. If I order it after basic system initialization, everything starts working. The socket gets started, when one attempts to log to it rsyslogd is spawned and sends the messages to /var/log/messages. (/dev/log is not a symlink to /run/systemd/journal/dev-log, but I don't think it is a big problem). By the way, I can reproduce the problem as well using a simple for loop:
for i in `seq 1 1000`; do echo foo$i; done
this hangs when the problem occur, because the socket gets filled.
Wit my fixes to the systemd units, it is fine, the output goies to /var/log/messages. I can also see the output from grub2-mkconfig (actually, from os-prober) there. So the problem you are seeing should be fixed. The changes are on my branch: https://github.com/pcahyna/rear/tree/rsyslog . What do you think?

Regarding RHEL 8, I see that the logs go into the systemd journal by default, so it seems that the problem does not occur there and so I won't touch it.

Comment 18 errata-xmlrpc 2023-11-07 08:37:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (rear bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:6571