Bug 2039091

Summary: Boot fails when /var is an LV
Product: Red Hat Enterprise Linux 9 Reporter: Gordon Messmer <gordon.messmer>
Component: lvm2Assignee: LVM and device-mapper development team <lvm-team>
lvm2 sub component: Activating existing Logical Volumes QA Contact: cluster-qe <cluster-qe>
Status: CLOSED DUPLICATE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: agk, bstinson, heinzm, jbrassow, jstodola, jwboyer, msnitzer, prajnoha, teigland, zkabelac
Version: CentOS Stream   
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-12 14:52:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
system.devices and lsblk -o +UUID
none
journal from failed boot, with udev debug logs none

Description Gordon Messmer 2022-01-10 22:42:27 UTC
Description of problem:

I've installed a CentOS Stream 9 system from a kickstart file that specified (among other things) several logical volumes:

logvol / --fstype="ext4" --size=10240 --name=lv_root --vgname=VolGroup
logvol /var --fstype="ext4" --size=4096 --name=lv_var --vgname=VolGroup
logvol swap --fstype="swap" --size=2048 --name=lv_swap --vgname=VolGroup

When that system rebooted, the kernel args did specify "rd.lvm.lv=VolGroup/lv_root rd.lvm.lv=VolGroup/lv_swap", but did not specify "rd.lvm.lv=VolGroup/lv_var", so boot failed because the filesystem required for /var couldn't be found.

I'd like to suggest that Anaconda be simplified, and specify "rd.lvm.vg=VolGroup" rather than enumerating individual LVs.  As far as I know, the LVs inside VolGroup can't be activated unless that VG is complete, and if it's complete, then I can see no good reason why Anaconda should add individual LVs to the kernel command line rather than the whole VG.


Version-Release number of selected component (if applicable):
34.25.0.23-1.el9

How reproducible:

I assume always, but have only done one the one installation.  I also don't know yet if the same problem results when filesystem are created manually.

Steps to Reproduce:
1. Create a kickstart file specifying multiple LVs.
2. Install a new system using that kickstart config.
3. Reboot.

Actual results:

Boot fails, unable to mount /var because the LV is missing.

Expected results:

First boot after a clean install should succeed.

Additional info:

Comment 1 Jan Stodola 2022-01-11 08:54:20 UTC
Could you please provide the content of /etc/lvm/devices/system.devices on the installed system? Does the system boot successfully if you (re)move the file?
This could be the same problem discussed in bug 2037905.

Comment 2 Gordon Messmer 2022-01-11 18:17:47 UTC
Created attachment 1850151 [details]
system.devices and lsblk -o +UUID

I'm attaching a file that contains both system.devices and the output of "lsblk -o +UUID" from a new VM on which I've replicated the problem (with manual partitioning instead of a kickstart file).

The file looks correct.  It's identical (other than the time) to a new file generated by "vgimportdevices cs".  Removing the file does not allow the system to boot, even if I regenerate the initrd with "dracut -f".

If I understand correctly, system.devices identifies the devices that are intended to be used as PVs.  If system.devices were incorrect, then I'd expect *no* VGs (and by extension, no LVs) to be activated.  But that's not the problem.  The problem is that dracut is only activating LVs named in the kernel args with rd.lvm.lv=<LV>, and that list is incomplete.

Comment 3 Gordon Messmer 2022-01-11 18:47:13 UTC
I'm unable to see bug 2024100, so I can't verify that this is a duplicate.

I've updated the VM with packages from https://kojihub.stream.centos.org/koji/buildinfo?buildID=16070 , rebuilt the initrd, and verified that the problem still exists.  I don't see any changes from -2 to -3 that would address the problem.  There doesn't seem to be anything wrong with lvm2; the system is doing exactly what the documentation says that it will.  The man page for dracut.cmdline says that if rd.lvm.lv= is provided, then only those LVs will be activated.  That is what is causing the boot failure.  Some LVs exist, and they're required for mounts defined in /etc/fstab, but they aren't being activated because Anaconda has told dracut not to activate them.

I believe the practice of specifying rd.lvm.lv= is itself, a bug.  Even if anaconda is fixed so that it specifies all of the LVs needed for /etc/fstab, anyone who creates new LVs in the default volume group after initial installation is going to struggle to figure out why they are missing when the system is rebooted.  Specifying rd.lvm.vg= instead should be a more reliable option.

Comment 4 Gordon Messmer 2022-01-11 19:23:41 UTC
In CS8, additional LVs are also not named on the kernel command line.  On that release, early boot activation of LVs that aren't given to rd.lvm.lv= is handled by /usr/lib/systemd/system/lvm2-pvscan@.service.  However, on CS9, that unit doesn't exist.  On CS9, early boot activation looks like it's intended to be handled by a rule in /usr/lib/udev/rules.d/69-dm-lvm.rules, but that appears to only happens for PVs that weren't handled during dracut init, so it doesn't fire for VGs that dracut partially activated.

Comment 5 Gordon Messmer 2022-01-11 23:48:14 UTC
Created attachment 1850205 [details]
journal from failed boot, with udev debug logs

I added a device to the VM that does not boot and created a new VG on that device.  (The new PV is /dev/vdc1, and the VG is BackupVG).  Then, I booted the VM with "udev.log-priority=debug" as a boot parameter.

The VM will boot to a rescue environment after the /var mount times.  At that point, /dev/cs/root, /dev/cs/swap, and /dev/BackupVG/lv01 exist, but /dev/cs/var does not.  The transient systemd unit, lvm-activate-BackupVG.service, exists, which suggests that the rule in 69-dm-lvm.rules is being triggered for /dev/vdc1.  And, indeed, the log includes the command from line 82.

However, there is no "lvm-activate-cs.service" unit.  And, while there are udev events for vdc1 both before and after the pivot, there are no events for md127 (the PV backing the cs VG) after the pivot.

Comment 6 Jan Stodola 2022-01-12 09:31:13 UTC
rd.lvm.lv= arguments provided on the kernel command line should be just the LVs used by dracut/initramfs to mount the root filesystem. The other LVs, that are not needed in initramfs, are activated later in the boot process. So, the kernel command line arguments created by the installer look OK.
I'm reassigning this bug to lvm2 to further review.

Comment 7 David Teigland 2022-01-12 14:52:33 UTC
root on lvm on md is broken until 64-lvm.rules in dracut is fixed.

As for the other broader suggestion, dracut activating all LVs in the root VG is reasonable in many cases, but in cases where there are thousands of LVs (and of varying complex types) in the root VG it would interfere with and delay the main job of the initrd which is activating the root fs LV.

*** This bug has been marked as a duplicate of bug 2033737 ***