After powerfailure the system is gets stuck during reboot and drops to a recovery shell. The issue is reproducable with kernel-6.6.8-200.fc39.x86_64 The issue does not manifest itself with kernel-6.6.8-100.fc38.x86_64 Currently the only solution to get the system running again is by booting with the fc38 kernel. Reproducible: Always Steps to Reproduce: 1. Power the system off 2. Boot the system 3. See it hang during boot. Normally at point 3 there's a line "device-mapper: cache: You have created a cache device with a lot of individual cache blocks (3258112)" after which the boot continues. Actual Results: Hang during boot Expected Results: Succesful boot lsblk: sda 8:0 0 931.5G 0 disk ├─sda1 8:1 0 199M 0 part │ └─md125 9:125 0 199M 0 raid1 /boot/efi ├─sda2 8:2 0 1G 0 part │ └─md127 9:127 0 1022M 0 raid1 /boot └─sda3 8:3 0 930.3G 0 part └─md126 9:126 0 1.8T 0 raid5 └─base-lukscache_corig 253:3 0 1.8T 0 lvm └─base-lukscache 253:4 0 1.8T 0 lvm └─luks-3d87465c-4df6-4215-8114-93a5b9d4cdd7 253:5 0 1.8T 0 crypt └─crypt--fedora-root 253:6 0 1.7T 0 lvm /var/lib/snapd/snap / sdb 8:16 0 931.5G 0 disk ├─sdb1 8:17 0 199M 0 part │ └─md125 9:125 0 199M 0 raid1 /boot/efi ├─sdb2 8:18 0 1G 0 part │ └─md127 9:127 0 1022M 0 raid1 /boot └─sdb3 8:19 0 930.3G 0 part └─md126 9:126 0 1.8T 0 raid5 └─base-lukscache_corig 253:3 0 1.8T 0 lvm └─base-lukscache 253:4 0 1.8T 0 lvm └─luks-3d87465c-4df6-4215-8114-93a5b9d4cdd7 253:5 0 1.8T 0 crypt └─crypt--fedora-root 253:6 0 1.7T 0 lvm /var/lib/snapd/snap / sdc 8:32 0 931.5G 0 disk ├─sdc1 8:33 0 199M 0 part │ └─md125 9:125 0 199M 0 raid1 /boot/efi ├─sdc2 8:34 0 1G 0 part │ └─md127 9:127 0 1022M 0 raid1 /boot └─sdc3 8:35 0 930.3G 0 part └─md126 9:126 0 1.8T 0 raid5 └─base-lukscache_corig 253:3 0 1.8T 0 lvm └─base-lukscache 253:4 0 1.8T 0 lvm └─luks-3d87465c-4df6-4215-8114-93a5b9d4cdd7 253:5 0 1.8T 0 crypt └─crypt--fedora-root 253:6 0 1.7T 0 lvm /var/lib/snapd/snap / sr0 11:0 1 1024M 0 rom nvme0n1 259:0 0 238.5G 0 disk ├─nvme0n1p1 259:1 0 200G 0 part │ └─base-cache_cvol 253:0 0 199G 0 lvm │ ├─base-cache_cvol-cdata │ │ 253:1 0 198.9G 0 lvm │ │ └─base-lukscache 253:4 0 1.8T 0 lvm │ │ └─luks-3d87465c-4df6-4215-8114-93a5b9d4cdd7 │ │ 253:5 0 1.8T 0 crypt │ │ └─crypt--fedora-root │ │ 253:6 0 1.7T 0 lvm /var/lib/snapd/snap │ │ / │ └─base-cache_cvol-cmeta │ 253:2 0 144M 0 lvm │ └─base-lukscache 253:4 0 1.8T 0 lvm │ └─luks-3d87465c-4df6-4215-8114-93a5b9d4cdd7 │ 253:5 0 1.8T 0 crypt │ └─crypt--fedora-root │ 253:6 0 1.7T 0 lvm /var/lib/snapd/snap │ / └─nvme0n1p2 259:2 0 16G 0 part [SWAP]
Created attachment 2008746 [details] Screenshot while hang occures
Created attachment 2008747 [details] dracut shell when boot fails
Created attachment 2008748 [details] Normal boot screenshot
The issue could also be attributed to device-mapper, because that has changed in FC39.
Changed the component to LVM2 because I have a clear indication now that dm-cache fails to start after powerfailure.
Created attachment 2010698 [details] The dm-cache status while not being fully started after power failure I added another screenshot showing the status of dm-cache while it's not being started after a powerfailure
The boot log suggest the device for decryption is not present in the system. Unfortunately from the provided info it's not yet visible why. Is it possible your 'raid' MD arrays is not working properly with newer kernel ? Or is the cached LV activated, but fails during the activation ? When you are in 'dracut' rescue shell - are you able to see LVs (see: lvs -a | or "lvm lvs -a"). Are you able to activate your LVs ? (see: vgchange -ay | "lvm vgchange -ay") Do you even see PVs (see: pvs | "lvm pvs") Aren't you using devicesfile - in that case you need to have properly configured /etc/lvm/devices/devicesfile (see "lvm lvmdevices") What errors do you get ? You should probably attach to this BZ the file generated by 'lvmdump' so we can more easily see the device stack. ATM I'd probably think the issue is related to mdraid which is receiving some patches recently and also there are no changes on lvm2 side (especially if you only switch kernel on boot)
Right after upgrading I ran the same kernel both in F38 (kernel-6.6.8-100.fc38.x86_64) and F39 (kernel-6.6.8-200.fc39.x86_64), at that time I was able to use both kernels + initramfs to boot F39. However after a powerfailure the F39 kernel would fail and the F38 was my lifeline. Not sure of these are actually the same kernels, but they seem close. As may be seen in the lsblk output in previous comments, the luks devices stacks on top of the dm-cache device. So it's definitely not present (yet) because the dm-cache is not started/created properly. In the failing situation it's noteworthy though that both md126 and md127 are active (auto-read-only) - Not sure is that's normal behaviour (not sure what rd.break setting is required to capture this an a health system). Is the read-only situation causing dm-cache to fail (wait, not do anything at all)? This is the mdstat stuation on a functional system: [root@home07 ~]# cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md125 : active raid1 sdb1[1] sda1[0] sdc1[2] 203776 blocks super 1.0 [3/3] [UUU] bitmap: 0/1 pages [0KB], 65536KB chunk md126 : active raid1 sdb2[1] sdc2[2] sda2[0] 1046528 blocks super 1.2 [3/3] [UUU] bitmap: 0/1 pages [0KB], 65536KB chunk md127 : active raid5 sdb3[1] sdc3[3] sda3[0] 1950750720 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] bitmap: 7/8 pages [28KB], 65536KB chunk unused devices: <none> [root@home07 ~]# Note that only md127 is required for dm-cache, the other arrays are (no cached) /boot and /boot/efi. One of the screenshots shows the available dmsetup output (not actual lvs output), approximation (manual typing, see screenshot for accurate data): [root]# dmsetup info -c -o name,uuid,suspended,readonly,open,tables_loaded,names_using_dev Name UUID Suspended Read-only Open Tables RefNames base-cache_cvol LVM-RLs0VskICCirKMnzAfDyVug37GpWhB1XEwQE1n1DCedKMKA4IsP3k257IyimpgYC-cvol Active Writeable 2 Live base-cache_cvol-cdata base-cache_cvol-cdata LVM-RLs0VskICCirKMnzAfDyVug37GpWhB1XEwQE1n1DCedKMKA4IsP3k257IyimpgYC-cdata Active Writeable 1 Live base-cache_cvol-cmeta LVM-RLs0VskICCirKMnzAfDyVug37GpWhB1XEwQE1n1DCedKMKA4IsP3k257IyimpgYC-cmeta Active Writeable 1 Live [root]# A functional system shows this: [root@home07 ~]# dmsetup info -c -o name,uuid,suspended,readonly,open,tables_loaded,names_using_dev Name UUID Suspended Read-only Open Tables RefNames base-cache_cvol LVM-RLs0VskICCirKMnzAfDyVug37GpWhB1XEwQE1n1DCedKMKA4IsP3k257IyimpgYC-cvol Active Writeable 2 Live base-cache_cvol-cdata,base-cache_cvol-cmeta base-cache_cvol-cdata LVM-RLs0VskICCirKMnzAfDyVug37GpWhB1XEwQE1n1DCedKMKA4IsP3k257IyimpgYC-cdata Active Writeable 1 Live base-lukscache base-cache_cvol-cmeta LVM-RLs0VskICCirKMnzAfDyVug37GpWhB1XEwQE1n1DCedKMKA4IsP3k257IyimpgYC-cmeta Active Writeable 1 Live base-lukscache base-lukscache LVM-RLs0VskICCirKMnzAfDyVug37GpWhB1XKoCD1KnAEnW59iNak7M0A7Cps6RSQKxd Active Writeable 1 Live luks-3d87465c-4df6-4215-8114-93a5b9d4cdd7 base-lukscache_corig LVM-RLs0VskICCirKMnzAfDyVug37GpWhB1XV9EVvnfuYxJa4ywz1Bsen2BujY9yEok9-real Active Writeable 1 Live base-lukscache crypt--fedora-root LVM-XZBlAXyOzRw9LWedyo6luoH7fAQpbW0lFqHAJt8Q1acUW5JUSGZmk3AccAWHOiKd Active Writeable 1 Live luks-3d87465c-4df6-4215-8114-93a5b9d4cdd7 CRYPT-LUKS2-3d87465c4df64215811493a5b9d4cdd7-luks-3d87465c-4df6-4215-8114-93a5b9d4cdd7 Active Writeable 1 Live crypt--fedora-root [root@home07 ~]# Regarding lvmdevices, no /etc/lvm/devices/devicesfile: [root@home07 ~]# lvm lvmdevices Devices file does not exist. [root@home07 ~]# There's no errors, dracut just times out waiting for the luks stuff. See screenshot. Furthermore dmesg shows no addtl. info compared to console. I'll get back on the other questions/remarks after hitting the powerswitch and creating the failing situation.
Created attachment 2010792 [details] screened lvmdump output
Created attachment 2010794 [details] lvm lvs output
Attached a screenshot showing lvm lvs during a failing boot, this is the output on a working system: [root@home07 ~]# lvm lvs -a LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert [cache_cvol] base Cwi-aoC--- 199.00g lukscache base Cwi-aoC--- <1.82t [cache_cvol] [lukscache_corig] 99.99 26.47 98.36 [lukscache_corig] base owi-aoC--- <1.82t root crypt-fedora -wi-ao---- <1.68t [root@home07 ~]#
FWIW I installed latest FC38 kernel: dnf install --release=38 kernel-6.6.13-100.fc38 Fails to boot after "reboot -f" the same as kernel-6.6.13-200.fc39 No surprise probably.
Created attachment 2011132 [details] cache_check error: insert at index NNN exceeds fixbitset size NNN
Assuming a needs_check is triggered in the status during a sudden reboot, one could assume that a cache_check is required. Doing so when the system fails to boot renders (more details in screenshot) cache_check /dev/mapper/base-cache_cvol_cmeta thread 'main' panicking ... insert at index 30479487 exceeds fixbitset size 30479487 ... Could mean anything, related or not related to FC39 in combination with bad meta data and/or cache_check tool.
Passing to the author of cache_check - it looks like there is detected some issue with metadata - not sure yet how this may influence target correctness - and whether it's not related to --auto-repair issue (see bug #2233177). Lvm2 on activation calls cache_check which happens to detect some problems within metadata - thus prevents the activation. What I don't really get here is - how it would be possible to spot the issue with newer kernel and not with old one. So let me also clarify this - when you run 'newer' kernel - all other tools in the system are bit-to-bit exactly the same ? Or do you just try with standard 'FC38' installation and standard 'F39' installation ? (As the FC39 has newer cache_check tool)
Hi (In reply to Rolf Fokkens from comment #14) > Assuming a needs_check is triggered in the status during a sudden reboot, > one could assume that a cache_check is required. Doing so when the system > fails to boot renders (more details in screenshot) > > cache_check /dev/mapper/base-cache_cvol_cmeta > thread 'main' panicking ... > insert at index 30479487 exceeds fixbitset size 30479487 > ... > > Could mean anything, related or not related to FC39 in combination with bad > meta data and/or cache_check tool. Would you be able to provide the cache metadata dump for us to do further check? dd if=/dev/mapper/base-cache_cvol-cmeta of=cmeta.bin tar -czvf cmeta.tar.gz cmeta.bin Then upload the compressed cmeta.tar.gz if possible. Thanks.
(In reply to Rolf Fokkens from comment #12) > FWIW I installed latest FC38 kernel: > > dnf install --release=38 kernel-6.6.13-100.fc38 > > Fails to boot after "reboot -f" the same as kernel-6.6.13-200.fc39 > > No surprise probably. It looks like you're just downgrade the kernel but keep userland tools unchanged (still in cache_check v1.0.x), so the issue remains there.
Created attachment 2013100 [details] /dev/mapper/base-cache_cvol-cmeta
As requested: [root@home07 ~]# dd if=/dev/mapper/base-cache_cvol-cmeta of=cmeta.bin 294912+0 records in 294912+0 records out 150994944 bytes (151 MB, 144 MiB) copied, 0.719608 s, 210 MB/s [root@home07 ~]# tar -czvf cmeta.tar.gz cmeta.bin cmeta.bin [root@home07 ~]# ls -l cmeta.tar.gz cmeta.bin -rw-r--r--. 1 root root 150994944 Jan 29 22:30 cmeta.bin -rw-r--r--. 1 root root 22664211 Jan 29 22:30 cmeta.tar.gz [root@home07 ~]# tar -cJvf cmeta.tar.xz cmeta.bin cmeta.bin [root@home07 ~]# ls -l cmeta.tar.xz cmeta.bin -rw-r--r--. 1 root root 150994944 Jan 29 22:30 cmeta.bin -rw-r--r--. 1 root root 10171900 Jan 29 22:34 cmeta.tar.xz [root@home07 ~]# Used xz compression because the gz compressed file was to large to upload.
(In reply to Rolf Fokkens from comment #18) > Created attachment 2013100 [details] > /dev/mapper/base-cache_cvol-cmeta There is a bug in checking the addresses of cached blocks, specifically to metadata that are not shut down normally (e.g., power failure). I'll fix the bug soon. Thank you for the detailed information provided, helping us in improving product quality.
Thanks for addressing the issue so swiftly.
Fixed in thin-provisioning-tools v1.0.11 upstream, commit d2390a50f3. Two new test cases are added accordingly: $ cargo test [--release] ... test metadata_v1_without_slow_dev_size_info ... ok test metadata_v2_without_slow_dev_size_info ... ok
FEDORA-2024-64e5a607ae (device-mapper-persistent-data-1.0.11-1.fc39) has been submitted as an update to Fedora 39. https://bodhi.fedoraproject.org/updates/FEDORA-2024-64e5a607ae
FEDORA-2024-3381af2ed1 (device-mapper-persistent-data-1.0.11-1.fc38) has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2024-3381af2ed1
FEDORA-2024-3381af2ed1 has been pushed to the Fedora 38 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-3381af2ed1` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-3381af2ed1 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
I can confirm that this fixes the reported issue: * updated the device-mapper-persistent-data-1.0.11-1.fc39 package * did "dracut /boot/initramfs-6.6.13-200.fc39.x86_64.img.2 6.6.13-200.fc39.x86_64" * did "reboot -f" * hit <shift> to enter grub and selected this initramfs * system booted fine
FEDORA-2024-64e5a607ae has been pushed to the Fedora 39 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-64e5a607ae` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-64e5a607ae See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2024-64e5a607ae (device-mapper-persistent-data-1.0.11-1.fc39) has been pushed to the Fedora 39 stable repository. If problem still persists, please make note of it in this bug report.
FEDORA-2024-3381af2ed1 (device-mapper-persistent-data-1.0.11-1.fc38) has been pushed to the Fedora 38 stable repository. If problem still persists, please make note of it in this bug report.