##Description of problem: Activating thin pools with minor errors in them will fail if `--auto-repair` isn't passed to thin-check. See [the `thin-provisioning-tools` issue](https://github.com/jthornber/thin-provisioning-tools/issues/244). ##Version-Release number of selected component (if applicable): archlinux lvm2-2.03.22-1 ##How reproducible: difficult. Requires a thin pool with errors that can be fixed by `--auto-repair`. ##Steps to Reproduce: 1. Make a thin pool. 2. Deactivate the thin pool. 3. Break the thin pool such that `--auto-repair` is required. 4. Attempt a `vgchange -ay` to reactivate the thin pool. ##Actual results: The thin pool fails to activate with error ``` Check of pool myvgnamehere/mythinpoolnamehere failed (status:64). Manual repair required! ``` ##Expected results: It should just activate if the errors are recoverable with --auto-repair. ##Additional info: This can be fixed by adding `--auto-repair` to `thin_check_options` in `/etc/lvm/lvm.conf`. This is affecting the booting of systems whose root fs is in a thin pool. When there are errors in the thin pool, the system won't boot, even if these errors are recoverable. Even if this option is only appropriate for systems whose root fs is in a thin pool, perhaps a note about this potential issue could be added to lvm.conf.
I can 100% confirm this is a major and painful issue to track down. It took me weeks to figure out why my system stopped booting with some newer kernels (which were actually innocent). It turns out the update of the package device-mapper-persistent-data to version 1.0 from version 0.9 breaks the boot on my system because I had a thin pool in need of repair, but there was no information about this at all, I could only see that some of my LVM volumes (the one in the affected thin pool, so all of them except swap) were not active. Please note that booting with an older kernel and an older initramfs using the older version of device-mapper-persistent-data was booting the system successfully. Adding adding `--auto-repair` to `thin_check_options` in `/etc/lvm/lvm.conf`, as suggested, and regenerating the initramfs made the system bootable again. If automatic repair is not desired an error message should be displayed pointing to the thinpool issue, eliminating the guess game. Thank you so much for the --auto-repair hint. Kind regards.
Did the --auto-repair hint come from the mailing list?
(In reply to Jonathan Earl Brassow from comment #3) > Did the --auto-repair hint come from the mailing list? It came from the first comment of this bug report, from Eric, which he got from the github link the posted in the same comment. That github issue also suggest that distributions should set --auto-repair by default, as now thin_check does more comprehensive checks on the metadata and might return non zero even if you just have some leaked blocks (which was my case) and can be automatically repaired safely. The problem when the repair cannot be done automatically still remains, it's not very clear why the boot process is failing. You are only told, after a very long time, that your root volume is not appearing.
(In reply to Enrico Tagliavini from comment #4) > The problem when the repair cannot be done automatically still remains, it's > not very clear why the boot process is failing. You are only told, after a > very long time, that your root volume is not appearing. Sorry for the inconvenience. We had addressed this issue in device-mapper-persistent-data v1.0.9 to fixed the leaked blocks with default lvm options (-q --clear-needs-check-flag). Users no longer need to set the --auto-repair option manually if the "thin_check_options" in lvm.conf is not explicitly set. This approach also ensures compatibility between lvm and older versions of thin_check, thus users are free to downgrade the device-mapper-persistent-data to versions prior to v0.9.0, where the --auto-repair option is not available. Here's the [patch link](https://github.com/jthornber/thin-provisioning-tools/commit/eb28ab94b2). I hope this patch solves your issue. If you have any further questions or suggestions, let me know.
This happened on another one of my machines, but it's not booting, even with --auto-repair. Thin_check reports: root@enrico-desktop ~ # thin_check /dev/mapper/vg_enricossd-pool00 bad checksum in superblock root@enrico-desktop ~ # thin_check /dev/mapper/vg_enricohdd-hddpool00 bad checksum in superblock Booting the older kernel with the older initramfs works without issues. I tried to regenerate the initramfs with dracut and even reinstalled the kernel. The timestamp of the initarmfs is from now, it should have the updated lvm.conf.
(In reply to Enrico Tagliavini from comment #6) > This happened on another one of my machines, but it's not booting, even with > --auto-repair. Thin_check reports: > > root@enrico-desktop ~ # thin_check /dev/mapper/vg_enricossd-pool00 > bad checksum in superblock > root@enrico-desktop ~ # thin_check /dev/mapper/vg_enricohdd-hddpool00 > bad checksum in superblock > > Booting the older kernel with the older initramfs works without issues. > > I tried to regenerate the initramfs with dracut and even reinstalled the > kernel. The timestamp of the initarmfs is from now, it should have the > updated lvm.conf. I think there are errors within the metadata that could only be detected by thin_check v1.0.x. However, you're using the wrong device for thin_check. LVM-created thin metadata must be ends with a "_tmeta" suffix, e.g., "/dev/mapper/vg_enricohdd-hddpool00_tmeta". Also, it's not able to run thin_check on an online pool, so you'll have to activate the metadata device individually. e.g., # lvchange -an /dev/mapper/vg_enricossd # lvchange -ay /dev/mapper/vg_enricossd/vg_enricossd-pool00_tmeta # thin_check dev/mapper/vg_enricossd/vg_enricossd-pool00_tmeta Then deactivate the metadata manually after checking # lvchange -an /dev/mapper/vg_enricossd/vg_enricossd-pool00_tmeta The older initramfs might come with thin_check older than v0.9.0, so the errors are ignored and you're be able to activate the thin-pool. It's not about the kernel version in most cases. Please help provide the logs of thin_check v1.0.x if possible. Thanks.
I can't activate the device, the device symlink doesn't exist in /dev/mapper . It's not created because the pool needs repair... I used the newer version of thin_check for the repair, from the dract shell in the new initramfs. The device was offline and it's not shown in /dev/mapper or /dev/vg_enricossd . I cannot run thin_check manually, the only way I can activate is lvchange -ay, but it refuses to do so, as the device needs manual repair as the status is 64. It's a chicken and egg problem. Moreover I tried to activate the meta data volume only, but LVM refused to do so, it would not allow me to do that because it's part of a pool, and the entire pool needs to be activated. From the initramfs it's difficult to provide the logs. I'll try with a Fedora 39 live USB, but updating thin_check to the latest version first. Thank you for the help.
I believe you should be able to start linux kernel with Dracut's 'rd.break=pre-pivot' (attached to the linux kernel boot line) (Eventually you will be dropped in rescue shell anyway) Then within 'ramdisk' rescue shell you could edit 'vi /etv/lvm/lvm.conf' - and add "--auto-repair" option to the 'thin_check_option' list. And try to activate thin-pool with 'lvm vgchange -ay' This should autorepair pool and you could reboot and restart system normally.
I did that, it didn't work unfortunately. The --auto-repair option doesn't seem to trigger a repair.
Can you please check with '-vvvv' that trace result of 'vgchange -ay -vvvv' really provides --auto-repair with thin_check tool ? Eventually attach this command log to the BZ.
Created attachment 2014567 [details] vgchange verbose output without the --auto-repair in lvm.conf This is the output of vgchange -aay -vvvv with the stock lvm.conf (no --auto-repair)
Created attachment 2014569 [details] vgchange verbose output with the --auto-repair in lvm.conf This is the output of vgchange -aay -vvvv after editing lvm.conf and adding --auto-repair to the thin_check options. It looks like the options are not picked up from lvm.conf. I checked multiple time lvm.conf and I can't find an error. I'll attach a copy, in case it's me editing it wrong as the next attachment. Note this was all done from a Fedora 39 live USB, after updating device-mapper-persistent-data to the version 1.0.9-1 with dnf update device-mapper-persistent-data
Created attachment 2014570 [details] /etc/lvm/lvm.conf file
Created attachment 2014571 [details] Output of thin_check for the meta data device only Note I had to activate the metadata device read only, otherwise it doesn't activate.
Small correction: the --auto-repair option is actually set correctly (I was looking at the wrong file, sorry!), but the auto-repair seems to fail, it still returns status 64
(In reply to Enrico Tagliavini from comment #15) > Created attachment 2014571 [details] > Output of thin_check for the meta data device only > > Note I had to activate the metadata device read only, otherwise it doesn't > activate. It's unusual that there's a block-out-of-bounds error. Could you help provide the metadata dump of the pool in vg_enricossd? There's a thin_metadata_pack command that does the job: (assume you had activate the metadata device read only) # thin_metadata_pack -i /dev/mapper/vg_enricossd-pool00_tmeta -o tmeta.pack The output tmeta.pack should be at around 10MB so you could upload directly. Thanks.
Created attachment 2015217 [details] output of thin_metadata_pack -i /dev/mapper/vg_enricossd-pool00_tmeta -o tmeta.pack Sure, I attached the output of thin_metadata_pack -i /dev/mapper/vg_enricossd-pool00_tmeta -o tmeta.pack as requested. Thank you for the help.
(In reply to Enrico Tagliavini from comment #18) > Created attachment 2015217 [details] > output of thin_metadata_pack -i /dev/mapper/vg_enricossd-pool00_tmeta -o > tmeta.pack > > Sure, I attached the output of > > thin_metadata_pack -i /dev/mapper/vg_enricossd-pool00_tmeta -o tmeta.pack > > as requested. > > Thank you for the help. It's the special case that we just solved in the recent upstream release (v1.0.10). Before the new version is being available on fedora, you could fix it by running lvconvert --repair: # lvchange -an vg_enricossd # lvconvert vg_enricossd/pool00 --repair Let me know if you have questions in running the commands. Thanks.
I think that fixed it. lvconvert reported an error: output_error: value error: I/O error: Broken pipe (os error 32) but then it proceeded, created a snapshot of the metadata and did some repair. I then rebooted normally and now the system booted fine. Thank you so much for all the help, I really appreciate it. It might be useful if the reason why the system is not booting would be displayed on screen. Maybe something for systemd-bsod?
(In reply to Enrico Tagliavini from comment #20) > I think that fixed it. lvconvert reported an error: > > output_error: value error: I/O error: Broken pipe (os error 32) > > but then it proceeded, created a snapshot of the metadata and did some > repair. I then rebooted normally and now the system booted fine. that's odd since I could run lvconvert successfully with the metadata you provided, and running lvconvert doesn't require snapshotting the metadata volume, so that raises several questions: - Which version of lvm did you use? Did you run lvconvert --repair from a Fedora 39 live CD? - Was the snapshot created automatically while running lvconvert, or something else?
(In reply to Ming-Hung Tsai from comment #21) > that's odd since I could run lvconvert successfully with the metadata you > provided, and running lvconvert doesn't require snapshotting the metadata > volume, so that raises several questions: > > - Which version of lvm did you use? Did you run lvconvert --repair from a > Fedora 39 live CD? > - Was the snapshot created automatically while running lvconvert, or > something else? I run lvconvert from a Fedora 39 live cd, but I run dnf update lvm2 device-mapper-persistent-data before running lvconvert. The host itself is still running Fedora 38, if that matters in any way I didn't create the snapshot. I honestly don't know what did.
(In reply to Enrico Tagliavini from comment #20) > I think that fixed it. lvconvert reported an error: > > output_error: value error: I/O error: Broken pipe (os error 32) > > but then it proceeded, created a snapshot of the metadata and did some > repair. I then rebooted normally and now the system booted fine. > The "Broken pipe" error message was from thin_dump, which is not harmful as the exit code of lvconvert is zero. I'm not entirely sure what the "snapshot of metadata" you mentioned, but if you followed my instructions without additional steps, the result should be fine. I'll work on stopping the confusing messages. Thank you for the feedback.
(In reply to Ming-Hung Tsai from comment #23) > The "Broken pipe" error message was from thin_dump, which is not harmful as > the exit code of lvconvert is zero. I'm not entirely sure what the "snapshot > of metadata" you mentioned, but if you followed my instructions without > additional steps, the result should be fine. I'll work on stopping the > confusing messages. Thank you for the feedback. Understood. I already booted with the latest kernel / initramfs (that was previously failing to boot) several days ago and it worked without issues. Maybe it was me creating the snapshot trying to run thin_repair manually (and I didn't in the end), but am really not sure about it. I don't remember the detauls any longer, it was happening in the middle of many other things I was working on. Sorry about that. Thank you so much again for your help, I appreciated it a lot.
All the issues mentioned above had been fixed in thin-provisioning-tools v1.0.12 upstream: * thin_check default to --auto-repair: Fixed in v1.0.9 upstream, commit eb28ab94, by making the lvm-defualt --clear-needs-check-flag option auto-repair capable. * Allow non-zero values in unused index block entries: Fixed in v1.0.10 upstream, commit d5fe6a1e * Remove the confusing error message from thin_dump on broken pipe: Fixed in v1.0.12 upstream, commit b3e05f2e