Bug 2233177 - `--auto-repair` in thin_check_options by default
Summary: `--auto-repair` in thin_check_options by default
Keywords:
Status: POST
Alias: None
Product: LVM and device-mapper
Classification: Community
Component: lvm2
Version: unspecified
Hardware: All
OS: All
high
medium
Target Milestone: ---
: ---
Assignee: Ming-Hung Tsai
QA Contact: cluster-qe
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-21 16:10 UTC by Eric Toombs
Modified: 2024-04-27 05:52 UTC (History)
8 users (show)

Fixed In Version: thin-provisioning-tools v1.0.12
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
pm-rhel: lvm-technical-solution?
pm-rhel: lvm-test-coverage?


Attachments (Terms of Use)
vgchange verbose output without the --auto-repair in lvm.conf (212.16 KB, text/plain)
2024-02-02 14:51 UTC, Enrico Tagliavini
no flags Details
vgchange verbose output with the --auto-repair in lvm.conf (212.19 KB, text/plain)
2024-02-02 14:54 UTC, Enrico Tagliavini
no flags Details
/etc/lvm/lvm.conf file (110.41 KB, text/plain)
2024-02-02 14:55 UTC, Enrico Tagliavini
no flags Details
Output of thin_check for the meta data device only (203 bytes, text/plain)
2024-02-02 14:56 UTC, Enrico Tagliavini
no flags Details
output of thin_metadata_pack -i /dev/mapper/vg_enricossd-pool00_tmeta -o tmeta.pack (3.51 MB, application/octet-stream)
2024-02-05 14:42 UTC, Enrico Tagliavini
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github jthornber thin-provisioning-tools issues 244 0 None closed Version 1.0.0 does not activate my thinpool. 2023-08-21 16:10:00 UTC
Red Hat Issue Tracker RHEL-24964 0 None None None 2024-02-09 11:51:53 UTC

Description Eric Toombs 2023-08-21 16:10:01 UTC
##Description of problem:
Activating thin pools with minor errors in them will fail if `--auto-repair` isn't passed to thin-check. See [the `thin-provisioning-tools` issue](https://github.com/jthornber/thin-provisioning-tools/issues/244).


##Version-Release number of selected component (if applicable):
archlinux lvm2-2.03.22-1

##How reproducible:
difficult. Requires a thin pool with errors that can be fixed by `--auto-repair`.


##Steps to Reproduce:
1. Make a thin pool.
2. Deactivate the thin pool.
3. Break the thin pool such that `--auto-repair` is required.
4. Attempt a `vgchange -ay` to reactivate the thin pool.

##Actual results:
The thin pool fails to activate with error
```
Check of pool myvgnamehere/mythinpoolnamehere failed (status:64). Manual repair required!
```

##Expected results:
It should just activate if the errors are recoverable with --auto-repair.

##Additional info:
This can be fixed by adding `--auto-repair` to `thin_check_options` in `/etc/lvm/lvm.conf`.

This is affecting the booting of systems whose root fs is in a thin pool. When there are errors in the thin pool, the system won't boot, even if these errors are recoverable.

Even if this option is only appropriate for systems whose root fs is in a thin pool, perhaps a note about this potential issue could be added to lvm.conf.

Comment 1 Enrico Tagliavini 2024-01-27 18:04:27 UTC
I can 100% confirm this is a major and painful issue to track down. It took me weeks to figure out why my system stopped booting with some newer kernels (which were actually innocent). It turns out the update of the package device-mapper-persistent-data to version 1.0 from version 0.9 breaks the boot on my system because I had a thin pool in need of repair, but there was no information about this at all, I could only see that some of my LVM volumes (the one in the affected thin pool, so all of them except swap) were not active.

Please note that booting with an older kernel and an older initramfs using the older version of device-mapper-persistent-data was booting the system successfully.

Adding adding `--auto-repair` to `thin_check_options` in `/etc/lvm/lvm.conf`, as suggested, and regenerating the initramfs made the system bootable again.

If automatic repair is not desired an error message should be displayed pointing to the thinpool issue, eliminating the guess game.

Thank you so much for the --auto-repair hint.

Kind regards.

Comment 3 Jonathan Earl Brassow 2024-01-28 00:12:50 UTC
Did the --auto-repair hint come from the mailing list?

Comment 4 Enrico Tagliavini 2024-01-28 09:08:42 UTC
(In reply to Jonathan Earl Brassow from comment #3)
> Did the --auto-repair hint come from the mailing list?

It came from the first comment of this bug report, from Eric, which he got from the github link the posted in the same comment. That github issue also suggest that distributions should set --auto-repair by default, as now thin_check does more comprehensive checks on the metadata and might return non zero even if you just have some leaked blocks (which was my case) and can be automatically repaired safely.

The problem when the repair cannot be done automatically still remains, it's not very clear why the boot process is failing. You are only told, after a very long time, that your root volume is not appearing.

Comment 5 Ming-Hung Tsai 2024-01-28 09:46:54 UTC
(In reply to Enrico Tagliavini from comment #4)
> The problem when the repair cannot be done automatically still remains, it's
> not very clear why the boot process is failing. You are only told, after a
> very long time, that your root volume is not appearing.

Sorry for the inconvenience. We had addressed this issue in device-mapper-persistent-data v1.0.9 to fixed the leaked blocks with default lvm options (-q --clear-needs-check-flag). Users no longer need to set the --auto-repair option manually if the "thin_check_options" in lvm.conf is not explicitly set. This approach also ensures compatibility between lvm and older versions of thin_check, thus users are free to downgrade the device-mapper-persistent-data to versions prior to v0.9.0, where the --auto-repair option is not available. Here's the [patch link](https://github.com/jthornber/thin-provisioning-tools/commit/eb28ab94b2).

I hope this patch solves your issue. If you have any further questions or suggestions, let me know.

Comment 6 Enrico Tagliavini 2024-02-01 16:15:17 UTC
This happened on another one of my machines, but it's not booting, even with --auto-repair. Thin_check reports:

root@enrico-desktop ~ # thin_check /dev/mapper/vg_enricossd-pool00
bad checksum in superblock
root@enrico-desktop ~ # thin_check /dev/mapper/vg_enricohdd-hddpool00
bad checksum in superblock

Booting the older kernel with the older initramfs works without issues.

I tried to regenerate the initramfs with dracut and even reinstalled the kernel. The timestamp of the initarmfs is from now, it should have the updated lvm.conf.

Comment 7 Ming-Hung Tsai 2024-02-02 03:07:14 UTC
(In reply to Enrico Tagliavini from comment #6)
> This happened on another one of my machines, but it's not booting, even with
> --auto-repair. Thin_check reports:
> 
> root@enrico-desktop ~ # thin_check /dev/mapper/vg_enricossd-pool00
> bad checksum in superblock
> root@enrico-desktop ~ # thin_check /dev/mapper/vg_enricohdd-hddpool00
> bad checksum in superblock
> 
> Booting the older kernel with the older initramfs works without issues.
> 
> I tried to regenerate the initramfs with dracut and even reinstalled the
> kernel. The timestamp of the initarmfs is from now, it should have the
> updated lvm.conf.

I think there are errors within the metadata that could only be detected by thin_check v1.0.x. However, you're using the wrong device for thin_check. LVM-created thin metadata must be ends with a "_tmeta" suffix, e.g., "/dev/mapper/vg_enricohdd-hddpool00_tmeta". Also, it's not able to run thin_check on an online pool, so you'll have to activate the metadata device individually. e.g.,

# lvchange -an /dev/mapper/vg_enricossd
# lvchange -ay /dev/mapper/vg_enricossd/vg_enricossd-pool00_tmeta
# thin_check dev/mapper/vg_enricossd/vg_enricossd-pool00_tmeta

Then deactivate the metadata manually after checking

# lvchange -an /dev/mapper/vg_enricossd/vg_enricossd-pool00_tmeta

The older initramfs might come with thin_check older than v0.9.0, so the errors are ignored and you're be able to activate the thin-pool. It's not about the kernel version in most cases.

Please help provide the logs of thin_check v1.0.x if possible. Thanks.

Comment 8 Enrico Tagliavini 2024-02-02 10:25:47 UTC
I can't activate the device, the device symlink doesn't exist in /dev/mapper . It's not created because the pool needs repair...

I used the newer version of thin_check for the repair, from the dract shell in the new initramfs. The device was offline and it's not shown in /dev/mapper or /dev/vg_enricossd . I cannot run thin_check manually, the only way I can activate is lvchange -ay, but it refuses to do so, as the device needs manual repair as the status is 64. It's a chicken and egg problem.

Moreover I tried to activate the meta data volume only, but LVM refused to do so, it would not allow me to do that because it's part of a pool, and the entire pool needs to be activated.

From the initramfs it's difficult to provide the logs. I'll try with a Fedora 39 live USB, but updating thin_check to the latest version first.

Thank you for the help.

Comment 9 Zdenek Kabelac 2024-02-02 11:30:04 UTC
I believe you should be able to start linux kernel with Dracut's   'rd.break=pre-pivot'    (attached to the linux kernel boot line)
(Eventually you will be dropped in rescue shell anyway)

Then within 'ramdisk' rescue shell you could edit 'vi /etv/lvm/lvm.conf' - and add  "--auto-repair" option to the 'thin_check_option' list.

And try to activate thin-pool with  'lvm vgchange -ay'

This should autorepair pool and you could reboot and restart system normally.

Comment 10 Enrico Tagliavini 2024-02-02 12:35:25 UTC
I did that, it didn't work unfortunately. The --auto-repair option doesn't seem to trigger a repair.

Comment 11 Zdenek Kabelac 2024-02-02 12:44:47 UTC
Can you please check with  '-vvvv'  that trace result of  'vgchange -ay -vvvv'  really provides --auto-repair with thin_check tool ?
Eventually attach this command log to the BZ.

Comment 12 Enrico Tagliavini 2024-02-02 14:51:30 UTC
Created attachment 2014567 [details]
vgchange verbose output without the --auto-repair in lvm.conf

This is the output of vgchange -aay -vvvv with the stock lvm.conf (no --auto-repair)

Comment 13 Enrico Tagliavini 2024-02-02 14:54:59 UTC
Created attachment 2014569 [details]
vgchange verbose output with the --auto-repair in lvm.conf

This is the output of vgchange -aay -vvvv after editing lvm.conf and adding --auto-repair to the thin_check options. 

It looks like the options are not picked up from lvm.conf. I checked multiple time lvm.conf and I can't find an error. I'll attach a copy, in case it's me editing it wrong as the next attachment.

Note this was all done from a Fedora 39 live USB, after updating device-mapper-persistent-data to the version 1.0.9-1 with dnf update device-mapper-persistent-data

Comment 14 Enrico Tagliavini 2024-02-02 14:55:32 UTC
Created attachment 2014570 [details]
/etc/lvm/lvm.conf file

Comment 15 Enrico Tagliavini 2024-02-02 14:56:32 UTC
Created attachment 2014571 [details]
Output of thin_check for the meta data device only

Note I had to activate the metadata device read only, otherwise it doesn't activate.

Comment 16 Enrico Tagliavini 2024-02-02 14:59:17 UTC
Small correction: the --auto-repair option is actually set correctly (I was looking at the wrong file, sorry!), but the auto-repair seems to fail, it still returns status 64

Comment 17 Ming-Hung Tsai 2024-02-05 12:08:41 UTC
(In reply to Enrico Tagliavini from comment #15)
> Created attachment 2014571 [details]
> Output of thin_check for the meta data device only
> 
> Note I had to activate the metadata device read only, otherwise it doesn't
> activate.

It's unusual that there's a block-out-of-bounds error. Could you help provide the metadata dump of the pool in vg_enricossd? There's a thin_metadata_pack command that does the job:

(assume you had activate the metadata device read only)

# thin_metadata_pack -i /dev/mapper/vg_enricossd-pool00_tmeta -o tmeta.pack

The output tmeta.pack should be at around 10MB so you could upload directly. Thanks.

Comment 18 Enrico Tagliavini 2024-02-05 14:42:12 UTC
Created attachment 2015217 [details]
output of thin_metadata_pack -i /dev/mapper/vg_enricossd-pool00_tmeta -o tmeta.pack

Sure, I attached the output of 

thin_metadata_pack -i /dev/mapper/vg_enricossd-pool00_tmeta -o tmeta.pack 

as requested.

Thank you for the help.

Comment 19 Ming-Hung Tsai 2024-02-05 17:00:37 UTC
(In reply to Enrico Tagliavini from comment #18)
> Created attachment 2015217 [details]
> output of thin_metadata_pack -i /dev/mapper/vg_enricossd-pool00_tmeta -o
> tmeta.pack
> 
> Sure, I attached the output of 
> 
> thin_metadata_pack -i /dev/mapper/vg_enricossd-pool00_tmeta -o tmeta.pack 
> 
> as requested.
> 
> Thank you for the help.

It's the special case that we just solved in the recent upstream release (v1.0.10). Before the new version is being available on fedora, you could fix it by running lvconvert --repair:

# lvchange -an vg_enricossd
# lvconvert vg_enricossd/pool00 --repair

Let me know if you have questions in running the commands. Thanks.

Comment 20 Enrico Tagliavini 2024-02-09 14:48:57 UTC
I think that fixed it. lvconvert reported an error:

output_error: value error: I/O error: Broken pipe (os error 32)

but then it proceeded, created a snapshot of the metadata and did some repair. I then rebooted normally and now the system booted fine.

Thank you so much for all the help, I really appreciate it. It might be useful if the reason why the system is not booting would be displayed on screen. Maybe something for systemd-bsod?

Comment 21 Ming-Hung Tsai 2024-02-15 18:29:33 UTC
(In reply to Enrico Tagliavini from comment #20)
> I think that fixed it. lvconvert reported an error:
> 
> output_error: value error: I/O error: Broken pipe (os error 32)
> 
> but then it proceeded, created a snapshot of the metadata and did some
> repair. I then rebooted normally and now the system booted fine.

that's odd since I could run lvconvert successfully with the metadata you provided, and running lvconvert doesn't require snapshotting the metadata volume, so that raises several questions:

- Which version of lvm did you use? Did you run lvconvert --repair from a Fedora 39 live CD?
- Was the snapshot created automatically while running lvconvert, or something else?

Comment 22 Enrico Tagliavini 2024-02-15 20:25:48 UTC
(In reply to Ming-Hung Tsai from comment #21)
> that's odd since I could run lvconvert successfully with the metadata you
> provided, and running lvconvert doesn't require snapshotting the metadata
> volume, so that raises several questions:
> 
> - Which version of lvm did you use? Did you run lvconvert --repair from a
> Fedora 39 live CD?
> - Was the snapshot created automatically while running lvconvert, or
> something else?


I run lvconvert from a Fedora 39 live cd, but I run dnf update lvm2 device-mapper-persistent-data   before running lvconvert. The host itself is still running Fedora 38, if that matters in any way

I didn't create the snapshot. I honestly don't know what did.

Comment 23 Ming-Hung Tsai 2024-02-16 16:52:21 UTC
(In reply to Enrico Tagliavini from comment #20)
> I think that fixed it. lvconvert reported an error:
> 
> output_error: value error: I/O error: Broken pipe (os error 32)
> 
> but then it proceeded, created a snapshot of the metadata and did some
> repair. I then rebooted normally and now the system booted fine.
> 

The "Broken pipe" error message was from thin_dump, which is not harmful as the exit code of lvconvert is zero. I'm not entirely sure what the "snapshot of metadata" you mentioned, but if you followed my instructions without additional steps, the result should be fine. I'll work on stopping the confusing messages. Thank you for the feedback.

Comment 24 Enrico Tagliavini 2024-02-16 17:04:23 UTC
(In reply to Ming-Hung Tsai from comment #23)
> The "Broken pipe" error message was from thin_dump, which is not harmful as
> the exit code of lvconvert is zero. I'm not entirely sure what the "snapshot
> of metadata" you mentioned, but if you followed my instructions without
> additional steps, the result should be fine. I'll work on stopping the
> confusing messages. Thank you for the feedback.

Understood. I already booted with the latest kernel / initramfs (that was previously failing to boot) several days ago and it worked without issues.

Maybe it was me creating the snapshot trying to run thin_repair manually (and I didn't in the end), but am really not sure about it. I don't remember the detauls any longer, it was happening in the middle of many other things I was working on. Sorry about that.

Thank you so much again for your help, I appreciated it a lot.

Comment 25 Ming-Hung Tsai 2024-02-27 14:22:13 UTC
All the issues mentioned above had been fixed in thin-provisioning-tools v1.0.12 upstream:

* thin_check default to --auto-repair: Fixed in v1.0.9 upstream, commit eb28ab94, by making the lvm-defualt --clear-needs-check-flag option auto-repair capable.
* Allow non-zero values in unused index block entries: Fixed in v1.0.10 upstream, commit d5fe6a1e
* Remove the confusing error message from thin_dump on broken pipe: Fixed in v1.0.12 upstream, commit b3e05f2e


Note You need to log in before you can comment on or make changes to this bug.