Created attachment 1841775 [details] Debug logs from two users Description of problem: Some Proxmox VE users reported [0] that autoactivation of a thin pool does not work for them anymore. Two users also provided debug logs, and to me it seems like the pvscan command during the initramfs stage crashes. For both users, there also is an lvchange command running at the same time. Manual activation after boot seems to work just fine. Version-Release number of selected component (if applicable): 2.03.11-2.1 from Debian 11 How reproducible: I was not able to reproduce the issue myself. Actual results: The thin pool LV and the LVs it contains are not activated, but the _tdata and _tmeta partial LVs might stay activated. Expected results: The thin pool LV and the LVs it contains are activated. Additional info: Please see the attached debug logs and [0] for more information. [0]: https://forum.proxmox.com/threads/local-lvm-not-available-after-kernel-update-on-pve-7.97406/
In the lvm group we don't test event-based autoactivation from the initrd as described in the forum, so there may be some issues that we've not run across. Comment 14 in the forum thread is correct. The file /run/lvm/vgs_online/<vgname> prevents the VG from being activated a second time. If the VG was not activated the first time, then that's the problem to look at. One possibility is that the VG contains thin pools, and lvm uses external tools to check thin pools prior to activating them (i.e. thin_check). If the initrd does not contain that command, then the lvm autoactivation command will likely fail to autoactivate thin pools.
Thank you for taking a look. Yes, in both cases it's a thin pool that doesn't activate properly. The initrd should contain the thin_check command (as a symlink to pdata-tools). If the binary were simply not present, I would expect a "WARNING: Check is skipped, please install recommended missing binary <path/to/binary>!" message in the log. I suspect that there is a crash, because there is no "<VG>: autoactivation failed." message in the log either. The last log entries from the pvscan command are right before executing the thin_check command. But if the check command itself were problematic, it should still be handled gracefully, right?
After these messages pvscan just calls wait() so it's probably just waiting for thin_check to finish running (I think it can run for a long time). 15:04:45.986728 pvscan[474] activate/dev_manager.c:2297 Running check command on /dev/mapper/pve-data_tmeta 15:04:46.5731 pvscan[474] config/config.c:1474 global/thin_check_options not found in config: defaulting to thin_check_options = [ "-q" ] 15:04:46.5756 pvscan[474] misc/lvm-exec.c:71 Executing: /usr/sbin/thin_check -q /dev/mapper/pve-data_tmeta 15:04:46.5966 pvscan[474] misc/lvm-flock.c:37 _drop_shared_flock /run/lock/lvm/V_pve. 15:04:46.6054 pvscan[474] mm/memlock.c:694 memlock reset.
So few points - since there is mentioned Debian. Upstream rules on i.e. Fedora or RHEL are using systemd services for actual autoactivation. The old style was executing 'pvscan' within udev rule (see the end of 69-dm-lvm-metad.rules file) When this happens - there is actually upper time limit enforced by udev - so if checking of thin-pool metadata takes longer time - while udev rules is actually killed - so activation is breaken - likely during 'thin_check' Since it's not clear what is the machine state and what are all the rules and the whole logic used - I could suggest couple tricks: 1. There is relatively easy way to dramatically speed-up thin_check with option '--skip-mappings' You could add this option into lvm.conf thin_check_option option list. (just make sure lvm.conf is also propagated to your ramdisk) 2. Switch the system to use systemd service for auto activation - this probably requires some cooperation with Debian devel people. 3. Add your own startup rule at the end of boot process and simply call 'vgchange -ay' from there.
Thank you very much for the suggestions! This (i.e. thin_check taking too long and pvscan getting killed after the udev time limit) seems to be indeed what's happening. To test it, I replaced my thin_check with a binary that sleeps for 6 minutes, and now I do get the same behavior as our users reported. I'll suggest 1. to the affected users as a quick fix, and will discuss 2./3. with my co-workers for the long term.