Bug 2023213 - Autoactivation of thin pool during boot fails
Summary: Autoactivation of thin pool during boot fails
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: LVM and device-mapper
Classification: Community
Component: lvm2
Version: unspecified
Hardware: Unspecified
OS: Linux
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: LVM and device-mapper development team
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-15 08:42 UTC by Fiona Ebner
Modified: 2021-11-17 11:18 UTC (History)
8 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2021-11-17 11:18:00 UTC
Embargoed:
pm-rhel: lvm-technical-solution?
pm-rhel: lvm-test-coverage?


Attachments (Terms of Use)
Debug logs from two users (489.63 KB, application/octet-stream)
2021-11-15 08:42 UTC, Fiona Ebner
no flags Details

Description Fiona Ebner 2021-11-15 08:42:16 UTC
Created attachment 1841775 [details]
Debug logs from two users

Description of problem:

Some Proxmox VE users reported [0] that autoactivation of a thin pool does not work for them anymore. Two users also provided debug logs, and to me it seems like the pvscan command during the initramfs stage crashes. For both users, there also is an lvchange command running at the same time. Manual activation after boot seems to work just fine.


Version-Release number of selected component (if applicable):

2.03.11-2.1 from Debian 11


How reproducible:

I was not able to reproduce the issue myself.


Actual results:

The thin pool LV and the LVs it contains are not activated, but the _tdata and _tmeta partial LVs might stay activated.


Expected results:

The thin pool LV and the LVs it contains are activated.


Additional info:

Please see the attached debug logs and [0] for more information.

[0]: https://forum.proxmox.com/threads/local-lvm-not-available-after-kernel-update-on-pve-7.97406/

Comment 1 David Teigland 2021-11-15 16:25:30 UTC
In the lvm group we don't test event-based autoactivation from the initrd as described in the forum, so there may be some issues that we've not run across.

Comment 14 in the forum thread is correct.  The file /run/lvm/vgs_online/<vgname> prevents the VG from being activated a second time.  If the VG was not activated the first time, then that's the problem to look at.

One possibility is that the VG contains thin pools, and lvm uses external tools to check thin pools prior to activating them (i.e. thin_check).  If the initrd does not contain that command, then the lvm autoactivation command will likely fail to autoactivate thin pools.

Comment 2 Fiona Ebner 2021-11-16 09:11:51 UTC
Thank you for taking a look.

Yes, in both cases it's a thin pool that doesn't activate properly. The initrd should contain the thin_check command (as a symlink to pdata-tools).

If the binary were simply not present, I would expect a "WARNING: Check is skipped, please install recommended missing binary <path/to/binary>!" message in the log. I suspect that there is a crash, because there is no "<VG>: autoactivation failed." message in the log either. The last log entries from the pvscan command are right before executing the thin_check command. But if the check command itself were problematic, it should still be handled gracefully, right?

Comment 3 David Teigland 2021-11-16 14:35:46 UTC
After these messages pvscan just calls wait() so it's probably just waiting for thin_check to finish running (I think it can run for a long time).

15:04:45.986728 pvscan[474] activate/dev_manager.c:2297  Running check command on /dev/mapper/pve-data_tmeta
15:04:46.5731 pvscan[474] config/config.c:1474  global/thin_check_options not found in config: defaulting to thin_check_options = [ "-q" ]
15:04:46.5756 pvscan[474] misc/lvm-exec.c:71  Executing: /usr/sbin/thin_check -q /dev/mapper/pve-data_tmeta
15:04:46.5966 pvscan[474] misc/lvm-flock.c:37  _drop_shared_flock /run/lock/lvm/V_pve.
15:04:46.6054 pvscan[474] mm/memlock.c:694  memlock reset.

Comment 4 Zdenek Kabelac 2021-11-16 18:57:06 UTC
So few points - since there is mentioned Debian.

Upstream rules on  i.e. Fedora or RHEL are using  systemd services for actual autoactivation.
The old style was executing 'pvscan' within udev rule  (see the end of 69-dm-lvm-metad.rules file)
When this happens - there is actually upper time limit enforced by udev - so if checking of 
thin-pool metadata takes longer time - while udev rules is actually killed - so activation is
breaken - likely during  'thin_check'

Since it's not clear what is the machine state and what are all the rules and the whole logic used  -
I could suggest couple tricks:

1. There is relatively easy way to dramatically speed-up  thin_check with option  '--skip-mappings'
   You could add this option into lvm.conf  thin_check_option option list.
   (just make sure lvm.conf is also propagated to your ramdisk)

2. Switch the system to use systemd service for auto activation - this probably requires some cooperation
   with Debian devel people.

3. Add your own startup rule at the end of boot process and simply call 'vgchange -ay' from there.

Comment 5 Fiona Ebner 2021-11-17 11:18:00 UTC
Thank you very much for the suggestions!

This (i.e. thin_check taking too long and pvscan getting killed after the udev time limit) seems to be indeed what's happening. To test it, I replaced my thin_check with a binary that sleeps for 6 minutes, and now I do get the same behavior as our users reported.

I'll suggest 1. to the affected users as a quick fix, and will discuss 2./3. with my co-workers for the long term.


Note You need to log in before you can comment on or make changes to this bug.