Bug 2158628

Summary:

lvextend of a volume backed by a thin pool sometimes triggers unmount

Product:

Red Hat Enterprise Linux 9

Reporter:

Steve Baker <sbaker>

Component:

lvm2

Assignee:

Peter Rajnoha <prajnoha>

lvm2 sub component:

Udev

QA Contact:

cluster-qe <cluster-qe>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

urgent

CC:

afazekas, agk, alfrgarc, apevec, bstinson, cmarthal, dhughes, dlehman, dtardon, heinzm, hjensas, jbrassow, jgrosso, jwboyer, kthakre, mcsontos, mgarciac, mpatocka, msekleta, msnitzer, mvollmer, prajnoha, pvlasin, rdiazcam, spower, stchen, systemd-maint-list, yuwatana, zkabelac

Version:

CentOS Stream

Keywords:

Triaged

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Unspecified

Whiteboard:

Fixed In Version:

lvm2-2.03.17-6.el9

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2023-05-09 08:23:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
journal of successful growvols run	none
systemd debug of successful boot	none
systemd debug of /home (dm-9) being unmounted after lvextend	none
A patch for the upstream kernel	none

Description Steve Baker 2023-01-05 21:41:33 UTC

Description of problem:
When lvextend is called to extend a volume backed by a thin pool, sometimes the volume is unmounted. Depending on which volume is unmounted the server can become degraded or completely non functional.

Version-Release number of selected component (if applicable):
This issue started when Centos-9-stream updated from systemd-250 to
systemd-252-2.el9.x86_64

Downgrading to systemd-250 avoids this issue
Downgrading lvm2, device-mapper, kernel, libblockdev to previous 9-stream versions has no effect, so we believe this is a systemd regression

How reproducible:
Around 20% of attempts

Steps to Reproduce:
A cloud image which has multiple LVM volumes backed by a thin pool, after boot a script grows the pool to the size of the server disk, and grows each logical volume so the thin pool is fully (but not over) allocated.
Sometimes, the lvextend results in unmounting the volume, the severity of the resulting failures depends on which volume this occurs on

Actual results:
Jan 03 21:28:06 lp2000226-1 growvols[2428]: [INFO] Running: lvextend --size +17200840704B /dev/mapper/vg-lv_srv
Jan 03 21:28:06 lp2000226-1 dmeventd[775]: No longer monitoring thin pool vg-lv_thinpool-tpool.
Jan 03 21:28:06 lp2000226-1 kernel: dm-10: detected capacity change from 98304 to 33693696
Jan 03 21:28:06 lp2000226-1 systemd[1]: Stopped target Local File Systems.
Jan 03 21:28:06 lp2000226-1 systemd[1]: Unmounting /srv...
Jan 03 21:28:06 lp2000226-1 kernel: XFS (dm-10): Unmounting Filesystem
Jan 03 21:28:06 lp2000226-1 dmeventd[775]: Monitoring thin pool vg-lv_thinpool-tpool.
Jan 03 21:28:06 lp2000226-1 systemd[1]: srv.mount: Deactivated successfully.
Jan 03 21:28:06 lp2000226-1 systemd[1]: Unmounted /srv.
Jan 03 21:28:06 lp2000226-1 systemd[1]: systemd-fsck@dev-disk-by\x2dlabel-fs_srv.service: Deactivated successfully.
Jan 03 21:28:06 lp2000226-1 systemd[1]: Stopped File System Check on /dev/disk/by-label/fs_srv.
Jan 03 21:28:07 lp2000226-1 growvols[2428]: [DEBUG] Result:   Size of logical volume vg/lv_srv changed from 48.00 MiB (12 extents) to <16.07 GiB (4113 extents).
Jan 03 21:28:07 lp2000226-1 growvols[2428]:   Logical volume vg/lv_srv successfully resized.


Expected results:
Jan 05 16:31:56 lp2000226-s-1 growvols[2420]: [INFO] Running: lvextend --size +17200840704B /dev/mapper/vg-lv_srv
Jan 05 16:31:56 lp2000226-s-1 dmeventd[771]: No longer monitoring thin pool vg-lv_thinpool-tpool.
Jan 05 16:31:56 lp2000226-s-1 growvols[2420]: [DEBUG] Result:   Size of logical volume vg/lv_srv changed from 48.00 MiB (12 extents) to <16.07 GiB (4113 extents).
Jan 05 16:31:56 lp2000226-s-1 growvols[2420]:   Logical volume vg/lv_srv successfully resized



Additional info:

The initial state of the logical volumes is:
# lvs
  LV          VG Attr       LSize   Pool        Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_audit    vg Vwi-aotz-- 192.00m lv_thinpool        4.13                                   
  lv_home     vg Vwi-aotz-- 240.00m lv_thinpool        3.36                                   
  lv_log      vg Vwi-aotz-- 240.00m lv_thinpool        4.35                                   
  lv_root     vg Vwi-aotz--  <3.62g lv_thinpool        57.99                                  
  lv_srv      vg Vwi-aotz--  48.00m lv_thinpool        15.62                                  
  lv_thinpool vg twi-aotz--  <4.93g                    51.39  21.97                           
  lv_tmp      vg Vwi-aotz-- 240.00m lv_thinpool        3.46                                   
  lv_var      vg Vwi-aotz-- 952.00m lv_thinpool        42.18   

After the growvols script is run, the state shows that no volume or pool is anywhere near capacity, so the unmount guardrails should not be being triggered:
# lvs
  LV          VG Attr       LSize   Pool        Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_audit    vg Vwi-aotz--  <2.05g lv_thinpool        0.62                                   
  lv_home     vg Vwi-aotz--   1.16g lv_thinpool        0.88                                   
  lv_log      vg Vwi-aotz--  <9.55g lv_thinpool        0.31                                   
  lv_root     vg Vwi-aotz-- <11.07g lv_thinpool        18.97                                  
  lv_srv      vg Vwi-aotz-- <16.07g lv_thinpool        0.31                                   
  lv_thinpool vg twi-aotz--  77.92g                    3.36   1.68                            
  lv_tmp      vg Vwi-aotz--   1.16g lv_thinpool        0.88                                   
  lv_var      vg Vwi-aotz-- <37.43g lv_thinpool        1.10    

The portion of the journal when growvols is running will be attached, showing both successful and failed runs.

An attempt was made to run lvextend with LVM_RUN_BY_DMEVENTD=1 (see man dmeventd) but this made the unmount issue worse.

Comment 1 Steve Baker 2023-01-05 21:42:58 UTC

Setting to high Severity since this is blocking upstream OpenStack CI and the only workaround is to pin to systemd-250

Comment 2 Steve Baker 2023-01-06 03:55:48 UTC

Created attachment 1936098 [details]
journal of successful growvols run

Comment 3 Steve Baker 2023-01-06 03:56:55 UTC

Created attachment 1936099 [details]
journal of failed growvols run, volume gets unmounted

Comment 4 Michal Sekletar 2023-01-09 13:11:56 UTC

Please enable systemd debug logging, reproduce and attach the log. Thanks!

Comment 5 Steve Baker 2023-01-16 22:11:13 UTC

Created attachment 1938491 [details]
systemd debug of successful boot

Comment 6 Steve Baker 2023-01-16 22:13:09 UTC

Created attachment 1938492 [details]
systemd debug of /home (dm-9) being unmounted after lvextend

Comment 10 Attila Fazekas 2023-01-18 17:21:25 UTC

The command executed by `/usr/local/sbin/growvols -yv /=8GB /tmp=1GB /var/log=10GB /var/log/audit=2GB /home=1GB /var=100%` when the disk size extended to 50G.

sgdisk --new=5:11487232:104857566 --change-name=5:growvols /dev/vda
partprobe
pvcreate /dev/vda5
vgextend vg /dev/vda5
lvextend --poolmetadatasize +1073741824B /dev/mapper/vg-lv_thinpool /dev/vda5
lvextend -L+46728740864B /dev/mapper/vg-lv_thinpool /dev/vda5
lvextend --size +7998537728B /dev/mapper/vg-lv_root
lvextend --size +998244352B /dev/mapper/vg-lv_tmp
lvextend --size +9999220736B /dev/mapper/vg-lv_log
lvextend --size +1996488704B /dev/mapper/vg-lv_audit
#sleep 60
lvextend --size +998244352B /dev/mapper/vg-lv_home
lvextend --size +24738004992B /dev/mapper/vg-lv_var
xfs_growfs /dev/mapper/vg-lv_root
xfs_growfs /dev/mapper/vg-lv_tmp
xfs_growfs /dev/mapper/vg-lv_log
xfs_growfs /dev/mapper/vg-lv_audit
xfs_growfs /dev/mapper/vg-lv_home
xfs_growfs /dev/mapper/vg-lv_var


The "sleep 60" seams to be able to workaround the issue.

Comment 11 Attila Fazekas 2023-01-18 18:07:47 UTC

If I reboot before 

lvextend --size +7998537728B /dev/mapper/vg-lv_root
lvextend --size +998244352B /dev/mapper/vg-lv_tmp
lvextend --size +9999220736B /dev/mapper/vg-lv_log
lvextend --size +1996488704B /dev/mapper/vg-lv_audit
lvextend --size +998244352B /dev/mapper/vg-lv_home
lvextend --size +24738004992B /dev/mapper/vg-lv_var

The issue still can happen, likely multiple lvextend "in flight" is required to trigger the issue.

Comment 12 Attila Fazekas 2023-01-19 10:34:43 UTC

Repeating the above 6 lvextend even with minimal sizes (+512B become 4M) can trigger the issue within  1~11 loop.

In case I downgrade systemd to 250-12.el9_1.2 before the reboot, the issue disappears.

Comment 13 Attila Fazekas 2023-01-19 15:47:24 UTC

v251 does not seams to be affected, bisecting the change might be possible.

Comment 14 Attila Fazekas 2023-01-20 12:50:02 UTC

bisect ended up:

"""
4228306b9d50df9a804859d00e84588a9fc4c4b9 is the first bad commit
commit 4228306b9d50df9a804859d00e84588a9fc4c4b9
Author: Yu Watanabe <watanabe.yu+github>
Date:   Thu Sep 1 01:17:27 2022 +0900

    core/device: always update existing devlink or alias units on uevent
    
    Previously, existing device units for devlinks or aliases were not
    removed unless the main device unit is removed. This makes all existing
    device units for devlinks and aliases are checked if they are still
    required, and remove if not necessary anymore.
    
    Fixes #24518.

 src/core/device.c | 315 +++++++++++++++++++++++++-----------------------------
 1 file changed, 146 insertions(+), 169 deletions(-)
"""

Not verified yet, I hope the random noise did not lead to very bad results..
Looks like increasing the vpcu count in the vm makes the issue less reproducible.

Comment 15 Steve Baker 2023-01-25 22:16:30 UTC

Yu, do you think LVM volumes backed by a thin provisioned pool changes some timing assumptions in commit 4228306b9d50df9a804859d00e84588a9fc4c4b9, causing this issue?

Comment 17 Yu Watanabe 2023-02-01 12:00:23 UTC

Thank you for bisecting the commits.

The issue is caused by that 13-dm-disk.rules does not enable device node symlink (SYMLINK+=) based on the filesystem label (and also by UUID). The upstream of lvm2 has a fix to address an issue something similar, and the fix is included in v2.03.15.
https://github.com/lvmteam/lvm2/commit/e10f67e91728f1e576803df884049ecbd92874d0

Note, no .rules files provided by LVM2 package contain IMPORT{db}="ID_FS_LABEL_ENC", but the import is done by 11-dm-parts.rules, which is provided by kpartx.rpm (at least on Fedora 37, I am not familiar with RHEL, sorry). 

Summary, please try LVM2-2.03.15 or newer with kpartx.rpm.

Comment 18 Alan Pevec 2023-02-01 14:01:11 UTC

CS9 had lvm2-2.03.16-1.el9 since June 2022 and kpartx-0.8.7 since ever, so it must be something else?

Comment 19 Yu Watanabe 2023-02-01 15:26:30 UTC

Ah, 11-dm-parts.rules from kpartx.rpm does not work for LVM devices, which satisfy DM_UUID=="LLVM-*".
So, kpartx.rpm is not relevant here, sorry.

And, https://github.com/lvmteam/lvm2/commit/e10f67e91728f1e576803df884049ecbd92874d0 is not enough to fix the issue.

Could you test the following?
=============
diff --git a/udev/13-dm-disk.rules.in b/udev/13-dm-disk.rules.in
index 5cc08121e..dca00bc01 100644
--- a/udev/13-dm-disk.rules.in
+++ b/udev/13-dm-disk.rules.in
@@ -17,12 +17,22 @@ ENV{DM_UDEV_DISABLE_DISK_RULES_FLAG}=="1", GOTO="dm_end"
 SYMLINK+="disk/by-id/dm-name-$env{DM_NAME}"
 ENV{DM_UUID}=="?*", SYMLINK+="disk/by-id/dm-uuid-$env{DM_UUID}"
 
-ENV{DM_SUSPENDED}=="1", ENV{DM_UDEV_PRIMARY_SOURCE_FLAG}=="1", GOTO="dm_link"
-ENV{DM_NOSCAN}=="1", ENV{DM_UDEV_PRIMARY_SOURCE_FLAG}=="1", GOTO="dm_link"
+ENV{DM_SUSPENDED}=="1", ENV{DM_UDEV_PRIMARY_SOURCE_FLAG}=="1", GOTO="dm_import"
+ENV{DM_NOSCAN}=="1", ENV{DM_UDEV_PRIMARY_SOURCE_FLAG}=="1", GOTO="dm_import"
 ENV{DM_SUSPENDED}=="1", GOTO="dm_end"
 ENV{DM_NOSCAN}=="1", GOTO="dm_watch"
 
 (BLKID_RULE)
+GOTO="dm_link"
+
+LABEL="dm_import"
+IMPORT{db}="ID_FS_USAGE"
+IMPORT{db}="ID_FS_UUID_ENC"
+IMPORT{db}="ID_FS_LABEL_ENC"
+IMPORT{db}="ID_PART_ENTRY_NAME"
+IMPORT{db}="ID_PART_ENTRY_UUID"
+IMPORT{db}="ID_PART_ENTRY_SCHEME"
+IMPORT{db}="ID_PART_GPT_AUTO_ROOT"
 
 LABEL="dm_link"
 ENV{DM_UDEV_LOW_PRIORITY_FLAG}=="1", OPTIONS="link_priority=-100"

Comment 20 Yu Watanabe 2023-02-01 15:37:03 UTC

The above patch is submitted as https://github.com/lvmteam/lvm2/pull/105

Comment 21 Steve Baker 2023-02-02 00:03:23 UTC

(In reply to Yu Watanabe from comment #20)
> The above patch is submitted as https://github.com/lvmteam/lvm2/pull/105

I've tested this change on my reproducer image for 17 runs with zero failures.

We could install a modified /usr/lib/udev/rules.d/13-dm-disk.rules on the image until this fix is packaged in a device-mapper-9 rpm. This will give some more test coverage and unblock our release pipeline.

Comment 24 Peter Rajnoha 2023-02-06 13:22:05 UTC

The issue here is that there are two uevents generated with a race. From my test, I can see (the dm-6 is the top level thin LV):

  KERNEL[1632.888585] change   /devices/virtual/block/dm-6 (block)
  ACTION=change
  DEVPATH=/devices/virtual/block/dm-6
  SUBSYSTEM=block
  RESIZE=1
  DEVNAME=/dev/dm-6
  DEVTYPE=disk
  DISKSEQ=26
  SEQNUM=2942
  MAJOR=253
  MINOR=6

  KERNEL[1632.889364] change   /devices/virtual/block/dm-6 (block)
  ACTION=change
  DEVPATH=/devices/virtual/block/dm-6
  SUBSYSTEM=block
  DM_COOKIE=6333296
  DEVNAME=/dev/dm-6
  DEVTYPE=disk
  DISKSEQ=26
  SEQNUM=2943
  MAJOR=253
  MINOR=6


The first uevent is generated to notify about the change in size (it contains RESIZE="1"). This is relatively new in kernel: https://github.com/torvalds/linux/commit/e598a72faeb543599bdf0d930df3a71906404e6f

The second uevent notifies about the DM device being resumed, that is, flipping DM device state from "suspended" to "active".

While the first uevent is being processed by udev, the DM device doesn't need to be resumed yet and so we may see the device as suspended. This causes blkid scan to be skipped and so the appropriate ID_FS_* variables are not set. As described earlier somewhere in this thread (or the github PR from comment #20), this causes systemd to unmount the mount point for which we lose the appropriate identification.

To fix this, we either need to reorder the uevents so that the "resize" one goes after "DM resume" one - that would be probably more correct solution to this. Or, alternatively, we make the udev rules to simply count with this situation (which is basically what is proposed in PR from comment #20). Since it's easier to change userspace and we need to fix this quickly, we will change the udev rules.

Comment 25 Peter Rajnoha 2023-02-07 08:41:26 UTC

Applied patch from comment #20: https://sourceware.org/git/?p=lvm2.git;a=commit;h=94f77a4d8d9737fca05fb4e451678ec440c68670

Comment 26 Mikuláš Patočka 2023-02-07 13:30:27 UTC

Created attachment 1942712 [details]
A patch for the upstream kernel

Hi

Here I'm submitting the upstream kernel patch for this bug. Please test if it helps.

Comment 29 Peter Rajnoha 2023-02-08 15:40:41 UTC

For the purpose of solving the issue described in this BZ report, the fix from comment #19 (comment #20) is enough - that is a patch for 13-dm-disk.rules which belong to lvm2 package. This fix will cause the ID_FS_* variables in udev to not be temporarily lost while DM/LVM device is suspended and we happen to receive a uevent during this suspended period. Such situation happens during DM/LVM device resize where we suspend the device first, the we resize it (so there is the CHANGE uevent with RESIZE="1") and then we resume the device (so there is another CHANGE event notifying about the resume itself). Keeping the ID_FS_* variables is important to not lose the /dev/disk/* content and also for systemd to trigger unmount operation.

The kernel patch from comment #26 is the more correct solution to this which will cause that NO other udev rule is skipped during resize operation (not just 13-dm-disk and associated systemd hooks), because it won't have a chance to see the device as suspended at all.

Comment 32 Corey Marthaler 2023-02-13 16:09:51 UTC

Marking Verified:Tested in the latest rpms.

kernel-5.14.0-252.el9    BUILT: Wed Feb  1 03:30:10 PM CET 2023
lvm2-2.03.17-6.el9    BUILT: Thu Feb  9 09:52:52 PM CET 2023
lvm2-libs-2.03.17-6.el9    BUILT: Thu Feb  9 09:52:52 PM CET 2023


[root@virt-506 ~]# vgcreate test /dev/sd[ab]
  Physical volume "/dev/sda" successfully created.
  Physical volume "/dev/sdb" successfully created.
  Volume group "test" successfully created
[root@virt-506 ~]# lvcreate -L200M test
  Logical volume "lvol0" created.
[root@virt-506 ~]# mkfs.xfs /dev/test/lvol0
meta-data=/dev/test/lvol0        isize=512    agcount=4, agsize=12800 blks
[...]
[root@virt-506 ~]# mount /dev/test/lvol0 /mnt
[root@virt-506 ~]# ls -l /dev/mapper/
lrwxrwxrwx. 1 root root       7 Feb 13 17:02 test-lvol0 -> ../dm-2
[root@virt-506 ~]# udevadm info --name /dev/mapper/test-lvol0 | grep ID_FS_
E: ID_FS_UUID=44b900ba-3d39-4870-a8ca-6a0acba5b882
E: ID_FS_UUID_ENC=44b900ba-3d39-4870-a8ca-6a0acba5b882
E: ID_FS_SIZE=204111872
E: ID_FS_LASTBLOCK=51200
E: ID_FS_BLOCKSIZE=4096
E: ID_FS_TYPE=xfs
E: ID_FS_USAGE=filesystem
[root@virt-506 ~]# dmsetup suspend /dev/dm-2
[root@virt-506 ~]# echo change > /sys/block/dm-2/uevent

[root@virt-506 ~]# udevadm info --name /dev/mapper/test-lvol0 | grep ID_FS_
E: ID_FS_USAGE=filesystem
E: ID_FS_UUID_ENC=44b900ba-3d39-4870-a8ca-6a0acba5b882

[root@virt-506 ~]# mount
/dev/mapper/test-lvol0 on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
[root@virt-506 ~]# df
Filesystem                      1K-blocks    Used Available Use% Mounted on
/dev/mapper/test-lvol0             199328    1660    197668   1% /mnt

Comment 33 Marius Vollmer 2023-02-15 09:43:39 UTC

This was probably also fixed with https://github.com/systemd/systemd/pull/24177.

Comment 34 Marius Vollmer 2023-02-15 09:47:25 UTC

(In reply to Marius Vollmer from comment #33)
> This was probably also fixed with
> https://github.com/systemd/systemd/pull/24177.

Ah, no, the systemd change was about SYSTEMD_READY, this here is about more attributes with the same problem.

Comment 38 Corey Marthaler 2023-02-16 16:59:10 UTC

Marking VERIFIED in the latest build as well.

kernel-5.14.0-252.el9    BUILT: Wed Feb  1 03:30:10 PM CET 2023
lvm2-2.03.17-7.el9    BUILT: Thu Feb 16 03:24:54 PM CET 2023
lvm2-libs-2.03.17-7.el9    BUILT: Thu Feb 16 03:24:54 PM CET 2023

[root@virt-497 ~]# udevadm info --name /dev/mapper/test-lvol0 | grep ID_FS_
E: ID_FS_UUID=940b98d7-392d-4e9b-99e5-6afd78be88d8
E: ID_FS_UUID_ENC=940b98d7-392d-4e9b-99e5-6afd78be88d8
E: ID_FS_SIZE=204111872
E: ID_FS_LASTBLOCK=51200
E: ID_FS_BLOCKSIZE=4096
E: ID_FS_TYPE=xfs
E: ID_FS_USAGE=filesystem
[root@virt-497 ~]# dmsetup suspend /dev/dm-2
[root@virt-497 ~]# echo change > /sys/block/dm-2/uevent
[root@virt-497 ~]# udevadm info --name /dev/mapper/test-lvol0 | grep ID_FS_
E: ID_FS_USAGE=filesystem
E: ID_FS_UUID_ENC=940b98d7-392d-4e9b-99e5-6afd78be88d8

Comment 40 errata-xmlrpc 2023-05-09 08:23:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (lvm2 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2544