Hide Forgot
Description of problem: ------------------------- LVM snapshot does not get deleted after merging the snapshot on LVs that could not be mounted and system needs to be rebooted for the snapshot to get merge. For example `/var` LV. We need to restart `lvm2-monitor.service` service to remove the snapshot. Merging is ok though. Version-Release number of selected component (if applicable): ------------------------- # uname -a Linux dhcp223.example.com 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux # rpm -qa |grep lvm2 lvm2-2.02.130-5.el7.x86_64 lvm2-libs-2.02.130-5.el7.x86_64 How reproducible: ------------------------- Every time Steps to Reproduce: ------------------------- # mkdir /var/testdata # cp /etc/a* /etc/b* /var/testdata/ # ls -l /var/testdata/ total 32 -rw-r--r--. 1 root root 16 Jan 4 13:33 adjtime -rw-r--r--. 1 root root 1518 Jan 4 13:33 aliases -rw-r--r--. 1 root root 12288 Jan 4 13:33 aliases.db -rw-------. 1 root root 541 Jan 4 13:33 anacrontab -rw-r--r--. 1 root root 55 Jan 4 13:33 asound.conf -rw-r--r--. 1 root root 2835 Jan 4 13:34 bashrc # # lvcreate --size 300M --name snap --snapshot rhel/var Copied some data to /var/testdata # lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root rhel -wi-ao---- 24.41g /dev/sda2(0) snap rhel swi-a-s--- 300.00m var 0.58 /dev/sda3(0) swap rhel -wi-ao---- 1.00g /dev/sda2(8250) var rhel owi-aos--- 7.81g /dev/sda2(6250) # lvconvert --merge rhel/snap Logical volume rhel/var contains a filesystem in use. Can't merge over open origin volume. Merging of snapshot rhel/snap will occur on next activation of rhel/var. # lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root rhel -wi-ao---- 24.41g /dev/sda2(0) [snap] rhel Swi-a-s--- 300.00m var 100.00 /dev/sda3(0) swap rhel -wi-ao---- 1.00g /dev/sda2(8250) var rhel Owi-aos--- 7.81g /dev/sda2(6250) # After reboot # lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root rhel -wi-ao---- 24.41g /dev/sda2(0) [snap] rhel Swi-a-s--- 300.00m var 0.00 /dev/sda3(0) swap rhel -wi-ao---- 1.00g /dev/sda2(8250) var rhel Owi-aos--- 7.81g /dev/sda2(6250) # ls -l /var/testdata/ total 32 -rw-r--r--. 1 root root 16 Jan 4 13:33 adjtime -rw-r--r--. 1 root root 1518 Jan 4 13:33 aliases -rw-r--r--. 1 root root 12288 Jan 4 13:33 aliases.db -rw-------. 1 root root 541 Jan 4 13:33 anacrontab -rw-r--r--. 1 root root 55 Jan 4 13:33 asound.conf -rw-r--r--. 1 root root 2835 Jan 4 13:34 bashrc # systemctl restart lvm2-monitor.service # lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root rhel -wi-ao---- 24.41g /dev/sda2(0) swap rhel -wi-ao---- 1.00g /dev/sda2(8250) var rhel -wi-ao---- 7.81g /dev/sda2(6250) Actual results: When we do `lvconvert --merge rhel/snap` and reboot the server, snapshot LV does not get removed and we have to restart lvm2-monitor.service to remove the same. Expected results: When we do `lvconvert --merge rhel/snap` and reboot the server, snapshot LV should get removed Additional info: Similar issue is not seen in RHEL7.1 RHEL7.1 # rpm -qa |grep lvm2 lvm2-2.02.115-3.el7.x86_64 lvm2-libs-2.02.115-3.el7.x86_64 Linux dhcp162.example.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux # mkdir /var/testdata # cp /etc/c* /etc/d* /var/testdata/ # lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root rhel -wi-ao---- 19.53g /dev/sda2(0) swap rhel -wi-ao---- 1.00g /dev/sda2(6250) var rhel -wi-ao---- 4.88g /dev/sda2(5000) # ls -l /var/testdata/ total 44 -rw-------. 1 root root 0 Jan 4 15:14 cron.deny -rw-r--r--. 1 root root 451 Jan 4 15:14 crontab -rw-------. 1 root root 0 Jan 4 15:14 crypttab -rw-r--r--. 1 root root 1602 Jan 4 15:14 csh.cshrc -rw-r--r--. 1 root root 841 Jan 4 15:14 csh.login -rw-r--r--. 1 root root 25213 Jan 4 15:14 dnsmasq.conf -rw-r--r--. 1 root root 1285 Jan 4 15:14 dracut.conf # lvcreate --size 300M --name snap --snapshot rhel/var Logical volume "snap" created. # lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root rhel -wi-ao---- 19.53g /dev/sda2(0) snap rhel swi-a-s--- 300.00m var 0.00 /dev/sda3(0) swap rhel -wi-ao---- 1.00g /dev/sda2(6250) var rhel owi-aos--- 4.88g /dev/sda2(5000) # cp -avr /etc/e* /etc/f* /etc/g* /etc/h* /var/testdata/ # ls /var/testdata/ cron.deny csh.cshrc dracut.conf ethertypes filesystems gcrypt group grub.d gss hosts crontab csh.login e2fsck.conf exports firewalld gnupg group- gshadow host.conf hosts.allow crypttab dnsmasq.conf environment favicon.png fstab groff grub2.cfg gshadow- hostname hosts.deny # lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root rhel -wi-ao---- 19.53g /dev/sda2(0) snap rhel swi-a-s--- 300.00m var 0.11 /dev/sda3(0) swap rhel -wi-ao---- 1.00g /dev/sda2(6250) var rhel owi-aos--- 4.88g /dev/sda2(5000) # lvconvert --merge rhel/snap Logical volume rhel/var contains a filesystem in use. Can't merge over open origin volume. Merging of snapshot rhel/snap will occur on next activation of rhel/var. # lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root rhel -wi-ao---- 19.53g /dev/sda2(0) [snap] rhel Swi-a-s--- 300.00m var 100.00 /dev/sda3(0) swap rhel -wi-ao---- 1.00g /dev/sda2(6250) var rhel Owi-aos--- 4.88g /dev/sda2(5000) After reboot # lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root rhel -wi-ao---- 19.53g /dev/sda2(0) swap rhel -wi-ao---- 1.00g /dev/sda2(6250) var rhel -wi-ao---- 4.88g /dev/sda2(5000) So this looks to be regression.
Description of problem: ------------------------- LVM snapshot does not get deleted after merging the snapshot on LVs that could not be ***unmounted*** and system needs to be rebooted for the snapshot to get merge. For example `/var` LV. ------------------- s/mounted/unmounted
Could it be that the lvm2-monitor service wasn't running until after the merge completed?
Hi Nitin, Coul you please verify that 'vgchange -ay rhel', or 'lvchange -ay rhel/var' is enough to fix this issue after the reboot (or unmounting fs residing on top of the origin volume rhel/var)? It may be that lvm2-monitor service restart fixes it only as a side effect of actually rerunning vgchange/lvchange command internally. Also could you try to reproduce it (the whole reproducer) with 'use_lvmpolld = 0' in /etc/lvm/lvm.conf file? (anyway I'm going to try to reproduce it locally myself)
Reproduced locally. lvm command fails to query status of kernel target in a case when actual snapshot merge had to be postponed until the origin LV was unmounted (or origin LV open count equals 0). If you're not comfortable with lvm2-monitor service restart you can trigger the snapshot lv cleanup if you deactivate and reactivate again the origin lv (with lvchange -an, lvchange -ay). What's not yet clear to me is why this doesn't work after full system restart. Using lvmpolld or not, the bug manifests with or without it. I'll add full analysis tomorrow.
We have this very test case as apart of our snapshot regression suite, however we are masking/hacking around this problem by preforming a refresh to remove the merged snapshot. [root@host-109 ~]# lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Devices [merge_reboot] snapper Swi-a-s--- 1.00g origin 0.00 /dev/sde1(1024) origin snapper Owi-a-s--- 4.00g /dev/sde1(0) [root@host-109 ~]# vgchange --refresh snapper [root@host-109 ~]# lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Devices origin snapper -wi-a----- 4.00g /dev/sde1(0) We'll have to test w/o that once this gets fixed? 3.10.0-327.el7.x86_64 lvm2-2.02.130-5.el7 BUILT: Wed Oct 14 08:27:29 CDT 2015 lvm2-libs-2.02.130-5.el7 BUILT: Wed Oct 14 08:27:29 CDT 2015 lvm2-cluster-2.02.130-5.el7 BUILT: Wed Oct 14 08:27:29 CDT 2015 device-mapper-1.02.107-5.el7 BUILT: Wed Oct 14 08:27:29 CDT 2015 device-mapper-libs-1.02.107-5.el7 BUILT: Wed Oct 14 08:27:29 CDT 2015 device-mapper-event-1.02.107-5.el7 BUILT: Wed Oct 14 08:27:29 CDT 2015 device-mapper-event-libs-1.02.107-5.el7 BUILT: Wed Oct 14 08:27:29 CDT 2015 device-mapper-persistent-data-0.5.5-1.el7 BUILT: Thu Aug 13 09:58:10 CDT 2015
Hi, it's more complicated than I though in the beginning. First of all, I found the commit responsible for the regression: ----- commit c26d81d6e6939906729d91fae83cd8bbdd743bb7 Author: Ondrej Kozina <okozina> <----!!!------ Date: Wed Apr 8 12:05:14 2015 +0200 toollib: do not spawn polling in lv_change_activate spawning a background polling from within the lv_change_activate fn went to two problems: 1) vgchange should not spawn any background polling until after the whole activation process for a VG is finished. Otherwise it could lead to a duplicite request for spawning background polling. This statement was alredy true with one exception of mirror up-conversion polling (fixed by this commit). 2) due to current conditions in lv_change_activate lvchange cmd couldn't start background polling for pvmove LVs if such LV was about to get activated by the command in the same time. This commit however doesn't alter the lvchange cmd so that it works same as vgchange with regard to not to spawn duplicate background pollings per unique LV. ---- Unfortunately I can't simply revert it because I would reintroduce the bug I it was supposed to fix. What went wrong: This commit breaks snapshot merge on autoactivation during device discovery on boot. (This is the reason snapshot will not get removed after reboot). The autoactivation works only with lvmetad enabled. To test this regression you can simply run following: 0) have lvmetad enabled in lvm.conf 1) create VG on single device (i.e.: sdx) 2) create origin lv 3) mount lv 4) create snapshot 'snap' 5) write some data to mounted origin lv 6) call lvconvert --merge vg/snap (you'll get the warning about deferred merge until open count == 0) 7) umount origin lv 8) deactivate whole vg 9) call pvscan --cache -aay major:minor (of sdx) this will simulate the bug on autoactivation the customer has experienced. expected result: origin lv in a VG is active and snapshot lv is removed after some time. Now the harder thing. I strongly suspect it's not the only bug related to snapshot merge. For example. when I call vgchange -ay vg while the 'vg' is still active I'll receive errors in lvmpolld log about not being able to to query snapshot merge state. And yes the lvchange --refresh vg/origin is much saner workaround for the time being. Thanks Corey!
Fixed upstream: https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=40701af9696a302c904fad30951385eb5a5adb85
Adding QA ACK for 7.3. Once verified the test case might be modified to not use 'vgchange --refresh' as mentioned in Comment #5.
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions
It happens if you reboot as well. lvm2-2.02.130-5.el7.x86_64 and kernel lvm2-2.02.130-5.el7.x86_64
Verified with latest rpms. Also tested manually with real reboot, since the scenario shown below simulates it by vgchange --sysinit and --refresh. The fix does not allow us to remove the 'vgchange --refresh' part (mentioned above). Automated: SCENARIO - [reboot_before_thin_snap_merge_starts] Attempt to merge an inuse snapshot, then "reboot" the machine before the merge can take place Making pool volume lvcreate --thinpool POOL -L 4G --profile thin-performance --zero y --poolmetadatasize 4M snapper_thinp Sanity checking pool device (POOL) metadata examining superblock examining devices tree examining mapping tree checking space map counts Making origin volume lvcreate --virtualsize 1G -T snapper_thinp/POOL -n origin lvcreate --virtualsize 1G -T snapper_thinp/POOL -n other1 lvcreate --virtualsize 1G -T snapper_thinp/POOL -n other2 lvcreate -V 1G -T snapper_thinp/POOL -n other3 lvcreate -V 1G -T snapper_thinp/POOL -n other4 WARNING: Sum of all thin volume sizes (5.00 GiB) exceeds the size of thin pool snapper_thinp/POOL (4.00 GiB)! lvcreate --virtualsize 1G -T snapper_thinp/POOL -n other5 WARNING: Sum of all thin volume sizes (6.00 GiB) exceeds the size of thin pool snapper_thinp/POOL (4.00 GiB)! Placing an xfs filesystem on origin volume Mounting origin volume Making snapshot of origin volume lvcreate -k n -s /dev/snapper_thinp/origin -n merge_reboot Mounting snap volume Attempt to merge snapshot snapper_thinp/merge_reboot lvconvert --merge snapper_thinp/merge_reboot --yes Logical volume snapper_thinp/merge_reboot contains a filesystem in use. umount and deactivate volume group vgchange --sysinit -ay snapper_thinp vgchange --refresh snapper_thinp Check if snapshot merged successfully. Failed to find logical volume "snapper_thinp/merge_reboot" OK. Snapshot is not present. Removing thin origin and other virtual thin volumes Removing thinpool snapper_thinp/POOL ======================================= Manual: Continue from point where vg is deactivated during snapshot merge. # lvs -a LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert ... [merge_reboot] snapper_thinp Swi---t--- 1.00g POOL origin origin snapper_thinp Owi---t--- 1.00g POOL ... # reboot ... # lvs -a LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert ... origin snapper_thinp Vwi-a-t--- 1.00g POOL 0.37 ... Tested with: 3.10.0-475.el7.x86_64 lvm2-2.02.162-1.el7 BUILT: Fri Jul 29 09:26:36 CEST 2016 lvm2-libs-2.02.162-1.el7 BUILT: Fri Jul 29 09:26:36 CEST 2016 lvm2-cluster-2.02.162-1.el7 BUILT: Fri Jul 29 09:26:36 CEST 2016 device-mapper-1.02.132-1.el7 BUILT: Fri Jul 29 09:26:36 CEST 2016 device-mapper-libs-1.02.132-1.el7 BUILT: Fri Jul 29 09:26:36 CEST 2016 device-mapper-event-1.02.132-1.el7 BUILT: Fri Jul 29 09:26:36 CEST 2016 device-mapper-event-libs-1.02.132-1.el7 BUILT: Fri Jul 29 09:26:36 CEST 2016 device-mapper-persistent-data-0.6.3-1.el7 BUILT: Fri Jul 22 12:29:13 CEST 2016 cmirror-2.02.162-1.el7 BUILT: Fri Jul 29 09:26:36 CEST 2016
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1445.html