| Summary: | LVM snapshot does not get deleted after merging the snapshot on LVs that could not be unmounted and system needs to be rebooted for the snapshot to get merge. | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Nitin Yewale <nyewale> | |
| Component: | lvm2 | Assignee: | Ondrej Kozina <okozina> | |
| lvm2 sub component: | Snapshots | QA Contact: | cluster-qe <cluster-qe> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | medium | |||
| Priority: | high | CC: | agk, aperotti, cmarthal, ealcaniz, heinzm, jbrassow, jmagrini, msnitzer, nkshirsa, nyewale, prajnoha, prockai, rbednar, salmy, ssundarr, zkabelac | |
| Version: | 7.2 | Keywords: | Regression, ZStream | |
| Target Milestone: | rc | |||
| Target Release: | --- | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | lvm2-2.02.152-1.el7 | Doc Type: | Bug Fix | |
| Doc Text: |
Due to a bug (regression), the lvm2 was unable to remove sucessfully merged snapshot LVs during autoactivation of logical volumes. Typically this occurred on system boot when lvmetad caching daemon was enabled (which is by default).
With this fix applied snapshot LVs are again correctly removed and workaround mentioned in the bugzilla is no longer needed.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1328799 (view as bug list) | Environment: | ||
| Last Closed: | 2016-11-04 04:13:51 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Bug Depends On: | ||||
| Bug Blocks: | 1203710, 1295577, 1313485, 1328799 | |||
|
Description
Nitin Yewale
2016-01-04 21:07:51 UTC
Description of problem: ------------------------- LVM snapshot does not get deleted after merging the snapshot on LVs that could not be ***unmounted*** and system needs to be rebooted for the snapshot to get merge. For example `/var` LV. ------------------- s/mounted/unmounted Could it be that the lvm2-monitor service wasn't running until after the merge completed? Hi Nitin, Coul you please verify that 'vgchange -ay rhel', or 'lvchange -ay rhel/var' is enough to fix this issue after the reboot (or unmounting fs residing on top of the origin volume rhel/var)? It may be that lvm2-monitor service restart fixes it only as a side effect of actually rerunning vgchange/lvchange command internally. Also could you try to reproduce it (the whole reproducer) with 'use_lvmpolld = 0' in /etc/lvm/lvm.conf file? (anyway I'm going to try to reproduce it locally myself) Reproduced locally. lvm command fails to query status of kernel target in a case when actual snapshot merge had to be postponed until the origin LV was unmounted (or origin LV open count equals 0). If you're not comfortable with lvm2-monitor service restart you can trigger the snapshot lv cleanup if you deactivate and reactivate again the origin lv (with lvchange -an, lvchange -ay). What's not yet clear to me is why this doesn't work after full system restart. Using lvmpolld or not, the bug manifests with or without it. I'll add full analysis tomorrow. We have this very test case as apart of our snapshot regression suite, however we are masking/hacking around this problem by preforming a refresh to remove the merged snapshot. [root@host-109 ~]# lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Devices [merge_reboot] snapper Swi-a-s--- 1.00g origin 0.00 /dev/sde1(1024) origin snapper Owi-a-s--- 4.00g /dev/sde1(0) [root@host-109 ~]# vgchange --refresh snapper [root@host-109 ~]# lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Devices origin snapper -wi-a----- 4.00g /dev/sde1(0) We'll have to test w/o that once this gets fixed? 3.10.0-327.el7.x86_64 lvm2-2.02.130-5.el7 BUILT: Wed Oct 14 08:27:29 CDT 2015 lvm2-libs-2.02.130-5.el7 BUILT: Wed Oct 14 08:27:29 CDT 2015 lvm2-cluster-2.02.130-5.el7 BUILT: Wed Oct 14 08:27:29 CDT 2015 device-mapper-1.02.107-5.el7 BUILT: Wed Oct 14 08:27:29 CDT 2015 device-mapper-libs-1.02.107-5.el7 BUILT: Wed Oct 14 08:27:29 CDT 2015 device-mapper-event-1.02.107-5.el7 BUILT: Wed Oct 14 08:27:29 CDT 2015 device-mapper-event-libs-1.02.107-5.el7 BUILT: Wed Oct 14 08:27:29 CDT 2015 device-mapper-persistent-data-0.5.5-1.el7 BUILT: Thu Aug 13 09:58:10 CDT 2015 Hi,
it's more complicated than I though in the beginning. First of all, I found the commit responsible for the regression:
-----
commit c26d81d6e6939906729d91fae83cd8bbdd743bb7
Author: Ondrej Kozina <okozina> <----!!!------
Date: Wed Apr 8 12:05:14 2015 +0200
toollib: do not spawn polling in lv_change_activate
spawning a background polling from within the lv_change_activate
fn went to two problems:
1) vgchange should not spawn any background polling until after
the whole activation process for a VG is finished. Otherwise
it could lead to a duplicite request for spawning background
polling. This statement was alredy true with one exception of
mirror up-conversion polling (fixed by this commit).
2) due to current conditions in lv_change_activate lvchange cmd
couldn't start background polling for pvmove LVs if such LV was
about to get activated by the command in the same time.
This commit however doesn't alter the lvchange cmd so that it works same as
vgchange with regard to not to spawn duplicate background pollings per
unique LV.
----
Unfortunately I can't simply revert it because I would reintroduce the bug I it was supposed to fix.
What went wrong: This commit breaks snapshot merge on autoactivation during device discovery on boot. (This is the reason snapshot will not get removed after reboot). The autoactivation works only with lvmetad enabled. To test this regression you can simply run following:
0) have lvmetad enabled in lvm.conf
1) create VG on single device (i.e.: sdx)
2) create origin lv
3) mount lv
4) create snapshot 'snap'
5) write some data to mounted origin lv
6) call lvconvert --merge vg/snap (you'll get the warning about deferred merge until open count == 0)
7) umount origin lv
8) deactivate whole vg
9) call pvscan --cache -aay major:minor (of sdx)
this will simulate the bug on autoactivation the customer has experienced.
expected result: origin lv in a VG is active and snapshot lv is removed after some time.
Now the harder thing. I strongly suspect it's not the only bug related to snapshot merge. For example. when I call vgchange -ay vg while the 'vg' is still active I'll receive errors in lvmpolld log about not being able to to query snapshot merge state.
And yes the lvchange --refresh vg/origin is much saner workaround for the time being. Thanks Corey!
Fixed upstream: https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=40701af9696a302c904fad30951385eb5a5adb85 Adding QA ACK for 7.3. Once verified the test case might be modified to not use 'vgchange --refresh' as mentioned in Comment #5. This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions It happens if you reboot as well. lvm2-2.02.130-5.el7.x86_64 and kernel lvm2-2.02.130-5.el7.x86_64 Verified with latest rpms.
Also tested manually with real reboot, since the scenario shown below simulates it by vgchange --sysinit and --refresh.
The fix does not allow us to remove the 'vgchange --refresh' part (mentioned above).
Automated:
SCENARIO - [reboot_before_thin_snap_merge_starts]
Attempt to merge an inuse snapshot, then "reboot" the machine before the merge can take place
Making pool volume
lvcreate --thinpool POOL -L 4G --profile thin-performance --zero y --poolmetadatasize 4M snapper_thinp
Sanity checking pool device (POOL) metadata
examining superblock
examining devices tree
examining mapping tree
checking space map counts
Making origin volume
lvcreate --virtualsize 1G -T snapper_thinp/POOL -n origin
lvcreate --virtualsize 1G -T snapper_thinp/POOL -n other1
lvcreate --virtualsize 1G -T snapper_thinp/POOL -n other2
lvcreate -V 1G -T snapper_thinp/POOL -n other3
lvcreate -V 1G -T snapper_thinp/POOL -n other4
WARNING: Sum of all thin volume sizes (5.00 GiB) exceeds the size of thin pool snapper_thinp/POOL (4.00 GiB)!
lvcreate --virtualsize 1G -T snapper_thinp/POOL -n other5
WARNING: Sum of all thin volume sizes (6.00 GiB) exceeds the size of thin pool snapper_thinp/POOL (4.00 GiB)!
Placing an xfs filesystem on origin volume
Mounting origin volume
Making snapshot of origin volume
lvcreate -k n -s /dev/snapper_thinp/origin -n merge_reboot
Mounting snap volume
Attempt to merge snapshot snapper_thinp/merge_reboot
lvconvert --merge snapper_thinp/merge_reboot --yes
Logical volume snapper_thinp/merge_reboot contains a filesystem in use.
umount and deactivate volume group
vgchange --sysinit -ay snapper_thinp
vgchange --refresh snapper_thinp
Check if snapshot merged successfully.
Failed to find logical volume "snapper_thinp/merge_reboot"
OK. Snapshot is not present.
Removing thin origin and other virtual thin volumes
Removing thinpool snapper_thinp/POOL
=======================================
Manual:
Continue from point where vg is deactivated during snapshot merge.
# lvs -a
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
...
[merge_reboot] snapper_thinp Swi---t--- 1.00g POOL origin
origin snapper_thinp Owi---t--- 1.00g POOL
...
# reboot
...
# lvs -a
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
...
origin snapper_thinp Vwi-a-t--- 1.00g POOL 0.37
...
Tested with:
3.10.0-475.el7.x86_64
lvm2-2.02.162-1.el7 BUILT: Fri Jul 29 09:26:36 CEST 2016
lvm2-libs-2.02.162-1.el7 BUILT: Fri Jul 29 09:26:36 CEST 2016
lvm2-cluster-2.02.162-1.el7 BUILT: Fri Jul 29 09:26:36 CEST 2016
device-mapper-1.02.132-1.el7 BUILT: Fri Jul 29 09:26:36 CEST 2016
device-mapper-libs-1.02.132-1.el7 BUILT: Fri Jul 29 09:26:36 CEST 2016
device-mapper-event-1.02.132-1.el7 BUILT: Fri Jul 29 09:26:36 CEST 2016
device-mapper-event-libs-1.02.132-1.el7 BUILT: Fri Jul 29 09:26:36 CEST 2016
device-mapper-persistent-data-0.6.3-1.el7 BUILT: Fri Jul 22 12:29:13 CEST 2016
cmirror-2.02.162-1.el7 BUILT: Fri Jul 29 09:26:36 CEST 2016
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1445.html |