Bug 1295562 - LVM snapshot does not get deleted after merging the snapshot on LVs that could not be unmounted and system needs to be rebooted for the snapshot to get merge.
LVM snapshot does not get deleted after merging the snapshot on LVs that coul...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: lvm2 (Show other bugs)
7.2
All Linux
high Severity medium
: rc
: ---
Assigned To: Ondrej Kozina
cluster-qe@redhat.com
: Regression, ZStream
Depends On:
Blocks: 1203710 1313485 1295577 1328799
  Show dependency treegraph
 
Reported: 2016-01-04 16:07 EST by Nitin Yewale
Modified: 2016-11-04 00:13 EDT (History)
16 users (show)

See Also:
Fixed In Version: lvm2-2.02.152-1.el7
Doc Type: Bug Fix
Doc Text:
Due to a bug (regression), the lvm2 was unable to remove sucessfully merged snapshot LVs during autoactivation of logical volumes. Typically this occurred on system boot when lvmetad caching daemon was enabled (which is by default). With this fix applied snapshot LVs are again correctly removed and workaround mentioned in the bugzilla is no longer needed.
Story Points: ---
Clone Of:
: 1328799 (view as bug list)
Environment:
Last Closed: 2016-11-04 00:13:51 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2111861 None None None 2016-01-04 16:31 EST
Red Hat Product Errata RHBA-2016:1445 normal SHIPPED_LIVE lvm2 bug fix and enhancement update 2016-11-03 09:46:41 EDT

  None (edit)
Description Nitin Yewale 2016-01-04 16:07:51 EST
Description of problem:
-------------------------

LVM snapshot does not get deleted after merging the snapshot on LVs that could not be mounted and system needs to be rebooted for the snapshot to get merge.

For example `/var` LV. 

We need to restart `lvm2-monitor.service` service to remove the snapshot. Merging is ok though. 

Version-Release number of selected component (if applicable):
-------------------------

# uname -a
Linux dhcp223.example.com 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
# rpm -qa |grep lvm2
lvm2-2.02.130-5.el7.x86_64
lvm2-libs-2.02.130-5.el7.x86_64

How reproducible:
-------------------------

Every time

Steps to Reproduce:
-------------------------

# mkdir /var/testdata

# cp /etc/a*  /etc/b* /var/testdata/

# ls -l /var/testdata/
total 32
-rw-r--r--. 1 root root    16 Jan  4 13:33 adjtime
-rw-r--r--. 1 root root  1518 Jan  4 13:33 aliases
-rw-r--r--. 1 root root 12288 Jan  4 13:33 aliases.db
-rw-------. 1 root root   541 Jan  4 13:33 anacrontab
-rw-r--r--. 1 root root    55 Jan  4 13:33 asound.conf
-rw-r--r--. 1 root root  2835 Jan  4 13:34 bashrc
# 


# lvcreate --size 300M --name snap --snapshot rhel/var


Copied some data to /var/testdata


# lvs -a -o +devices
  LV   VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices        
  root rhel -wi-ao----  24.41g                                                     /dev/sda2(0)   
  snap rhel swi-a-s--- 300.00m      var    0.58                                    /dev/sda3(0)   
  swap rhel -wi-ao----   1.00g                                                     /dev/sda2(8250)
  var  rhel owi-aos---   7.81g                                                     /dev/sda2(6250)



# lvconvert --merge rhel/snap
  Logical volume rhel/var contains a filesystem in use.
  Can't merge over open origin volume.
  Merging of snapshot rhel/snap will occur on next activation of rhel/var.


# lvs -a -o +devices
  LV     VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices        
  root   rhel -wi-ao----  24.41g                                                     /dev/sda2(0)   
  [snap] rhel Swi-a-s--- 300.00m      var    100.00                                  /dev/sda3(0)   
  swap   rhel -wi-ao----   1.00g                                                     /dev/sda2(8250)
  var    rhel Owi-aos---   7.81g                                                     /dev/sda2(6250)
# 

After reboot

# lvs -a -o +devices
  LV     VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices        
  root   rhel -wi-ao----  24.41g                                                     /dev/sda2(0)   
  [snap] rhel Swi-a-s--- 300.00m      var    0.00                                    /dev/sda3(0)   
  swap   rhel -wi-ao----   1.00g                                                     /dev/sda2(8250)
  var    rhel Owi-aos---   7.81g                                                     /dev/sda2(6250)


# ls -l /var/testdata/
total 32
-rw-r--r--. 1 root root    16 Jan  4 13:33 adjtime
-rw-r--r--. 1 root root  1518 Jan  4 13:33 aliases
-rw-r--r--. 1 root root 12288 Jan  4 13:33 aliases.db
-rw-------. 1 root root   541 Jan  4 13:33 anacrontab
-rw-r--r--. 1 root root    55 Jan  4 13:33 asound.conf
-rw-r--r--. 1 root root  2835 Jan  4 13:34 bashrc


# systemctl restart lvm2-monitor.service

# lvs -a -o +devices
  LV   VG   Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices        
  root rhel -wi-ao---- 24.41g                                                     /dev/sda2(0)   
  swap rhel -wi-ao----  1.00g                                                     /dev/sda2(8250)
  var  rhel -wi-ao----  7.81g                                                     /dev/sda2(6250)



Actual results:

When we do `lvconvert --merge rhel/snap` and reboot the server, snapshot LV does not get removed and we have to restart lvm2-monitor.service to remove the same.

Expected results:

When we do `lvconvert --merge rhel/snap` and reboot the server, snapshot LV should get removed

Additional info:

Similar issue is not seen in RHEL7.1

RHEL7.1 

# rpm -qa |grep lvm2
lvm2-2.02.115-3.el7.x86_64
lvm2-libs-2.02.115-3.el7.x86_64

Linux dhcp162.example.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux


# mkdir /var/testdata

# cp /etc/c* /etc/d* /var/testdata/

# lvs -a -o +devices
  LV   VG   Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices        
  root rhel -wi-ao---- 19.53g                                                     /dev/sda2(0)   
  swap rhel -wi-ao----  1.00g                                                     /dev/sda2(6250)
  var  rhel -wi-ao----  4.88g                                                     /dev/sda2(5000)

# ls -l /var/testdata/
total 44
-rw-------. 1 root root     0 Jan  4 15:14 cron.deny
-rw-r--r--. 1 root root   451 Jan  4 15:14 crontab
-rw-------. 1 root root     0 Jan  4 15:14 crypttab
-rw-r--r--. 1 root root  1602 Jan  4 15:14 csh.cshrc
-rw-r--r--. 1 root root   841 Jan  4 15:14 csh.login
-rw-r--r--. 1 root root 25213 Jan  4 15:14 dnsmasq.conf
-rw-r--r--. 1 root root  1285 Jan  4 15:14 dracut.conf



# lvcreate --size 300M --name snap --snapshot rhel/var
  Logical volume "snap" created.

# lvs -a -o +devices
  LV   VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices        
  root rhel -wi-ao----  19.53g                                                     /dev/sda2(0)   
  snap rhel swi-a-s--- 300.00m      var    0.00                                    /dev/sda3(0)   
  swap rhel -wi-ao----   1.00g                                                     /dev/sda2(6250)
  var  rhel owi-aos---   4.88g                                                     /dev/sda2(5000)


# cp -avr /etc/e* /etc/f* /etc/g* /etc/h* /var/testdata/


# ls /var/testdata/
cron.deny  csh.cshrc     dracut.conf  ethertypes   filesystems  gcrypt  group      grub.d    gss        hosts
crontab    csh.login     e2fsck.conf  exports      firewalld    gnupg   group-     gshadow   host.conf  hosts.allow
crypttab   dnsmasq.conf  environment  favicon.png  fstab        groff   grub2.cfg  gshadow-  hostname   hosts.deny


# lvs -a -o +devices
  LV   VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices        
  root rhel -wi-ao----  19.53g                                                     /dev/sda2(0)   
  snap rhel swi-a-s--- 300.00m      var    0.11                                    /dev/sda3(0)   
  swap rhel -wi-ao----   1.00g                                                     /dev/sda2(6250)
  var  rhel owi-aos---   4.88g                                                     /dev/sda2(5000)



# lvconvert --merge rhel/snap
  Logical volume rhel/var contains a filesystem in use.
  Can't merge over open origin volume.
  Merging of snapshot rhel/snap will occur on next activation of rhel/var.


# lvs -a -o +devices
  LV     VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices        
  root   rhel -wi-ao----  19.53g                                                     /dev/sda2(0)   
  [snap] rhel Swi-a-s--- 300.00m      var    100.00                                  /dev/sda3(0)   
  swap   rhel -wi-ao----   1.00g                                                     /dev/sda2(6250)
  var    rhel Owi-aos---   4.88g                                                     /dev/sda2(5000)



After reboot


# lvs -a -o +devices
  LV   VG   Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices        
  root rhel -wi-ao---- 19.53g                                                     /dev/sda2(0)   
  swap rhel -wi-ao----  1.00g                                                     /dev/sda2(6250)
  var  rhel -wi-ao----  4.88g                                                     /dev/sda2(5000)


So this looks to be regression.
Comment 1 Nitin Yewale 2016-01-04 16:22:56 EST
Description of problem:
-------------------------

LVM snapshot does not get deleted after merging the snapshot on LVs that could not be ***unmounted*** and system needs to be rebooted for the snapshot to get merge.

For example `/var` LV. 

-------------------
s/mounted/unmounted
Comment 2 Mike Snitzer 2016-01-04 16:57:58 EST
Could it be that the lvm2-monitor service wasn't running until after the merge completed?
Comment 3 Ondrej Kozina 2016-01-05 10:27:39 EST
Hi Nitin,

Coul you please verify that 'vgchange -ay rhel', or 'lvchange -ay rhel/var' is enough to fix this issue after the reboot (or unmounting fs residing on top of the origin volume rhel/var)? It may be that lvm2-monitor service restart fixes it only as a side effect of actually rerunning vgchange/lvchange command internally.

Also could you try to reproduce it (the whole reproducer) with 'use_lvmpolld = 0' in /etc/lvm/lvm.conf file?

(anyway I'm going to try to reproduce it locally myself)
Comment 4 Ondrej Kozina 2016-01-05 12:22:12 EST
Reproduced locally. lvm command fails to query status of kernel target in a case when actual snapshot merge had to be postponed until the origin LV was unmounted (or origin LV open count equals 0).

If you're not comfortable with lvm2-monitor service restart you can trigger the snapshot lv cleanup if you deactivate and reactivate again the origin lv (with lvchange -an, lvchange -ay). What's not yet clear to me is why this doesn't work after full system restart.

Using lvmpolld or not, the bug manifests with or without it.

I'll add full analysis tomorrow.
Comment 5 Corey Marthaler 2016-01-05 13:15:00 EST
We have this very test case as apart of our snapshot regression suite, however we are masking/hacking around this problem by preforming a refresh to remove the merged snapshot.

[root@host-109 ~]# lvs -a -o +devices
  LV             VG       Attr       LSize   Pool Origin Data% Devices        
  [merge_reboot] snapper  Swi-a-s---   1.00g      origin 0.00  /dev/sde1(1024)
  origin         snapper  Owi-a-s---   4.00g                   /dev/sde1(0)   

[root@host-109 ~]# vgchange --refresh snapper

[root@host-109 ~]# lvs -a -o +devices
  LV     VG       Attr       LSize   Pool Origin Data% Devices       
  origin snapper  -wi-a-----   4.00g                   /dev/sde1(0)  

We'll have to test w/o that once this gets fixed?


3.10.0-327.el7.x86_64
lvm2-2.02.130-5.el7    BUILT: Wed Oct 14 08:27:29 CDT 2015
lvm2-libs-2.02.130-5.el7    BUILT: Wed Oct 14 08:27:29 CDT 2015
lvm2-cluster-2.02.130-5.el7    BUILT: Wed Oct 14 08:27:29 CDT 2015
device-mapper-1.02.107-5.el7    BUILT: Wed Oct 14 08:27:29 CDT 2015
device-mapper-libs-1.02.107-5.el7    BUILT: Wed Oct 14 08:27:29 CDT 2015
device-mapper-event-1.02.107-5.el7    BUILT: Wed Oct 14 08:27:29 CDT 2015
device-mapper-event-libs-1.02.107-5.el7    BUILT: Wed Oct 14 08:27:29 CDT 2015
device-mapper-persistent-data-0.5.5-1.el7    BUILT: Thu Aug 13 09:58:10 CDT 2015
Comment 6 Ondrej Kozina 2016-01-06 11:27:39 EST
Hi,

it's more complicated than I though in the beginning. First of all, I found the commit responsible for the regression:

-----
commit c26d81d6e6939906729d91fae83cd8bbdd743bb7
Author: Ondrej Kozina <okozina@redhat.com>      <----!!!------
Date:   Wed Apr 8 12:05:14 2015 +0200

    toollib: do not spawn polling in lv_change_activate
    
    spawning a background polling from within the lv_change_activate
    fn went to two problems:
    
    1) vgchange should not spawn any background polling until after
       the whole activation process for a VG is finished. Otherwise
       it could lead to a duplicite request for spawning background
       polling. This statement was alredy true with one exception of
       mirror up-conversion polling (fixed by this commit).
    
    2) due to current conditions in lv_change_activate lvchange cmd
       couldn't start background polling for pvmove LVs if such LV was
       about to get activated by the command in the same time.
    
    This commit however doesn't alter the lvchange cmd so that it works same as
    vgchange with regard to not to spawn duplicate background pollings per
    unique LV.
----

Unfortunately I can't simply revert it because I would reintroduce the bug I it was supposed to fix.

What went wrong: This commit breaks snapshot merge on autoactivation during device discovery on boot. (This is the reason snapshot will not get removed after reboot). The autoactivation works only with lvmetad enabled. To test this regression you can simply run following:

0) have lvmetad enabled in lvm.conf
1) create VG on single device (i.e.: sdx)
2) create origin lv
3) mount lv
4) create snapshot 'snap'
5) write some data to mounted origin lv
6) call lvconvert --merge vg/snap (you'll get the warning about deferred merge until open count == 0)
7) umount origin lv
8) deactivate whole vg
9) call pvscan --cache -aay major:minor (of sdx)

this will simulate the bug on autoactivation the customer has experienced.

expected result: origin lv in a VG is active and snapshot lv is removed after some time.

Now the harder thing. I strongly suspect it's not the only bug related to snapshot merge. For example. when I call vgchange -ay vg while the 'vg' is still active I'll receive errors in lvmpolld log about not being able to to query snapshot merge state.

And yes the lvchange --refresh vg/origin is much saner workaround for the time being. Thanks Corey!
Comment 9 Roman Bednář 2016-02-10 08:46:46 EST
Adding QA ACK for 7.3. 

Once verified the test case might be modified to not use 'vgchange --refresh' as mentioned in Comment #5.
Comment 10 Mike McCune 2016-03-28 19:14:23 EDT
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions
Comment 12 Edu Alcaniz 2016-05-03 03:39:42 EDT
It happens if you reboot as well. 

lvm2-2.02.130-5.el7.x86_64
and kernel lvm2-2.02.130-5.el7.x86_64
Comment 17 Roman Bednář 2016-08-03 08:14:33 EDT
Verified with latest rpms. 

Also tested manually with real reboot, since the scenario shown below simulates it by vgchange --sysinit and --refresh.
The fix does not allow us to remove the 'vgchange --refresh' part (mentioned above).


Automated:

SCENARIO - [reboot_before_thin_snap_merge_starts]
Attempt to merge an inuse snapshot, then "reboot" the machine before the merge can take place
Making pool volume
lvcreate  --thinpool POOL -L 4G --profile thin-performance --zero y --poolmetadatasize 4M snapper_thinp

Sanity checking pool device (POOL) metadata
examining superblock
examining devices tree
examining mapping tree
checking space map counts


Making origin volume
lvcreate  --virtualsize 1G -T snapper_thinp/POOL -n origin
lvcreate  --virtualsize 1G -T snapper_thinp/POOL -n other1
lvcreate  --virtualsize 1G -T snapper_thinp/POOL -n other2
lvcreate  -V 1G -T snapper_thinp/POOL -n other3
lvcreate  -V 1G -T snapper_thinp/POOL -n other4
  WARNING: Sum of all thin volume sizes (5.00 GiB) exceeds the size of thin pool snapper_thinp/POOL (4.00 GiB)!
lvcreate  --virtualsize 1G -T snapper_thinp/POOL -n other5
  WARNING: Sum of all thin volume sizes (6.00 GiB) exceeds the size of thin pool snapper_thinp/POOL (4.00 GiB)!
Placing an xfs filesystem on origin volume
Mounting origin volume

Making snapshot of origin volume
lvcreate  -k n -s /dev/snapper_thinp/origin -n merge_reboot
Mounting snap volume

Attempt to merge snapshot snapper_thinp/merge_reboot
lvconvert --merge snapper_thinp/merge_reboot --yes
  Logical volume snapper_thinp/merge_reboot contains a filesystem in use.

umount and deactivate volume group
vgchange --sysinit -ay snapper_thinp
vgchange --refresh snapper_thinp
Check if snapshot merged successfully.
  Failed to find logical volume "snapper_thinp/merge_reboot"
OK. Snapshot is not present.
Removing thin origin and other virtual thin volumes
Removing thinpool snapper_thinp/POOL

=======================================
Manual:

Continue from point where vg is deactivated during snapshot merge.

# lvs -a
  LV              VG            Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  ...
  [merge_reboot]  snapper_thinp Swi---t---   1.00g POOL origin                                        
  origin          snapper_thinp Owi---t---   1.00g POOL                                               
  ...
                                              
# reboot
...

# lvs -a
  LV              VG            Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  ...                                   
  origin          snapper_thinp Vwi-a-t---   1.00g POOL        0.37   

                                
  ... 


Tested with:
3.10.0-475.el7.x86_64

lvm2-2.02.162-1.el7    BUILT: Fri Jul 29 09:26:36 CEST 2016
lvm2-libs-2.02.162-1.el7    BUILT: Fri Jul 29 09:26:36 CEST 2016
lvm2-cluster-2.02.162-1.el7    BUILT: Fri Jul 29 09:26:36 CEST 2016
device-mapper-1.02.132-1.el7    BUILT: Fri Jul 29 09:26:36 CEST 2016
device-mapper-libs-1.02.132-1.el7    BUILT: Fri Jul 29 09:26:36 CEST 2016
device-mapper-event-1.02.132-1.el7    BUILT: Fri Jul 29 09:26:36 CEST 2016
device-mapper-event-libs-1.02.132-1.el7    BUILT: Fri Jul 29 09:26:36 CEST 2016
device-mapper-persistent-data-0.6.3-1.el7    BUILT: Fri Jul 22 12:29:13 CEST 2016
cmirror-2.02.162-1.el7    BUILT: Fri Jul 29 09:26:36 CEST 2016
Comment 21 errata-xmlrpc 2016-11-04 00:13:51 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1445.html

Note You need to log in before you can comment on or make changes to this bug.