Bug 1536015

Summary: locking error when converting raid1 volumes to cache pool
Product: Red Hat Enterprise Linux 7 Reporter: Roman Bednář <rbednar>
Component: lvm2Assignee: LVM and device-mapper development team <lvm-team>
lvm2 sub component: Cache Logical Volumes QA Contact: cluster-qe <cluster-qe>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: agk, heinzm, jbrassow, mcsontos, msnitzer, prajnoha, rbednar, rhandlin, zkabelac
Version: 7.5   
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.177-1.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 15:23:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Roman Bednář 2018-01-18 12:56:54 UTC
Converting a raid1 "slow" and raid1 "fast" volumes to cache pool occasionally ends up with locking error.


This does not seem 100% reproducible, although there are scenarios that should eventually trigger this bug.
Unfortunately I can not attach verbose lvconvert output since I was not able to reproduce this again yet.


=============================================
SCENARIO - [many_caches]
 Create 500 cached volumes
 
 *** Cache info for this scenario ***
 *  origin (slow):  /dev/sdb1 /dev/sdf1
 *  pool (fast):    /dev/sdc1 /dev/sdd1
 ************************************
 
 Recreating VG and PVs to increase metadata size
 Adding "slow" and "fast" tags to corresponding pvs
 Create origin (slow) volume
 lvcreate --activate ey --type raid1 -m 1 -L 20M -n origin_1 cache_sanity @slow
 Waiting until all mirror|raid volumes become fully syncd...
    1/1 mirror(s) are fully synced: ( 100.00% )
 Sleeping 15 sec
 
 Create cache data and cache metadata (fast) volumes
 lvcreate --activate ey --type raid1 -m 1 -L 10M -n pool_1 cache_sanity @fast
 lvcreate --activate ey --type raid1 -m 1 -L 12M -n pool_1_meta cache_sanity @fast
 Waiting until all mirror|raid volumes become fully syncd...
    2/2 mirror(s) are fully synced: ( 100.00% 100.00% )
 Sleeping 15 sec
 Sleeping 15 sec
 
 Create cache pool volume by combining the cache data and cache metadata (fast) volumes with policy: cleaner  mode: writethrough
 lvconvert --yes --type cache-pool --cachepolicy cleaner --cachemode writethrough -c 32 --poolmetadata cache_sanity/pool_1_meta cache_sanity/pool_1
   WARNING: Converting cache_sanity/pool_1 and cache_sanity/pool_1_meta to cache pool's data and metadata volumes with metadata wiping.
   THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
 Create cached volume by combining the cache pool (fast) and origin (slow) volumes
 lvconvert --yes --type cache --cachemetadataformat 1 --cachepool cache_sanity/pool_1 cache_sanity/origin_1
 dmsetup status | grep cache_sanity-origin_1 | grep writethrough | grep -w cleaner
 
 
 Create origin (slow) volume
 lvcreate --activate ey --type raid1 -m 1 -L 20M -n origin_2 cache_sanity @slow
 Waiting until all mirror|raid volumes become fully syncd...
    1/1 mirror(s) are fully synced: ( 100.00% )
 Sleeping 15 sec
 
 Create cache data and cache metadata (fast) volumes
 lvcreate --activate ey --type raid1 -m 1 -L 10M -n pool_2 cache_sanity @fast
 lvcreate --activate ey --type raid1 -m 1 -L 12M -n pool_2_meta cache_sanity @fast
 Waiting until all mirror|raid volumes become fully syncd...
    2/2 mirror(s) are fully synced: ( 100.00% 100.00% )
 Sleeping 15 sec
 Sleeping 15 sec
 
 Create cache pool volume by combining the cache data and cache metadata (fast) volumes with policy: mq  mode: writethrough
 lvconvert --yes --type cache-pool --cachepolicy mq --cachemode writethrough -c 32 --poolmetadata cache_sanity/pool_2_meta cache_sanity/pool_2
   WARNING: Converting cache_sanity/pool_2 and cache_sanity/pool_2_meta to cache pool's data and metadata volumes with metadata wiping.
   THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
   Error locking on node 1: Command timed out
   Aborting. Failed to deactivate metadata lv. Manual intervention required.
 couldn't create combined cache pool volume

=============================================

### Layout after failed lvconvert attempt (after manually attempting conversion again it ended up ok)

# lvs -a
  LV                        VG             Attr       LSize   Pool     Origin           Data%  Meta%  Move Log Cpy%Sync Convert
  [...]                                                               
  pool_2                    cache_sanity   rwi-a-r---  12.00m                                                  100.00          
  pool_2_meta               cache_sanity   rwi---r---  12.00m                                                                  
  [pool_2_meta_rimage_0]    cache_sanity   Iwi---r---  12.00m                                                                  
  [pool_2_meta_rimage_1]    cache_sanity   Iwi---r---  12.00m                                                                  
  [pool_2_meta_rmeta_0]     cache_sanity   ewi---r---   4.00m                                                                  
  [pool_2_meta_rmeta_1]     cache_sanity   ewi---r---   4.00m                                                                  
  [pool_2_rimage_0]         cache_sanity   iwi-aor---  12.00m                                                                  
  [pool_2_rimage_1]         cache_sanity   iwi-aor---  12.00m                                                                  
  [pool_2_rmeta_0]          cache_sanity   ewi-aor---   4.00m                                                                  
  [pool_2_rmeta_1]          cache_sanity   ewi-aor---   4.00m                                                                  
  [...]                           


=============================================
3.10.0-829.el7.x86_64

lvm2-2.02.176-5.el7    BUILT: Wed Dec  6 11:13:07 CET 2017
lvm2-libs-2.02.176-5.el7    BUILT: Wed Dec  6 11:13:07 CET 2017
lvm2-cluster-2.02.176-5.el7    BUILT: Wed Dec  6 11:13:07 CET 2017
lvm2-python-boom-0.8.1-5.el7    BUILT: Wed Dec  6 11:15:40 CET 2017
cmirror-2.02.176-5.el7    BUILT: Wed Dec  6 11:13:07 CET 2017
device-mapper-1.02.145-5.el7    BUILT: Wed Dec  6 11:13:07 CET 2017
device-mapper-libs-1.02.145-5.el7    BUILT: Wed Dec  6 11:13:07 CET 2017
device-mapper-event-1.02.145-5.el7    BUILT: Wed Dec  6 11:13:07 CET 2017
device-mapper-event-libs-1.02.145-5.el7    BUILT: Wed Dec  6 11:13:07 CET 2017
device-mapper-persistent-data-0.7.3-3.el7    BUILT: Tue Nov 14 12:07:18 CET 2017
vdo-6.1.0.98-13    BUILT: Tue Dec 12 14:35:03 CET 2017
kmod-kvdo-6.1.0.98-11.el7    BUILT: Tue Dec 12 15:08:52 CET 2017

Comment 2 Zdenek Kabelac 2018-01-18 17:10:33 UTC
Assuming same patch as for Bug 1533932 fixes this issue as well.

https://www.redhat.com/archives/lvm-devel/2018-January/msg00049.html

Comment 7 Zdenek Kabelac 2018-02-08 12:50:31 UTC
I don't think it's the same case here.

Please open new BZ for this case.


Using 'mirror' without dmeventd is risky thing on it's own.


Cluster does require the nodes have no leaked 'devices' occupying DM device.

So for full exploration we need to get traces & tables from all nodes.

Comment 8 Roman Bednář 2018-02-08 13:13:33 UTC
Ok, thank you for explanation. Opening separate BZ#1543429 to track this.

Just a note: dmeventd actually was running before (and after) the lvcreate.

Comment 11 errata-xmlrpc 2018-04-10 15:23:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0853