Bug 1014758

Summary: LVM RAID + thin do not work in a cluster (even when activated exclusively)
Product: Red Hat Enterprise Linux 7 Reporter: Jonathan Earl Brassow <jbrassow>
Component: lvm2Assignee: Heinz Mauelshagen <heinzm>
lvm2 sub component: Thin Provisioning QA Contact: cluster-qe <cluster-qe>
Status: CLOSED WONTFIX Docs Contact: Milan Navratil <mnavrati>
Severity: medium    
Priority: medium CC: agk, cmarthal, dwysocha, heinzm, jbrassow, me, msnitzer, pasik, prajnoha, slevine, thornber, tscherf, zkabelac
Version: 7.3Keywords: Triaged
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
No support for thin provisioning on top of RAID in a cluster While RAID logical volumes and thinly provisioned logical volumes can be used in a cluster when activated exclusively, there is currently no support for thin provisioning on top of RAID in a cluster. This is the case even if the combination is activated exclusively. Currently this combination is only supported in LVM's single machine non-clustered mode.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-28 20:16:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1075263    

Description Jonathan Earl Brassow 2013-10-02 17:15:19 UTC
We have expressed that thinpools should be underpinned by some redundant device, like a RAID LV.  This combination does not work in a cluster, however.  There are two issues:

1) Inability to query a cluster lock on a sub-LV.
The RAID LV becomes a sub-LV when it is under a thinpool.  There are no cluster locks associated with sub-LVs.  Thus, the tests used to identify whether the RAID LV is active exclusively or not fail.

2) Suspend/Resume only operate on LVs for which there is an associated lock.
Since RAID LVs under a thinpool have no lock associated with them, they cannot be suspended/resumed independently of the top-level/locked LV.

This means that operations that require exclusive activation checks or suspend/resume will fail.  Things that are know to fail at the moment are:
- Replacing RAID devices (for repair or otherwise)
- Refreshing (i.e. 'lvchange --refresh vg/lv')
- Changes to writemostly, writebehind and sync I/O throttling
Things that will still work:
- Scrubbing

Comment 1 Jonathan Earl Brassow 2013-10-02 17:18:04 UTC
Here is one example where the command fails because it cannot determine the exclusive nature of the activation of the underlying RAID LV:

[root@bp-xen-01 ~]# devices vg
  LV                  Attr       Cpy%Sync Devices                                  
  lv                  twi-a-tz--          lv_tdata(0)                              
  [lv_tdata]          rwi-aor---   100.00 lv_tdata_rimage_0(0),lv_tdata_rimage_1(0)
  [lv_tdata_rimage_0] iwi-aor---          /dev/sda1(1)                             
  [lv_tdata_rimage_1] iwi-aor---          /dev/sdb1(1)                             
  [lv_tdata_rmeta_0]  ewi-aor---          /dev/sda1(0)                             
  [lv_tdata_rmeta_1]  ewi-aor---          /dev/sdb1(0)                             
  [lv_tmeta]          ewi-ao----          /dev/sda1(257)                           
  [lvol0_pmspare]     ewi-------          /dev/sda1(258)                           
[root@bp-xen-01 ~]# lvconvert --replace /dev/sdb1 vg/lv_tdata
  vg/lv_tdata must be active exclusive locally to perform this operation.

It is, in fact, activated exclusively but the 'lv_is_active_*' functions don't work properly on sub-LVs with no locks.

Comment 3 Zdenek Kabelac 2013-10-03 08:26:34 UTC
(In reply to Jonathan Earl Brassow from comment #1)
> Here is one example where the command fails because it cannot determine the
> exclusive nature of the activation of the underlying RAID LV:

> [root@bp-xen-01 ~]# lvconvert --replace /dev/sdb1 vg/lv_tdata
>   vg/lv_tdata must be active exclusive locally to perform this operation.
> 
> It is, in fact, activated exclusively but the 'lv_is_active_*' functions
> don't work properly on sub-LVs with no locks.

There has been recently introduced support for function:

lv_info(cmd, LVptr, 0/1, >>>NULL<<<, 0, 0)

(commit: 3b604e5c8e505138fc86e009375fd33160ab84ff)

The idea behind this would be - if you know - the whole subtree is supposed to be active on given node - you may query local devices through local dm ioctl
calls instead of doing more generic queries here.

There will be a very thin-line where the locking operations is supposed to be used (in the example above - we always have to query for   thinLV/thinpoolLV)
but one the tool goes more deeply and it knows it's on right node,
it may start to use  lv_info(NULL) variant to check for local presence of device.

We should probably introduce generic function which will seek for master LV lock from any given subLV -  i.e.     get_lock_lv(logical_volume*lv, int *lock)

The function will try to obtain lock status from 'master' LV (i.e. for thins it needs to check for the first of any active thinLVs associated with the pool or the pool itself so it's a bit more complex) - but function user should get some information about availability state on the node.

The /tools code gets slowly merged into  /lib.
In the /lib we should be able to use  lv_info() easily, in the /tools we need to carefully check where to use  lv_is_active() and where is lv_info() valid shortcut.

Comment 4 Jonathan Earl Brassow 2013-10-03 21:48:45 UTC
Locks are behaving very strangely in a cluster for thin volumes:

## LVs are 1 thin-LV, 1 thinpool with tdata on RAID
[root@bp-xen-01 ~]# lvs -a vg
  LV                  VG   Attr       LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  lv                  vg   twi---tz-- 500.00m                                             
  [lv_tdata]          vg   rwi---r--- 500.00m                                             
  [lv_tdata_rimage_0] vg   Iwi---r--- 500.00m                                             
  [lv_tdata_rimage_1] vg   Iwi---r--- 500.00m                                             
  [lv_tdata_rmeta_0]  vg   ewi---r---   4.00m                                             
  [lv_tdata_rmeta_1]  vg   ewi---r---   4.00m                                             
  [lv_tmeta]          vg   ewi-------   4.00m                                             
  [lvol0_pmspare]     vg   ewi-------   4.00m                                             
  thinlv              vg   Vwi---tz--   1.00g lv                                          

## When not active, there are no outstanding clvmd locks (obviously)
[root@bp-xen-01 ~]# vgchange -an vg
  0 logical volume(s) in volume group "vg" now active
[root@bp-xen-01 lvm2]# dlm_tool lockdump clvmd
[root@bp-xen-01 lvm2]# 


## Activate with 'vgchange', and there are 2 EX locks
[root@bp-xen-01 ~]# vgchange -ay vg
  2 logical volume(s) in volume group "vg" now active

[root@bp-xen-01 lvm2]# dlm_tool lockdump clvmd
id 03710001 gr EX rq IV pid 9920 master 0 "R5CV0prWKunffFzen8Xwwng5wksedgh7zT3ms98yhmEpLLrbERePmSmI7rd79wZ2"
id 016b0001 gr EX rq IV pid 9920 master 0 "R5CV0prWKunffFzen8Xwwng5wksedgh7727mXdeSowp6Vcdf1U5omavum6ujAL08"

[root@bp-xen-01 lvm2]# dmsetup info -c | egrep '9wZ2|AL08'
vg-lv-tpool          253   8 L--w    2    1      0 LVM-R5CV0prWKunffFzen8Xwwng5wksedgh7zT3ms98yhmEpLLrbERePmSmI7rd79wZ2-tpool
vg-lv                253   9 L--w    0    1      0 LVM-R5CV0prWKunffFzen8Xwwng5wksedgh7zT3ms98yhmEpLLrbERePmSmI7rd79wZ2      
vg-thinlv            253  10 L--w    0    1      0 LVM-R5CV0prWKunffFzen8Xwwng5wksedgh7727mXdeSowp6Vcdf1U5omavum6ujAL08      


## Activate with 'lvchange', and there are 1 EX locks
[root@bp-xen-01 ~]# lvchange -ay vg/thinlv

[root@bp-xen-01 lvm2]# dlm_tool lockdump clvmd
id 02a80001 gr EX rq IV pid 9920 master 0 "R5CV0prWKunffFzen8Xwwng5wksedgh7727mXdeSowp6Vcdf1U5omavum6ujAL08"

[root@bp-xen-01 lvm2]# dmsetup info -c | egrep AL08
vg-thinlv            253   9 L--w    0    1      0 LVM-R5CV0prWKunffFzen8Xwwng5wksedgh7727mXdeSowp6Vcdf1U5omavum6ujAL08

Comment 5 Jonathan Earl Brassow 2013-10-03 21:54:43 UTC
I have added a mechanism (just for my testing) that walks up a tree to find the top-level LV that it can query.  This seems to allow things to work properly if I use 'vgchange' for activation, because the thinpool also has a lock in this case.

If we could generalize so that thinpools have locks in every case, it would simplify this problem with RAID and greatly simplify the problem of limiting all thin-LVs associated with a particular thinpool to be active on only one node.

Comment 6 Zdenek Kabelac 2013-10-04 06:50:13 UTC
(In reply to Jonathan Earl Brassow from comment #4)
> Locks are behaving very strangely in a cluster for thin volumes:
> 
> ## LVs are 1 thin-LV, 1 thinpool with tdata on RAID
> [root@bp-xen-01 ~]# lvs -a vg
>   LV                  VG   Attr       LSize   Pool Origin Data%  Move Log
> Cpy%Sync Convert
>   lv                  vg   twi---tz-- 500.00m                               
> 
>   thinlv              vg   Vwi---tz--   1.00g lv                            


Nope - works as designed here.

> ## Activate with 'vgchange', and there are 2 EX locks
> [root@bp-xen-01 ~]# vgchange -ay vg
>   2 logical volume(s) in volume group "vg" now active
> 
> [root@bp-xen-01 lvm2]# dlm_tool lockdump clvmd
> id 03710001 gr EX rq IV pid 9920 master 0
> "R5CV0prWKunffFzen8Xwwng5wksedgh7zT3ms98yhmEpLLrbERePmSmI7rd79wZ2"
> id 016b0001 gr EX rq IV pid 9920 master 0
> "R5CV0prWKunffFzen8Xwwng5wksedgh7727mXdeSowp6Vcdf1U5omavum6ujAL08"


You get the lock for  thinpool (wrapping device) and thin volume device.
The wrapping device might or might not be active - it's 'something'
like control node for thin-pool -  if there is 'any' thin volume from pool active - you do not need to have this volume active and you may deactivate
this thin-pool.

However if there is no thin volume active - and you want to know some information about the pool itself- you could active only the thinpool volume and have all other thin volumes inactive.

> [root@bp-xen-01 lvm2]# dmsetup info -c | egrep '9wZ2|AL08'
> vg-lv-tpool          253   8 L--w    2    1      0
> LVM-R5CV0prWKunffFzen8Xwwng5wksedgh7zT3ms98yhmEpLLrbERePmSmI7rd79wZ2-tpool
> vg-lv                253   9 L--w    0    1      0
> LVM-R5CV0prWKunffFzen8Xwwng5wksedgh7zT3ms98yhmEpLLrbERePmSmI7rd79wZ2      
> vg-thinlv            253  10 L--w    0    1      0
> LVM-R5CV0prWKunffFzen8Xwwng5wksedgh7727mXdeSowp6Vcdf1U5omavum6ujAL08      
> 
> 
> ## Activate with 'lvchange', and there are 1 EX locks
> [root@bp-xen-01 ~]# lvchange -ay vg/thinlv
> 
> [root@bp-xen-01 lvm2]# dlm_tool lockdump clvmd
> id 02a80001 gr EX rq IV pid 9920 master 0
> "R5CV0prWKunffFzen8Xwwng5wksedgh7727mXdeSowp6Vcdf1U5omavum6ujAL08"


Yep - if you only selectively active thin volume  there is no need to activate 'wrapping' device thin pool so only thin volume is active.

Comment 7 Zdenek Kabelac 2013-10-04 07:31:34 UTC
(In reply to Jonathan Earl Brassow from comment #5)
> I have added a mechanism (just for my testing) that walks up a tree to find
> the top-level LV that it can query.  This seems to allow things to work
> properly if I use 'vgchange' for activation, because the thinpool also has a
> lock in this case.
> 
> If we could generalize so that thinpools have locks in every case, it would
> simplify this problem with RAID and greatly simplify the problem of limiting
> all thin-LVs associated with a particular thinpool to be active on only one
> node.

This can't be done easily this way.

The problem here is - that we could always taky only 1 lock  (while we played with idea of grabbing 2 LV locks at the same time - it has appeared too much complex).

So the activation of thin-volume LV (i.e. when it's being stacked) is implicitly activating thin-pool LV  - thus to avoid locking collisions - implicit activation is using -tpool device.  So when you take a lock for thin pool volume there is 'wrapping' device which holds the lock and then there implicit reference again on -tpool device.

The 'dark' side of this solution is - that for detection of active pool on some node - we need to check  lv_is_active() for every thin volume for a thin-pool with the thin-pool itself - since each of them may hold the dlm lock - this is the main purpose of the  'pool_is_active()' function - however there are still unresolved problems - mainly I've not yet added code to prevent cross activation of thin-pools on different nodes (and I should do that since now I guess all pieces are in place - it just needs to be in the right place - which is a bit problematic when it should be also efficient).

Anyway - for the subtree scan function  'pool_is_active()' is the best.

'The other way around' - i.e. to always active pool prior activation of any thin volume would need to always active thin pool explicitly before activation of thin volume itself - which would require to 'hack' active_lv() somehow or to check every place where this may happen.

Comment 8 Jonathan Earl Brassow 2013-10-04 15:32:55 UTC
Why not just make the user activate the thinpool?  I think all the arguments about it being difficult to take 2 locks melt away then.

If the user tries to activate a thin-LV without the pool being active, it fails and tells the user that the thinpool "vg/pool_lv" must be active before it's thin-LVs can be activated.

Why is this such a bad option?

Comment 9 Jonathan Earl Brassow 2013-10-04 15:38:50 UTC
It also doesn't make sense to me that the thinpool can be "active" or "in-active" at the same time it is being used by a thin-LV.

I understand that if the pool has no thin-LVs or when none of them are active, it can be either active or inactive.

I don't understand why the pool can be either active or inactive when it has a loaded map in the kernel and is being used by thin-LVs.

I understand this is the way it was designed.  I'm not sure it is right.

Comment 10 Jonathan Earl Brassow 2013-10-04 15:40:17 UTC
I suppose the thinpool is "active" in a sense... 'lv_is_active' will return true - it's just there is no lock associated with it...

Comment 11 Alasdair Kergon 2013-10-04 15:46:48 UTC
Think in terms of "accessible from /dev/vg" i.e. something the user would be expected to open and read from/write to.

Peter has proposed to start to use udev flags for a related problem - zeroing a device before use - so it's possible that extensions along similar lines to that discussion will help us here too.

Comment 12 Jonathan Earl Brassow 2013-10-04 16:19:41 UTC
I guess my gripe is that the thinpool is a top-level, visible device that may or may not have a lock associated with it.  This makes it unique from all other LV types.  It is also extremely useful to have a lock associated with the thinpool and bothersome (and inefficient) when it does not have one.

I am advocating that the thinpool always have a lock when it is active.  You can't say thinpools shouldn't have a lock because there are cases when it does.  IOW, I want to move from the current position that says it may or may not have a lock to a position that is more certain - it must have a lock if it is active.

Whether we do this through code that implicitly detects the necessity to activate the thinpool (with a lock) or requires the user to activate the thinpool before any thin-LVs, I don't care.

Comment 13 Zdenek Kabelac 2013-10-07 09:04:23 UTC
Yep - thin volume and thin pool are unique since they combine multiple devices together - however I need to emphasize why it's not so simple to 'always' activate thin-pool first  - so while I see you are advocating for this solution, it's been given up for it's huge complexity it would add to the rest of lvm2 code.

For now - the whole 'logic' is kept in dm-tree calculation after the lock has been taken - so the code in lvm2 doesn't need to resolve anything - it just activates some 'top-level' LV - and code after lock does the magic.

(i.e. imagine  external origin is another thin volume from different pool - which is currently easily possible -

external origin has also it's fake & real device - so when you activate it only through the  top-level LV - you bring are  -real & -tpool related device.
While if you would be advocating for taking locks prior activate - you would need to duplicate whole code for activation which currently happens after the lock an takes that code in front - this would be very very complex - not to mentioning probably complex locking ordering issues.

So that's why it's been decided we will go with independent lock. It's fairly easier to just add code that detects lock collisions before activation (though this code is still missing in upstream code) - which could be made more efficiently.

There is even very simple example which is hard to resolve with all locks being taking in front of lock -

activate thin-volume ->  pre-auto-activates thin-pool

now when you deactivate this thin-volume - you do not know if you should or should not deactivate thin-pool (and activation/deactivation of thin-pool could be expensive operation)

With explicit mode - you may activate thin-pool and thus keep pool active all the time even if there is no active thin volume.

Comment 17 Jonathan Earl Brassow 2014-05-05 21:14:58 UTC
Yes, full RAID + thin support will need to be tested when we get the locking ideas worked out.  I've given my suggestions for how locking might be done, but I probably won't be the one coding that and should assign this bug to someone else.

Before I do, I can itemize the necessary tests better if you like; but basically we need to ensure that someone using thin on RAID will be just as happy as using RAID alone.  That means all the failure and I/O tests of RAID while thin is on top.

Comment 23 Steven J. Levine 2015-10-19 16:18:43 UTC
Adding summary title to Doc Text for release note format.