No support for thin provisioning on top of RAID in a cluster
While RAID logical volumes and thinly provisioned logical volumes can be used in a cluster when activated exclusively, there is currently no support for thin provisioning on top of RAID in a cluster. This is the case even if the combination is activated exclusively. Currently this combination is only supported in LVM's single machine non-clustered mode.
We have expressed that thinpools should be underpinned by some redundant device, like a RAID LV. This combination does not work in a cluster, however. There are two issues:
1) Inability to query a cluster lock on a sub-LV.
The RAID LV becomes a sub-LV when it is under a thinpool. There are no cluster locks associated with sub-LVs. Thus, the tests used to identify whether the RAID LV is active exclusively or not fail.
2) Suspend/Resume only operate on LVs for which there is an associated lock.
Since RAID LVs under a thinpool have no lock associated with them, they cannot be suspended/resumed independently of the top-level/locked LV.
This means that operations that require exclusive activation checks or suspend/resume will fail. Things that are know to fail at the moment are:
- Replacing RAID devices (for repair or otherwise)
- Refreshing (i.e. 'lvchange --refresh vg/lv')
- Changes to writemostly, writebehind and sync I/O throttling
Things that will still work:
Here is one example where the command fails because it cannot determine the exclusive nature of the activation of the underlying RAID LV:
[root@bp-xen-01 ~]# devices vg
LV Attr Cpy%Sync Devices
lv twi-a-tz-- lv_tdata(0)
[lv_tdata] rwi-aor--- 100.00 lv_tdata_rimage_0(0),lv_tdata_rimage_1(0)
[lv_tdata_rimage_0] iwi-aor--- /dev/sda1(1)
[lv_tdata_rimage_1] iwi-aor--- /dev/sdb1(1)
[lv_tdata_rmeta_0] ewi-aor--- /dev/sda1(0)
[lv_tdata_rmeta_1] ewi-aor--- /dev/sdb1(0)
[lv_tmeta] ewi-ao---- /dev/sda1(257)
[lvol0_pmspare] ewi------- /dev/sda1(258)
[root@bp-xen-01 ~]# lvconvert --replace /dev/sdb1 vg/lv_tdata
vg/lv_tdata must be active exclusive locally to perform this operation.
It is, in fact, activated exclusively but the 'lv_is_active_*' functions don't work properly on sub-LVs with no locks.
(In reply to Jonathan Earl Brassow from comment #1)
> Here is one example where the command fails because it cannot determine the
> exclusive nature of the activation of the underlying RAID LV:
> [root@bp-xen-01 ~]# lvconvert --replace /dev/sdb1 vg/lv_tdata
> vg/lv_tdata must be active exclusive locally to perform this operation.
> It is, in fact, activated exclusively but the 'lv_is_active_*' functions
> don't work properly on sub-LVs with no locks.
There has been recently introduced support for function:
lv_info(cmd, LVptr, 0/1, >>>NULL<<<, 0, 0)
The idea behind this would be - if you know - the whole subtree is supposed to be active on given node - you may query local devices through local dm ioctl
calls instead of doing more generic queries here.
There will be a very thin-line where the locking operations is supposed to be used (in the example above - we always have to query for thinLV/thinpoolLV)
but one the tool goes more deeply and it knows it's on right node,
it may start to use lv_info(NULL) variant to check for local presence of device.
We should probably introduce generic function which will seek for master LV lock from any given subLV - i.e. get_lock_lv(logical_volume*lv, int *lock)
The function will try to obtain lock status from 'master' LV (i.e. for thins it needs to check for the first of any active thinLVs associated with the pool or the pool itself so it's a bit more complex) - but function user should get some information about availability state on the node.
The /tools code gets slowly merged into /lib.
In the /lib we should be able to use lv_info() easily, in the /tools we need to carefully check where to use lv_is_active() and where is lv_info() valid shortcut.
Locks are behaving very strangely in a cluster for thin volumes:
## LVs are 1 thin-LV, 1 thinpool with tdata on RAID
[root@bp-xen-01 ~]# lvs -a vg
LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert
lv vg twi---tz-- 500.00m
[lv_tdata] vg rwi---r--- 500.00m
[lv_tdata_rimage_0] vg Iwi---r--- 500.00m
[lv_tdata_rimage_1] vg Iwi---r--- 500.00m
[lv_tdata_rmeta_0] vg ewi---r--- 4.00m
[lv_tdata_rmeta_1] vg ewi---r--- 4.00m
[lv_tmeta] vg ewi------- 4.00m
[lvol0_pmspare] vg ewi------- 4.00m
thinlv vg Vwi---tz-- 1.00g lv
## When not active, there are no outstanding clvmd locks (obviously)
[root@bp-xen-01 ~]# vgchange -an vg
0 logical volume(s) in volume group "vg" now active
[root@bp-xen-01 lvm2]# dlm_tool lockdump clvmd
## Activate with 'vgchange', and there are 2 EX locks
[root@bp-xen-01 ~]# vgchange -ay vg
2 logical volume(s) in volume group "vg" now active
[root@bp-xen-01 lvm2]# dlm_tool lockdump clvmd
id 03710001 gr EX rq IV pid 9920 master 0 "R5CV0prWKunffFzen8Xwwng5wksedgh7zT3ms98yhmEpLLrbERePmSmI7rd79wZ2"
id 016b0001 gr EX rq IV pid 9920 master 0 "R5CV0prWKunffFzen8Xwwng5wksedgh7727mXdeSowp6Vcdf1U5omavum6ujAL08"
[root@bp-xen-01 lvm2]# dmsetup info -c | egrep '9wZ2|AL08'
vg-lv-tpool 253 8 L--w 2 1 0 LVM-R5CV0prWKunffFzen8Xwwng5wksedgh7zT3ms98yhmEpLLrbERePmSmI7rd79wZ2-tpool
vg-lv 253 9 L--w 0 1 0 LVM-R5CV0prWKunffFzen8Xwwng5wksedgh7zT3ms98yhmEpLLrbERePmSmI7rd79wZ2
vg-thinlv 253 10 L--w 0 1 0 LVM-R5CV0prWKunffFzen8Xwwng5wksedgh7727mXdeSowp6Vcdf1U5omavum6ujAL08
## Activate with 'lvchange', and there are 1 EX locks
[root@bp-xen-01 ~]# lvchange -ay vg/thinlv
[root@bp-xen-01 lvm2]# dlm_tool lockdump clvmd
id 02a80001 gr EX rq IV pid 9920 master 0 "R5CV0prWKunffFzen8Xwwng5wksedgh7727mXdeSowp6Vcdf1U5omavum6ujAL08"
[root@bp-xen-01 lvm2]# dmsetup info -c | egrep AL08
vg-thinlv 253 9 L--w 0 1 0 LVM-R5CV0prWKunffFzen8Xwwng5wksedgh7727mXdeSowp6Vcdf1U5omavum6ujAL08
I have added a mechanism (just for my testing) that walks up a tree to find the top-level LV that it can query. This seems to allow things to work properly if I use 'vgchange' for activation, because the thinpool also has a lock in this case.
If we could generalize so that thinpools have locks in every case, it would simplify this problem with RAID and greatly simplify the problem of limiting all thin-LVs associated with a particular thinpool to be active on only one node.
(In reply to Jonathan Earl Brassow from comment #4)
> Locks are behaving very strangely in a cluster for thin volumes:
> ## LVs are 1 thin-LV, 1 thinpool with tdata on RAID
> [root@bp-xen-01 ~]# lvs -a vg
> LV VG Attr LSize Pool Origin Data% Move Log
> Cpy%Sync Convert
> lv vg twi---tz-- 500.00m
> thinlv vg Vwi---tz-- 1.00g lv
Nope - works as designed here.
> ## Activate with 'vgchange', and there are 2 EX locks
> [root@bp-xen-01 ~]# vgchange -ay vg
> 2 logical volume(s) in volume group "vg" now active
> [root@bp-xen-01 lvm2]# dlm_tool lockdump clvmd
> id 03710001 gr EX rq IV pid 9920 master 0
> id 016b0001 gr EX rq IV pid 9920 master 0
You get the lock for thinpool (wrapping device) and thin volume device.
The wrapping device might or might not be active - it's 'something'
like control node for thin-pool - if there is 'any' thin volume from pool active - you do not need to have this volume active and you may deactivate
However if there is no thin volume active - and you want to know some information about the pool itself- you could active only the thinpool volume and have all other thin volumes inactive.
> [root@bp-xen-01 lvm2]# dmsetup info -c | egrep '9wZ2|AL08'
> vg-lv-tpool 253 8 L--w 2 1 0
> vg-lv 253 9 L--w 0 1 0
> vg-thinlv 253 10 L--w 0 1 0
> ## Activate with 'lvchange', and there are 1 EX locks
> [root@bp-xen-01 ~]# lvchange -ay vg/thinlv
> [root@bp-xen-01 lvm2]# dlm_tool lockdump clvmd
> id 02a80001 gr EX rq IV pid 9920 master 0
Yep - if you only selectively active thin volume there is no need to activate 'wrapping' device thin pool so only thin volume is active.
(In reply to Jonathan Earl Brassow from comment #5)
> I have added a mechanism (just for my testing) that walks up a tree to find
> the top-level LV that it can query. This seems to allow things to work
> properly if I use 'vgchange' for activation, because the thinpool also has a
> lock in this case.
> If we could generalize so that thinpools have locks in every case, it would
> simplify this problem with RAID and greatly simplify the problem of limiting
> all thin-LVs associated with a particular thinpool to be active on only one
This can't be done easily this way.
The problem here is - that we could always taky only 1 lock (while we played with idea of grabbing 2 LV locks at the same time - it has appeared too much complex).
So the activation of thin-volume LV (i.e. when it's being stacked) is implicitly activating thin-pool LV - thus to avoid locking collisions - implicit activation is using -tpool device. So when you take a lock for thin pool volume there is 'wrapping' device which holds the lock and then there implicit reference again on -tpool device.
The 'dark' side of this solution is - that for detection of active pool on some node - we need to check lv_is_active() for every thin volume for a thin-pool with the thin-pool itself - since each of them may hold the dlm lock - this is the main purpose of the 'pool_is_active()' function - however there are still unresolved problems - mainly I've not yet added code to prevent cross activation of thin-pools on different nodes (and I should do that since now I guess all pieces are in place - it just needs to be in the right place - which is a bit problematic when it should be also efficient).
Anyway - for the subtree scan function 'pool_is_active()' is the best.
'The other way around' - i.e. to always active pool prior activation of any thin volume would need to always active thin pool explicitly before activation of thin volume itself - which would require to 'hack' active_lv() somehow or to check every place where this may happen.
Why not just make the user activate the thinpool? I think all the arguments about it being difficult to take 2 locks melt away then.
If the user tries to activate a thin-LV without the pool being active, it fails and tells the user that the thinpool "vg/pool_lv" must be active before it's thin-LVs can be activated.
Why is this such a bad option?
It also doesn't make sense to me that the thinpool can be "active" or "in-active" at the same time it is being used by a thin-LV.
I understand that if the pool has no thin-LVs or when none of them are active, it can be either active or inactive.
I don't understand why the pool can be either active or inactive when it has a loaded map in the kernel and is being used by thin-LVs.
I understand this is the way it was designed. I'm not sure it is right.
I suppose the thinpool is "active" in a sense... 'lv_is_active' will return true - it's just there is no lock associated with it...
Think in terms of "accessible from /dev/vg" i.e. something the user would be expected to open and read from/write to.
Peter has proposed to start to use udev flags for a related problem - zeroing a device before use - so it's possible that extensions along similar lines to that discussion will help us here too.
I guess my gripe is that the thinpool is a top-level, visible device that may or may not have a lock associated with it. This makes it unique from all other LV types. It is also extremely useful to have a lock associated with the thinpool and bothersome (and inefficient) when it does not have one.
I am advocating that the thinpool always have a lock when it is active. You can't say thinpools shouldn't have a lock because there are cases when it does. IOW, I want to move from the current position that says it may or may not have a lock to a position that is more certain - it must have a lock if it is active.
Whether we do this through code that implicitly detects the necessity to activate the thinpool (with a lock) or requires the user to activate the thinpool before any thin-LVs, I don't care.
Yep - thin volume and thin pool are unique since they combine multiple devices together - however I need to emphasize why it's not so simple to 'always' activate thin-pool first - so while I see you are advocating for this solution, it's been given up for it's huge complexity it would add to the rest of lvm2 code.
For now - the whole 'logic' is kept in dm-tree calculation after the lock has been taken - so the code in lvm2 doesn't need to resolve anything - it just activates some 'top-level' LV - and code after lock does the magic.
(i.e. imagine external origin is another thin volume from different pool - which is currently easily possible -
external origin has also it's fake & real device - so when you activate it only through the top-level LV - you bring are -real & -tpool related device.
While if you would be advocating for taking locks prior activate - you would need to duplicate whole code for activation which currently happens after the lock an takes that code in front - this would be very very complex - not to mentioning probably complex locking ordering issues.
So that's why it's been decided we will go with independent lock. It's fairly easier to just add code that detects lock collisions before activation (though this code is still missing in upstream code) - which could be made more efficiently.
There is even very simple example which is hard to resolve with all locks being taking in front of lock -
activate thin-volume -> pre-auto-activates thin-pool
now when you deactivate this thin-volume - you do not know if you should or should not deactivate thin-pool (and activation/deactivation of thin-pool could be expensive operation)
With explicit mode - you may activate thin-pool and thus keep pool active all the time even if there is no active thin volume.
Yes, full RAID + thin support will need to be tested when we get the locking ideas worked out. I've given my suggestions for how locking might be done, but I probably won't be the one coding that and should assign this bug to someone else.
Before I do, I can itemize the necessary tests better if you like; but basically we need to ensure that someone using thin on RAID will be just as happy as using RAID alone. That means all the failure and I/O tests of RAID while thin is on top.
Adding summary title to Doc Text for release note format.