Bug 1100514
Summary: | support non-power-of-2 VG physical extents | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Mike Snitzer <msnitzer> | |
Component: | lvm2 | Assignee: | Alasdair Kergon <agk> | |
lvm2 sub component: | LVM Metadata / lvmetad | QA Contact: | Cluster QE <mspqa-list> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | agk, cmarthal, heinzm, jbrassow, mpillai, msnitzer, prajnoha, prockai, rcyriac, zkabelac | |
Version: | 7.1 | Keywords: | FutureFeature, Triaged | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | lvm2-2.02.112-1.el7 | Doc Type: | Enhancement | |
Doc Text: |
LVM2 now supports extent sizes which are not power of 2 if extent size is higher than 128 KB.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1153312 (view as bug list) | Environment: | ||
Last Closed: | 2015-03-05 13:08:39 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1156164 |
Description
Mike Snitzer
2014-05-23 01:12:21 UTC
Heh, in comment#0 I meant to say "HW raid6 device is composed of 12 devices, ..." (rather than 10). For historical reasons lvm2 is using 2 32bit number to make an LV size. (Heinz, agk ?) 32bit for extent size and 32bit for extent count. This totals into 64b number allowing to address at most 16Exa bytes. So allowing to use arbitrary number would probably mean to drop extent_size and just use 64b for size - but that is probably bigger hack into lvm2 code base making metadata incompatible (likely lvm3 format) As a reasonable 'workaround' for cases like this is to lower the extent_size to some multiple power-of-2 which divides stripe size on all PVs in the VG. In case like this (in Description) - it looks like extent_size 64KiB would nicely fit. This allows lvm2 to create LVs in sizes like 640KiB (10 extents). (of course many other values (1,2,4..256KiB) would fit as well) But of course there are limitations: the default extent_size 4MiB (22bits) limits max LV size (and also sum of all LVs in VG) to (22+32=54) 16PB. With 64KiB the limit drops to (16+32=48) 256TB - which starts to be reachable even with relatively 'low cost' hardware these days. (with 256KiB the limit is 1PB). As a short-term goal I guess lvm2 may detect misalignment on the PV->LV level and at least print warning and eventually 'suggested' some values. However since we may combine quite different PVs into a single VG, there is likely problematic to create a generic code that always fits - unless we go with very small extent size like page size 4KiB, which limits maximum size of all LVs in VG to 16TB. It's similar range of problem like when we create an LV accross different PVs with different geometry parameters - it's unclear how such device should advertise itself in the system - and even better at may change dynamically while it's in use (i.e. volume is pvmove on different disk) It's probably also worth to note here the 'virtual' extent size limitation in lvm2 code has also impact on virtual devices like virtual snapshot or thin volume - where the maximum size is also limited - thus unless a VG is created with 4G extents user can't play with 8ExaBytes disks (for some reason glibc/kernel? is somewhat broken when playing with larger devices) Other important thing to note is - LVM2 now handles raid6 internally allowing easier integration for device stacking - since PVs could be used for multiple different purposes. In this case the striping is handled at lvm2 level and alignment works. So the question is: Are there any good reasons why LVM still needs to restrict the extent size to be a power of 2? - If we relaxed that constraint, would any other code need changing or would any situations need additional restrictions? We need to audit the code to find out. To explain some background behind how does the allocation works: lvm2 maintains 'hidden' LV (with _pmspare suffix) to present a space for metadata recovery - this size is always at least the size of 'biggest' metadata area in VG (be it thin pool or cache pool) It's allocated first to ensure VG has at least this space before it continues with allocation of metadata and data device (seeing this option as most logical). Now skilled user who takes responsibility for metadata repair in his hands may disable creation of this hidden LV with '--poolmetadataspare n'. To workaround things here manually lvm2 commands do support creation on step-by-step basis on empty vg with 256k chunk size: lvcreate -L16G --name meta vg lvcreate -l100%PVS --name pool vg /dev/pv0:65540- lvconvert --thinpool vg/pool --poolmetadata meta --poolmetadataspare n lvcreate -V size -T vg/pool I'm able to configure a thinlv that is properly aligned relative to the HW RAID6 1280K stripesize (without any false alignment_offset!=0) by using this sequence of lvm2 commands as a workaround: pvcreate --dataalignment 1280K /dev/sdb vgcreate --physicalextentsize 256K bricks /dev/sdb # create ~16GB metadata LV that has a size which is a multiple of 1280K lvcreate -L $((13107*1280))K --name metadata bricks # create ~512GB data LV that has a size which is a multiple of 1280K lvcreate -L $((419430*1280))K --name pool bricks # create thin-pool without creating a spare metadata area # (though spare metadata area is useful if/when thin-pool metadata repair is needed) # -- NOTE: could also use --chunksize 256K or 128K here since they are a factor of 1280K lvconvert --chunksize 1280K --thinpool bricks/pool --poolmetadata bricks/metadata --poolmetadataspare n # create 1280K aligned ~256GB thin volume: lvcreate -V $((209715*1280))K -T /dev/bricks/pool -n thinlv # cat /sys/block/dm-6/alignment_offset 0 # cat /sys/block/dm-7/alignment_offset 0 (In reply to Alasdair Kergon from comment #3) > So the question is: > > Are there any good reasons why LVM still needs to restrict the extent size > to be a power of 2? > - If we relaxed that constraint, would any other code need changing or > would any situations need additional restrictions? > > We need to audit the code to find out. I've performed this audit and, as I had hoped, found no issues. The extent size is only validated when it is set and I've not found any code that assumes it is a power of 2. I propose replacing the "must be a power of 2" validation (at 2 sites in the code) with "must be a multiple of (say) 64KB". This will kick in for format_text through a new flag that vgconvert will also test for. The vgcreate and vgchange man pages need updating. I've not found any reason yet why existing versions of LVM won't work perfectly fine with this metadata. I'm going to restrict this bugzilla to these basic changes. Anything more-sophisticated, such as better automatic extent size calculations and handling PVs with different extent sizes in the same VG needs to be covered by different bugzillas. https://lists.fedorahosted.org/pipermail/lvm2-commits/2014-October/002944.html (Like I said, any further changes should go onto separate bugzillas for separate consideration.) (In reply to Mike Snitzer from comment #11) > I'm able to configure a thinlv that is properly aligned relative to the HW > RAID6 1280K stripesize (without any false alignment_offset!=0) by using this > sequence of lvm2 commands as a workaround: > > pvcreate --dataalignment 1280K /dev/sdb > vgcreate --physicalextentsize 256K bricks /dev/sdb > > # create ~16GB metadata LV that has a size which is a multiple of 1280K > lvcreate -L $((13107*1280))K --name metadata bricks > > # create ~512GB data LV that has a size which is a multiple of 1280K > lvcreate -L $((419430*1280))K --name pool bricks > > # create thin-pool without creating a spare metadata area > # (though spare metadata area is useful if/when thin-pool metadata repair is > needed) > # -- NOTE: could also use --chunksize 256K or 128K here since they are a > factor of 1280K > lvconvert --chunksize 1280K --thinpool bricks/pool --poolmetadata > bricks/metadata --poolmetadataspare n We can leave out the "--poolmetadataspare n" part when we do it this way, and still get the correct alignment, right? That will get rid of this ominous warning: Logical volume "tp01" created WARNING: recovery of pools without pool metadata spare LV is not automated. Converted vg0/tp01 to thin pool. (In reply to Manoj from comment #18) > (In reply to Mike Snitzer from comment #11) > > I'm able to configure a thinlv that is properly aligned relative to the HW > > RAID6 1280K stripesize (without any false alignment_offset!=0) by using this > > sequence of lvm2 commands as a workaround: > > > > pvcreate --dataalignment 1280K /dev/sdb > > vgcreate --physicalextentsize 256K bricks /dev/sdb > > > > # create ~16GB metadata LV that has a size which is a multiple of 1280K > > lvcreate -L $((13107*1280))K --name metadata bricks > > > > # create ~512GB data LV that has a size which is a multiple of 1280K > > lvcreate -L $((419430*1280))K --name pool bricks > > > > # create thin-pool without creating a spare metadata area > > # (though spare metadata area is useful if/when thin-pool metadata repair is > > needed) > > # -- NOTE: could also use --chunksize 256K or 128K here since they are a > > factor of 1280K > > lvconvert --chunksize 1280K --thinpool bricks/pool --poolmetadata > > bricks/metadata --poolmetadataspare n > > We can leave out the "--poolmetadataspare n" part when we do it this way, > and still get the correct alignment, right? > > That will get rid of this ominous warning: > Logical volume "tp01" created > WARNING: recovery of pools without pool metadata spare LV is not automated. > Converted vg0/tp01 to thin pool. Yes, it'll just be a slightly less optimal layout in that the metadata volume will be _after_ the data volume. (The start of the disk is generally fastest on rotational storage). But in practice that likely doesn't matter on this RAID6 storage and having the metadata spare gives Gluster the added benefit of more convenient repair of thinp metadata (if/when that is ever needed). Should there be any checks for attempting to create extent sizes that are way larger than your PVs? :) [root@host-110 ~]# pvcreate --dataalignment 360960 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 Physical volume "/dev/sda1" successfully created Physical volume "/dev/sdb1" successfully created Physical volume "/dev/sdc1" successfully created Physical volume "/dev/sdd1" successfully created Physical volume "/dev/sde1" successfully created [root@host-110 ~]# pvs -o +pe_start PV VG Fmt Attr PSize PFree 1st PE /dev/sda1 lvm2 --- 15.00g 15.00g 352.50m /dev/sdb1 lvm2 --- 15.00g 15.00g 352.50m /dev/sdc1 lvm2 --- 15.00g 15.00g 352.50m /dev/sdd1 lvm2 --- 15.00g 15.00g 352.50m /dev/sde1 lvm2 --- 15.00g 15.00g 352.50m [root@host-110 ~]# vgcreate --physicalextentsize 72192 snapper_thinp /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 Volume group "snapper_thinp" successfully created [root@host-110 ~]# pvs -o +pe_start PV VG Fmt Attr PSize PFree 1st PE /dev/sda1 snapper_thinp lvm2 a-- 15.00g 0 352.50m /dev/sdb1 snapper_thinp lvm2 a-- 15.00g 0 352.50m /dev/sdc1 snapper_thinp lvm2 a-- 15.00g 0 352.50m /dev/sdd1 snapper_thinp lvm2 a-- 15.00g 0 352.50m /dev/sde1 snapper_thinp lvm2 a-- 15.00g 0 352.50m [root@host-110 ~]# pvscan PV /dev/sda1 VG snapper_thinp lvm2 [0 / 0 free] PV /dev/sdb1 VG snapper_thinp lvm2 [0 / 0 free] PV /dev/sdc1 VG snapper_thinp lvm2 [0 / 0 free] PV /dev/sdd1 VG snapper_thinp lvm2 [0 / 0 free] PV /dev/sde1 VG snapper_thinp lvm2 [0 / 0 free] [root@host-110 ~]# lvcreate --thinpool POOL --zero y -L 1G --poolmetadatasize 100M snapper_thinp Rounding up size to full physical extent 70.50 GiB Rounding up size to full physical extent 70.50 GiB WARNING: Maximum supported pool metadata size is 16.00 GiB. Rounding up size to full physical extent 70.50 GiB Volume group "snapper_thinp" has insufficient free space (0 extents): 1 required. Raid1 currently can't handle non power of 2 physical extents. Is this a new raid1 specific bug, or does this block this bug from being verified? Recreating PVs/VG with non power of 2 dataalignment and extentsizes pvcreate --dataalignment 3072k /dev/sda1 /dev/sdb1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 vgcreate --physicalextentsize 384k snapper_thinp /dev/sda1 /dev/sdb1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 [root@host-114 ~]# lvcreate --type raid1 -m 1 -L 100M -n raid snapper_thinp Rounding up size to full physical extent 100.12 MiB Using reduced mirror region size of 768 sectors. device-mapper: reload ioctl on failed: Invalid argument Failed to activate new LV. Nov 24 18:08:02 host-114 kernel: device-mapper: table: 253:6: raid: Region size is not a power of 2 Nov 24 18:08:02 host-114 kernel: device-mapper: ioctl: error adding target to table [root@host-114 ~]# dmsetup ls snapper_thinp-raid_rmeta_1 (253:4) snapper_thinp-raid_rmeta_0 (253:2) snapper_thinp-raid_rimage_1 (253:5) snapper_thinp-raid_rimage_0 (253:3) We need to look into this. 1) If the kernel is being too strict, then it needs relaxing with a kernel patch. (Bug remains ON_QA, retest after kernel is patched.) 2) If the kernel is right, then userspace needs to warn users that they cannot use raid volumes in these VGs. (Bug goes FailedQA and we add a lvm2 patch.) I hope the first case is possible... So more work is needed to deal with raid and raid1 targets - if we have to keep the powers of 2 region sizes we either 'lose' a bit of space at the ends of the devices or we just stop people using these targets completely in these VGs. pvmove probably also can't be supported. As for the "no data space" PVs, you can have a PV that only contains a metadata area, but it would be sensible to warn you when you're doing this. So the first change I've made (for the next build) is: https://lists.fedorahosted.org/pipermail/lvm2-commits/2014-December/003224.html which should prevent the mirror code selecting a non-power-of-2 region size. In your example, this reduces the mirror region size: Rounding up size to full physical extent 100.12 MiB Using reduced mirror region size of 256 sectors. I verified that mirrors, raids, cache, thin, and snapshot volumes now work on top of volume groups with non power of two physical extents. Also verified no misalignment issues when running the configuration case in comment #6. Marking verified in the latest rpms. 3.10.0-220.el7.x86_64 lvm2-2.02.114-4.el7 BUILT: Wed Jan 7 07:07:47 CST 2015 lvm2-libs-2.02.114-4.el7 BUILT: Wed Jan 7 07:07:47 CST 2015 lvm2-cluster-2.02.114-4.el7 BUILT: Wed Jan 7 07:07:47 CST 2015 device-mapper-1.02.92-4.el7 BUILT: Wed Jan 7 07:07:47 CST 2015 device-mapper-libs-1.02.92-4.el7 BUILT: Wed Jan 7 07:07:47 CST 2015 device-mapper-event-1.02.92-4.el7 BUILT: Wed Jan 7 07:07:47 CST 2015 device-mapper-event-libs-1.02.92-4.el7 BUILT: Wed Jan 7 07:07:47 CST 2015 device-mapper-persistent-data-0.4.1-2.el7 BUILT: Wed Nov 12 12:39:46 CST 2014 cmirror-2.02.114-4.el7 BUILT: Wed Jan 7 07:07:47 CST 2015 [root@host-115 ~]# dmsetup status bricks-thinlv: 0 536870400 thin 0 - bricks-mypool: 0 1073740800 linear bricks-mypool-tpool: 0 1073740800 thin-pool 1 288/4145152 0/419430 - rw no_discard_passdown queue_if_no_space bricks-mypool_tdata: 0 1073740800 linear bricks-mypool_tmeta: 0 33161216 linear [root@host-115 ~]# lsblk /dev/sda NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 750G 0 disk ├─bricks-mypool_tmeta 253:2 0 15.8G 0 lvm │ └─bricks-mypool-tpool 253:4 0 512G 0 lvm │ ├─bricks-mypool 253:5 0 512G 0 lvm │ └─bricks-thinlv 253:6 0 256G 0 lvm └─bricks-mypool_tdata 253:3 0 512G 0 lvm └─bricks-mypool-tpool 253:4 0 512G 0 lvm ├─bricks-mypool 253:5 0 512G 0 lvm └─bricks-thinlv 253:6 0 256G 0 lvm [root@host-115 ~]# cat /sys/block/dm-2/alignment_offset 0 [root@host-115 ~]# cat /sys/block/dm-3/alignment_offset 0 [root@host-115 ~]# cat /sys/block/dm-4/alignment_offset 0 [root@host-115 ~]# cat /sys/block/dm-5/alignment_offset 0 [root@host-115 ~]# cat /sys/block/dm-6/alignment_offset 0 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0513.html |