Bug 1100514

Summary:	support non-power-of-2 VG physical extents
Product:	Red Hat Enterprise Linux 7	Reporter:	Mike Snitzer <msnitzer>
Component:	lvm2	Assignee:	Alasdair Kergon <agk>
lvm2 sub component:	LVM Metadata / lvmetad	QA Contact:	Cluster QE <mspqa-list>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	agk, cmarthal, heinzm, jbrassow, mpillai, msnitzer, prajnoha, prockai, rcyriac, zkabelac
Version:	7.1	Keywords:	FutureFeature, Triaged
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	lvm2-2.02.112-1.el7	Doc Type:	Enhancement
Doc Text:	LVM2 now supports extent sizes which are not power of 2 if extent size is higher than 128 KB.	Story Points:	---
Clone Of:
Clones:	1153312 (view as bug list)		Environment:
Last Closed:	2015-03-05 13:08:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1156164

Description Mike Snitzer 2014-05-23 01:12:21 UTC

Description of problem:

lvm2 needs to support the creation of non-power-of-2 aligned logical volumes.

If a HW raid6 device is composed of 10 devices, with a chunk size of 64K, the overall raid stripe is 640K.  640K is _not_ a power of 2.  The pvcreate --dataalignment flag supports non-power-of-2 values so the PVs' data area is aligned relative to the underlying storage's alignment... but when these non-power-of-2 aligned PVs are added to a VG the Logical Volumes that are layered ontop cannot be properly aligned to the underlying PVs' alignment because the VG's physical extent cannot be configured to be a non-power-of-2.

The current workaround is to use a physical extent size that is compatible with the underlying PVs' non-power-of-2 alignment (e.g. use 64k if the raid stripe size is 640k).  But the lvcreate command doesn't know to align the LV relative to the underlying PVs' data alignment (as specified via pvcreate --dataalignment).  SO: the user needs to also use lvcreate -l <# extents> to specify the appropriate number of extents that would perfectly align the LV to the underlying PVs' data alignment.

(NOTE: if possible, the current workaround needs to be added to the Admin guide for 7.0)

Comment 1 Mike Snitzer 2014-05-23 01:14:13 UTC

Heh, in comment#0 I meant to say "HW raid6 device is composed of 12 devices, ..." (rather than 10).

Comment 2 Zdenek Kabelac 2014-05-23 07:28:19 UTC

For historical reasons lvm2 is using 2 32bit number to make an LV size.
(Heinz, agk ?)

32bit for extent size and 32bit for extent count.

This totals into 64b number allowing to address at most 16Exa bytes.

So allowing to use arbitrary number would probably mean to drop extent_size and just use 64b for size - but that is probably bigger hack into lvm2 code base making metadata incompatible (likely lvm3 format)

As a reasonable 'workaround' for cases like this is to lower the extent_size to some multiple power-of-2 which divides stripe size on all PVs in the VG.

In case like this (in Description) - it looks like extent_size 64KiB would nicely fit. This allows lvm2 to create LVs in sizes like 640KiB (10 extents).
(of course many other values (1,2,4..256KiB) would fit as well)

But of course there are limitations: the default extent_size 4MiB (22bits)
limits max LV size (and also sum of all LVs in VG) to (22+32=54) 16PB.

With 64KiB the limit drops to (16+32=48) 256TB - which starts to be reachable even with relatively 'low cost' hardware these days.
(with 256KiB the limit is 1PB).

As a short-term goal I guess lvm2 may detect misalignment on the PV->LV level and at least print warning and eventually 'suggested' some values.

However since we may combine quite different PVs into a single VG, there is likely problematic to create a generic code that always fits - unless we go with very small extent size like page size 4KiB, which limits maximum size of all LVs in VG to 16TB.

It's similar range of problem like when we create an LV accross different PVs with different geometry parameters - it's unclear how such device should advertise itself in the system - and even better at may change dynamically while it's in use (i.e. volume is pvmove on different disk)

It's probably also worth to note here the 'virtual' extent size limitation in lvm2 code has also impact on virtual devices like virtual snapshot or thin volume - where the maximum size is also limited - thus unless a VG is created with 4G extents user can't play with 8ExaBytes disks (for some reason glibc/kernel? is somewhat broken when playing with larger devices)

Other important thing to note is - LVM2 now handles raid6 internally allowing easier integration for device stacking - since PVs could be used for multiple different purposes. In this case the striping is handled at lvm2 level and alignment works.

Comment 3 Alasdair Kergon 2014-07-16 02:14:14 UTC

So the question is:

  Are there any good reasons why LVM still needs to restrict the extent size to be a power of 2?
  - If we relaxed that constraint, would any other code need changing or would any situations need additional restrictions?

We need to audit the code to find out.

Comment 8 Zdenek Kabelac 2014-10-08 20:52:06 UTC

To explain some background behind how does the allocation works:

lvm2 maintains 'hidden' LV  (with _pmspare suffix)  to present a space for metadata recovery - this size is always at least the size of 'biggest' metadata area in VG (be it thin pool or cache pool)

It's allocated first to ensure VG has at least this space before it continues with allocation of  metadata and data device  (seeing this option as most logical).

Now skilled user who takes responsibility for metadata repair in his hands may disable creation of this hidden LV with  '--poolmetadataspare n'.


To workaround things here manually lvm2 commands do support creation on step-by-step basis

on empty vg with 256k chunk size:

lvcreate -L16G  --name meta vg 
lvcreate -l100%PVS --name pool vg  /dev/pv0:65540-
lvconvert --thinpool vg/pool --poolmetadata meta --poolmetadataspare n

lvcreate -V size  -T vg/pool

Comment 11 Mike Snitzer 2014-10-09 03:20:51 UTC

I'm able to configure a thinlv that is properly aligned relative to the HW RAID6 1280K stripesize (without any false alignment_offset!=0) by using this sequence of lvm2 commands as a workaround:

pvcreate --dataalignment 1280K /dev/sdb
vgcreate --physicalextentsize 256K bricks /dev/sdb

# create ~16GB metadata LV that has a size which is a multiple of 1280K
lvcreate -L $((13107*1280))K --name metadata bricks

# create ~512GB data LV that has a size which is a multiple of 1280K
lvcreate -L $((419430*1280))K --name pool bricks

# create thin-pool without creating a spare metadata area
# (though spare metadata area is useful if/when thin-pool metadata repair is needed)
# -- NOTE: could also use --chunksize 256K or 128K here since they are a factor of 1280K
lvconvert --chunksize 1280K --thinpool bricks/pool --poolmetadata bricks/metadata --poolmetadataspare n

# create 1280K aligned ~256GB thin volume:
lvcreate -V $((209715*1280))K -T /dev/bricks/pool -n thinlv

# cat /sys/block/dm-6/alignment_offset 
0
# cat /sys/block/dm-7/alignment_offset 
0

Comment 12 Alasdair Kergon 2014-10-10 22:13:35 UTC

(In reply to Alasdair Kergon from comment #3)
> So the question is:
> 
>   Are there any good reasons why LVM still needs to restrict the extent size
> to be a power of 2?
>   - If we relaxed that constraint, would any other code need changing or
> would any situations need additional restrictions?
> 
> We need to audit the code to find out.

I've performed this audit and, as I had hoped, found no issues.

The extent size is only validated when it is set and I've not found any code that assumes it is a power of 2.

I propose replacing the "must be a power of 2" validation (at 2 sites in the code) with "must be a multiple of (say) 64KB".  This will kick in for format_text through a new flag that vgconvert will also test for.  The vgcreate and vgchange man pages need updating.

I've not found any reason yet why existing versions of LVM won't work perfectly fine with this metadata.

I'm going to restrict this bugzilla to these basic changes.

Anything more-sophisticated, such as better automatic extent size calculations and handling PVs with different extent sizes in the same VG needs to be covered by different bugzillas.

Comment 16 Alasdair Kergon 2014-10-14 17:20:17 UTC

https://lists.fedorahosted.org/pipermail/lvm2-commits/2014-October/002944.html

(Like I said, any further changes should go onto separate bugzillas for separate consideration.)

Comment 18 Manoj Pillai 2014-10-22 08:24:06 UTC

(In reply to Mike Snitzer from comment #11)
> I'm able to configure a thinlv that is properly aligned relative to the HW
> RAID6 1280K stripesize (without any false alignment_offset!=0) by using this
> sequence of lvm2 commands as a workaround:
> 
> pvcreate --dataalignment 1280K /dev/sdb
> vgcreate --physicalextentsize 256K bricks /dev/sdb
> 
> # create ~16GB metadata LV that has a size which is a multiple of 1280K
> lvcreate -L $((13107*1280))K --name metadata bricks
> 
> # create ~512GB data LV that has a size which is a multiple of 1280K
> lvcreate -L $((419430*1280))K --name pool bricks
> 
> # create thin-pool without creating a spare metadata area
> # (though spare metadata area is useful if/when thin-pool metadata repair is
> needed)
> # -- NOTE: could also use --chunksize 256K or 128K here since they are a
> factor of 1280K
> lvconvert --chunksize 1280K --thinpool bricks/pool --poolmetadata
> bricks/metadata --poolmetadataspare n

We can leave out the "--poolmetadataspare n" part when we do it this way, and still get the correct alignment, right?

That will get rid of this ominous warning:
  Logical volume "tp01" created
  WARNING: recovery of pools without pool metadata spare LV is not automated.
  Converted vg0/tp01 to thin pool.

Comment 19 Mike Snitzer 2014-10-22 18:12:21 UTC

(In reply to Manoj from comment #18)
> (In reply to Mike Snitzer from comment #11)
> > I'm able to configure a thinlv that is properly aligned relative to the HW
> > RAID6 1280K stripesize (without any false alignment_offset!=0) by using this
> > sequence of lvm2 commands as a workaround:
> > 
> > pvcreate --dataalignment 1280K /dev/sdb
> > vgcreate --physicalextentsize 256K bricks /dev/sdb
> > 
> > # create ~16GB metadata LV that has a size which is a multiple of 1280K
> > lvcreate -L $((13107*1280))K --name metadata bricks
> > 
> > # create ~512GB data LV that has a size which is a multiple of 1280K
> > lvcreate -L $((419430*1280))K --name pool bricks
> > 
> > # create thin-pool without creating a spare metadata area
> > # (though spare metadata area is useful if/when thin-pool metadata repair is
> > needed)
> > # -- NOTE: could also use --chunksize 256K or 128K here since they are a
> > factor of 1280K
> > lvconvert --chunksize 1280K --thinpool bricks/pool --poolmetadata
> > bricks/metadata --poolmetadataspare n
> 
> We can leave out the "--poolmetadataspare n" part when we do it this way,
> and still get the correct alignment, right?
> 
> That will get rid of this ominous warning:
>   Logical volume "tp01" created
>   WARNING: recovery of pools without pool metadata spare LV is not automated.
>   Converted vg0/tp01 to thin pool.

Yes, it'll just be a slightly less optimal layout in that the metadata volume will be _after_ the data volume.  (The start of the disk is generally fastest on rotational storage).

But in practice that likely doesn't matter on this RAID6 storage and having the metadata spare gives Gluster the added benefit of more convenient repair of thinp metadata (if/when that is ever needed).

Comment 21 Corey Marthaler 2014-11-14 23:08:10 UTC

Should there be any checks for attempting to create extent sizes that are way larger than your PVs? :)

[root@host-110 ~]# pvcreate --dataalignment 360960 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
  Physical volume "/dev/sda1" successfully created
  Physical volume "/dev/sdb1" successfully created
  Physical volume "/dev/sdc1" successfully created
  Physical volume "/dev/sdd1" successfully created
  Physical volume "/dev/sde1" successfully created

[root@host-110 ~]# pvs -o +pe_start
  PV         VG            Fmt  Attr PSize  PFree  1st PE 
  /dev/sda1                lvm2 ---  15.00g 15.00g 352.50m
  /dev/sdb1                lvm2 ---  15.00g 15.00g 352.50m
  /dev/sdc1                lvm2 ---  15.00g 15.00g 352.50m
  /dev/sdd1                lvm2 ---  15.00g 15.00g 352.50m
  /dev/sde1                lvm2 ---  15.00g 15.00g 352.50m

[root@host-110 ~]# vgcreate --physicalextentsize 72192 snapper_thinp /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
  Volume group "snapper_thinp" successfully created

[root@host-110 ~]# pvs -o +pe_start
  PV         VG            Fmt  Attr PSize  PFree  1st PE 
  /dev/sda1  snapper_thinp lvm2 a--  15.00g     0  352.50m
  /dev/sdb1  snapper_thinp lvm2 a--  15.00g     0  352.50m
  /dev/sdc1  snapper_thinp lvm2 a--  15.00g     0  352.50m
  /dev/sdd1  snapper_thinp lvm2 a--  15.00g     0  352.50m
  /dev/sde1  snapper_thinp lvm2 a--  15.00g     0  352.50m

[root@host-110 ~]# pvscan
  PV /dev/sda1   VG snapper_thinp   lvm2 [0    / 0    free]
  PV /dev/sdb1   VG snapper_thinp   lvm2 [0    / 0    free]
  PV /dev/sdc1   VG snapper_thinp   lvm2 [0    / 0    free]
  PV /dev/sdd1   VG snapper_thinp   lvm2 [0    / 0    free]
  PV /dev/sde1   VG snapper_thinp   lvm2 [0    / 0    free]

[root@host-110 ~]# lvcreate  --thinpool POOL  --zero y -L 1G --poolmetadatasize 100M snapper_thinp
  Rounding up size to full physical extent 70.50 GiB
  Rounding up size to full physical extent 70.50 GiB
  WARNING: Maximum supported pool metadata size is 16.00 GiB.
  Rounding up size to full physical extent 70.50 GiB
  Volume group "snapper_thinp" has insufficient free space (0 extents): 1 required.

Comment 22 Corey Marthaler 2014-11-25 00:14:02 UTC

Raid1 currently can't handle non power of 2 physical extents. Is this a new raid1 specific bug, or does this block this bug from being verified?


Recreating PVs/VG with non power of 2 dataalignment and extentsizes

pvcreate --dataalignment 3072k /dev/sda1 /dev/sdb1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1
vgcreate --physicalextentsize 384k snapper_thinp /dev/sda1 /dev/sdb1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1

[root@host-114 ~]# lvcreate  --type raid1  -m 1 -L 100M -n raid snapper_thinp
 Rounding up size to full physical extent 100.12 MiB
 Using reduced mirror region size of 768 sectors.
 device-mapper: reload ioctl on  failed: Invalid argument
 Failed to activate new LV.

Nov 24 18:08:02 host-114 kernel: device-mapper: table: 253:6: raid: Region size is not a power of 2
Nov 24 18:08:02 host-114 kernel: device-mapper: ioctl: error adding target to table

[root@host-114 ~]# dmsetup ls
snapper_thinp-raid_rmeta_1      (253:4)
snapper_thinp-raid_rmeta_0      (253:2)
snapper_thinp-raid_rimage_1     (253:5)
snapper_thinp-raid_rimage_0     (253:3)

Comment 23 Alasdair Kergon 2014-11-25 02:51:22 UTC

We need to look into this.

1) If the kernel is being too strict, then it needs relaxing with a kernel patch.  (Bug remains ON_QA, retest after kernel is patched.)

2) If the kernel is right, then userspace needs to warn users that they cannot use raid volumes in these VGs. (Bug goes FailedQA and we add a lvm2 patch.)

I hope the first case is possible...

Comment 24 Alasdair Kergon 2014-11-25 03:21:00 UTC

So more work is needed to deal with raid and raid1 targets - if we have to keep the powers of 2 region sizes we either 'lose' a bit of space at the ends of the devices or we just stop people using these targets completely in these VGs.  pvmove probably also can't be supported.

As for the "no data space" PVs, you can have a PV that only contains a metadata area, but it would be sensible to warn you when you're doing this.

Comment 25 Alasdair Kergon 2014-12-03 17:16:01 UTC

So the first change I've made (for the next build) is:
  https://lists.fedorahosted.org/pipermail/lvm2-commits/2014-December/003224.html
which should prevent the mirror code selecting a non-power-of-2 region size.

In your example, this reduces the mirror region size:

  Rounding up size to full physical extent 100.12 MiB
  Using reduced mirror region size of 256 sectors.

Comment 28 Corey Marthaler 2015-01-05 23:37:08 UTC

I verified that mirrors, raids, cache, thin, and snapshot volumes now work on top of volume groups with non power of two physical extents.

Comment 29 Corey Marthaler 2015-01-07 20:03:52 UTC

Also verified no misalignment issues when running the configuration case in comment #6. Marking verified in the latest rpms.


3.10.0-220.el7.x86_64
lvm2-2.02.114-4.el7    BUILT: Wed Jan  7 07:07:47 CST 2015
lvm2-libs-2.02.114-4.el7    BUILT: Wed Jan  7 07:07:47 CST 2015
lvm2-cluster-2.02.114-4.el7    BUILT: Wed Jan  7 07:07:47 CST 2015
device-mapper-1.02.92-4.el7    BUILT: Wed Jan  7 07:07:47 CST 2015
device-mapper-libs-1.02.92-4.el7    BUILT: Wed Jan  7 07:07:47 CST 2015
device-mapper-event-1.02.92-4.el7    BUILT: Wed Jan  7 07:07:47 CST 2015
device-mapper-event-libs-1.02.92-4.el7    BUILT: Wed Jan  7 07:07:47 CST 2015
device-mapper-persistent-data-0.4.1-2.el7    BUILT: Wed Nov 12 12:39:46 CST 2014
cmirror-2.02.114-4.el7    BUILT: Wed Jan  7 07:07:47 CST 2015



[root@host-115 ~]# dmsetup status
bricks-thinlv: 0 536870400 thin 0 -
bricks-mypool: 0 1073740800 linear 
bricks-mypool-tpool: 0 1073740800 thin-pool 1 288/4145152 0/419430 - rw no_discard_passdown queue_if_no_space 
bricks-mypool_tdata: 0 1073740800 linear 
bricks-mypool_tmeta: 0 33161216 linear 

[root@host-115 ~]# lsblk /dev/sda
NAME                    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                       8:0    0  750G  0 disk 
├─bricks-mypool_tmeta   253:2    0 15.8G  0 lvm  
│ └─bricks-mypool-tpool 253:4    0  512G  0 lvm  
│   ├─bricks-mypool     253:5    0  512G  0 lvm  
│   └─bricks-thinlv     253:6    0  256G  0 lvm  
└─bricks-mypool_tdata   253:3    0  512G  0 lvm  
  └─bricks-mypool-tpool 253:4    0  512G  0 lvm  
    ├─bricks-mypool     253:5    0  512G  0 lvm  
    └─bricks-thinlv     253:6    0  256G  0 lvm  

[root@host-115 ~]# cat /sys/block/dm-2/alignment_offset
0
[root@host-115 ~]# cat /sys/block/dm-3/alignment_offset
0
[root@host-115 ~]# cat /sys/block/dm-4/alignment_offset
0
[root@host-115 ~]# cat /sys/block/dm-5/alignment_offset
0
[root@host-115 ~]# cat /sys/block/dm-6/alignment_offset
0

Comment 31 errata-xmlrpc 2015-03-05 13:08:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0513.html