Bug 1270792 - console partitioned brick devices are created with incorrect data alignment [NEEDINFO]
console partitioned brick devices are created with incorrect data alignment
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rhsc (Show other bugs)
3.1
Unspecified Unspecified
high Severity high
: ---
: RHGS 3.1.2
Assigned To: Timothy Asir
Triveni Rao
: ZStream
Depends On:
Blocks: 1260783
  Show dependency treegraph
 
Reported: 2015-10-12 08:09 EDT by Martin Bukatovic
Modified: 2016-05-16 00:39 EDT (History)
13 users (show)

See Also:
Fixed In Version: vdsm-4.16.30-1.3
Doc Type: Bug Fix
Doc Text:
Previously, while creating brick from Red Hat Gluster Storage Console, MBR partition was created on the disk and then Logical Volume Manager (LVM) PV (physical volume) was created on the partition. As a consequence, creating MBR partition invalidated the alignment done on LVM PV by Red Hat Gluster Storage Console. It resulted in PV not being correctly aligned with underlying RAID volumes. With this fix, the brick will be aligned with an underlying RAID volume parameter to get better performance and the partition will not be created on the disk.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-03-01 01:12:16 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
divya: needinfo? (rnachimu)


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 47959 master MERGED gluster: fix alignment issue for brick creation on JBOD 2015-12-21 02:27 EST
oVirt gerrit 50705 master MERGED gluster: fix size conversion issues in brick create 2015-12-21 06:32 EST
oVirt gerrit 50769 master MERGED gluster: create bricks without MBR partitions 2015-12-21 02:26 EST
oVirt gerrit 50770 master MERGED gluster add mount options for gluster brick. 2015-12-21 02:29 EST
oVirt gerrit 50802 master MERGED gluster: ensure LV size is multiple of VG's PE value 2015-12-21 08:59 EST
oVirt gerrit 50805 ovirt-3.5-gluster MERGED gluster: create bricks without MBR partitions 2015-12-22 00:42 EST
oVirt gerrit 50806 ovirt-3.5-gluster MERGED gluster: fix alignment issue for brick creation on JBOD 2015-12-22 00:43 EST
oVirt gerrit 50807 ovirt-3.5-gluster MERGED gluster add mount options for gluster brick. 2015-12-22 00:44 EST
oVirt gerrit 50808 ovirt-3.5-gluster MERGED gluster: fix size conversion issues in brick create 2015-12-22 00:45 EST
oVirt gerrit 50809 ovirt-3.5-gluster MERGED gluster: ensure LV size is multiple of VG's PE value 2015-12-22 00:46 EST
oVirt gerrit 50850 ovirt-3.6 MERGED gluster: create bricks without MBR partitions 2015-12-24 03:47 EST
oVirt gerrit 50851 ovirt-3.6 MERGED gluster: fix alignment issue for brick creation on JBOD 2015-12-24 03:47 EST
oVirt gerrit 50852 ovirt-3.6 MERGED gluster add mount options for gluster brick. 2015-12-24 03:47 EST
oVirt gerrit 50853 ovirt-3.6 MERGED gluster: fix size conversion issues in brick create 2015-12-24 03:47 EST
oVirt gerrit 50854 ovirt-3.6 MERGED gluster: ensure LV size is multiple of VG's PE value 2015-12-24 03:48 EST

  None (edit)
Description Martin Bukatovic 2015-10-12 08:09:23 EDT
Description of problem
======================

When admin uses "Create Brick" button[1] to initialize given storage device
(eg. vdc), Console partitions it in the following way:

 * single mbr partition (eg. vdc1) is created on the device 
 * lvm pv (physical volume) is created in vdc1 partition
 * lvm vg (volume group) is created in the pv
 * lvm thin pool and thin volume is created inside vg

So that the final partitioning looks like this:

    [sda------------------------------- ... --]
    [mbr-gap][vda1--------------------- ... --]
             [pv-gap][pv-data---------- ... --]
    ^        ^       ^
lba 0        1MB     2MB (pe_start)

Where lba is adress of sector counted from the start of the device.

As you can see, first mbr partition starts on 1MB boundary (default value),
and there is another gap for lvm metadata, again 1MB by default. This means
that pe_start is actually located 2MB from the start of the device.

The problem is that when you need to tweak data alignemnt (eg. when RAID is
used), console uses --dataalignment option of pvcreate command to do it. But
pvcreate cares only about start of the device it's created on - in our case
vda1, and doesn't know (by design) about 1MB wide mbr gap.

So for example when one needs to set data alignment to value which is not
multiple of 1MB, let's say 1,25 MB, console aligns data in a wrong way as you
will end up with 1MB (mbr gap) + 1,25MB (lvm data alignment) = 2.25MB instead.

Moreover Console seems to break our own recommendations. Process how to align
partitions properly is described in great detail in chapter "11.2. Brick
Configuration"[2] of RHS Admin Guide and mbr partitions are not mentioned there
at all.

[1] the button is available inside "Storage Devices" tab for each Host.
[2] https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Administration_Guide/Brick_Configuration.html

Version-Release number of selected component (if applicable)
============================================================

rhsc-3.1.0-0.62.el6.noarch

How reproducible
================

100%

Steps to Reproduce
==================

1. Prepare clean disk device on one host managed by console
2. Use "Create Brick" function to initialize disk device while
   selecting nontrivial setup which requires special data
   alignment which is not multiple of 1MB (eg. some RAID).
3. Check partitioning created on the disk device.

Actual results
==============

Let's see partition table on the disk device (/dev/vdc in my case):

~~~
# fdisk -cul /dev/vdc

Disk /dev/vdc: 107.4 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders, total 209715200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0009f9f6

   Device Boot      Start         End      Blocks   Id  System
/dev/vdc1            2048   209715199   104856576   8e  Linux LVM
~~~

As we can see, Console created mbr partition vdc1 on the device. This
means that first sector of vdc1 partition starts on lba 2048 (1MB from
the beginning of the disk). So far so good.

But when we look at pv setup:

~~~
# pvs -o +pe_start /dev/vdc1
  PV         VG                Fmt  Attr PSize   PFree 1st PE
  /dev/vdc1  vg-alignmentbrick lvm2 a--  100.00g 3.50g   1.25m
~~~

We see that data alignment of 1.25 MB was used. And if I read manpage
of pvcreate right, this means that actual first sector of pv (pe_start)
lies on 1MB + 1.25 MB = 2.25 MB. And this is be a problem, because the actual
data alignment is 2.25 MB, which is not multiple of required data alignment
value 1.25 MB and so the alignment setup is wrong.

Expected results
================

Console makes sure that actual data alignment matches the requirements.

It seems to me that the easiest way to achieve this would be to drop mbr
partition entirely. Note that chapter "11.2. Brick Configuration"[2] doesn't
mentions mbr partitioning at all.
Comment 2 Martin Bukatovic 2015-10-19 12:01:27 EDT
If this BZ is fixed by not creating mbr partition when brick is created on clean
device as suggested in expected results section, would it make sense to prevent
storage admin to create brick on already existing (not created by console) mbr
partition?
Comment 4 Ramesh N 2015-12-01 00:41:59 EST
(In reply to Martin Bukatovic from comment #2)
> If this BZ is fixed by not creating mbr partition when brick is created on
> clean
> device as suggested in expected results section, would it make sense to
> prevent
> storage admin to create brick on already existing (not created by console)
> mbr
> partition?

Yes. This make sense. let me clone the bz#1211140 to downstream.
Comment 6 Triveni Rao 2015-12-28 06:37:45 EST
Before creating any bricks:
=============================================================================================

[root@dhcp35-99 ~]# df -h
Filesystem             Size  Used Avail Use% Mounted on
/dev/mapper/rhgs-root   18G  1.5G   16G   9% /
devtmpfs               1.9G     0  1.9G   0% /dev
tmpfs                  1.9G     0  1.9G   0% /dev/shm
tmpfs                  1.9G  8.5M  1.9G   1% /run
tmpfs                  1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/vda1              497M   89M  409M  18% /boot
tmpfs                  389M     0  389M   0% /run/user/0
[root@dhcp35-99 ~]# 
[root@dhcp35-99 ~]# pvs
  PV         VG   Fmt  Attr PSize  PFree 
  /dev/vda2  rhgs lvm2 a--  19.51g 40.00m
[root@dhcp35-99 ~]# vgs
  VG   #PV #LV #SN Attr   VSize  VFree 
  rhgs   1   2   0 wz--n- 19.51g 40.00m
[root@dhcp35-99 ~]# lvs
  LV   VG   Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root rhgs -wi-ao---- 17.47g                                                    
  swap rhgs -wi-ao----  2.00g                                                    
[root@dhcp35-99 ~]# fdisk -l |grep vd
Disk /dev/vda: 21.5 GB, 21474836480 bytes, 41943040 sectors
/dev/vda1   *        2048     1026047      512000   83  Linux
/dev/vda2         1026048    41943039    20458496   8e  Linux LVM
Disk /dev/vdb: 53.7 GB, 53687091200 bytes, 104857600 sectors
Disk /dev/vdc: 53.7 GB, 53687091200 bytes, 104857600 sectors
Disk /dev/vdd: 53.7 GB, 53687091200 bytes, 104857600 sectors
Disk /dev/vde: 53.7 GB, 53687091200 bytes, 104857600 sectors
Disk /dev/vdf: 53.7 GB, 53687091200 bytes, 104857600 sectors
Disk /dev/vdg: 53.7 GB, 53687091200 bytes, 104857600 sectors
[root@dhcp35-99 ~]# 
[root@dhcp35-99 ~]# 

=======================================================================================================================

After creating brick from console: used RAID 6 for creation where stripe size is 128KB and no of bricks used is 6
so, data alignment = 4disks * 128KB= 512KB

[root@dhcp35-99 ~]# df -h
Filesystem                     Size  Used Avail Use% Mounted on
/dev/mapper/rhgs-root           18G  1.5G   16G   9% /
devtmpfs                       1.9G     0  1.9G   0% /dev
tmpfs                          1.9G     0  1.9G   0% /dev/shm
tmpfs                          1.9G  8.6M  1.9G   1% /run
tmpfs                          1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/vda1                      497M   89M  409M  18% /boot
tmpfs                          389M     0  389M   0% /run/user/0
/dev/mapper/vg--brick1-brick1   50G   33M   50G   1% /rhgs/brick1
[root@dhcp35-99 ~]# 
[root@dhcp35-99 ~]# pvs
  PV         VG        Fmt  Attr PSize  PFree 
  /dev/vda2  rhgs      lvm2 a--  19.51g 40.00m
  /dev/vdg   vg-brick1 lvm2 a--  50.00g  1.50m
[root@dhcp35-99 ~]# vgs
  VG        #PV #LV #SN Attr   VSize  VFree 
  rhgs        1   2   0 wz--n- 19.51g 40.00m
  vg-brick1   1   2   0 wz--n- 50.00g  1.50m
[root@dhcp35-99 ~]# lvs
  LV          VG        Attr       LSize  Pool        Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root        rhgs      -wi-ao---- 17.47g                                                           
  swap        rhgs      -wi-ao----  2.00g                                                           
  brick1      vg-brick1 Vwi-aot--- 50.00g pool-brick1        0.07                                   
  pool-brick1 vg-brick1 twi-aot--- 49.75g                    0.07   0.03                            
[root@dhcp35-99 ~]# 
[root@dhcp35-99 ~]# 
[root@dhcp35-99 ~]# fdisk -l /dev/vdg

Disk /dev/vdg: 53.7 GB, 53687091200 bytes, 104857600 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

[root@dhcp35-99 ~]# 


[root@dhcp35-99 ~]# pvs -o +pe_start /dev/vdg
  PV         VG        Fmt  Attr PSize  PFree 1st PE 
  /dev/vdg   vg-brick1 lvm2 a--  50.00g 1.50m 512.00k
[root@dhcp35-99 ~]# 
[root@dhcp35-99 ~]# cat /etc/fstab 

#
# /etc/fstab
# Created by anaconda on Sat Dec 26 04:42:40 2015
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/rhgs-root   /                       xfs     defaults        0 0
UUID=0ebedc93-5a4c-4709-8860-b398ed59ec7e /boot                   xfs     defaults        0 0
/dev/mapper/rhgs-swap   swap                    swap    defaults        0 0
/dev/mapper/vg--brick1-brick1	/rhgs/brick1	xfs	inode64,noatime	0	0
[root@dhcp35-99 ~]# 


=======================================================================================================


For RAID 10, stripe size is 256KB and no of bricks used is 4
so, data alignment = 2disks * 256KB= 512KB


[root@dhcp35-99 ~]# df -h
Filesystem                     Size  Used Avail Use% Mounted on
/dev/mapper/rhgs-root           18G  1.6G   16G   9% /
devtmpfs                       1.9G     0  1.9G   0% /dev
tmpfs                          1.9G     0  1.9G   0% /dev/shm
tmpfs                          1.9G  8.6M  1.9G   1% /run
tmpfs                          1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/vda1                      497M   89M  409M  18% /boot
tmpfs                          389M     0  389M   0% /run/user/0
/dev/mapper/vg--brick1-brick1   50G   33M   50G   1% /rhgs/brick1
/dev/mapper/vg--brick2-brick2   50G   33M   50G   1% /rhgs/brick2
[root@dhcp35-99 ~]# 
[root@dhcp35-99 ~]# pvs
  PV         VG        Fmt  Attr PSize  PFree 
  /dev/vda2  rhgs      lvm2 a--  19.51g 40.00m
  /dev/vdf   vg-brick2 lvm2 a--  50.00g  1.50m
  /dev/vdg   vg-brick1 lvm2 a--  50.00g  1.50m
[root@dhcp35-99 ~]# vgs
  VG        #PV #LV #SN Attr   VSize  VFree 
  rhgs        1   2   0 wz--n- 19.51g 40.00m
  vg-brick1   1   2   0 wz--n- 50.00g  1.50m
  vg-brick2   1   2   0 wz--n- 50.00g  1.50m
[root@dhcp35-99 ~]# lvs
  LV          VG        Attr       LSize  Pool        Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root        rhgs      -wi-ao---- 17.47g                                                           
  swap        rhgs      -wi-ao----  2.00g                                                           
  brick1      vg-brick1 Vwi-aot--- 50.00g pool-brick1        0.07                                   
  pool-brick1 vg-brick1 twi-aot--- 49.75g                    0.07   0.03                            
  brick2      vg-brick2 Vwi-aot--- 50.00g pool-brick2        0.06                                   
  pool-brick2 vg-brick2 twi-aot--- 49.75g                    0.06   0.04                            
[root@dhcp35-99 ~]# 
[root@dhcp35-99 ~]# fdisk -l /dev/vdf

Disk /dev/vdf: 53.7 GB, 53687091200 bytes, 104857600 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

[root@dhcp35-99 ~]# 
[root@dhcp35-99 ~]#  pvs -o +pe_start /dev/vdf
  PV         VG        Fmt  Attr PSize  PFree 1st PE 
  /dev/vdf   vg-brick2 lvm2 a--  50.00g 1.50m 512.00k
[root@dhcp35-99 ~]# 
[root@dhcp35-99 ~]# 


===========================================================================================================================

With normal brick creation without RAID configs then usual JBOD stripe size is used which is 256KB.

[root@dhcp35-99 ~]# df -h
Filesystem                     Size  Used Avail Use% Mounted on
/dev/mapper/rhgs-root           18G  1.6G   16G   9% /
devtmpfs                       1.9G     0  1.9G   0% /dev
tmpfs                          1.9G     0  1.9G   0% /dev/shm
tmpfs                          1.9G  8.6M  1.9G   1% /run
tmpfs                          1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/vda1                      497M   89M  409M  18% /boot
tmpfs                          389M     0  389M   0% /run/user/0
/dev/mapper/vg--brick1-brick1   50G   33M   50G   1% /rhgs/brick1
/dev/mapper/vg--brick2-brick2   50G   33M   50G   1% /rhgs/brick2
/dev/mapper/vg--brick3-brick3   50G   33M   50G   1% /rhgs/brick3
[root@dhcp35-99 ~]# pvs
  PV         VG        Fmt  Attr PSize  PFree 
  /dev/vda2  rhgs      lvm2 a--  19.51g 40.00m
  /dev/vde   vg-brick3 lvm2 a--  50.00g  1.75m
  /dev/vdf   vg-brick2 lvm2 a--  50.00g  1.50m
  /dev/vdg   vg-brick1 lvm2 a--  50.00g  1.50m
[root@dhcp35-99 ~]# vgs
  VG        #PV #LV #SN Attr   VSize  VFree 
  rhgs        1   2   0 wz--n- 19.51g 40.00m
  vg-brick1   1   2   0 wz--n- 50.00g  1.50m
  vg-brick2   1   2   0 wz--n- 50.00g  1.50m
  vg-brick3   1   2   0 wz--n- 50.00g  1.75m
[root@dhcp35-99 ~]# lvs
  LV          VG        Attr       LSize  Pool        Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root        rhgs      -wi-ao---- 17.47g                                                           
  swap        rhgs      -wi-ao----  2.00g                                                           
  brick1      vg-brick1 Vwi-aot--- 50.00g pool-brick1        0.07                                   
  pool-brick1 vg-brick1 twi-aot--- 49.75g                    0.07   0.03                            
  brick2      vg-brick2 Vwi-aot--- 50.00g pool-brick2        0.06                                   
  pool-brick2 vg-brick2 twi-aot--- 49.75g                    0.06   0.04                            
  brick3      vg-brick3 Vwi-aot--- 50.00g pool-brick3        0.06                                   
  pool-brick3 vg-brick3 twi-aot--- 49.75g                    0.06   0.04                            
[root@dhcp35-99 ~]# 

[root@dhcp35-99 ~]# fdisk -l /dev/vde

Disk /dev/vde: 53.7 GB, 53687091200 bytes, 104857600 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

[root@dhcp35-99 ~]# 
[root@dhcp35-99 ~]#  pvs -o +pe_start /dev/vde
  PV         VG        Fmt  Attr PSize  PFree 1st PE 
  /dev/vde   vg-brick3 lvm2 a--  50.00g 1.75m 256.00k
[root@dhcp35-99 ~]# 
[root@dhcp35-99 ~]# 


Conclusion: no mbr partitions were created when creating bricks from console, with which brick will be aligned with RAID parameters.
Comment 9 errata-xmlrpc 2016-03-01 01:12:16 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0310.html

Note You need to log in before you can comment on or make changes to this bug.