Bug 1412455

Summary: [Bug] Gluster brick created with RHEV manager is overallocated
Product: Red Hat Enterprise Virtualization Manager Reporter: Sachin Raje <sraje>
Component: vdsmAssignee: Gobinda Das <godas>
Status: CLOSED ERRATA QA Contact: Kevin Alon Goldblatt <kgoldbla>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.6.9CC: bkunal, godas, lsurette, mpillai, nkshirsa, pstehlik, ratamir, sabose, sasundar, sraje, srevivo, tjelinek, ycui, ykaul
Target Milestone: ovirt-4.2.0Keywords: Reopened
Target Release: 4.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
If this bug requires documentation, please select an appropriate Doc Type value.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-15 17:49:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Gluster RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sachin Raje 2017-01-12 04:42:14 UTC
Description of problem:

While adding a gluster brick using RHEV Manager, The RHEV uses "lvm thin pool" to add brick. In one of customer's case faced the 'gluster filesystem' corruption issue. 

The storage team suggested it could be due to "thin_pool_autoextend_threshold = 100" set in lvm.conf.


Version-Release number of selected component (if applicable):
RHEV-3.6
RHEV-H 7.x


How reproducible: Always


Steps to Reproduce:
1. Attach new empty disk to hyperviosr.
2. From rhevm portal select Host
3. Select Storage Device and then click on newly added empty disk.
4. Click on "Create Brick" and create a brick.

Actual results: 

The "thin_pool_autoextend_threshold" value in lvm.conf of host is set to "100".

Expected results:
The value for "thin_pool_autoextend_threshold" should be set to less than 100 while using 'lvm thin pool'. 

It should be set less than 100 by default by rhev / vdsm while using gluster host in RHEV or allow user to manually set it while creating gluster brick using RHEVM portal.

Additional info:

I have checked by installing latest gluster 3.1 host and it also has default value set "thin_pool_autoextend_threshold = 100" which according to storage team should be less than 100.


# egrep 'monitoring =|thin_pool_autoextend_threshold|thin_pool_autoextend_percent' /etc/lvm/lvm.conf
    # 'thin_pool_autoextend_threshold' and 'thin_pool_autoextend_percent' define
    # For example, if you set thin_pool_autoextend_threshold to 70 and
    # thin_pool_autoextend_percent to 20, whenever a pool exceeds 70% usage,
    # Setting thin_pool_autoextend_threshold to 100 disables automatic
    thin_pool_autoextend_threshold = 100
    thin_pool_autoextend_percent = 20
    monitoring = 1

--------------

I have got below reply from storage team regarding lvm.conf filter value "thin_pool_autoextend_threshold = 100" and its possible cause for filesystem corruption.

~~~
if thin_pool_autoextend_threshold is set to 100, autoextend is off.

       lvm.conf(5) thin_pool_autoextend_threshold
       is a percentage full value that defines when the thin pool LV should
       be extended.  Setting this to 100 disables automatic extention.  The
       minimum value is 50.

So if automatic extention is disabled, if the thin pool is full, we run into undefined behavior. Full thin pool does not act like full filesystem. We basically are not sure about pending IO's, etc, and filesystem corruption can occur for sure in such cases.

We generally advise sysadmins to make sure that this situation never occurs. Monitoring must be on, and autoextend must be in place. If the default lvm.conf generated through gluster/rhev is not setting this value correctly, then I suspect its a bug. 

# egrep 'monitoring =|thin_pool_autoextend_threshold|thin_pool_autoextend_percent' /etc/lvm/lvm.conf
	# Configuration option activation/thin_pool_autoextend_threshold.
	# Also see thin_pool_autoextend_percent.
	# thin_pool_autoextend_threshold = 70
	thin_pool_autoextend_threshold = 100
	# Configuration option activation/thin_pool_autoextend_percent.
	# thin_pool_autoextend_percent = 20
	thin_pool_autoextend_percent = 20
	monitoring = 1

~~~

Comment 1 Sachin Raje 2017-01-12 04:45:27 UTC
if the pv, the vg and the thinly provisioned lv are all the same size, where do we store the metadata? Don't we need to reserve at least some space?

  --- Physical volume ---
  PV Name               /dev/mapper/mpathb
  VG Name               vg-brick04
  PV Size               36.33 TiB / not usable 0   
  Allocatable           yes 
  PE Size               1.25 MiB
  Total PE              30478511
  Free PE               1
  Allocated PE          30478510
  PV UUID               CQdZxE-Lj3D-kuRD-zYma-cn8t-3s9P-hCo30s

  --- Volume group ---
  VG Name               vg-brick04
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  7
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               1
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               36.33 TiB
  PE Size               1.25 MiB
  Total PE              30478511
  Alloc PE / Size       30478510 / 36.33 TiB
  Free  PE / Size       1 / 1.25 MiB
  VG UUID               PKYtXv-Zk4Z-6b5a-k6Ym-JSO3-5uxW-NyIsZZ

  --- Logical volume ---
  LV Path                /dev/vg-brick04/brick04
  LV Name                brick04
  VG Name                vg-brick04
  LV UUID                2owAeG-qPIC-GFK7-l6yC-F4fF-XYAj-5JMMZl
  LV Write Access        read/write
  LV Creation host, time svg302.sst.rad.lan, 2016-08-11 10:36:52 +0200
  LV Pool name           pool-brick04
  LV Status              available
  # open                 1
  LV Size                36.33 TiB
  Mapped size            97.49%
  Current LE             30478511
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:27

  LV                   VG         Attr       LSize   Pool         Origin Data%  Meta%  Move Log Cpy%Sync Convert PE Ranges        
pool-brick04         vg-brick04 twi-aot---  36.32t                     97.53  5.53                             pool-brick04_tdata:0-30465402       
  [pool-brick04_tdata] vg-brick04 Twi-ao----  36.32t                                                             /dev/mapper/mpathb:0-30465402       
  [pool-brick04_tmeta] vg-brick04 ewi-ao----  16.00g                                                             /dev/mapper/mpathb:30465403-30478509

---------------------

This is the reason storage team suggested to set 'thin_pool_autoextend_threshold" less than 100 and the exact value should be decided by the admin or tool which creates it ?

Comment 2 Allon Mureinik 2017-01-12 09:48:16 UTC
This bug is about managing gluster bricks in RHV, not about how RHV consumes it.

Sahina - I'm assigning it to you for initial research. I guess the component should be changed too, but I'm not sure to what.

Comment 3 Sahina Bose 2017-01-23 10:31:38 UTC
Ramesh - can you take a look?

Comment 4 Ramesh N 2017-01-23 12:55:15 UTC
(In reply to Sachin Raje from comment #1)
> if the pv, the vg and the thinly provisioned lv are all the same size, where
> do we store the metadata? Don't we need to reserve at least some space?
> 
>   --- Physical volume ---
>   PV Name               /dev/mapper/mpathb
>   VG Name               vg-brick04
>   PV Size               36.33 TiB / not usable 0   
>   Allocatable           yes 
>   PE Size               1.25 MiB
>   Total PE              30478511
>   Free PE               1
>   Allocated PE          30478510
>   PV UUID               CQdZxE-Lj3D-kuRD-zYma-cn8t-3s9P-hCo30s
> 
>   --- Volume group ---
>   VG Name               vg-brick04
>   System ID             
>   Format                lvm2
>   Metadata Areas        1
>   Metadata Sequence No  7
>   VG Access             read/write
>   VG Status             resizable
>   MAX LV                0
>   Cur LV                2
>   Open LV               1
>   Max PV                0
>   Cur PV                1
>   Act PV                1
>   VG Size               36.33 TiB
>   PE Size               1.25 MiB
>   Total PE              30478511
>   Alloc PE / Size       30478510 / 36.33 TiB
>   Free  PE / Size       1 / 1.25 MiB
>   VG UUID               PKYtXv-Zk4Z-6b5a-k6Ym-JSO3-5uxW-NyIsZZ
> 
>   --- Logical volume ---
>   LV Path                /dev/vg-brick04/brick04
>   LV Name                brick04
>   VG Name                vg-brick04
>   LV UUID                2owAeG-qPIC-GFK7-l6yC-F4fF-XYAj-5JMMZl
>   LV Write Access        read/write
>   LV Creation host, time svg302.sst.rad.lan, 2016-08-11 10:36:52 +0200
>   LV Pool name           pool-brick04
>   LV Status              available
>   # open                 1
>   LV Size                36.33 TiB
>   Mapped size            97.49%
>   Current LE             30478511
>   Segments               1
>   Allocation             inherit
>   Read ahead sectors     auto
>   - currently set to     8192
>   Block device           253:27
> 
>   LV                   VG         Attr       LSize   Pool         Origin
> Data%  Meta%  Move Log Cpy%Sync Convert PE Ranges        
> pool-brick04         vg-brick04 twi-aot---  36.32t                     97.53
> 5.53                             pool-brick04_tdata:0-30465402       
>   [pool-brick04_tdata] vg-brick04 Twi-ao----  36.32t                        
> /dev/mapper/mpathb:0-30465402       
>   [pool-brick04_tmeta] vg-brick04 ewi-ao----  16.00g                        
> /dev/mapper/mpathb:30465403-30478509
> 
> ---------------------
> 
> This is the reason storage team suggested to set
> 'thin_pool_autoextend_threshold" less than 100 and the exact value should be
> decided by the admin or tool which creates it ?

We are not modifying the RHEL default values of 'thin_pool_autoextend_threshold". But in Gluster brick creation, 16GB is reserved for metadata. So in this case, lv is overallocated by 16GB. We should reduce this from thinn lv. But how it will work with lvm snapshots? Is there a possibility of facing the same problem while using lvm snapshots?

Comment 12 Ramesh N 2017-02-20 11:10:39 UTC
This has be to fixed in vdsm-gluster. So changing the component accordingly.

Comment 13 Ramesh N 2017-02-20 12:51:25 UTC
I will post a patch a to match the size of lv with the size of thinpool. This can guarantee that there is no over allocation as long as there is no LVM Snaphost/Gluster Volume snapshot.

Comment 14 Ramesh N 2017-02-20 12:55:02 UTC
Before the fix:
You can see that Current LE in LV is higher then current LV in thinpool
   
  --- Logical volume ---
  LV Name                pool-brick1
  VG Name                vg-brick1
  LV UUID                Ynk3ou-kAnd-MuZe-d1Qn-owPQ-DHT5-fKwULZ
  LV Write Access        read/write
  LV Creation host, time headwig.lab.eng.blr.redhat.com, 2017-02-20 17:06:12 +0530
  LV Pool metadata       pool-brick1_tmeta
  LV Pool data           pool-brick1_tdata
  LV Status              available
  # open                 2
  LV Size                1.80 TiB
  Allocated pool data    0.05%
  Allocated metadata     0.01%
  Current LE             1512651
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:22
   
  --- Logical volume ---
  LV Path                /dev/vg-brick1/brick1
  LV Name                brick1
  VG Name                vg-brick1
  LV UUID                mH035s-CciV-By4G-RIIi-sHP0-oxaX-ZnFLoF
  LV Write Access        read/write
  LV Creation host, time headwig.lab.eng.blr.redhat.com, 2017-02-20 17:06:20 +0530
  LV Pool name           pool-brick1
  LV Status              available
  # open                 1
  LV Size                1.82 TiB
  Mapped size            0.05%
  Current LE             1525759
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:24


After the fix:
You can see that Current LE in thinpool and lv matches exactly. 
  --- Logical volume ---
  LV Name                pool-brick2
  VG Name                vg-brick2
  LV UUID                nDngM3-tL4q-3Xaf-Azqn-ljHR-psZf-u8FQAr
  LV Write Access        read/write
  LV Creation host, time headwig.lab.eng.blr.redhat.com, 2017-02-20 18:14:41 +0530
  LV Pool metadata       pool-brick2_tmeta
  LV Pool data           pool-brick2_tdata
  LV Status              available
  # open                 2
  LV Size                1.80 TiB
  Allocated pool data    0.05%
  Allocated metadata     0.01%
  Current LE             1512651
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:27
   
  --- Logical volume ---
  LV Path                /dev/vg-brick2/brick2
  LV Name                brick2
  VG Name                vg-brick2
  LV UUID                SQgqXI-n564-6AkX-aJlR-KpM5-ajYD-IqDdRc
  LV Write Access        read/write
  LV Creation host, time headwig.lab.eng.blr.redhat.com, 2017-02-20 18:14:50 +0530
  LV Pool name           pool-brick2
  LV Status              available
  # open                 1
  LV Size                1.80 TiB
  Mapped size            0.05%
  Current LE             1512651
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:29

Comment 17 Allon Mureinik 2017-05-23 05:31:38 UTC
Ramesh - The patch attached to this BZ has been merged for several months. If this is indeed a z-stream bug, it needs to be cloned and the patch needs to be backported.
If it isn't, please retarget it to 4.2.0.

Comment 18 Kevin Alon Goldblatt 2017-06-27 15:42:25 UTC
moving to

Comment 23 Kevin Alon Goldblatt 2017-12-06 13:55:07 UTC
Verified with code:
--------------------------
vdsm-4.20.9-1.git8d0bd46.el7.centos.x86_64


Verified with scenario:
--------------------------
1. I attached a 110GB disk to my host
2. I created a brick on via host storage devices - create brick
3. The result is that the Logical Extents for the LV and the thin pool are the same. 

Moving to VERIFIED


 --- Logical volume ---
  LV Name                pool-brick2
  VG Name                vg-brick2
  LV UUID                nLGNxf-zl7d-Obh5-Vqbd-ixCb-z4hL-0GhTDw
  LV Write Access        read/write
  LV Creation host, time vm-83-162.scl.lab.tlv.redhat.com, 2017-12-05 14:29:46 +0200
  LV Pool metadata       pool-brick2_tmeta
  LV Pool data           pool-brick2_tdata
  LV Status              available
  # open                 2
  LV Size                <99.50 GiB
  Allocated pool data    0.05%
  Allocated metadata     0.03%  *****IS THIS CORRECT AS EXPECTED ??? *****
  Current LE             407548
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:3
   
  --- Logical volume ---
  LV Path                /dev/vg-brick2/brick2
  LV Name                brick2
  VG Name                vg-brick2
  LV UUID                lG5xm2-c6el-NOGc-Jgv5-SCYF-CYtS-HoRRnv
  LV Write Access        read/write
  LV Creation host, time vm-83-162.scl.lab.tlv.redhat.com, 2017-12-05 14:29:52 +0200
  LV Pool name           pool-brick2
  LV Status              available
  # open                 1
  LV Size                <99.50 GiB
  Mapped size            0.05%
  Current LE             407548
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:5

Comment 28 errata-xmlrpc 2018-05-15 17:49:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1489

Comment 29 Franta Kust 2019-05-16 13:05:56 UTC
BZ<2>Jira Resync

Comment 30 Daniel Gur 2019-08-28 12:57:52 UTC
sync2jira

Comment 31 Daniel Gur 2019-08-28 13:03:01 UTC
sync2jira

Comment 32 Daniel Gur 2019-08-28 13:13:11 UTC
sync2jira

Comment 33 Daniel Gur 2019-08-28 13:17:23 UTC
sync2jira