Description of problem: While adding a gluster brick using RHEV Manager, The RHEV uses "lvm thin pool" to add brick. In one of customer's case faced the 'gluster filesystem' corruption issue. The storage team suggested it could be due to "thin_pool_autoextend_threshold = 100" set in lvm.conf. Version-Release number of selected component (if applicable): RHEV-3.6 RHEV-H 7.x How reproducible: Always Steps to Reproduce: 1. Attach new empty disk to hyperviosr. 2. From rhevm portal select Host 3. Select Storage Device and then click on newly added empty disk. 4. Click on "Create Brick" and create a brick. Actual results: The "thin_pool_autoextend_threshold" value in lvm.conf of host is set to "100". Expected results: The value for "thin_pool_autoextend_threshold" should be set to less than 100 while using 'lvm thin pool'. It should be set less than 100 by default by rhev / vdsm while using gluster host in RHEV or allow user to manually set it while creating gluster brick using RHEVM portal. Additional info: I have checked by installing latest gluster 3.1 host and it also has default value set "thin_pool_autoextend_threshold = 100" which according to storage team should be less than 100. # egrep 'monitoring =|thin_pool_autoextend_threshold|thin_pool_autoextend_percent' /etc/lvm/lvm.conf # 'thin_pool_autoextend_threshold' and 'thin_pool_autoextend_percent' define # For example, if you set thin_pool_autoextend_threshold to 70 and # thin_pool_autoextend_percent to 20, whenever a pool exceeds 70% usage, # Setting thin_pool_autoextend_threshold to 100 disables automatic thin_pool_autoextend_threshold = 100 thin_pool_autoextend_percent = 20 monitoring = 1 -------------- I have got below reply from storage team regarding lvm.conf filter value "thin_pool_autoextend_threshold = 100" and its possible cause for filesystem corruption. ~~~ if thin_pool_autoextend_threshold is set to 100, autoextend is off. lvm.conf(5) thin_pool_autoextend_threshold is a percentage full value that defines when the thin pool LV should be extended. Setting this to 100 disables automatic extention. The minimum value is 50. So if automatic extention is disabled, if the thin pool is full, we run into undefined behavior. Full thin pool does not act like full filesystem. We basically are not sure about pending IO's, etc, and filesystem corruption can occur for sure in such cases. We generally advise sysadmins to make sure that this situation never occurs. Monitoring must be on, and autoextend must be in place. If the default lvm.conf generated through gluster/rhev is not setting this value correctly, then I suspect its a bug. # egrep 'monitoring =|thin_pool_autoextend_threshold|thin_pool_autoextend_percent' /etc/lvm/lvm.conf # Configuration option activation/thin_pool_autoextend_threshold. # Also see thin_pool_autoextend_percent. # thin_pool_autoextend_threshold = 70 thin_pool_autoextend_threshold = 100 # Configuration option activation/thin_pool_autoextend_percent. # thin_pool_autoextend_percent = 20 thin_pool_autoextend_percent = 20 monitoring = 1 ~~~
if the pv, the vg and the thinly provisioned lv are all the same size, where do we store the metadata? Don't we need to reserve at least some space? --- Physical volume --- PV Name /dev/mapper/mpathb VG Name vg-brick04 PV Size 36.33 TiB / not usable 0 Allocatable yes PE Size 1.25 MiB Total PE 30478511 Free PE 1 Allocated PE 30478510 PV UUID CQdZxE-Lj3D-kuRD-zYma-cn8t-3s9P-hCo30s --- Volume group --- VG Name vg-brick04 System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 7 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 1 Max PV 0 Cur PV 1 Act PV 1 VG Size 36.33 TiB PE Size 1.25 MiB Total PE 30478511 Alloc PE / Size 30478510 / 36.33 TiB Free PE / Size 1 / 1.25 MiB VG UUID PKYtXv-Zk4Z-6b5a-k6Ym-JSO3-5uxW-NyIsZZ --- Logical volume --- LV Path /dev/vg-brick04/brick04 LV Name brick04 VG Name vg-brick04 LV UUID 2owAeG-qPIC-GFK7-l6yC-F4fF-XYAj-5JMMZl LV Write Access read/write LV Creation host, time svg302.sst.rad.lan, 2016-08-11 10:36:52 +0200 LV Pool name pool-brick04 LV Status available # open 1 LV Size 36.33 TiB Mapped size 97.49% Current LE 30478511 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 8192 Block device 253:27 LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert PE Ranges pool-brick04 vg-brick04 twi-aot--- 36.32t 97.53 5.53 pool-brick04_tdata:0-30465402 [pool-brick04_tdata] vg-brick04 Twi-ao---- 36.32t /dev/mapper/mpathb:0-30465402 [pool-brick04_tmeta] vg-brick04 ewi-ao---- 16.00g /dev/mapper/mpathb:30465403-30478509 --------------------- This is the reason storage team suggested to set 'thin_pool_autoextend_threshold" less than 100 and the exact value should be decided by the admin or tool which creates it ?
This bug is about managing gluster bricks in RHV, not about how RHV consumes it. Sahina - I'm assigning it to you for initial research. I guess the component should be changed too, but I'm not sure to what.
Ramesh - can you take a look?
(In reply to Sachin Raje from comment #1) > if the pv, the vg and the thinly provisioned lv are all the same size, where > do we store the metadata? Don't we need to reserve at least some space? > > --- Physical volume --- > PV Name /dev/mapper/mpathb > VG Name vg-brick04 > PV Size 36.33 TiB / not usable 0 > Allocatable yes > PE Size 1.25 MiB > Total PE 30478511 > Free PE 1 > Allocated PE 30478510 > PV UUID CQdZxE-Lj3D-kuRD-zYma-cn8t-3s9P-hCo30s > > --- Volume group --- > VG Name vg-brick04 > System ID > Format lvm2 > Metadata Areas 1 > Metadata Sequence No 7 > VG Access read/write > VG Status resizable > MAX LV 0 > Cur LV 2 > Open LV 1 > Max PV 0 > Cur PV 1 > Act PV 1 > VG Size 36.33 TiB > PE Size 1.25 MiB > Total PE 30478511 > Alloc PE / Size 30478510 / 36.33 TiB > Free PE / Size 1 / 1.25 MiB > VG UUID PKYtXv-Zk4Z-6b5a-k6Ym-JSO3-5uxW-NyIsZZ > > --- Logical volume --- > LV Path /dev/vg-brick04/brick04 > LV Name brick04 > VG Name vg-brick04 > LV UUID 2owAeG-qPIC-GFK7-l6yC-F4fF-XYAj-5JMMZl > LV Write Access read/write > LV Creation host, time svg302.sst.rad.lan, 2016-08-11 10:36:52 +0200 > LV Pool name pool-brick04 > LV Status available > # open 1 > LV Size 36.33 TiB > Mapped size 97.49% > Current LE 30478511 > Segments 1 > Allocation inherit > Read ahead sectors auto > - currently set to 8192 > Block device 253:27 > > LV VG Attr LSize Pool Origin > Data% Meta% Move Log Cpy%Sync Convert PE Ranges > pool-brick04 vg-brick04 twi-aot--- 36.32t 97.53 > 5.53 pool-brick04_tdata:0-30465402 > [pool-brick04_tdata] vg-brick04 Twi-ao---- 36.32t > /dev/mapper/mpathb:0-30465402 > [pool-brick04_tmeta] vg-brick04 ewi-ao---- 16.00g > /dev/mapper/mpathb:30465403-30478509 > > --------------------- > > This is the reason storage team suggested to set > 'thin_pool_autoextend_threshold" less than 100 and the exact value should be > decided by the admin or tool which creates it ? We are not modifying the RHEL default values of 'thin_pool_autoextend_threshold". But in Gluster brick creation, 16GB is reserved for metadata. So in this case, lv is overallocated by 16GB. We should reduce this from thinn lv. But how it will work with lvm snapshots? Is there a possibility of facing the same problem while using lvm snapshots?
This has be to fixed in vdsm-gluster. So changing the component accordingly.
I will post a patch a to match the size of lv with the size of thinpool. This can guarantee that there is no over allocation as long as there is no LVM Snaphost/Gluster Volume snapshot.
Before the fix: You can see that Current LE in LV is higher then current LV in thinpool --- Logical volume --- LV Name pool-brick1 VG Name vg-brick1 LV UUID Ynk3ou-kAnd-MuZe-d1Qn-owPQ-DHT5-fKwULZ LV Write Access read/write LV Creation host, time headwig.lab.eng.blr.redhat.com, 2017-02-20 17:06:12 +0530 LV Pool metadata pool-brick1_tmeta LV Pool data pool-brick1_tdata LV Status available # open 2 LV Size 1.80 TiB Allocated pool data 0.05% Allocated metadata 0.01% Current LE 1512651 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 8192 Block device 253:22 --- Logical volume --- LV Path /dev/vg-brick1/brick1 LV Name brick1 VG Name vg-brick1 LV UUID mH035s-CciV-By4G-RIIi-sHP0-oxaX-ZnFLoF LV Write Access read/write LV Creation host, time headwig.lab.eng.blr.redhat.com, 2017-02-20 17:06:20 +0530 LV Pool name pool-brick1 LV Status available # open 1 LV Size 1.82 TiB Mapped size 0.05% Current LE 1525759 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 8192 Block device 253:24 After the fix: You can see that Current LE in thinpool and lv matches exactly. --- Logical volume --- LV Name pool-brick2 VG Name vg-brick2 LV UUID nDngM3-tL4q-3Xaf-Azqn-ljHR-psZf-u8FQAr LV Write Access read/write LV Creation host, time headwig.lab.eng.blr.redhat.com, 2017-02-20 18:14:41 +0530 LV Pool metadata pool-brick2_tmeta LV Pool data pool-brick2_tdata LV Status available # open 2 LV Size 1.80 TiB Allocated pool data 0.05% Allocated metadata 0.01% Current LE 1512651 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 8192 Block device 253:27 --- Logical volume --- LV Path /dev/vg-brick2/brick2 LV Name brick2 VG Name vg-brick2 LV UUID SQgqXI-n564-6AkX-aJlR-KpM5-ajYD-IqDdRc LV Write Access read/write LV Creation host, time headwig.lab.eng.blr.redhat.com, 2017-02-20 18:14:50 +0530 LV Pool name pool-brick2 LV Status available # open 1 LV Size 1.80 TiB Mapped size 0.05% Current LE 1512651 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 8192 Block device 253:29
Ramesh - The patch attached to this BZ has been merged for several months. If this is indeed a z-stream bug, it needs to be cloned and the patch needs to be backported. If it isn't, please retarget it to 4.2.0.
moving to
Verified with code: -------------------------- vdsm-4.20.9-1.git8d0bd46.el7.centos.x86_64 Verified with scenario: -------------------------- 1. I attached a 110GB disk to my host 2. I created a brick on via host storage devices - create brick 3. The result is that the Logical Extents for the LV and the thin pool are the same. Moving to VERIFIED --- Logical volume --- LV Name pool-brick2 VG Name vg-brick2 LV UUID nLGNxf-zl7d-Obh5-Vqbd-ixCb-z4hL-0GhTDw LV Write Access read/write LV Creation host, time vm-83-162.scl.lab.tlv.redhat.com, 2017-12-05 14:29:46 +0200 LV Pool metadata pool-brick2_tmeta LV Pool data pool-brick2_tdata LV Status available # open 2 LV Size <99.50 GiB Allocated pool data 0.05% Allocated metadata 0.03% *****IS THIS CORRECT AS EXPECTED ??? ***** Current LE 407548 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 8192 Block device 253:3 --- Logical volume --- LV Path /dev/vg-brick2/brick2 LV Name brick2 VG Name vg-brick2 LV UUID lG5xm2-c6el-NOGc-Jgv5-SCYF-CYtS-HoRRnv LV Write Access read/write LV Creation host, time vm-83-162.scl.lab.tlv.redhat.com, 2017-12-05 14:29:52 +0200 LV Pool name pool-brick2 LV Status available # open 1 LV Size <99.50 GiB Mapped size 0.05% Current LE 407548 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 8192 Block device 253:5
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1489
BZ<2>Jira Resync
sync2jira