Bug 1560926

Summary: [RFE] [Volume Scale] Evaluate common thinpool for all bricks in a volume group
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Manoj Pillai <mpillai>
Component: heketiAssignee: Michael Adam <madam>
Status: CLOSED WONTFIX QA Contact: Rahul Hinduja <rhinduja>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: ekuric, hchiramm, jmulligan, jstrunk, psuriset, rhs-bugs, rtalur, shberry, storage-qa-internal
Target Milestone: ---Keywords: FutureFeature, Performance
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-23 19:30:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1641915    

Description Manoj Pillai 2018-03-27 09:34:22 UTC
Description of problem:
The brick creation steps in heketi result in a thin pool (and a thin lv within that thin pool) being created for each brick. 

The steps could potentially be changed to create a single thinpool in a volume group (as part of device setup, along with pvcreate and vgcreate); brick creation could then just create a thin lv for the brick in the common thinpool. This approach greatly reduces the total number of lvs that get created when there are a large number of bricks, which in turn potentially improves scalability.

The main concerns I can think of with the proposed approach are:
1. increased fragmentation in the thin pool from allocation/deallocation patterns of multiple workloads (probably not a big concern when you're anyway creating dozens of bricks on a device).
2. performance impact of operations particularly thin device deletion (deletion of brick lv or snapshot lv) on I/O in progress on other bricks (due to contention on shared thinpool metadata device).
3. greater impact in terms of number of volumes affected if you happen to run out of space in the thinpool (e.g. one volume with runaway snapshotting can eat up all free space in the thin pool and affect everyone else).

Given the push for ever higher scalability in CNS, it makes sense to evaluate this alternate brick configuration approach, maybe as an option that can be chosen in heketi.

My main concern would be 2. above, i.e. performance impact of deletion on ongoing I/O, so that would need to be carefully evaluated. 
Also need to test actual benefit to scalability.

Comment 2 Manoj Pillai 2018-04-09 09:38:06 UTC
Detailing the current and proposed approach in terms of commands to create 5 bricks. This roughly mimics the commands as used in heketi, but the size calculations are not accurate; the focus is more on the number of LVs created by the two approaches.

Note this is on a RHEL 7.5 system:
kernel-3.10.0-858.el7.x86_64
lvm2-2.02.177-4.el7.x86_64
lvm2-libs-2.02.177-4.el7.x86_64

*****************************************************
Current approach in commands:
device setup:
pvcreate --metadatasize=128M --dataalignment=256K /dev/sdc
vgcreate vg_1 /dev/sdc

brick setup:
lvcreate --poolmetadatasize 57672K -c 256K -L 11534336K -T vg_1/tp_1 -V 10485760K -n brick_1
lvcreate --poolmetadatasize 57672K -c 256K -L 11534336K -T vg_1/tp_2 -V 10485760K -n brick_2
lvcreate --poolmetadatasize 57672K -c 256K -L 11534336K -T vg_1/tp_3 -V 10485760K -n brick_3
lvcreate --poolmetadatasize 57672K -c 256K -L 11534336K -T vg_1/tp_4 -V 10485760K -n brick_4
lvcreate --poolmetadatasize 57672K -c 256K -L 11534336K -T vg_1/tp_5 -V 10485760K -n brick_5

# lvs -a vg_1:
  LV              VG   Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  brick_1         vg_1 Vwi-a-tz-- 10.00g tp_1        0.00                                   
  brick_2         vg_1 Vwi-a-tz-- 10.00g tp_2        0.00                                   
  brick_3         vg_1 Vwi-a-tz-- 10.00g tp_3        0.00                                   
  brick_4         vg_1 Vwi-a-tz-- 10.00g tp_4        0.00                                   
  brick_5         vg_1 Vwi-a-tz-- 10.00g tp_5        0.00                                   
  [lvol0_pmspare] vg_1 ewi------- 60.00m                                                    
  tp_1            vg_1 twi-aotz-- 11.00g             0.00   0.08                            
  [tp_1_tdata]    vg_1 Twi-ao---- 11.00g                                                    
  [tp_1_tmeta]    vg_1 ewi-ao---- 60.00m                                                    
  tp_2            vg_1 twi-aotz-- 11.00g             0.00   0.08                            
  [tp_2_tdata]    vg_1 Twi-ao---- 11.00g                                                    
  [tp_2_tmeta]    vg_1 ewi-ao---- 60.00m                                                    
  tp_3            vg_1 twi-aotz-- 11.00g             0.00   0.08                            
  [tp_3_tdata]    vg_1 Twi-ao---- 11.00g                                                    
  [tp_3_tmeta]    vg_1 ewi-ao---- 60.00m                                                    
  tp_4            vg_1 twi-aotz-- 11.00g             0.00   0.08                            
  [tp_4_tdata]    vg_1 Twi-ao---- 11.00g                                                    
  [tp_4_tmeta]    vg_1 ewi-ao---- 60.00m                                                    
  tp_5            vg_1 twi-aotz-- 11.00g             0.00   0.08                            
  [tp_5_tdata]    vg_1 Twi-ao---- 11.00g                                                    
  [tp_5_tmeta]    vg_1 ewi-ao---- 60.00m

# lvs -a vg_1 | wc -l
22

********************************************************
Proposed approach in commands:
device setup:
pvcreate --metadatasize=128M --dataalignment=256K /dev/sdc
vgcreate vg_1 /dev/sdc
lvcreate --poolmetadatasize 10485760K -c 256K --extents 100%FREE -T vg_1/tp_1

brick setup:
lvcreate --thin -V 10485760K vg_1/tp_1 -n brick1
lvcreate --thin -V 10485760K vg_1/tp_1 -n brick2
lvcreate --thin -V 10485760K vg_1/tp_1 -n brick3
lvcreate --thin -V 10485760K vg_1/tp_1 -n brick4
lvcreate --thin -V 10485760K vg_1/tp_1 -n brick5

# lvs -a vg_1
  LV              VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  brick1          vg_1 Vwi-a-tz--  10.00g tp_1        0.00                                   
  brick2          vg_1 Vwi-a-tz--  10.00g tp_1        0.00                                   
  brick3          vg_1 Vwi-a-tz--  10.00g tp_1        0.00                                   
  brick4          vg_1 Vwi-a-tz--  10.00g tp_1        0.00                                   
  brick5          vg_1 Vwi-a-tz--  10.00g tp_1        0.00                                   
  [lvol0_pmspare] vg_1 ewi-------  10.00g                                                    
  tp_1            vg_1 twi-aotz-- 910.87g             0.00   0.02                            
  [tp_1_tdata]    vg_1 Twi-ao---- 910.87g                                                    
  [tp_1_tmeta]    vg_1 ewi-ao----  10.00g

# lvs -a vg_1 | wc -l
10

*********************************************************
rtalur:
* Can you please check if I correctly captured current heketi approach?
* heketi does not seem to skip block zeroing in the thin pool. Is there a reason for that?