Bug 1560926 - [RFE] [Volume Scale] Evaluate common thinpool for all bricks in a volume group
Summary: [RFE] [Volume Scale] Evaluate common thinpool for all bricks in a volume group
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: heketi
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Michael Adam
QA Contact: Rahul Hinduja
URL:
Whiteboard:
Depends On:
Blocks: OCS-3.11.1-devel-triage-done
TreeView+ depends on / blocked
 
Reported: 2018-03-27 09:34 UTC by Manoj Pillai
Modified: 2019-01-23 19:30 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-23 19:30:30 UTC
Embargoed:


Attachments (Terms of Use)

Description Manoj Pillai 2018-03-27 09:34:22 UTC
Description of problem:
The brick creation steps in heketi result in a thin pool (and a thin lv within that thin pool) being created for each brick. 

The steps could potentially be changed to create a single thinpool in a volume group (as part of device setup, along with pvcreate and vgcreate); brick creation could then just create a thin lv for the brick in the common thinpool. This approach greatly reduces the total number of lvs that get created when there are a large number of bricks, which in turn potentially improves scalability.

The main concerns I can think of with the proposed approach are:
1. increased fragmentation in the thin pool from allocation/deallocation patterns of multiple workloads (probably not a big concern when you're anyway creating dozens of bricks on a device).
2. performance impact of operations particularly thin device deletion (deletion of brick lv or snapshot lv) on I/O in progress on other bricks (due to contention on shared thinpool metadata device).
3. greater impact in terms of number of volumes affected if you happen to run out of space in the thinpool (e.g. one volume with runaway snapshotting can eat up all free space in the thin pool and affect everyone else).

Given the push for ever higher scalability in CNS, it makes sense to evaluate this alternate brick configuration approach, maybe as an option that can be chosen in heketi.

My main concern would be 2. above, i.e. performance impact of deletion on ongoing I/O, so that would need to be carefully evaluated. 
Also need to test actual benefit to scalability.

Comment 2 Manoj Pillai 2018-04-09 09:38:06 UTC
Detailing the current and proposed approach in terms of commands to create 5 bricks. This roughly mimics the commands as used in heketi, but the size calculations are not accurate; the focus is more on the number of LVs created by the two approaches.

Note this is on a RHEL 7.5 system:
kernel-3.10.0-858.el7.x86_64
lvm2-2.02.177-4.el7.x86_64
lvm2-libs-2.02.177-4.el7.x86_64

*****************************************************
Current approach in commands:
device setup:
pvcreate --metadatasize=128M --dataalignment=256K /dev/sdc
vgcreate vg_1 /dev/sdc

brick setup:
lvcreate --poolmetadatasize 57672K -c 256K -L 11534336K -T vg_1/tp_1 -V 10485760K -n brick_1
lvcreate --poolmetadatasize 57672K -c 256K -L 11534336K -T vg_1/tp_2 -V 10485760K -n brick_2
lvcreate --poolmetadatasize 57672K -c 256K -L 11534336K -T vg_1/tp_3 -V 10485760K -n brick_3
lvcreate --poolmetadatasize 57672K -c 256K -L 11534336K -T vg_1/tp_4 -V 10485760K -n brick_4
lvcreate --poolmetadatasize 57672K -c 256K -L 11534336K -T vg_1/tp_5 -V 10485760K -n brick_5

# lvs -a vg_1:
  LV              VG   Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  brick_1         vg_1 Vwi-a-tz-- 10.00g tp_1        0.00                                   
  brick_2         vg_1 Vwi-a-tz-- 10.00g tp_2        0.00                                   
  brick_3         vg_1 Vwi-a-tz-- 10.00g tp_3        0.00                                   
  brick_4         vg_1 Vwi-a-tz-- 10.00g tp_4        0.00                                   
  brick_5         vg_1 Vwi-a-tz-- 10.00g tp_5        0.00                                   
  [lvol0_pmspare] vg_1 ewi------- 60.00m                                                    
  tp_1            vg_1 twi-aotz-- 11.00g             0.00   0.08                            
  [tp_1_tdata]    vg_1 Twi-ao---- 11.00g                                                    
  [tp_1_tmeta]    vg_1 ewi-ao---- 60.00m                                                    
  tp_2            vg_1 twi-aotz-- 11.00g             0.00   0.08                            
  [tp_2_tdata]    vg_1 Twi-ao---- 11.00g                                                    
  [tp_2_tmeta]    vg_1 ewi-ao---- 60.00m                                                    
  tp_3            vg_1 twi-aotz-- 11.00g             0.00   0.08                            
  [tp_3_tdata]    vg_1 Twi-ao---- 11.00g                                                    
  [tp_3_tmeta]    vg_1 ewi-ao---- 60.00m                                                    
  tp_4            vg_1 twi-aotz-- 11.00g             0.00   0.08                            
  [tp_4_tdata]    vg_1 Twi-ao---- 11.00g                                                    
  [tp_4_tmeta]    vg_1 ewi-ao---- 60.00m                                                    
  tp_5            vg_1 twi-aotz-- 11.00g             0.00   0.08                            
  [tp_5_tdata]    vg_1 Twi-ao---- 11.00g                                                    
  [tp_5_tmeta]    vg_1 ewi-ao---- 60.00m

# lvs -a vg_1 | wc -l
22

********************************************************
Proposed approach in commands:
device setup:
pvcreate --metadatasize=128M --dataalignment=256K /dev/sdc
vgcreate vg_1 /dev/sdc
lvcreate --poolmetadatasize 10485760K -c 256K --extents 100%FREE -T vg_1/tp_1

brick setup:
lvcreate --thin -V 10485760K vg_1/tp_1 -n brick1
lvcreate --thin -V 10485760K vg_1/tp_1 -n brick2
lvcreate --thin -V 10485760K vg_1/tp_1 -n brick3
lvcreate --thin -V 10485760K vg_1/tp_1 -n brick4
lvcreate --thin -V 10485760K vg_1/tp_1 -n brick5

# lvs -a vg_1
  LV              VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  brick1          vg_1 Vwi-a-tz--  10.00g tp_1        0.00                                   
  brick2          vg_1 Vwi-a-tz--  10.00g tp_1        0.00                                   
  brick3          vg_1 Vwi-a-tz--  10.00g tp_1        0.00                                   
  brick4          vg_1 Vwi-a-tz--  10.00g tp_1        0.00                                   
  brick5          vg_1 Vwi-a-tz--  10.00g tp_1        0.00                                   
  [lvol0_pmspare] vg_1 ewi-------  10.00g                                                    
  tp_1            vg_1 twi-aotz-- 910.87g             0.00   0.02                            
  [tp_1_tdata]    vg_1 Twi-ao---- 910.87g                                                    
  [tp_1_tmeta]    vg_1 ewi-ao----  10.00g

# lvs -a vg_1 | wc -l
10

*********************************************************
rtalur:
* Can you please check if I correctly captured current heketi approach?
* heketi does not seem to skip block zeroing in the thin pool. Is there a reason for that?


Note You need to log in before you can comment on or make changes to this bug.