Description of problem: The brick creation steps in heketi result in a thin pool (and a thin lv within that thin pool) being created for each brick. The steps could potentially be changed to create a single thinpool in a volume group (as part of device setup, along with pvcreate and vgcreate); brick creation could then just create a thin lv for the brick in the common thinpool. This approach greatly reduces the total number of lvs that get created when there are a large number of bricks, which in turn potentially improves scalability. The main concerns I can think of with the proposed approach are: 1. increased fragmentation in the thin pool from allocation/deallocation patterns of multiple workloads (probably not a big concern when you're anyway creating dozens of bricks on a device). 2. performance impact of operations particularly thin device deletion (deletion of brick lv or snapshot lv) on I/O in progress on other bricks (due to contention on shared thinpool metadata device). 3. greater impact in terms of number of volumes affected if you happen to run out of space in the thinpool (e.g. one volume with runaway snapshotting can eat up all free space in the thin pool and affect everyone else). Given the push for ever higher scalability in CNS, it makes sense to evaluate this alternate brick configuration approach, maybe as an option that can be chosen in heketi. My main concern would be 2. above, i.e. performance impact of deletion on ongoing I/O, so that would need to be carefully evaluated. Also need to test actual benefit to scalability.
Detailing the current and proposed approach in terms of commands to create 5 bricks. This roughly mimics the commands as used in heketi, but the size calculations are not accurate; the focus is more on the number of LVs created by the two approaches. Note this is on a RHEL 7.5 system: kernel-3.10.0-858.el7.x86_64 lvm2-2.02.177-4.el7.x86_64 lvm2-libs-2.02.177-4.el7.x86_64 ***************************************************** Current approach in commands: device setup: pvcreate --metadatasize=128M --dataalignment=256K /dev/sdc vgcreate vg_1 /dev/sdc brick setup: lvcreate --poolmetadatasize 57672K -c 256K -L 11534336K -T vg_1/tp_1 -V 10485760K -n brick_1 lvcreate --poolmetadatasize 57672K -c 256K -L 11534336K -T vg_1/tp_2 -V 10485760K -n brick_2 lvcreate --poolmetadatasize 57672K -c 256K -L 11534336K -T vg_1/tp_3 -V 10485760K -n brick_3 lvcreate --poolmetadatasize 57672K -c 256K -L 11534336K -T vg_1/tp_4 -V 10485760K -n brick_4 lvcreate --poolmetadatasize 57672K -c 256K -L 11534336K -T vg_1/tp_5 -V 10485760K -n brick_5 # lvs -a vg_1: LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert brick_1 vg_1 Vwi-a-tz-- 10.00g tp_1 0.00 brick_2 vg_1 Vwi-a-tz-- 10.00g tp_2 0.00 brick_3 vg_1 Vwi-a-tz-- 10.00g tp_3 0.00 brick_4 vg_1 Vwi-a-tz-- 10.00g tp_4 0.00 brick_5 vg_1 Vwi-a-tz-- 10.00g tp_5 0.00 [lvol0_pmspare] vg_1 ewi------- 60.00m tp_1 vg_1 twi-aotz-- 11.00g 0.00 0.08 [tp_1_tdata] vg_1 Twi-ao---- 11.00g [tp_1_tmeta] vg_1 ewi-ao---- 60.00m tp_2 vg_1 twi-aotz-- 11.00g 0.00 0.08 [tp_2_tdata] vg_1 Twi-ao---- 11.00g [tp_2_tmeta] vg_1 ewi-ao---- 60.00m tp_3 vg_1 twi-aotz-- 11.00g 0.00 0.08 [tp_3_tdata] vg_1 Twi-ao---- 11.00g [tp_3_tmeta] vg_1 ewi-ao---- 60.00m tp_4 vg_1 twi-aotz-- 11.00g 0.00 0.08 [tp_4_tdata] vg_1 Twi-ao---- 11.00g [tp_4_tmeta] vg_1 ewi-ao---- 60.00m tp_5 vg_1 twi-aotz-- 11.00g 0.00 0.08 [tp_5_tdata] vg_1 Twi-ao---- 11.00g [tp_5_tmeta] vg_1 ewi-ao---- 60.00m # lvs -a vg_1 | wc -l 22 ******************************************************** Proposed approach in commands: device setup: pvcreate --metadatasize=128M --dataalignment=256K /dev/sdc vgcreate vg_1 /dev/sdc lvcreate --poolmetadatasize 10485760K -c 256K --extents 100%FREE -T vg_1/tp_1 brick setup: lvcreate --thin -V 10485760K vg_1/tp_1 -n brick1 lvcreate --thin -V 10485760K vg_1/tp_1 -n brick2 lvcreate --thin -V 10485760K vg_1/tp_1 -n brick3 lvcreate --thin -V 10485760K vg_1/tp_1 -n brick4 lvcreate --thin -V 10485760K vg_1/tp_1 -n brick5 # lvs -a vg_1 LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert brick1 vg_1 Vwi-a-tz-- 10.00g tp_1 0.00 brick2 vg_1 Vwi-a-tz-- 10.00g tp_1 0.00 brick3 vg_1 Vwi-a-tz-- 10.00g tp_1 0.00 brick4 vg_1 Vwi-a-tz-- 10.00g tp_1 0.00 brick5 vg_1 Vwi-a-tz-- 10.00g tp_1 0.00 [lvol0_pmspare] vg_1 ewi------- 10.00g tp_1 vg_1 twi-aotz-- 910.87g 0.00 0.02 [tp_1_tdata] vg_1 Twi-ao---- 910.87g [tp_1_tmeta] vg_1 ewi-ao---- 10.00g # lvs -a vg_1 | wc -l 10 ********************************************************* rtalur: * Can you please check if I correctly captured current heketi approach? * heketi does not seem to skip block zeroing in the thin pool. Is there a reason for that?