Bug 1973181

Summary: very poor performance of the random write to raid 1 lvolume based on NVMEoF block devices
Product: [Community] LVM and device-mapper Reporter: vbponomarev
Component: lvm2Assignee: Heinz Mauelshagen <heinzm>
lvm2 sub component: Mirroring and RAID QA Contact: cluster-qe <cluster-qe>
Status: NEW --- Docs Contact:
Severity: high    
Priority: unspecified CC: agk, heinzm, jbrassow, msnitzer, ncroxon, prajnoha, vbponomarev, xni, zkabelac
Version: 2.02.185Flags: pm-rhel: lvm-technical-solution?
pm-rhel: lvm-test-coverage?
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description vbponomarev 2021-06-17 11:25:08 UTC
Description of problem:
very poor performance of the random write to raid 1 lvolume based on NVMEoF block devices.

Version-Release number of selected component (if applicable):
Oracle Linux Server release 8.3 (5.4.17-2102.201.3.el8uek.x86_64)
LVM version:     2.03.09(2)-RHEL8 (2020-05-28)
Library version: 1.02.171-RHEL8 (2020-05-28)
Driver version:  4.41.0


How reproducible:
100%

Steps to Reproduce:
You need 2 hosts , connected by 10GE (may be 25/100GE or IB)
1. Setup nvmeof target host (based on https://blogs.oracle.com/linux/nvme-over-tcp)
modprobe brd rd_nr=2 rd_size=10485760 max_part=1
mkdir /sys/kernel/config/nvmet/subsystems/nvmet-test
cd /sys/kernel/config/nvmet/subsystems/nvmet-test
echo 1 |sudo tee -a attr_allow_any_host > /dev/null
sudo mkdir namespaces/1
cd namespaces/1/
echo -n /dev/ram0  > device_path
echo 1 > enable
cd ../..
mkdir namespaces/2
cd namespaces/2
echo -n /dev/ram1 > device_path
echo 1 > enable
mkdir /sys/kernel/config/nvmet/ports/1
cd /sys/kernel/config/nvmet/ports/1
echo 192.168.1.18 |sudo tee -a addr_traddr > /dev/null
echo tcp|sudo tee -a addr_trtype > /dev/null
echo 4420|sudo tee -a addr_trsvcid > /dev/null
echo ipv4|sudo tee -a addr_adrfam > /dev/null
ln -s /sys/kernel/config/nvmet/subsystems/nvmet-test/ /sys/kernel/config/nvmet/ports/1/subsystems/nvmet-test

2. Setup nvmeof initiator host (based on https://blogs.oracle.com/linux/nvme-over-tcp)
modprobe nvme-tcp
nvme discover -t tcp -a 192.168.1.18 -s 4420
nvme connect -t tcp -n nvmet-test -a 192.168.1.18 -s 4420
nvme list

Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme4n1     06f045a09791c576     Linux                                    1          10.74  GB /  10.74  GB    512   B +  0 B   5.4.17-2
/dev/nvme4n2     06f045a09791c576     Linux                                    2          10.74  GB /  10.74  GB    512   B +  0 B   5.4.17-2

3. Make tests for the nvmeof devices
fio --name=random-writers --filename=/dev/nvme4n1 --ioengine=libaio --iodepth=4 --rw=randwrite --bs=8k --direct=1 --numjobs=100 --time_based=1 --runtime=600 --group_reporting
....
write: IOPS=113k, BW=884MiB/s (927MB/s)(29.7GiB/34464msec); 0 zone resets
...
fio --name=random-writers --filename=/dev/nvme4n2 --ioengine=libaio --iodepth=4 --rw=randwrite --bs=8k --direct=1 --numjobs=100 --time_based=1 --runtime=600 --group_reporting
...
write: IOPS=113k, BW=881MiB/s (924MB/s)(5661MiB/6425msec); 0 zone resets
...
4. Make LVM Raid 1 and test it
pvcreate /dev/nvme4n1 /dev/nvme4n2
vgcreate vgt /dev/nvme4n1 /dev/nvme4n2
lvcreate -l 100%FREE -m 1 -n lvt vgt --nosync
lvs -a -o +devices
 LV                VG            Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices                         
  root              ol_odb-node01 -wi-ao---- 150.00g                                                     /dev/md126p3(0)                 
  swap              ol_odb-node01 -wi-ao----  64.00g                                                     /dev/md126p3(38400)             
  lvt               vgt           Rwi-a-r---   9.99g                                    100.00           lvt_rimage_0(0),lvt_rimage_1(0) 

fio --name=random-writers --filename=/dev/vgt/lvt --ioengine=libaio --iodepth=4 --rw=randwrite --bs=8k --direct=1 --numjobs=100 --time_based=1 --runtime=600 --group_reporting

Actual results:
...
write: IOPS=3345, BW=26.1MiB/s (27.4MB/s)(15.3GiB/600069msec); 0 zone resets
...

Expected results:
Based on tests results of the devices  /dev/nvme4n1 and /dev/nvme4n2 I expected to see about 100K IOPS

Additional info:

Comment 1 Heinz Mauelshagen 2021-06-17 14:59:13 UTC
Not having NVMeoF access at this pont, I tried to second (randwrite, 8K as you did) on iSCSI LUs but failed.

With your fio job as of above, I get on two 10GiB in _parralel_ access (i.e. fio started on both legs in parallel):
  write: IOPS=20.2k, BW=158MiB/s (165MB/s)(325MiB/2065msec); 0 zone resets
  WRITE: bw=158MiB/s (165MB/s), 158MiB/s-158MiB/s (165MB/s-165MB/s), io=325MiB (341MB), run=2065-2065msec
  write: IOPS=24.8k, BW=194MiB/s (203MB/s)(389MiB/2008msec); 0 zone resets
  WRITE: bw=194MiB/s (203MB/s), 194MiB/s-194MiB/s (203MB/s-203MB/s), io=389MiB (408MB), run=2008-2008msec

On a 'raid1' on top of those aforementioned two:
  write: IOPS=18.0k, BW=140MiB/s (147MB/s)(282MiB/2007msec); 0 zone resets
  WRITE: bw=140MiB/s (147MB/s), 140MiB/s-140MiB/s (147MB/s-147MB/s), io=282MiB (295MB), run=2007-2007msec

Mind that raid1 duplicates writes, that's why I ran the fio job in parallel on the two iscsi devices and also stores write intent bitmap metadata hence throttling a bit which nicely shows in my 'raid1' fio measures above.

Could be the transport throttles more drastically on NVMEeoF on parallel I/O to multiple targets than anticipated, please try running the 2 fio jobs in parallel on both your targets to tell if they still hold up.

Comment 2 vbponomarev 2021-06-21 15:19:23 UTC
I ran  2 fio jobs in parallel on both  targets and got results like below:

Jobs: 100 (f=100): [w(100)][10.5%][w=443MiB/s][w=56.8k IOPS][eta 08m:58s]
...
Jobs: 100 (f=100): [w(100)][14.6%][w=463MiB/s][w=59.3k IOPS][eta 08m:33s]
...

dstat -rd
--io/total- -dsk/total-
 read  writ| read  writ
   0   117k|   0   917M
   0   116k|   0   899M
   0   115k|   0   897M
   0   115k|   0   896M
   0   116k|   0   904M
   0   116k|   0   907M
   0   109k|   0   854M
....

Then I had a try with lvm raid1:

Jobs: 100 (f=100): [w(100)][24.3%][w=25.5MiB/s][w=3262 IOPS][eta 07m:34s]
...

psn -G syscall,wchan
Linux Process Snapper v1.1.0 by Tanel Poder [https://0x.tools]
Sampling /proc/stat, syscall, wchan for 5 seconds...
finished.
=== Active Threads ===================================================================================================
 samples | avg_threads | comm                       | state                  | syscall         | wchan
----------------------------------------------------------------------------------------------------------------------
    3348 |       95.66 | (fio)                      | Disk (Uninterruptible) | io_submit       | rq_qos_wait
     142 |        4.06 | (fio)                      | Disk (Uninterruptible) | io_submit       | md_super_wait
...

Kstack for the first is:
64_sys_io_submit()
io_submit_one()
aio_write()
blkdev_write_iter()
blk_finish_plug()
blk_flush_plug_list()
raid1_unplug()
flush_bio_list()
generic_make_request()
nvme_ns_head_make_request()
direct_make_request()
blk_mq_make_request()
__rq_qos_throttle()
wbt_wait()
rq_qos_wait()

Comment 3 Heinz Mauelshagen 2021-11-03 23:47:15 UTC
Also, the raid1 region size may be rather small taken the relatively small LV size hence throttling because of too many write-intent bitmap updates.
Try to 'lvconvert -R 512M vgt/lvt' and retry your fio test.