Description of problem: very poor performance of the random write to raid 1 lvolume based on NVMEoF block devices. Version-Release number of selected component (if applicable): Oracle Linux Server release 8.3 (5.4.17-2102.201.3.el8uek.x86_64) LVM version: 2.03.09(2)-RHEL8 (2020-05-28) Library version: 1.02.171-RHEL8 (2020-05-28) Driver version: 4.41.0 How reproducible: 100% Steps to Reproduce: You need 2 hosts , connected by 10GE (may be 25/100GE or IB) 1. Setup nvmeof target host (based on https://blogs.oracle.com/linux/nvme-over-tcp) modprobe brd rd_nr=2 rd_size=10485760 max_part=1 mkdir /sys/kernel/config/nvmet/subsystems/nvmet-test cd /sys/kernel/config/nvmet/subsystems/nvmet-test echo 1 |sudo tee -a attr_allow_any_host > /dev/null sudo mkdir namespaces/1 cd namespaces/1/ echo -n /dev/ram0 > device_path echo 1 > enable cd ../.. mkdir namespaces/2 cd namespaces/2 echo -n /dev/ram1 > device_path echo 1 > enable mkdir /sys/kernel/config/nvmet/ports/1 cd /sys/kernel/config/nvmet/ports/1 echo 192.168.1.18 |sudo tee -a addr_traddr > /dev/null echo tcp|sudo tee -a addr_trtype > /dev/null echo 4420|sudo tee -a addr_trsvcid > /dev/null echo ipv4|sudo tee -a addr_adrfam > /dev/null ln -s /sys/kernel/config/nvmet/subsystems/nvmet-test/ /sys/kernel/config/nvmet/ports/1/subsystems/nvmet-test 2. Setup nvmeof initiator host (based on https://blogs.oracle.com/linux/nvme-over-tcp) modprobe nvme-tcp nvme discover -t tcp -a 192.168.1.18 -s 4420 nvme connect -t tcp -n nvmet-test -a 192.168.1.18 -s 4420 nvme list Node SN Model Namespace Usage Format FW Rev ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme4n1 06f045a09791c576 Linux 1 10.74 GB / 10.74 GB 512 B + 0 B 5.4.17-2 /dev/nvme4n2 06f045a09791c576 Linux 2 10.74 GB / 10.74 GB 512 B + 0 B 5.4.17-2 3. Make tests for the nvmeof devices fio --name=random-writers --filename=/dev/nvme4n1 --ioengine=libaio --iodepth=4 --rw=randwrite --bs=8k --direct=1 --numjobs=100 --time_based=1 --runtime=600 --group_reporting .... write: IOPS=113k, BW=884MiB/s (927MB/s)(29.7GiB/34464msec); 0 zone resets ... fio --name=random-writers --filename=/dev/nvme4n2 --ioengine=libaio --iodepth=4 --rw=randwrite --bs=8k --direct=1 --numjobs=100 --time_based=1 --runtime=600 --group_reporting ... write: IOPS=113k, BW=881MiB/s (924MB/s)(5661MiB/6425msec); 0 zone resets ... 4. Make LVM Raid 1 and test it pvcreate /dev/nvme4n1 /dev/nvme4n2 vgcreate vgt /dev/nvme4n1 /dev/nvme4n2 lvcreate -l 100%FREE -m 1 -n lvt vgt --nosync lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root ol_odb-node01 -wi-ao---- 150.00g /dev/md126p3(0) swap ol_odb-node01 -wi-ao---- 64.00g /dev/md126p3(38400) lvt vgt Rwi-a-r--- 9.99g 100.00 lvt_rimage_0(0),lvt_rimage_1(0) fio --name=random-writers --filename=/dev/vgt/lvt --ioengine=libaio --iodepth=4 --rw=randwrite --bs=8k --direct=1 --numjobs=100 --time_based=1 --runtime=600 --group_reporting Actual results: ... write: IOPS=3345, BW=26.1MiB/s (27.4MB/s)(15.3GiB/600069msec); 0 zone resets ... Expected results: Based on tests results of the devices /dev/nvme4n1 and /dev/nvme4n2 I expected to see about 100K IOPS Additional info:
Not having NVMeoF access at this pont, I tried to second (randwrite, 8K as you did) on iSCSI LUs but failed. With your fio job as of above, I get on two 10GiB in _parralel_ access (i.e. fio started on both legs in parallel): write: IOPS=20.2k, BW=158MiB/s (165MB/s)(325MiB/2065msec); 0 zone resets WRITE: bw=158MiB/s (165MB/s), 158MiB/s-158MiB/s (165MB/s-165MB/s), io=325MiB (341MB), run=2065-2065msec write: IOPS=24.8k, BW=194MiB/s (203MB/s)(389MiB/2008msec); 0 zone resets WRITE: bw=194MiB/s (203MB/s), 194MiB/s-194MiB/s (203MB/s-203MB/s), io=389MiB (408MB), run=2008-2008msec On a 'raid1' on top of those aforementioned two: write: IOPS=18.0k, BW=140MiB/s (147MB/s)(282MiB/2007msec); 0 zone resets WRITE: bw=140MiB/s (147MB/s), 140MiB/s-140MiB/s (147MB/s-147MB/s), io=282MiB (295MB), run=2007-2007msec Mind that raid1 duplicates writes, that's why I ran the fio job in parallel on the two iscsi devices and also stores write intent bitmap metadata hence throttling a bit which nicely shows in my 'raid1' fio measures above. Could be the transport throttles more drastically on NVMEeoF on parallel I/O to multiple targets than anticipated, please try running the 2 fio jobs in parallel on both your targets to tell if they still hold up.
I ran 2 fio jobs in parallel on both targets and got results like below: Jobs: 100 (f=100): [w(100)][10.5%][w=443MiB/s][w=56.8k IOPS][eta 08m:58s] ... Jobs: 100 (f=100): [w(100)][14.6%][w=463MiB/s][w=59.3k IOPS][eta 08m:33s] ... dstat -rd --io/total- -dsk/total- read writ| read writ 0 117k| 0 917M 0 116k| 0 899M 0 115k| 0 897M 0 115k| 0 896M 0 116k| 0 904M 0 116k| 0 907M 0 109k| 0 854M .... Then I had a try with lvm raid1: Jobs: 100 (f=100): [w(100)][24.3%][w=25.5MiB/s][w=3262 IOPS][eta 07m:34s] ... psn -G syscall,wchan Linux Process Snapper v1.1.0 by Tanel Poder [https://0x.tools] Sampling /proc/stat, syscall, wchan for 5 seconds... finished. === Active Threads =================================================================================================== samples | avg_threads | comm | state | syscall | wchan ---------------------------------------------------------------------------------------------------------------------- 3348 | 95.66 | (fio) | Disk (Uninterruptible) | io_submit | rq_qos_wait 142 | 4.06 | (fio) | Disk (Uninterruptible) | io_submit | md_super_wait ... Kstack for the first is: 64_sys_io_submit() io_submit_one() aio_write() blkdev_write_iter() blk_finish_plug() blk_flush_plug_list() raid1_unplug() flush_bio_list() generic_make_request() nvme_ns_head_make_request() direct_make_request() blk_mq_make_request() __rq_qos_throttle() wbt_wait() rq_qos_wait()
Also, the raid1 region size may be rather small taken the relatively small LV size hence throttling because of too many write-intent bitmap updates. Try to 'lvconvert -R 512M vgt/lvt' and retry your fio test.