1265927 – commit "virtio-blk: introduce multiread" introduces the performance regression of 4k block size write in nfs backend

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1265927 - commit "virtio-blk: introduce multiread" introduces the performance regression of 4k block size write in nfs backend

Summary: commit "virtio-blk: introduce multiread" introduces the performance regressio...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.2
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	7.4
Assignee:	Kevin Wolf
QA Contact:	Yanhui Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1288337 1401400 1473046 1558351 1649160
TreeView+	depends on / blocked

Reported:	2015-09-24 07:39 UTC by Yanhui Ma
Modified:	2019-02-01 15:17 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-02-01 15:17:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Yanhui Ma 2015-09-24 07:39:28 UTC

Description of problem:

In nfs backend block performace test, there is ~26% throughput regression with commit "virtio-blk: introduce multiread" in 4k block size write scenario.

Version-Release number of selected component (if applicable):


How reproducible:

100%

Steps to Reproduce:
1.boot up guest with 1 data disk on nfs backend:

numactl \
    -m 1 /home/qemu-kvm/x86_64-softmmu/qemu-system-x86_64  \
    -S \
    -name 'virt-tests-vm1' \
    -nodefaults \
    -chardev socket,id=hmp_id_humanmonitor1,path=/tmp/monitor-humanmonitor1-20150924-013103-vH7oCHFs,server,nowait \
    -mon chardev=hmp_id_humanmonitor1,mode=readline \
    -chardev socket,id=serial_id_serial1,path=/tmp/serial-serial1-20150924-013103-vH7oCHFs,server,nowait \
    -device isa-serial,chardev=serial_id_serial1 \
    -chardev socket,id=seabioslog_id_20150924-013103-vH7oCHFs,path=/tmp/seabios-20150924-013103-vH7oCHFs,server,nowait \
    -device isa-debugcon,chardev=seabioslog_id_20150924-013103-vH7oCHFs,iobase=0x402 \
    -device ich9-usb-uhci1,id=usb1,bus=pci.0,addr=0x3 \
    -drive file='/usr/local/autotest/tests/virt/shared/data/images/RHEL-Server-7.2-64.qcow2',index=0,if=none,id=drive-virtio-disk1,media=disk,cache=none,snapshot=off,format=qcow2,aio=threads \
    -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk1,bootindex=0 \
    -drive file='/usr/local/autotest/tests/virt/shared/data/images/storage2.qcow2',index=2,if=none,id=drive-virtio-disk2,media=disk,cache=none,snapshot=off,format=qcow2,aio=threads \
    -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk2,bootindex=1 \
    -device virtio-net-pci,netdev=id6XIuwv,mac='9a:37:37:37:37:9e',bus=pci.0,addr=0x6,id='idem9I0u' \
    -netdev tap,id=id6XIuwv,vhost=on,vhostfd=27,fd=26 \
    -m 4096 \
    -smp 2,cores=1,threads=1,sockets=2 \
    -cpu 'Westmere' \
    -M pc \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -vnc :0 \
    -vga cirrus \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot order=cdn,once=c,menu=off   \
    -device sga \
    -enable-kvm
2. in host:
#numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 8175 MB
node 0 free: 7645 MB
node 1 cpus: 4 5 6 7
node 1 size: 8192 MB
node 1 free: 7503 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 

pin vhost thread to cpu(4)
pin two vcpu threads to cpu(5) and cpu(6) respectively

3. in guest:

#mkfs.xfs /dev/vdb
#mount /dev/vdb /mnt
#fio --rw=write --bs=4k --iodepth=1 --runtime=1m --direct=1 --filename=/mnt/write_4k_1 --name=job1 --ioengine=libaio --thread --group_reporting --numjobs=16 --size=512MB --time_based --output=/tmp/fio_result &> /dev/null

Actual results:

Here is the results: w/ the bad commit vs w/o the bad commit
http://10.66.94.121/results/regression/2015-w39/firstbad.html

Expected results:

no regression

Additional info:

Comment 4 Ademar Reis 2015-09-24 21:14:41 UTC

For reference, this is the commit pointed out as introducing this performance regression:

commit 95f7142abc86a916682bd735aecd90172ffa0d30
Author: Peter Lieven <pl>
Date:   Mon Feb 2 14:52:21 2015 +0100

    virtio-blk: introduce multiread
    
    this patch finally introduces multiread support to virtio-blk. While
    multiwrite support was there for a long time, read support was missing.
    
    The complete merge logic is moved into virtio-blk.c which has
    been the only user of request merging ever since. This is required
    to be able to merge chunks of requests and immediately invoke callbacks
    for those requests. Secondly, this is required to switch to
    direct invocation of coroutines which is planned at a later stage.
    
    The following benchmarks show the performance of running fio with
    4 worker threads on a local ram disk. The numbers show the average
    of 10 test runs after 1 run as warmup phase.
    
                  |        4k        |       64k        |        4k
    MB/s          | rd seq | rd rand | rd seq | rd rand | wr seq | wr rand
    --------------+--------+---------+--------+---------+--------+--------
    master        | 1221   | 1187    | 4178   | 4114    | 1745   | 1213
    multiread     | 1829   | 1189    | 4639   | 4110    | 1894   | 1216
    
    Signed-off-by: Peter Lieven <pl>
    Signed-off-by: Kevin Wolf <kwolf>


Given it's marked blocker, it should be investigated with high priority. I'm assigning it to Kevin.

Comment 6 Kevin Wolf 2015-09-25 15:11:23 UTC

Quick update from my side: I tried to reproduce this and at first it looked
indeed as if I could, but now it seems that my results are too unstable to be
significant. I'll have to work some more on my benchmarking environment.

I also compared the old and new version of the code and couldn't find any
suspicious differences. The principle of what the new code does isn't much
different from the old, so it might be an implementation detail that makes the
difference.

The only relatively big change I could see is that requests are now completed
individually instead of waiting for all requests of the same batch to complete.
This should be an important fix for latency (QE's benchmark result contradicts,
though), but it might actually hurt throughput if we get many small completion
events now instead of few events where each is for multiple requests.

Comment 7 Kevin Wolf 2015-09-28 14:12:02 UTC

After a few more benchmarks, it looks to me as if the "batch completion" of
the whole multiwrite array plays a role for performance indeed. I tried a hack
that doesn't complete virtio-blk requests until there is no request in flight
any more. This did help the throughput in this test case, but obviously
randomly delaying completion of requests (which must hurt in terms of latency)
is not something we want to do as a fix.

Another thing I noticed is that virtio-blk dataplane notifies the guest of
completions only in a BH, which results in kind of a batch completion for
requests that complete more or less at the same time, without introducing
arbitrary delays. This seems much more reasonable to do for the non-dataplane
case as well, and I did see some difference, but it didn't help quite as much.

Anyway, for this specific bug report, QE has the setup that the problem was
originally found on. So I would ask you to give the following scratch build
a try (it adopts the dataplane approach to notification for non-dataplane):

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9894116

Comment 8 Yanhui Ma 2015-09-29 04:34:23 UTC

(In reply to Kevin Wolf from comment #7)
> After a few more benchmarks, it looks to me as if the "batch completion" of
> the whole multiwrite array plays a role for performance indeed. I tried a
> hack
> that doesn't complete virtio-blk requests until there is no request in flight
> any more. This did help the throughput in this test case, but obviously
> randomly delaying completion of requests (which must hurt in terms of
> latency)
> is not something we want to do as a fix.
> 
> Another thing I noticed is that virtio-blk dataplane notifies the guest of
> completions only in a BH, which results in kind of a batch completion for
> requests that complete more or less at the same time, without introducing
> arbitrary delays. This seems much more reasonable to do for the non-dataplane
> case as well, and I did see some difference, but it didn't help quite as
> much.
> 
> Anyway, for this specific bug report, QE has the setup that the problem was
> originally found on. So I would ask you to give the following scratch build
> a try (it adopts the dataplane approach to notification for non-dataplane):
> 
> http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9894116

Here is the results with above scratch build:

### kvm-userspace-ver : qemu-kvm-rhev-2.3.0-26.el7.bz1265927_bh.x86_64
### kvm_version : 3.10.0-306.el7.x86_64
### guest-kernel-ver :3.10.0-306.el7.x86_64
Category:write
blocksize:4k

Iodepth| BW(MB/S)|  IOPS| Latency(ms)| Host_CPU|  BW/CPU|  KVM_Exits|  Util%
      1|    44.55| 11403|        1.40|    18.59|    2.40|    3255257|  99.91
      8|    43.00| 11006|       11.63|    16.84|    2.55|    3391894|  99.90



Based on the above results, regression still exists.

Comment 9 Kevin Wolf 2015-09-29 09:54:20 UTC

Looks like no change at all on your setup. Then I'm afraid we won't be able to
fix this in 7.2.

Can you try the following build to get some more data points? It has the patch
that enforces batch completion, but which I said won't be suitable for
inclusion. If this doesn't work either, I might have to find some entirely
different cause for the regression.

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9897363

Comment 12 Yanhui Ma 2015-10-08 09:14:19 UTC

(In reply to Kevin Wolf from comment #9)
> Looks like no change at all on your setup. Then I'm afraid we won't be able
> to
> fix this in 7.2.
> 
> Can you try the following build to get some more data points? It has the
> patch
> that enforces batch completion, but which I said won't be suitable for
> inclusion. If this doesn't work either, I might have to find some entirely
> different cause for the regression.
> 
> http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9897363

The host is busy. Once it is idle, I will update results asap.

Comment 13 Yanhui Ma 2015-10-26 08:19:18 UTC

(In reply to Yanhui Ma from comment #12)
> (In reply to Kevin Wolf from comment #9)
> > Looks like no change at all on your setup. Then I'm afraid we won't be able
> > to
> > fix this in 7.2.
> > 
> > Can you try the following build to get some more data points? It has the
> > patch
> > that enforces batch completion, but which I said won't be suitable for
> > inclusion. If this doesn't work either, I might have to find some entirely
> > different cause for the regression.
> > 
> > http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9897363
> 
> The host is busy. Once it is idle, I will update results asap.

Hi,Kevin Wolf 
I am always busying with other something, so sorry for the delay.
The build (http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9897363) is overdue. Could you provide a new one? Do I still need to test the build?

Comment 15 Kevin Wolf 2015-11-19 14:22:35 UTC

(In reply to Yanhui Ma from comment #13)
> I am always busying with other something, so sorry for the delay.
> The build (http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9897363) is
> overdue. Could you provide a new one? Do I still need to test the build?

Sorry for taking so much time before providing a new build. Please try this one:
https://brewweb.devel.redhat.com/taskinfo?taskID=10124453

Comment 16 Yanhui Ma 2015-11-20 10:14:09 UTC

(In reply to Kevin Wolf from comment #15)
> (In reply to Yanhui Ma from comment #13)
> > I am always busying with other something, so sorry for the delay.
> > The build (http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9897363) is
> > overdue. Could you provide a new one? Do I still need to test the build?
> 
> Sorry for taking so much time before providing a new build. Please try this
> one:
> https://brewweb.devel.redhat.com/taskinfo?taskID=10124453

Here are results with above scratch build:

### kvm-userspace-ver : qemu-kvm-rhev-2.3.0-26.el7.bz1265927_gc1.x86_64
### kvm_version : 3.10.0-327.el7.x86_64
### guest-kernel-ver :3.10.0-306.el7.x86_64
Category:write
Block_size| Iodepth| Threads| BW(MB/S)|  IOPS| Latency(ms)| Host_CPU| BW/CPU| KVM_Exits| Util%
         4|       1|      16|    35.06|  8975|        1.78|     8.67|   4.04|    787472| 96.75
         4|       8|      16|    52.20| 13364|        9.58|     8.23|   6.34|    659240| 98.32



Regression still exists with Iodepth 1, but BW with Iodeptch 8 is ok.

Comment 18 Ademar Reis 2016-01-15 12:27:50 UTC

(In reply to Yanhui Ma from comment #0)

> Here is the results: w/ the bad commit vs w/o the bad commit
> http://10.66.94.121/results/regression/2015-w39/firstbad.html
> 

Can you please attach these results to the BZ? We need to compare the previou and actual data to understand the regression, but the link above appears to be dead.

Comment 19 Yanhui Ma 2016-01-19 01:37:38 UTC

(In reply to Ademar Reis from comment #18)
> (In reply to Yanhui Ma from comment #0)
> 
> > Here is the results: w/ the bad commit vs w/o the bad commit
> > http://10.66.94.121/results/regression/2015-w39/firstbad.html
> > 
> 
> Can you please attach these results to the BZ? We need to compare the
> previou and actual data to understand the regression, but the link above
> appears to be dead.

There is something wrong with server 10.66.94.121. Once it works, I will attach the results

Comment 20 Yanhui Ma 2016-01-19 08:32:59 UTC

(In reply to Ademar Reis from comment #18)
> (In reply to Yanhui Ma from comment #0)
> 
> > Here is the results: w/ the bad commit vs w/o the bad commit
> > http://10.66.94.121/results/regression/2015-w39/firstbad.html
> > 
> 
> Can you please attach these results to the BZ? We need to compare the
> previou and actual data to understand the regression, but the link above
> appears to be dead.

Now you can visit the link.

Comment 21 Ademar Reis 2016-08-07 18:57:57 UTC

We had several changes upstream and belive the performance changed (hopefully improved). Please test again with a more recent version of QEMU. Deferring to 7.4 for now.

Comment 27 Yanhui Ma 2017-06-13 05:42:54 UTC

Results in comment 25 show regression still exist, so change the bz assigned. If any error pls correct it.

Comment 36 Ademar Reis 2019-02-01 15:17:02 UTC

This report is quite old (2015) and we could never find a clear culprit, so at this point the investigation would not be about the regression, but instead about performance improvements and bottlenecks. In that regard, there are several scenarios worth investigating and there's no reason to track or prioritize this one in particular.

So I'm closing it as WONTFIX. The current performance results will have to be our baseline from now on.

Note You need to log in before you can comment on or make changes to this bug.