Bug 1104748

Summary: 48% reduction in IO performance for KVM guest, io=native
Product: Red Hat Enterprise Linux 7 Reporter: Andrew Theurer <atheurer>
Component: qemu-kvmAssignee: Fam Zheng <famz>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.0CC: atheurer, famz, hhuang, juzhang, knoel, michen, pbonzini, rbalakri, sharpwiner, sluo, virt-maint, wquan, xigao
Target Milestone: rcKeywords: Regression
Target Release: 7.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-1.5.3-78.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1161535 (view as bug list) Environment:
Last Closed: 2015-03-05 08:09:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1141705    

Description Andrew Theurer 2014-06-04 15:15:05 UTC
Description of problem:

When using option io=native for a KVM guest configuration. 4k random write throughput is 48% lower on RHEL7 host compared to RHEL6 host

Version-Release number of selected component (if applicable):

kernel: 3.10.0-123.el7.x86_64
qemu: qemu-kvm-1.5.3-60.el7.x86_64

How reproducible: 

Easily with a single system and a ramdisk

Steps to Reproduce:

1. Configure linux ramdisk size for 1GB

2. Create KVM VM with 4 vcpus and 2 extra virtio-blk test disks.  The first test disk uses /dev/ram0 as the device, and io=threads.  The second disk uses /dev/ram0 and io=native

3. Run fio in the guest with the following job file:

[global]
bs=4k
ioengine=libaio
iodepth=32
direct=1
time_based=1
runtime=60
[job1]
rw=randwrite
filename= /dev/xyz <--match this to the blkdev/io model you want to test (vdb/vdc)
size=896M

Actual results:

4K IOPS:
         threads  native
RHEL65     93050  107584 
RHEL7      91814   55743

Expected results:

RHEL7 results for io=native should be equal to RHEL65

Comment 2 Xiaomei Gao 2014-06-05 02:39:49 UTC
FYI, the issue seems like Bug 966398. Please check the following comment.
https://bugzilla.redhat.com/show_bug.cgi?id=966398#c17

From above results, there is ~20%-60% degradation in windows guest, ~10%-40% degradation in linux guest.

Storage backend: ramdisk
aio: native

Comment 3 Andrew Theurer 2014-06-05 16:53:27 UTC
Xiaomei, this may be the same issue.  I did not realize when you were asking for FusionIO that it was for the same thing I was observing :)

I can reproduce with SSD (FusionIO):

4K rand-write IOPS:

         threads  native
RHEL65     93967   76046 
RHEL7      53864   35326

There is still a significant drop for io=native from RHEL6.5->7

I am not sure why we are seeing a drop for io=threads for the FusioIO, but not the ramdisk.  Perhaps, there's regressions in the FusionIO driver as well.  To focus on any Qemu issues, it may be easier to just use the ramdisk first.  I did get a perf report for rhel7 io=native:

     7.72%  qemu-kvm  qemu-kvm                  [.] 0x00000000000df224
     5.59%  qemu-kvm  [kernel.kallsyms]         [k] memcpy 
     4.60%  qemu-kvm  qemu-kvm                  [.] phys_page_find 
     2.39%  qemu-kvm  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave                                                  
     2.20%  qemu-kvm  qemu-kvm                  [.] lduw_phys                                                               
     2.09%  qemu-kvm  [kernel.kallsyms]         [k] fget_light                                                              
     1.92%  qemu-kvm  [kernel.kallsyms]         [k] _raw_spin_unlock_irqrestore                                             
     1.88%  qemu-kvm  [kernel.kallsyms]         [k] put_compound_page                                                       
     1.82%  qemu-kvm  [kernel.kallsyms]         [k] do_blockdev_direct_IO                                                   
     1.34%  qemu-kvm  libc-2.17.so              [.] _int_malloc                                                             
     1.18%  qemu-kvm  [kernel.kallsyms]         [k] vmx_vcpu_run                                                            
     1.18%  qemu-kvm  [kernel.kallsyms]         [k] vcpu_enter_guest                                                        
     1.11%  qemu-kvm  [kernel.kallsyms]         [k] fput                                                                    
     1.07%  qemu-kvm  [kernel.kallsyms]         [k] __radix_tree_lookup                                                     
     1.01%  qemu-kvm  libglib-2.0.so.0.3600.3   [.] g_private_get_impl                                                      
     1.00%  qemu-kvm  [kernel.kallsyms]         [k] __get_page_tail                                                         
     0.94%  qemu-kvm  libc-2.17.so              [.] _int_free                                                               
     0.86%  qemu-kvm  [vdso]                    [.] 0x0000000000000d02                        



I am not sure what the top line is (and I have -debuginfo installed for qemu-kvm)


We can probably mark this bug a duplicate of Bug 966398

Comment 4 Andrew Theurer 2014-06-05 17:12:05 UTC
Forgot to ask: Xiaomei, do you see this problem with iodepth=32?  My tests use only iodepth=32

Comment 5 Andrew Theurer 2014-06-05 19:43:11 UTC
I forgot to mention, the perf report is from the main qemu thread only, where all of the IO submissions are called.

Comment 6 Fam Zheng 2014-06-06 03:35:45 UTC
(In reply to Andrew Theurer from comment #3)
> Xiaomei, this may be the same issue.  I did not realize when you were asking
> for FusionIO that it was for the same thing I was observing :)
> 
> I can reproduce with SSD (FusionIO):
> 
> 4K rand-write IOPS:
> 
>          threads  native
> RHEL65     93967   76046 
> RHEL7      53864   35326
> 

Andrew,

Did you use host block device as the guest drive or qcow2 image over file system? And what's the filesystem type?

Fam

Comment 8 Andrew Theurer 2014-06-06 12:17:38 UTC
Fam, this is with a block device in the host.  No file-system is used.  I will get a perf report for RHEL65 host to see if it looks any different from the RHEL7 one.

Comment 9 Andrew Theurer 2014-06-10 14:28:50 UTC
Here is a comparison of strace between RHEL6.5 and RHEL7 host:

RHEL6.5:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 40.49    0.076373           0   1033499           io_submit
 26.07    0.049171           0    206167           ioctl
 18.50    0.034887           0     87103           select
 10.52    0.019837           0   1041535           write
  3.20    0.006040           0    209502     73828 read
  0.53    0.001000           7       138         2 futex
  0.41    0.000768           0     32826           io_getevents
  0.17    0.000322         161         2           brk
  0.07    0.000138           0      4066           timer_settime
  0.03    0.000048           0      3988           rt_sigaction
  0.02    0.000040           0      4291           timer_gettime
  0.00    0.000000           0        61           poll
  0.00    0.000000           0        10           rt_sigprocmask
  0.00    0.000000           0         3           clone
------ ----------- ----------- --------- --------- ----------------
100.00    0.188624               2623191     73830 total



RHEL7:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 41.41    1.254259           2    605299           io_submit
 30.15    0.913113           1    922925           write
  7.86    0.238163           7     32689           io_getevents
  7.71    0.233438           4     58046           poll
  7.52    0.227880           1    180574     41646 read
  4.94    0.149772           2     64732           ioctl
  0.13    0.004068           1      3638           rt_sigaction
  0.13    0.004033           1      3818           timer_settime
  0.13    0.004014           1      3819           timer_gettime
  0.00    0.000117           3        41         1 futex
  0.00    0.000071          24         3           clone
  0.00    0.000052          52         1           restart_syscall
  0.00    0.000033           6         6           rt_sigprocmask
------ ----------- ----------- --------- --------- ----------------
100.00    3.029013               1875591     41647 total


Note that the performance of io_submit for RHEL65 is ~28x better

Comment 10 Andrew Theurer 2014-06-12 21:26:12 UTC
I have build the RHEL7 kernel on RHEL65 host and got similar performance as default RHEL65 (actually slightly higher), so I don't think this is a problem with RHEL7 kernel itself.

I have also tried to install and/or build qemu, RHEL65 version on RHEL7, and RHEL7 version on RHEL65, to see if we can narrow this down to a specific qemu version, but those attempts (to install or build) have failed.

Comment 11 Xiaomei Gao 2014-06-13 01:35:19 UTC
(In reply to Andrew Theurer from comment #10)
> I have build the RHEL7 kernel on RHEL65 host and got similar performance as
> default RHEL65 (actually slightly higher), so I don't think this is a
> problem with RHEL7 kernel itself.

Yes, it is qemu-kvm issue.
- linux guest: (block_size=4k, iodepth=1, write)
   rhel 6 qemu + rhel 7 host  : 447.62 MB/s
   rhel 7 qemu + rhel 7 host  : 239.43 MB/s
- windows guest: (block_size=4k, iodepth=1, write)
   rhel 6 qemu + rhel 7 host  : 91.12 MB/s
   rhel 7 qemu + rhel 7 host  : 37.24 MB/s

> I have also tried to install and/or build qemu, RHEL65 version on RHEL7, and
> RHEL7 version on RHEL65, to see if we can narrow this down to a specific
> qemu version, but those attempts (to install or build) have failed.

We have tried to bisect the issue, and found the following patch is one factor of introducing degradation. 

commit c90caf25e2b6945ae13560476a5ecd7992e9f945
Author: Paolo Bonzini <pbonzini>
Date:   Fri Feb 24 08:39:02 2012 +0100

    linux-aio: use event notifiers
    
    Since linux-aio already uses an eventfd, converting it to use the
    EventNotifier-based API simplifies the code even though it is not
    meant to be portable.
    
    Reviewed-by: Anthony Liguori <anthony>
    Signed-off-by: Paolo Bonzini <pbonzini>

Comment 12 Andrew Theurer 2014-06-13 15:51:00 UTC
I tried qemu.git v2.0.0, and the regression is gone.  Also tried qemu.git v1.5.3 and the regression is present.  I'll try 1.5.3 again with that commit backed out to see if the performance is restored.

Comment 13 Andrew Theurer 2014-06-13 16:28:02 UTC
I have tested with qemu,git v1.5.3 with the abve commit backed out, and performance is much better (119k IOPS).  Unless there's a great need for this commit in qemu, can we update qemu-img-1.5.3 package with this commit reverted?

Comment 14 Paolo Bonzini 2014-06-17 14:26:40 UTC
That's really weird, but yes, we should simply revert this commit if it's already fixed upstream.

Comment 18 Miroslav Rezanina 2014-11-10 09:29:51 UTC
Fix included in qemu-kvm-1.5.3-78.el7

Comment 21 Fam Zheng 2014-11-24 03:34:35 UTC
Yes, I think it could be verified. And please open a new bug for the 15% gap.

Comment 22 Xiaomei Gao 2014-11-24 08:36:20 UTC
(In reply to Fam Zheng from comment #21)
> Yes, I think it could be verified. And please open a new bug for the 15% gap.

Okay, we have opened new Bug 1167210, so set the bug verified status.

Comment 24 Xiaomei Gao 2015-01-15 08:55:36 UTC
(In reply to Xiaomei Gao from comment #23)
> From the above results, there is ~5%-45% host cpu usage rise, but almost no
> cpu usage.

Sorry, here should be "there is ~5%-45% host cpu usage rise, but almost no IOPS difference".

Comment 26 errata-xmlrpc 2015-03-05 08:09:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0349.html