Bug 705082

Summary:	qemu-kvm takes lots of CPU resources due to _spin_lock_irqsave on a 64 cpu machine
Product:	Red Hat Enterprise Linux 6	Reporter:	Moran Goldboim <mgoldboi>
Component:	kernel	Assignee:	Karen Noel <knoel>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	6.1	CC:	juzhang, knoel, kzhang, mathieu-acct, tburke, toracat, wnefal+redhatbugzilla, xfu, xnaveira, ykaul, zliu
Target Milestone:	rc	Keywords:	TestBlocker
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	kernel-2.6.32-153.el6	Doc Type:	Bug Fix
Doc Text:	Cause: Current posix-timers have scalability issue due to spin lock Effect: Scalability problem with KVM/QEMU that hit very hard a single spinlock (idr_lock) in posix-timers code, on its 48 core machine. Even on a 16 cpu machine (2x4x2), a single test can show 98% of cpu time used in ticket_spin_lock, from lock_timer Fix: Switching to RCU is quite easy, IDR being already RCU ready. Result: Benchmark on a 2x4x2 machine, 16 processes calling timer_gettime(). Before : real 1m18.669s After : real 0m3.296s	Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-12-06 13:31:41 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Moran Goldboim 2011-05-16 15:13:43 UTC

Description of problem:
when running vms on a 64 cpu machine each vm takes ~35% cpu when idle, consuming all cpu resources.
running perf report gives the following output:
# Overhead        Command        Shared Object                                     Symbol
# ........  .............  ...................  .........................................
#
    78.70%       qemu-kvm  [kernel.kallsyms]    [k] _spin_lock_irqsave
                 |
                 --- _spin_lock_irqsave
                    |          
                    |--99.30%-- lock_timer
                    |          |          
                    |          |--64.52%-- sys_timer_gettime
                    |          |          system_call_fastpath
                    |          |          __timer_gettime_new
                    |          |          
                    |           --35.48%-- sys_timer_settime
                    |                     system_call_fastpath
                    |                     __timer_settime_new
                     --0.70%-- [...]

     1.69%       qemu-kvm  [kvm_intel]          [k] vmx_vcpu_run
                 |
                 --- vmx_vcpu_run
                    |          
                    |--97.48%-- ftrace_raw_output_kvm_inj_exception
                    |          ftrace_raw_event_kvm_mmio
                    |          vfs_ioctl
                    |          do_vfs_ioctl
                    |          sys_ioctl
                    |          system_call_fastpath
                    |          __GI___ioctl
                    |          
                    |--2.31%-- __GI___ioctl
                     --0.20%-- [...]

     0.66%       qemu-kvm  [kernel.kallsyms]    [k] fget_light
                 |
                 --- fget_light
                    |          
                    |--76.00%-- do_select
                    |          core_sys_select
                    |          sys_select
                    |          system_call_fastpath
                    |          __GI_select
                    |          |          
                    |           --100.00%-- (nil)
                    |          
                    |--10.80%-- sys_read
                    |          system_call_fastpath
                    |          __read
                    |          
                    |--5.16%-- sys_write
                    |          system_call_fastpath
                    |          0x3f35a0e50d


Version-Release number of selected component (if applicable):
2.6.32-131.0.1.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Avi Kivity 2011-05-16 15:31:53 UTC

Upstream patch: https://lkml.org/lkml/2011/4/5/235

Comment 3 Moran Goldboim 2011-05-16 15:36:05 UTC

command line:
qemu     81662     1 32 06:17 ?        01:43:03 /usr/libexec/qemu-kvm -S -M rhel6.0.0 -cpu Conroe -enable-kvm -m 256 -smp 1,sockets=1,cores=1,threads=1 -name R6A-113 -uuid dde328fe-d6ae-437a-b570-c50acba21953 -smbios type=1,manufacturer=Red Hat,product=RHEL,version=6Server-6.1.0.1.el6,serial=4C4C4544-004A-5A10-805A-C2C04F514E31_,uuid=dde328fe-d6ae-437a-b570-c50acba21953 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/R6A-113.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2011-05-16T10:17:25 -boot c -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive file=/rhev/data-center/06b29e0f-8c22-431e-a6f1-75989a582c52/2b27b725-9063-4a27-899a-3ec8d02c1ade/images/3e858fde-f85b-4b0e-8f60-3e523e7ffb52/f5fdc830-9e42-49e2-8680-73fbd92097c9,if=none,id=drive-virtio-disk0,format=qcow2,serial=0e-8f60-3e523e7ffb52,cache=none,werror=stop,rerror=stop,aio=native -device virtio-blk-pci,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=24,id=hostnet0,vhost=on,vhostfd=143 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1b:4a:16:9f:7a,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/R6A-113.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -usb -device usb-tablet,id=input0 -vnc 0:118,password -k en-us -vga cirrus -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0

Comment 4 Dor Laor 2011-05-17 08:30:30 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Current posix-timers have scalability issue due to spin lock
Effect: Scalability problem with KVM/QEMU that hit very hard
a single spinlock (idr_lock) in posix-timers code, on its 48 core machine. Even on a 16 cpu machine (2x4x2), a single test can show 98% of cpu time used in ticket_spin_lock, from lock_timer
Fix: Switching to RCU is quite easy, IDR being already RCU ready.
Result: Benchmark on a 2x4x2 machine, 16 processes calling timer_gettime(). Before : real    1m18.669s 
After : real    0m3.296s

Comment 6 RHEL Program Management 2011-05-17 08:39:50 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 7 Avi Kivity 2011-05-17 11:56:47 UTC

(In reply to comment #4)

> Fix: Switching to RCU is quite easy, IDR being already RCU ready.
> Result: Benchmark on a 2x4x2 machine, 16 processes calling timer_gettime().
> Before : real    1m18.669s 
> After : real    0m3.296s

This doesn't seem to belong in a technical note.

Comment 8 juzhang 2011-06-07 10:32:23 UTC

Run 6 vms on host which has 64 cpus, tested several times, seems  _spin_lock_irqsave just around 0.27%.

Hi,Avi/Moran

   Would you please tell me which part I miss?


Host kernel version
#uname -r
2.6.32-156.el6.x86_64
Kvm version
qemu-kvm-0.12.1.2-2.163.el6.x86_64



#perf record -a -g -f
#perf report -g


-      0.27%         qemu-kvm  [kernel.kallsyms]                   [k] _spin_lock_irqsave
   - _spin_lock_irqsave
      + 15.82% add_wait_queue
      + 15.60% remove_wait_queue
      + 13.50% lock_timer_base
      + 8.54% lock_hrtimer_base
      ................
      + 6.01% eventfd_poll
      - 3.50% lock_timer
         + 59.73% sys_timer_gettime
         + 40.27% sys_timer_settime


CML:
#/usr/libexec/qemu-kvm -M rhel6.1.0 -enable-kvm  -smp 2,sockets=2,cores=1,threads=1 -m 4G -boot c -drive file=/root/zhangjunyi/RHEL-Server-6.1-64-virtio.qcow2,if=none,id=drive-ide0-0-0,format=qcow2,boot=on,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -net none -vnc :9 -qmp tcp:0:4444,server,nowait -monitor stdio -rtc base=localtime -no-acpi

#/usr/libexec/qemu-kvm -M rhel6.1.0 -enable-kvm  -smp 2,sockets=2,cores=1,threads=1 -m 4G -boot c -drive file=/root/zhangjunyi/RHEL-Server-6.1-64-virtio.qcow2,if=none,id=drive-ide0-0-0,format=qcow2,boot=on,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -net none -vnc :9 -qmp tcp:0:4444,server,nowait -monitor stdio -rtc base=localtime -no-acpi

# /usr/libexec/qemu-kvm -M rhel6.1.0 -enable-kvm -smp 1,sockets=1,cores=1,threads=1 -m 2G -uuid 9e6f04cf-2ad7-45aa-9333-2d2ee26570c6 -boot c -drive file=/root/zhang/RHEL-Server-6.1-64-virtio.qcow2,if=none,id=drive-ide0-0-0,format=qcow2,boot=on,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -net none -vnc :11 -qmp tcp:0:5556,server,nowait -monitor stdio -rtc base=localtime -no-acpi

..........................

Comment 17 Avi Kivity 2011-06-13 15:00:05 UTC

Write a test program that executes timer_gettime() in a loop.  Run multiple instances, observe high spinlock contention.

With the patch, the same number of instances will not cause contention.

Example program:


#include <signal.h>
#include <time.h>
#include <stdlib.h>

int main(int ac, char **av)
{
    timer_t timerid;
    struct itimerspec its;
    int r;

    r = timer_create(CLOCK_REALTIME, NULL, &timerid);
    if (r == -1) {
	perror("timer_create");
	exit(1);
    }
    while (1) {
	r = timer_gettime(timerid, &its);
	if (r == -1) {
	    perror("timer_gettime");
	    exit(1);
	}
    }
}


Note: will probably need a large number of cpus and instances.

Comment 18 juzhang 2011-06-14 03:04:49 UTC

> 
> #include <signal.h>
> #include <time.h>
> #include <stdlib.h>
> 
> int main(int ac, char **av)
> {
>     timer_t timerid;
>     struct itimerspec its;
>     int r;
> 
>     r = timer_create(CLOCK_REALTIME, NULL, &timerid);
>     if (r == -1) {
>  perror("timer_create");
>  exit(1);
>     }
>     while (1) {
>  r = timer_gettime(timerid, &its);
>  if (r == -1) {
>      perror("timer_gettime");
>      exit(1);
>  }
>     }
> }
> 
> 
> Note: will probably need a large number of cpus and instances.
#gcc -o test -lrt test.c

Launched 400 instances on host(amd) which has 48 cpus, tried 4 times.
_spin_lock_irqsave is around 4.7%.
 

#ps -aux | grep test |wc -l
401

Hi,Avi

   Would you please tell me which part I missed?

Best Regards,
Junyi

Comment 19 juzhang 2011-06-14 06:15:33 UTC

Launched 500 instances on host(intel) which has 64 cpus.
_spin_lock_irqsave is around 6.5%.

Comment 22 Avi Kivity 2011-06-14 11:43:43 UTC

(In reply to comment #18)
> Hi,Avi
> 
>    Would you please tell me which part I missed?
> 

I don't think you missed anything.  It probably happens on specific hardware combinations.

Try the reporter's machine please.

Comment 25 juzhang 2011-06-15 08:09:04 UTC

Really thanks mgoldboi for lending me his host.

reproduced with qemu-kvm-0.12.1.2-2.164.el6.x86_64,Marked qa_ack+

#gcc -o test -lrt test.c

Launched 500 instances on intel(L7555) which has 64 cpus
 

Results:
_spin_lock_irqsave is around 96.4%.

Please note:
1. this issue just can be reproduced on specific hardware.
Intel(R) Xeon(R) CPU           L7555  @ 1.87GHz can reproduce this issue.

model name	: Intel(R) Xeon(R) CPU           L7555  @ 1.87GHz

2. I also tried x7550,Intel(R) W3520 and amd-6172, these these platforms can't reproduce this issue.

Comment 27 Aristeu Rozanski 2011-08-11 19:48:08 UTC

Patch(es) available on kernel-2.6.32-153.el6

Comment 30 FuXiangChun 2011-10-12 11:31:16 UTC

thanks for mgoldboi lend intel(L7555) to me.

reproduce on kernl-2.6.32-128.el6.x86_64

after 3 steps
actual result:
about 47.7% _spin_lock_irqsave kernel.kallsyms]  

verify on kernel-2.6.32-202.el6.x86_64

after 3 steps
actual result:
3.9% _spin_lock_irqsave                   [kernel.kallsyms]

reproduce to steps:
1. boot 500 guests
kernel128]# ps aux|grep qemu|wc -l
500
2. gcc -o test -lrt test.c
   ./test
3.perf top


base on above testing result, this bug has been fixed.

Comment 31 juzhang 2011-10-13 01:51:55 UTC

According to comment30,set this issue as verified.

Comment 32 errata-xmlrpc 2011-12-06 13:31:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1530.html