705082 – qemu-kvm takes lots of CPU resources due to _spin_lock_irqsave on a 64 cpu machine

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 705082 - qemu-kvm takes lots of CPU resources due to _spin_lock_irqsave on a 64 cpu machine

Summary: qemu-kvm takes lots of CPU resources due to _spin_lock_irqsave on a 64 cpu ma...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Karen Noel
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-05-16 15:13 UTC by Moran Goldboim
Modified:	2014-10-29 15:37 UTC (History)
CC List:	11 users (show)
Fixed In Version:	kernel-2.6.32-153.el6
Doc Type:	Bug Fix
Doc Text:	Cause: Current posix-timers have scalability issue due to spin lock Effect: Scalability problem with KVM/QEMU that hit very hard a single spinlock (idr_lock) in posix-timers code, on its 48 core machine. Even on a 16 cpu machine (2x4x2), a single test can show 98% of cpu time used in ticket_spin_lock, from lock_timer Fix: Switching to RCU is quite easy, IDR being already RCU ready. Result: Benchmark on a 2x4x2 machine, 16 processes calling timer_gettime(). Before : real 1m18.669s After : real 0m3.296s
Clone Of:
Environment:
Last Closed:	2011-12-06 13:31:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:1530	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise Linux 6 kernel security, bug fix and enhancement update	2011-12-06 01:45:35 UTC

Description Moran Goldboim 2011-05-16 15:13:43 UTC

Description of problem:
when running vms on a 64 cpu machine each vm takes ~35% cpu when idle, consuming all cpu resources.
running perf report gives the following output:
# Overhead        Command        Shared Object                                     Symbol
# ........  .............  ...................  .........................................
#
    78.70%       qemu-kvm  [kernel.kallsyms]    [k] _spin_lock_irqsave
                 |
                 --- _spin_lock_irqsave
                    |          
                    |--99.30%-- lock_timer
                    |          |          
                    |          |--64.52%-- sys_timer_gettime
                    |          |          system_call_fastpath
                    |          |          __timer_gettime_new
                    |          |          
                    |           --35.48%-- sys_timer_settime
                    |                     system_call_fastpath
                    |                     __timer_settime_new
                     --0.70%-- [...]

     1.69%       qemu-kvm  [kvm_intel]          [k] vmx_vcpu_run
                 |
                 --- vmx_vcpu_run
                    |          
                    |--97.48%-- ftrace_raw_output_kvm_inj_exception
                    |          ftrace_raw_event_kvm_mmio
                    |          vfs_ioctl
                    |          do_vfs_ioctl
                    |          sys_ioctl
                    |          system_call_fastpath
                    |          __GI___ioctl
                    |          
                    |--2.31%-- __GI___ioctl
                     --0.20%-- [...]

     0.66%       qemu-kvm  [kernel.kallsyms]    [k] fget_light
                 |
                 --- fget_light
                    |          
                    |--76.00%-- do_select
                    |          core_sys_select
                    |          sys_select
                    |          system_call_fastpath
                    |          __GI_select
                    |          |          
                    |           --100.00%-- (nil)
                    |          
                    |--10.80%-- sys_read
                    |          system_call_fastpath
                    |          __read
                    |          
                    |--5.16%-- sys_write
                    |          system_call_fastpath
                    |          0x3f35a0e50d


Version-Release number of selected component (if applicable):
2.6.32-131.0.1.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Avi Kivity 2011-05-16 15:31:53 UTC

Upstream patch: https://lkml.org/lkml/2011/4/5/235

Comment 3 Moran Goldboim 2011-05-16 15:36:05 UTC

command line:
qemu     81662     1 32 06:17 ?        01:43:03 /usr/libexec/qemu-kvm -S -M rhel6.0.0 -cpu Conroe -enable-kvm -m 256 -smp 1,sockets=1,cores=1,threads=1 -name R6A-113 -uuid dde328fe-d6ae-437a-b570-c50acba21953 -smbios type=1,manufacturer=Red Hat,product=RHEL,version=6Server-6.1.0.1.el6,serial=4C4C4544-004A-5A10-805A-C2C04F514E31_,uuid=dde328fe-d6ae-437a-b570-c50acba21953 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/R6A-113.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2011-05-16T10:17:25 -boot c -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive file=/rhev/data-center/06b29e0f-8c22-431e-a6f1-75989a582c52/2b27b725-9063-4a27-899a-3ec8d02c1ade/images/3e858fde-f85b-4b0e-8f60-3e523e7ffb52/f5fdc830-9e42-49e2-8680-73fbd92097c9,if=none,id=drive-virtio-disk0,format=qcow2,serial=0e-8f60-3e523e7ffb52,cache=none,werror=stop,rerror=stop,aio=native -device virtio-blk-pci,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=24,id=hostnet0,vhost=on,vhostfd=143 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1b:4a:16:9f:7a,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/R6A-113.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -usb -device usb-tablet,id=input0 -vnc 0:118,password -k en-us -vga cirrus -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0

Comment 4 Dor Laor 2011-05-17 08:30:30 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Current posix-timers have scalability issue due to spin lock
Effect: Scalability problem with KVM/QEMU that hit very hard
a single spinlock (idr_lock) in posix-timers code, on its 48 core machine. Even on a 16 cpu machine (2x4x2), a single test can show 98% of cpu time used in ticket_spin_lock, from lock_timer
Fix: Switching to RCU is quite easy, IDR being already RCU ready.
Result: Benchmark on a 2x4x2 machine, 16 processes calling timer_gettime(). Before : real    1m18.669s 
After : real    0m3.296s

Comment 6 RHEL Program Management 2011-05-17 08:39:50 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 7 Avi Kivity 2011-05-17 11:56:47 UTC

(In reply to comment #4)

> Fix: Switching to RCU is quite easy, IDR being already RCU ready.
> Result: Benchmark on a 2x4x2 machine, 16 processes calling timer_gettime().
> Before : real    1m18.669s 
> After : real    0m3.296s

This doesn't seem to belong in a technical note.

Comment 8 juzhang 2011-06-07 10:32:23 UTC

Run 6 vms on host which has 64 cpus, tested several times, seems  _spin_lock_irqsave just around 0.27%.

Hi,Avi/Moran

   Would you please tell me which part I miss?


Host kernel version
#uname -r
2.6.32-156.el6.x86_64
Kvm version
qemu-kvm-0.12.1.2-2.163.el6.x86_64



#perf record -a -g -f
#perf report -g


-      0.27%         qemu-kvm  [kernel.kallsyms]                   [k] _spin_lock_irqsave
   - _spin_lock_irqsave
      + 15.82% add_wait_queue
      + 15.60% remove_wait_queue
      + 13.50% lock_timer_base
      + 8.54% lock_hrtimer_base
      ................
      + 6.01% eventfd_poll
      - 3.50% lock_timer
         + 59.73% sys_timer_gettime
         + 40.27% sys_timer_settime


CML:
#/usr/libexec/qemu-kvm -M rhel6.1.0 -enable-kvm  -smp 2,sockets=2,cores=1,threads=1 -m 4G -boot c -drive file=/root/zhangjunyi/RHEL-Server-6.1-64-virtio.qcow2,if=none,id=drive-ide0-0-0,format=qcow2,boot=on,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -net none -vnc :9 -qmp tcp:0:4444,server,nowait -monitor stdio -rtc base=localtime -no-acpi

#/usr/libexec/qemu-kvm -M rhel6.1.0 -enable-kvm  -smp 2,sockets=2,cores=1,threads=1 -m 4G -boot c -drive file=/root/zhangjunyi/RHEL-Server-6.1-64-virtio.qcow2,if=none,id=drive-ide0-0-0,format=qcow2,boot=on,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -net none -vnc :9 -qmp tcp:0:4444,server,nowait -monitor stdio -rtc base=localtime -no-acpi

# /usr/libexec/qemu-kvm -M rhel6.1.0 -enable-kvm -smp 1,sockets=1,cores=1,threads=1 -m 2G -uuid 9e6f04cf-2ad7-45aa-9333-2d2ee26570c6 -boot c -drive file=/root/zhang/RHEL-Server-6.1-64-virtio.qcow2,if=none,id=drive-ide0-0-0,format=qcow2,boot=on,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -net none -vnc :11 -qmp tcp:0:5556,server,nowait -monitor stdio -rtc base=localtime -no-acpi

..........................

Comment 17 Avi Kivity 2011-06-13 15:00:05 UTC

Write a test program that executes timer_gettime() in a loop.  Run multiple instances, observe high spinlock contention.

With the patch, the same number of instances will not cause contention.

Example program:


#include <signal.h>
#include <time.h>
#include <stdlib.h>

int main(int ac, char **av)
{
    timer_t timerid;
    struct itimerspec its;
    int r;

    r = timer_create(CLOCK_REALTIME, NULL, &timerid);
    if (r == -1) {
	perror("timer_create");
	exit(1);
    }
    while (1) {
	r = timer_gettime(timerid, &its);
	if (r == -1) {
	    perror("timer_gettime");
	    exit(1);
	}
    }
}


Note: will probably need a large number of cpus and instances.

Comment 18 juzhang 2011-06-14 03:04:49 UTC

> 
> #include <signal.h>
> #include <time.h>
> #include <stdlib.h>
> 
> int main(int ac, char **av)
> {
>     timer_t timerid;
>     struct itimerspec its;
>     int r;
> 
>     r = timer_create(CLOCK_REALTIME, NULL, &timerid);
>     if (r == -1) {
>  perror("timer_create");
>  exit(1);
>     }
>     while (1) {
>  r = timer_gettime(timerid, &its);
>  if (r == -1) {
>      perror("timer_gettime");
>      exit(1);
>  }
>     }
> }
> 
> 
> Note: will probably need a large number of cpus and instances.
#gcc -o test -lrt test.c

Launched 400 instances on host(amd) which has 48 cpus, tried 4 times.
_spin_lock_irqsave is around 4.7%.
 

#ps -aux | grep test |wc -l
401

Hi,Avi

   Would you please tell me which part I missed?

Best Regards,
Junyi

Comment 19 juzhang 2011-06-14 06:15:33 UTC

Launched 500 instances on host(intel) which has 64 cpus.
_spin_lock_irqsave is around 6.5%.

Comment 22 Avi Kivity 2011-06-14 11:43:43 UTC

(In reply to comment #18)
> Hi,Avi
> 
>    Would you please tell me which part I missed?
> 

I don't think you missed anything.  It probably happens on specific hardware combinations.

Try the reporter's machine please.

Comment 25 juzhang 2011-06-15 08:09:04 UTC

Really thanks mgoldboi for lending me his host.

reproduced with qemu-kvm-0.12.1.2-2.164.el6.x86_64,Marked qa_ack+

#gcc -o test -lrt test.c

Launched 500 instances on intel(L7555) which has 64 cpus
 

Results:
_spin_lock_irqsave is around 96.4%.

Please note:
1. this issue just can be reproduced on specific hardware.
Intel(R) Xeon(R) CPU           L7555  @ 1.87GHz can reproduce this issue.

model name	: Intel(R) Xeon(R) CPU           L7555  @ 1.87GHz

2. I also tried x7550,Intel(R) W3520 and amd-6172, these these platforms can't reproduce this issue.

Comment 27 Aristeu Rozanski 2011-08-11 19:48:08 UTC

Patch(es) available on kernel-2.6.32-153.el6

Comment 30 FuXiangChun 2011-10-12 11:31:16 UTC

thanks for mgoldboi lend intel(L7555) to me.

reproduce on kernl-2.6.32-128.el6.x86_64

after 3 steps
actual result:
about 47.7% _spin_lock_irqsave kernel.kallsyms]  

verify on kernel-2.6.32-202.el6.x86_64

after 3 steps
actual result:
3.9% _spin_lock_irqsave                   [kernel.kallsyms]

reproduce to steps:
1. boot 500 guests
kernel128]# ps aux|grep qemu|wc -l
500
2. gcc -o test -lrt test.c
   ./test
3.perf top


base on above testing result, this bug has been fixed.

Comment 31 juzhang 2011-10-13 01:51:55 UTC

According to comment30,set this issue as verified.

Comment 32 errata-xmlrpc 2011-12-06 13:31:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1530.html

Note You need to log in before you can comment on or make changes to this bug.