1233568 – qemu eats 100% CPU after host suspend/resume

Bug 1233568 - qemu eats 100% CPU after host suspend/resume

Summary: qemu eats 100% CPU after host suspend/resume

Keywords:
Status:	CLOSED DUPLICATE of bug 1380893
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	qemu
Sub Component:
Version:	23
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Fedora Virtualization Maintainers
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-06-19 07:57 UTC by Jakub Hrozek
Modified:	2017-03-16 15:38 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-05-05 13:27:34 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Jakub Hrozek 2015-06-19 07:57:38 UTC

Description of problem:
Filing against qemu for the lack of a better idea, feel free to redirect me..

Most of the time (not always, most most) when I suspend my laptop that was running some VMs and later resume it, the qemu-kvm process eats 100% CPU and the VMs are effectivelly locked, must be powercycled.

Version-Release number of selected component (if applicable):
qemu-kvm-2.3.0-5.fc22.x86_64 but this has been going on for a while

How reproducible:
about 75% I'd say

Steps to Reproduce:
1. run VMs. I mostly run Fedora and RHEL VMs
2. suspend the laptop hosting the VMs (close the lid)
3. next morning, open the lid

Actual results:
qemu spiking CPU, has to be killed

Expected results:
graceful resume of the guests

Additional info:
I'm adding a backtrace. Please let me know what more info I can get.

#0  0x00007f77ec8e1731 in __GI_ppoll (fds=0x7f77fda90100, nfds=6, timeout=<optimized out>, timeout@entry=0x7fffae413720, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:56              [60/1910]
        resultvar = 18446744073709551102
        oldtype = 0
        tval = {tv_sec = 0, tv_nsec = 521145404}
        result = <optimized out>
#1  0x00007f77fb672b89 in qemu_poll_ns (__ss=0x0, __timeout=0x7fffae413720, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
        ts = {tv_sec = 0, tv_nsec = 718571157}
Python Exception <class 'gdb.error'> That operation is not available on integers of more than 8 bytes.: 
#2  0x00007f77fb672b89 in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=718571157) at qemu-timer.c:322
        ts = {tv_sec = 0, tv_nsec = 718571157}
Python Exception <class 'gdb.error'> That operation is not available on integers of more than 8 bytes.: 
#3  0x00007f77fb67243c in main_loop_wait (timeout=718571157) at main-loop.c:239
        ret = <optimized out>
        spin_counter = 0
        ret = -58739600
        timeout = 4294967295
        timeout_ns = <optimized out>
#4  0x00007f77fb67243c in main_loop_wait (nonblocking=<optimized out>) at main-loop.c:494
        ret = -58739600
        timeout = 4294967295
        timeout_ns = <optimized out>
#5  0x00007f77fb41c37b in main () at vl.c:1798                                                                                                                                                            [39/1910]
        nonblocking = <optimized out>
        last_io = 0
        i = <optimized out>
        snapshot = <optimized out>
        linux_boot = <optimized out>
        initrd_filename = <optimized out>
        kernel_filename = <optimized out>
        kernel_cmdline = <optimized out>
        boot_order = <optimized out>
        boot_once = <optimized out>
        ds = <optimized out>
        cyls = <optimized out>
        heads = <optimized out>
        secs = <optimized out>
        translation = <optimized out>
        hda_opts = <optimized out>
        opts = <optimized out>
        machine_opts = <optimized out>
        icount_opts = <optimized out>
        olist = <optimized out>
        optind = 55
        optarg = 0x7f77fc7f7970 "pc-i440fx-2.1"
        loadvm = <optimized out>
        machine_class = <optimized out>
        cpu_model = <optimized out>
        vga_model = 0x0
        qtest_chrdev = <optimized out>
        qtest_log = <optimized out>
        pid_file = <optimized out>
        incoming = <optimized out>
        defconfig = <optimized out>
        userconfig = false
        log_mask = <optimized out>
        log_file = <optimized out>
        mem_trace = {malloc = 0x7f77fb5164e0 <malloc_and_trace>, realloc = 0x7f77fb5164c0 <realloc_and_trace>, free = 0x7f77fb5164b0 <free_and_trace>, calloc = 0x0, try_malloc = 0x0, try_realloc = 0x0}
        trace_events = <optimized out>
        trace_file = <optimized out>
        maxram_size = <optimized out>
        ram_slots = <optimized out>
        vmstate_dump_file = <optimized out>
        main_loop_err = 0x0
        __func__ = "main"
#6  0x00007f77fb41c37b in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4362
        i = <optimized out>
        snapshot = <optimized out>
        linux_boot = <optimized out>
        initrd_filename = <optimized out>
        kernel_filename = <optimized out>
        kernel_cmdline = <optimized out>
        boot_order = <optimized out>
        boot_once = <optimized out>
        ds = <optimized out>
        cyls = <optimized out>
        heads = <optimized out>
        secs = <optimized out>
        translation = <optimized out>
        hda_opts = <optimized out>
        opts = <optimized out>
        machine_opts = <optimized out>
        icount_opts = <optimized out>
        olist = <optimized out>
        optind = 55
        optarg = 0x7f77fc7f7970 "pc-i440fx-2.1"
        loadvm = <optimized out>
        machine_class = <optimized out>
        cpu_model = <optimized out>
        vga_model = 0x0
        qtest_chrdev = <optimized out>
        qtest_log = <optimized out>
        pid_file = <optimized out>
        incoming = <optimized out>
        defconfig = <optimized out>
        userconfig = false
        log_mask = <optimized out>
        log_file = <optimized out>
        mem_trace = {malloc = 0x7f77fb5164e0 <malloc_and_trace>, realloc = 0x7f77fb5164c0 <realloc_and_trace>, free = 0x7f77fb5164b0 <free_and_trace>, calloc = 0x0, try_malloc = 0x0, try_realloc = 0x0}
        trace_events = <optimized out>
        trace_file = <optimized out>
        maxram_size = <optimized out>
        ram_slots = <optimized out>
        vmstate_dump_file = <optimized out>
        main_loop_err = 0x0
        __func__ = "main"

Comment 1 Cole Robinson 2015-10-14 14:47:28 UTC

*** Bug 1165352 has been marked as a duplicate of this bug. ***

Comment 2 Cole Robinson 2015-10-14 14:54:31 UTC

*** Bug 1221518 has been marked as a duplicate of this bug. ***

Comment 3 Cole Robinson 2015-11-04 23:36:19 UTC

*** Bug 1178533 has been marked as a duplicate of this bug. ***

Comment 4 Dimitris 2015-11-04 23:44:33 UTC

BTW I've reproduced this under F23 too.

Comment 5 Cole Robinson 2015-11-05 00:26:59 UTC

Thanks, setting version to f23. I'll try and reproduce eventually :)

Comment 6 Cole Robinson 2016-05-02 21:04:33 UTC

Actually it sounds like this may have fixed the CPU burn issue. But it sounds like it's in 2.4.0 which is in F23. So can anyone else still reproduce this on an f23 host?


commit fb1a3a051d89975f26296163066bb0745ecca49d
Author: Paul Donohue <qemu-devel>
Date:   Fri Jun 12 10:08:45 2015 -0400

    qemu-timer: Call clock reset notifiers on forward jumps
    
    Commit 691a0c9c introduced a mechanism by which QEMU_CLOCK_HOST can
    notify other parts of the emulator when the host clock has jumped
    backward.  This is used to avoid stalling timers that were scheduled
    based on the host clock.
    
    However, if the host clock jumps forward, then timers that were
    scheduled based on the host clock may fire rapidly and cause other
    problems.  For example, the mc146818rtc periodic timer will block
    execution of the VM and consume host CPU while firing every interrupt
    for the time period that was skipped by the host clock.
    
    To correct that problem, this commit fires the reset notification if the
    host clock jumps forward by more than a hard-coded limit.  The limit is
    currently set to a value of 60 seconds, which should be small enough to
    prevent excessive timer loops, but large enough to avoid frequent resets
    in idle VMs.
    
    Signed-off-by: Paul Donohue <qemu-git>
    Message-Id: <20150612140845.GD2749>
    Signed-off-by: Paolo Bonzini <pbonzini>

Comment 7 Aleksandar Kostadinov 2016-05-02 22:29:10 UTC

I've hit a short CPU burn with fedora 23 but not anything as significant as before. I'll report back if I happen to hit it again. Not sure when exactly was the above patch effective in fedora 23.

Comment 8 Aleksandar Kostadinov 2016-05-02 22:29:43 UTC

I meant a few short ...

Comment 9 Jakub Hrozek 2016-05-03 08:23:29 UTC

Ah, thanks for the good news. I'm still on F-22 at the moment, would it be sufficient to upgrade to F-23's qemu/libvirt/etc or does this fix depend on other components (kernel maybe) as well?

Comment 10 Cole Robinson 2016-05-03 12:08:12 UTC

(In reply to Jakub Hrozek from comment #9)
> Ah, thanks for the good news. I'm still on F-22 at the moment, would it be
> sufficient to upgrade to F-23's qemu/libvirt/etc or does this fix depend on
> other components (kernel maybe) as well?

That patch is just in qemu, so updating only qemu will determine whether it's the complete fix or not

Comment 11 Dimitris 2016-05-05 03:29:09 UTC

As of kernel 4.4.8-300.fc23 (both host and guest) and qemu 2:2.4.1-8.fc23 this is no longer happening here.

Comment 12 Cole Robinson 2016-05-05 13:27:34 UTC

Okay, closing. But if anyone else can reliably reproduce with f23 or later, please reopen and list your package versions

Comment 13 Dimitris 2016-07-05 01:20:49 UTC

This regressed for me after upgrading from F23 to F24.

The host was upgraded on 2016-07-02 and the guest on 2016-07-04.  This seems to have regressed when the guest was upgraded, but I can't be 100% sure about that - my use of the guest has been sporadic over the weekend.

The steps to reproduce are:

- Host and guest are both F24 on x86_64
- Start the guest
- Suspend the host
- 1-2 hours later, resume the host
- Initially the guest seems OK, but after a few minutes the qemu process pegs all its CPU cores (host has 2 physical cores with hyperthreading, guest is configured to see 2 cores).  Graph on virt-manager shows 100% guest CPU and 50% host CPU.
- Guest is extremely slow to respond to e.g. SSH sessions at this point.
- But eventually a "virsh shutdown <guest>" will take effect.

qemu is at 2:2.6.0-4.fc24

Comment 14 Dimitris 2016-07-05 17:25:55 UTC

Update: this regression doesn't seem related to suspend, it happened to me on a qemu started after the host was resumed: but 1352992

Comment 16 Aleksandar Kostadinov 2016-07-05 17:36:16 UTC

advise generously provided by Cole how to get debug info when that happens:
> sudo debuginfo-install qemu-system-x86
>
> Then on the next hang, grab the pid of the busted VM from ps axwww, and do:
>
> sudo pstack $pid
>
> The dump that output in a bug report, along with
> /var/log/libvirt/qemu/$vmname.log. File it against qemu

Comment 17 Amit Shah 2016-07-13 09:43:42 UTC

(In reply to Dimitris from comment #13)
> This regressed for me after upgrading from F23 to F24.
> 
> The host was upgraded on 2016-07-02 and the guest on 2016-07-04.  This seems
> to have regressed when the guest was upgraded, but I can't be 100% sure
> about that - my use of the guest has been sporadic over the weekend.

I've seen similar behaviour.

If I let the guest sit in that state for a long time (10-15 mins), qemu eventually returns to normal CPU usage, and the guest behaves normally.

If I remove the guest agents in the guest (qemu-guest-agent and spice-vdagent), this doesn't happen, but sometimes the guest loses its network after resume.

Can you confirm if you're seeing the same bug as I am?

Comment 18 Dimitris 2016-07-13 16:06:49 UTC

(In reply to Amit Shah from comment #17)

I'll try stopping and disabling the guest agents.  BTW there's a host-side spice-gtk (https://bodhi.fedoraproject.org/updates/FEDORA-2016-de01d538f0) that I've updated to.

Since this bug is closed shouldn't I be updating bug 1352992 instead?

Comment 19 Dimitris 2016-07-13 20:56:51 UTC

(In reply to Dimitris from comment #18)

> I'll try stopping and disabling the guest agents.

Tried, no change - after host suspend/resume the qemu process spins the CPU again.

Comment 20 Dimitris 2016-07-14 23:37:45 UTC

With agents running, something else I noticed.  This is now happening even with the guest shut off while the host is suspended.

Also, I have a host->guest port redirection (host port 3000 redirects to guest port 3000).  Immediately before the qemu process starting to spin the CPU, I was loading a page on the guest-run web stack, using a host-run browser.  At that time, on the host log:

> [19582.635045] TCP: request_sock_TCP: Possible SYN flooding on port 3000. Sending cookies.  Check SNMP counters.

(followed by a thermal throttle event due to the CPU spin).

Comment 21 Aleksandar Kostadinov 2016-10-01 07:25:33 UTC

FYI I created a new issue for Fedora 24 (bug 1380893). I don't see messages about syn flooding in my logs. I attached my logs to the bug (decided to mark them private though).

Comment 22 Cole Robinson 2017-03-16 15:38:30 UTC


*** This bug has been marked as a duplicate of bug 1380893 ***

Note You need to log in before you can comment on or make changes to this bug.