Description of problem: Filing against qemu for the lack of a better idea, feel free to redirect me.. Most of the time (not always, most most) when I suspend my laptop that was running some VMs and later resume it, the qemu-kvm process eats 100% CPU and the VMs are effectivelly locked, must be powercycled. Version-Release number of selected component (if applicable): qemu-kvm-2.3.0-5.fc22.x86_64 but this has been going on for a while How reproducible: about 75% I'd say Steps to Reproduce: 1. run VMs. I mostly run Fedora and RHEL VMs 2. suspend the laptop hosting the VMs (close the lid) 3. next morning, open the lid Actual results: qemu spiking CPU, has to be killed Expected results: graceful resume of the guests Additional info: I'm adding a backtrace. Please let me know what more info I can get. #0 0x00007f77ec8e1731 in __GI_ppoll (fds=0x7f77fda90100, nfds=6, timeout=<optimized out>, timeout@entry=0x7fffae413720, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:56 [60/1910] resultvar = 18446744073709551102 oldtype = 0 tval = {tv_sec = 0, tv_nsec = 521145404} result = <optimized out> #1 0x00007f77fb672b89 in qemu_poll_ns (__ss=0x0, __timeout=0x7fffae413720, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77 ts = {tv_sec = 0, tv_nsec = 718571157} Python Exception <class 'gdb.error'> That operation is not available on integers of more than 8 bytes.: #2 0x00007f77fb672b89 in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=718571157) at qemu-timer.c:322 ts = {tv_sec = 0, tv_nsec = 718571157} Python Exception <class 'gdb.error'> That operation is not available on integers of more than 8 bytes.: #3 0x00007f77fb67243c in main_loop_wait (timeout=718571157) at main-loop.c:239 ret = <optimized out> spin_counter = 0 ret = -58739600 timeout = 4294967295 timeout_ns = <optimized out> #4 0x00007f77fb67243c in main_loop_wait (nonblocking=<optimized out>) at main-loop.c:494 ret = -58739600 timeout = 4294967295 timeout_ns = <optimized out> #5 0x00007f77fb41c37b in main () at vl.c:1798 [39/1910] nonblocking = <optimized out> last_io = 0 i = <optimized out> snapshot = <optimized out> linux_boot = <optimized out> initrd_filename = <optimized out> kernel_filename = <optimized out> kernel_cmdline = <optimized out> boot_order = <optimized out> boot_once = <optimized out> ds = <optimized out> cyls = <optimized out> heads = <optimized out> secs = <optimized out> translation = <optimized out> hda_opts = <optimized out> opts = <optimized out> machine_opts = <optimized out> icount_opts = <optimized out> olist = <optimized out> optind = 55 optarg = 0x7f77fc7f7970 "pc-i440fx-2.1" loadvm = <optimized out> machine_class = <optimized out> cpu_model = <optimized out> vga_model = 0x0 qtest_chrdev = <optimized out> qtest_log = <optimized out> pid_file = <optimized out> incoming = <optimized out> defconfig = <optimized out> userconfig = false log_mask = <optimized out> log_file = <optimized out> mem_trace = {malloc = 0x7f77fb5164e0 <malloc_and_trace>, realloc = 0x7f77fb5164c0 <realloc_and_trace>, free = 0x7f77fb5164b0 <free_and_trace>, calloc = 0x0, try_malloc = 0x0, try_realloc = 0x0} trace_events = <optimized out> trace_file = <optimized out> maxram_size = <optimized out> ram_slots = <optimized out> vmstate_dump_file = <optimized out> main_loop_err = 0x0 __func__ = "main" #6 0x00007f77fb41c37b in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4362 i = <optimized out> snapshot = <optimized out> linux_boot = <optimized out> initrd_filename = <optimized out> kernel_filename = <optimized out> kernel_cmdline = <optimized out> boot_order = <optimized out> boot_once = <optimized out> ds = <optimized out> cyls = <optimized out> heads = <optimized out> secs = <optimized out> translation = <optimized out> hda_opts = <optimized out> opts = <optimized out> machine_opts = <optimized out> icount_opts = <optimized out> olist = <optimized out> optind = 55 optarg = 0x7f77fc7f7970 "pc-i440fx-2.1" loadvm = <optimized out> machine_class = <optimized out> cpu_model = <optimized out> vga_model = 0x0 qtest_chrdev = <optimized out> qtest_log = <optimized out> pid_file = <optimized out> incoming = <optimized out> defconfig = <optimized out> userconfig = false log_mask = <optimized out> log_file = <optimized out> mem_trace = {malloc = 0x7f77fb5164e0 <malloc_and_trace>, realloc = 0x7f77fb5164c0 <realloc_and_trace>, free = 0x7f77fb5164b0 <free_and_trace>, calloc = 0x0, try_malloc = 0x0, try_realloc = 0x0} trace_events = <optimized out> trace_file = <optimized out> maxram_size = <optimized out> ram_slots = <optimized out> vmstate_dump_file = <optimized out> main_loop_err = 0x0 __func__ = "main"
*** Bug 1165352 has been marked as a duplicate of this bug. ***
*** Bug 1221518 has been marked as a duplicate of this bug. ***
*** Bug 1178533 has been marked as a duplicate of this bug. ***
BTW I've reproduced this under F23 too.
Thanks, setting version to f23. I'll try and reproduce eventually :)
Actually it sounds like this may have fixed the CPU burn issue. But it sounds like it's in 2.4.0 which is in F23. So can anyone else still reproduce this on an f23 host? commit fb1a3a051d89975f26296163066bb0745ecca49d Author: Paul Donohue <qemu-devel> Date: Fri Jun 12 10:08:45 2015 -0400 qemu-timer: Call clock reset notifiers on forward jumps Commit 691a0c9c introduced a mechanism by which QEMU_CLOCK_HOST can notify other parts of the emulator when the host clock has jumped backward. This is used to avoid stalling timers that were scheduled based on the host clock. However, if the host clock jumps forward, then timers that were scheduled based on the host clock may fire rapidly and cause other problems. For example, the mc146818rtc periodic timer will block execution of the VM and consume host CPU while firing every interrupt for the time period that was skipped by the host clock. To correct that problem, this commit fires the reset notification if the host clock jumps forward by more than a hard-coded limit. The limit is currently set to a value of 60 seconds, which should be small enough to prevent excessive timer loops, but large enough to avoid frequent resets in idle VMs. Signed-off-by: Paul Donohue <qemu-git> Message-Id: <20150612140845.GD2749> Signed-off-by: Paolo Bonzini <pbonzini>
I've hit a short CPU burn with fedora 23 but not anything as significant as before. I'll report back if I happen to hit it again. Not sure when exactly was the above patch effective in fedora 23.
I meant a few short ...
Ah, thanks for the good news. I'm still on F-22 at the moment, would it be sufficient to upgrade to F-23's qemu/libvirt/etc or does this fix depend on other components (kernel maybe) as well?
(In reply to Jakub Hrozek from comment #9) > Ah, thanks for the good news. I'm still on F-22 at the moment, would it be > sufficient to upgrade to F-23's qemu/libvirt/etc or does this fix depend on > other components (kernel maybe) as well? That patch is just in qemu, so updating only qemu will determine whether it's the complete fix or not
As of kernel 4.4.8-300.fc23 (both host and guest) and qemu 2:2.4.1-8.fc23 this is no longer happening here.
Okay, closing. But if anyone else can reliably reproduce with f23 or later, please reopen and list your package versions
This regressed for me after upgrading from F23 to F24. The host was upgraded on 2016-07-02 and the guest on 2016-07-04. This seems to have regressed when the guest was upgraded, but I can't be 100% sure about that - my use of the guest has been sporadic over the weekend. The steps to reproduce are: - Host and guest are both F24 on x86_64 - Start the guest - Suspend the host - 1-2 hours later, resume the host - Initially the guest seems OK, but after a few minutes the qemu process pegs all its CPU cores (host has 2 physical cores with hyperthreading, guest is configured to see 2 cores). Graph on virt-manager shows 100% guest CPU and 50% host CPU. - Guest is extremely slow to respond to e.g. SSH sessions at this point. - But eventually a "virsh shutdown <guest>" will take effect. qemu is at 2:2.6.0-4.fc24
Update: this regression doesn't seem related to suspend, it happened to me on a qemu started after the host was resumed: but 1352992
advise generously provided by Cole how to get debug info when that happens: > sudo debuginfo-install qemu-system-x86 > > Then on the next hang, grab the pid of the busted VM from ps axwww, and do: > > sudo pstack $pid > > The dump that output in a bug report, along with > /var/log/libvirt/qemu/$vmname.log. File it against qemu
(In reply to Dimitris from comment #13) > This regressed for me after upgrading from F23 to F24. > > The host was upgraded on 2016-07-02 and the guest on 2016-07-04. This seems > to have regressed when the guest was upgraded, but I can't be 100% sure > about that - my use of the guest has been sporadic over the weekend. I've seen similar behaviour. If I let the guest sit in that state for a long time (10-15 mins), qemu eventually returns to normal CPU usage, and the guest behaves normally. If I remove the guest agents in the guest (qemu-guest-agent and spice-vdagent), this doesn't happen, but sometimes the guest loses its network after resume. Can you confirm if you're seeing the same bug as I am?
(In reply to Amit Shah from comment #17) I'll try stopping and disabling the guest agents. BTW there's a host-side spice-gtk (https://bodhi.fedoraproject.org/updates/FEDORA-2016-de01d538f0) that I've updated to. Since this bug is closed shouldn't I be updating bug 1352992 instead?
(In reply to Dimitris from comment #18) > I'll try stopping and disabling the guest agents. Tried, no change - after host suspend/resume the qemu process spins the CPU again.
With agents running, something else I noticed. This is now happening even with the guest shut off while the host is suspended. Also, I have a host->guest port redirection (host port 3000 redirects to guest port 3000). Immediately before the qemu process starting to spin the CPU, I was loading a page on the guest-run web stack, using a host-run browser. At that time, on the host log: > [19582.635045] TCP: request_sock_TCP: Possible SYN flooding on port 3000. Sending cookies. Check SNMP counters. (followed by a thermal throttle event due to the CPU spin).
FYI I created a new issue for Fedora 24 (bug 1380893). I don't see messages about syn flooding in my logs. I attached my logs to the bug (decided to mark them private though).
*** This bug has been marked as a duplicate of bug 1380893 ***