Description of problem: Fedora 24 host and guest, both dnf-upgraded from F23. A few minutes after starting the guest, and while "top" in the guest shows it as idle, one of the host cores is spun 100% by the qemu process. This continues for a few minutes, then stops. Version-Release number of selected component (if applicable): 2:2.6.0-4.fc24 How reproducible: Every time Steps to Reproduce: 1. Start the guest 2. Do some work on the guest 3. After a few minutes, one of the CPU cores is pegged at 100%. 4. After a few minutes, this spinning stops. Actual results: One host CPU core spins without any work apparently being done in the guest. Expected results: Host CPU usage by qemu corresponds to guest workload. Additional info:
One more thing, there is also a Windows 10 guest running, which does not exhibit this behavior - when that VM takes up guest/host CPU, it's "legitimate" - one of win10's many background, not-yet-disabled processes doing their thing, and CPU shown as correspondingly busy in the Windows task manager.
Update: This happends more than once after the guest starts. Not sure what the trigger event is, as I keep starting/stopping an application stack in the guest throughout the day. pstack output during a recent occurrence: Thread 8 (Thread 0x7f17297fa700 (LWP 1645)): #0 0x00007f19489f72b3 in futex_abstimed_wait_cancelable (private=0, abstime=0x7f17297f7660, expected=0, futex_word=0x56171155a888) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 #1 do_futex_wait (sem=sem@entry=0x56171155a888, abstime=abstime@entry=0x7f17297f7660) at sem_waitcommon.c:111 #2 0x00007f19489f737f in __new_sem_wait_slow (sem=0x56171155a888, abstime=0x7f17297f7660) at sem_waitcommon.c:181 #3 0x00007f19489f7432 in sem_timedwait (sem=sem@entry=0x56171155a888, abstime=abstime@entry=0x7f17297f7660) at sem_timedwait.c:36 #4 0x000056170f9eae87 in qemu_sem_timedwait (sem=sem@entry=0x56171155a888, ms=ms@entry=10000) at util/qemu-thread-posix.c:245 #5 0x000056170f95043c in worker_thread (opaque=0x56171155a820) at thread-pool.c:92 #6 0x00007f19489ef5ca in start_thread (arg=0x7f17297fa700) at pthread_create.c:333 #7 0x00007f1948728ead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 Thread 7 (Thread 0x7f17217ea700 (LWP 1592)): #0 0x00007f19489f72b3 in futex_abstimed_wait_cancelable (private=0, abstime=0x7f17217e7660, expected=0, futex_word=0x56171155a888) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 #1 do_futex_wait (sem=sem@entry=0x56171155a888, abstime=abstime@entry=0x7f17217e7660) at sem_waitcommon.c:111 #2 0x00007f19489f737f in __new_sem_wait_slow (sem=0x56171155a888, abstime=0x7f17217e7660) at sem_waitcommon.c:181 #3 0x00007f19489f7432 in sem_timedwait (sem=sem@entry=0x56171155a888, abstime=abstime@entry=0x7f17217e7660) at sem_timedwait.c:36 #4 0x000056170f9eae87 in qemu_sem_timedwait (sem=sem@entry=0x56171155a888, ms=ms@entry=10000) at util/qemu-thread-posix.c:245 #5 0x000056170f95043c in worker_thread (opaque=0x56171155a820) at thread-pool.c:92 #6 0x00007f19489ef5ca in start_thread (arg=0x7f17217ea700) at pthread_create.c:333 #7 0x00007f1948728ead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 Thread 6 (Thread 0x7f171c7e0700 (LWP 31489)): #0 0x00007f19489f72b3 in futex_abstimed_wait_cancelable (private=0, abstime=0x7f171c7dd660, expected=0, futex_word=0x56171155a888) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 #1 do_futex_wait (sem=sem@entry=0x56171155a888, abstime=abstime@entry=0x7f171c7dd660) at sem_waitcommon.c:111 #2 0x00007f19489f737f in __new_sem_wait_slow (sem=0x56171155a888, abstime=0x7f171c7dd660) at sem_waitcommon.c:181 #3 0x00007f19489f7432 in sem_timedwait (sem=sem@entry=0x56171155a888, abstime=abstime@entry=0x7f171c7dd660) at sem_timedwait.c:36 #4 0x000056170f9eae87 in qemu_sem_timedwait (sem=sem@entry=0x56171155a888, ms=ms@entry=10000) at util/qemu-thread-posix.c:245 #5 0x000056170f95043c in worker_thread (opaque=0x56171155a820) at thread-pool.c:92 #6 0x00007f19489ef5ca in start_thread (arg=0x7f171c7e0700) at pthread_create.c:333 #7 0x00007f1948728ead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 Thread 5 (Thread 0x7f172d1ff700 (LWP 21368)): #0 0x00007f194871d32d in poll () at ../sysdeps/unix/syscall-template.S:84 #1 0x00007f194a3d3f8f in poll (__timeout=<optimized out>, __nfds=20, __fds=0x561713ea0038) at /usr/include/bits/poll2.h:46 #2 red_worker_main (arg=<optimized out>) at red_worker.c:12420 #3 0x00007f19489ef5ca in start_thread (arg=0x7f172d1ff700) at pthread_create.c:333 #4 0x00007f1948728ead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 Thread 4 (Thread 0x7f1936777700 (LWP 21366)): #0 0x00007f194871eaa7 in ioctl () at ../sysdeps/unix/syscall-template.S:84 #1 0x000056170f700b55 in kvm_vcpu_ioctl (cpu=cpu@entry=0x561711ecc000, type=type@entry=44672) at /usr/src/debug/qemu-2.6.0/kvm-all.c:1984 #2 0x000056170f700c0e in kvm_cpu_exec (cpu=cpu@entry=0x561711ecc000) at /usr/src/debug/qemu-2.6.0/kvm-all.c:1834 #3 0x000056170f6ef2ae in qemu_kvm_cpu_thread_fn (arg=<optimized out>) at /usr/src/debug/qemu-2.6.0/cpus.c:1064 #4 0x00007f19489ef5ca in start_thread (arg=0x7f1936777700) at pthread_create.c:333 #5 0x00007f1948728ead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 Thread 3 (Thread 0x7f1936f78700 (LWP 21365)): #0 0x00007f194871eaa7 in ioctl () at ../sysdeps/unix/syscall-template.S:84 #1 0x000056170f700b55 in kvm_vcpu_ioctl (cpu=cpu@entry=0x561711e4c000, type=type@entry=44672) at /usr/src/debug/qemu-2.6.0/kvm-all.c:1984 #2 0x000056170f700c0e in kvm_cpu_exec (cpu=cpu@entry=0x561711e4c000) at /usr/src/debug/qemu-2.6.0/kvm-all.c:1834 #3 0x000056170f6ef2ae in qemu_kvm_cpu_thread_fn (arg=<optimized out>) at /usr/src/debug/qemu-2.6.0/cpus.c:1064 #4 0x00007f19489ef5ca in start_thread (arg=0x7f1936f78700) at pthread_create.c:333 #5 0x00007f1948728ead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 Thread 2 (Thread 0x7f1939590700 (LWP 21363)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x000056170f9eb048 in futex_wait (val=<optimized out>, ev=<optimized out>) at util/qemu-thread-posix.c:292 #2 qemu_event_wait (ev=ev@entry=0x5617103e4264 <rcu_call_ready_event>) at util/qemu-thread-posix.c:399 #3 0x000056170f9f91be in call_rcu_thread (opaque=<optimized out>) at util/rcu.c:250 #4 0x00007f19489ef5ca in start_thread (arg=0x7f1939590700) at pthread_create.c:333 #5 0x00007f1948728ead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 Thread 1 (Thread 0x7f1965d19f80 (LWP 21360)): #0 0x00007f194871d3f1 in __GI_ppoll (fds=0x561713bb1000, nfds=80, timeout=<optimized out>, timeout@entry=0x7ffffd5b74a0, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:50 #1 0x000056170f958ae9 in ppoll (__ss=0x0, __timeout=0x7ffffd5b74a0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77 #2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=499000000) at qemu-timer.c:325 #3 0x000056170f95850a in os_host_main_loop_wait (timeout=499000000) at main-loop.c:252 #4 main_loop_wait (nonblocking=<optimized out>) at main-loop.c:506 #5 0x000056170f6bbdbd in main_loop () at vl.c:1934 #6 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4656
Created attachment 1176669 [details] pstack output after the CPU spin stops attaching pstack output from when the qemu process has stopped spinning the CPU.
Another update: This morning I noticed that after the F23->24 upgrade, for whatever reason, the domain's definition/XML was slightly different than before. I compared the XML between the current one and the pre-upgrade backup, and this is one of the differences: 24,25c24,25 < <cpu mode='custom' match='exact'> < <model fallback='allow'>Broadwell</model> --- > <cpu mode='host-model'> > <model fallback='allow'/> Not sure what caused that and whether it's some libvirt or virt-manager bug - I've seen "mysterious" changes happen once or twice before in my guests' definitions before but put it down to fat-fingering some unrelated change in the guest. Anyway, I changed the CPU definition back to explicitly Broadwell (same as the host, btw, so should the diff above really matter?) and so far, no CPU spinning. I'll keep an eye on this, take it through a couple of host suspend/resume cycles (related old bug 1233568), and post back here. Still seems like a legitimate bug though, letting the CPU fall back to host config shouldn't idle-spin.
Unfortunately this is still there. The regression behavior is very much like things were before under F22: After the host resuming, the qemu process spins the CPU. "virsh shutdown" never completes, "virsh destroy" does after a while. Also, while this CPU spin is ongoing, the gnome-terminal is almost completely unresponsive. I usually end up having to reboot to bring things back to normal.
I just reproduced this without the host having suspended/resumed with the guest running. I had shut the guest down before suspending, then resumed followed by restarting the guest. Several hours later, the symptoms above (CPU burn, unresponsive qemu process, and somehow unresponsive gnome-terminal in the host too) occurred. I tried to run pstack on the qemu process and that command also went completely unresponsive. This spinning qemu is becoming quite the black hole.
Just an update, this seems to be getting worse. I got two incidents only minutes apart on the same guest - each time after the guest had been cleanly shut down and restarted before. This is making qemu Fedora guests almost unusable for daily work. Ironically, the Windows 10 guest on the same machine works without any issues.
Can you please try grabbing a "strace -ff -p QEMUPID" and two or three "sudo trace-cmd record -e kvm -- sleep 3" while the guest is spinning? The output may be huge, so compress it and feel free to contact me off Bugzilla to share it. Thanks!
Just managed to reproduce this, I've been "provoking" it by keeping the guest running across host suspend/resume. But, by the time I realized it was happening and got ready to fire off the trace commands above, the CPU spin stopped. I do see this in the guest's journal though (including one line from before the host was suspended): Jul 23 19:01:01 wk_vm run-parts[1476]: (/etc/cron.hourly) finished 0anacron Jul 24 00:38:32 wk_vm kernel: INFO: rcu_sched detected stalls on CPUs/tasks: Jul 24 00:38:32 wk_vm kernel: 0-...: (1 GPs behind) idle=8d4/0/0 softirq=8862/8863 fqs=0 Jul 24 00:38:32 wk_vm kernel: (detected by 1, t=19791210 jiffies, g=10325, c=10324, q=0) Jul 24 00:38:32 wk_vm kernel: Task dump for CPU 0: Jul 24 00:38:32 wk_vm kernel: swapper/0 R running task 0 0 0 0x00000000 Jul 24 00:38:32 wk_vm kernel: ffffffff810655b6 0000000000000010 0000000000000246 ffffffff81c03e90 Jul 24 00:38:32 wk_vm kernel: 0000000000000018 ffffffff81c03eb8 ffffffff81038e00 ffffffff81c04000 Jul 24 00:38:32 wk_vm kernel: 0000000000000000 0000000000000000 ffffffff81c03ec8 ffffffff8103960f Jul 24 00:38:32 wk_vm kernel: Call Trace: Jul 24 00:38:32 wk_vm kernel: [<ffffffff810655b6>] ? native_safe_halt+0x6/0x10 Jul 24 00:38:32 wk_vm kernel: [<ffffffff81038e00>] ? default_idle+0x20/0xe0 Jul 24 00:38:32 wk_vm kernel: [<ffffffff8103960f>] ? arch_cpu_idle+0xf/0x20 Jul 24 00:38:32 wk_vm kernel: [<ffffffff810eb37a>] ? default_idle_call+0x2a/0x40 Jul 24 00:38:32 wk_vm kernel: [<ffffffff810eb67f>] ? cpu_startup_entry+0x2ef/0x350 Jul 24 00:38:32 wk_vm kernel: [<ffffffff817d3337>] ? rest_init+0x77/0x80 Jul 24 00:38:32 wk_vm kernel: [<ffffffff81d69006>] ? start_kernel+0x453/0x474 Jul 24 00:38:32 wk_vm kernel: [<ffffffff81d68120>] ? early_idt_handler_array+0x120/0x120 Jul 24 00:38:32 wk_vm kernel: [<ffffffff81d68346>] ? x86_64_start_reservations+0x2a/0x2c Jul 24 00:38:32 wk_vm kernel: [<ffffffff81d68494>] ? x86_64_start_kernel+0x14c/0x16f Jul 24 00:38:32 wk_vm kernel: rcu_sched kthread starved for 19791210 jiffies! g10325 c10324 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1 Jul 24 00:38:32 wk_vm kernel: rcu_sched S ffff880236223d68 0 7 2 0x00000000 Jul 24 00:38:32 wk_vm kernel: ffff880236223d68 00ff880236223d50 ffff8800ba580000 ffff880236213d00 Jul 24 00:38:32 wk_vm kernel: ffff880236224000 ffff880236223da0 ffff88023fd0e0c0 ffff88023fd0e0c0 Jul 24 00:38:32 wk_vm kernel: 0000000000000003 ffff880236223d80 ffffffff817dc6b5 0000000100189add Jul 24 00:38:32 wk_vm kernel: Call Trace: Jul 24 00:38:32 wk_vm kernel: [<ffffffff817dc6b5>] schedule+0x35/0x80 Jul 24 00:38:32 wk_vm kernel: [<ffffffff817df543>] schedule_timeout+0x123/0x270 Jul 24 00:38:32 wk_vm kernel: [<ffffffff81112e70>] ? trace_raw_output_tick_stop+0x80/0x80 Jul 24 00:38:32 wk_vm kernel: [<ffffffff8110cd78>] rcu_gp_kthread+0x3b8/0x940 Jul 24 00:38:32 wk_vm kernel: [<ffffffff8110c9c0>] ? rcu_sched_qs.part.54+0x40/0x40 Jul 24 00:38:32 wk_vm kernel: [<ffffffff810c6f68>] kthread+0xd8/0xf0 Jul 24 00:38:32 wk_vm kernel: [<ffffffff817e0982>] ret_from_fork+0x22/0x40 Jul 24 00:38:32 wk_vm kernel: [<ffffffff810c6e90>] ? kthread_worker_fn+0x180/0x180 Jul 24 00:38:32 wk_vm kernel: clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: Jul 24 00:38:32 wk_vm kernel: clocksource: 'kvm-clock' wd_now: 13be7a8fa645 wd_last: 1be7b064317 mask: ffffffffffffffff Jul 24 00:38:32 wk_vm kernel: clocksource: 'tsc' cs_now: 5a3ba369cee cs_last: 4863aa14c4e mask: ffffffffffffffff
Created attachment 1183500 [details] strace and trace-cmd output Reproduced, this time I managed to get the traces.
Created attachment 1189360 [details] traces after reproducing on fresh boot, no host suspend This has been happening mostly after host suspend/resume, but I just reproduced on a fresh host boot and without a host suspend - just minutes after the host and the guest were started.
Just a note that this is still present with 2.6.1-1.fc24
Still happening, years later (this first popped up in F21 or so), qemu 2:2.7.1-2.fc25
Dimitris, sorry this has dragged so long. Hopefully we can get to the bottom of it. To confirm, you are only seeing this with linux guests? Can you provide an example - virsh dumpxml $vmname - ~/.cache/libvirt/qemu/log/$vmname.log When you reproduce without host suspend/resume, how are you starting and stopping your VMs? Via virsh, or virt-manager, or gnome-boxes? Would be interesting to see if in those cases the VM is actually starting from a fully shutoff state or from from a 'saved' state (which is kinda suspend/resume of the VM) which is what gnome-boxes will generally do. Given some internal RH discussion, and https://bugs.launchpad.net/qemu/+bug/1174654 , this sounds related to timer stuff, probably qemu -rtc driftfix=slew (libvirt <timer name='rtc' tickpolicy='catchup'/> ) The launchpad bug suggests this libvirt config tweak as a workaround: <clock ...> <timer name='rtc' track='guest'/> ... </clock> So that's something to try. But the proper fix is either changing virt-manager/boxes defaults, or fixing driftfix=slew to not have the pathological behavior.
(In reply to Paolo Bonzini from comment #8) > Can you please try grabbing a "strace -ff -p QEMUPID" and two or three "sudo > trace-cmd record -e kvm -- sleep 3" while the guest is spinning? The output > may be huge, so compress it and feel free to contact me off Bugzilla to > share it. Paolo, see comment #10 and comment #11, Dimitris provided these logs
*** Bug 1380893 has been marked as a duplicate of this bug. ***
*** Bug 1393352 has been marked as a duplicate of this bug. ***
I'll try strace next time it happens. When it does it often comes in "clusters": CPU pegs, I notice, shut down guest, start guest, put some load on guest, CPU pegs immediately again. I don't suspend with the guest running any more, but it happens without host suspend/resume often enough that reproducing and grabbing the traces shouldn't take long. I start/stop the guest with virsh start/shutdown.
Created attachment 1256755 [details] virsh dumpxml output for affected guest
Created attachment 1256757 [details] log fragment from affected guest
I managed to reproduce this once, but I didn't grab the debug info and other attempts to reproduce have failed. Still trying though
*** Bug 1389226 has been marked as a duplicate of this bug. ***
Created attachment 1260143 [details] pstack, trace-cmd, strace snippet Okay I grabbed some data. Up to date f25 host and guest, kernel-4.9.12-200.fc25.x86_64, qemu-system-x86-2.8.0-1.fc26.x86_64 Laptop was suspended for about 10 hours with the VM running, on laptop resume the VM was spinning at 100%. Only grabbed one trace-cmd as I botched the command. Multiple pstack calls all showed the same results. The strace output is just a snippet of what was spamming while the VM was spinning, was just hitting ppoll with a minor interruption every second or so.
Looks like Cole managed to get some stack traces with f25 packages, clearing NEEDINFO. Please let me know if you'd still like me to try to grab a stack trace.
The trace doesn't show much going on. driftfix=slew cannot really be fixed, it is working as intended (!), OTOH it is only needed for Windows VMs. The workaround is to suspend the VM too before suspending the host.
Hm, this happened again even though I'm running the guest with: <clock offset='utc'> <timer name='rtc' track='guest'/> </clock> qemu: -rtc base=utc,clock=vm Should I be using some other option to work around this? Suspending the VM has its own issues, due to virtio drivers last time I checked.
Tried to run pstack on the qemu process, but as happens often in that case, it somehow ended up getting my **host** desktop (gnome, wayland, but happened before under f24/X) to the point where the only way out was a messy ctrl-alt-del. VTs were not available. This happens often if I don't stop the guest quickly after it starts spinning its qemu process.
@Paolo, then how would you explain that usually after a night being suspended, there is no issue after wake up, and sometimes when I just move to another location with a half an hour of sleep time, then it may take several minutes in VM unresponsive? The only difference I can think about is that overnight I do not undock the machine. While when moving, the machine is undocked and usually power is not connected when waking up.
@Paolo also to be clear, this doesn't require a host suspend to occur. The key "requirement" is the guest being loaded (both its virtual cores at 100%) for a minute or two. It's after that that the host stays spinning, even after the guest load has dropped.
> @Paolo also to be clear, this doesn't require a host suspend to occur. The key > "requirement" is the guest being loaded (both its virtual cores at 100%) for a > minute or two. It's after that that the host stays spinning, even after the > guest load has dropped. Ok, this is different and should not happen.
Created attachment 1263182 [details] Fresh pstack output in current (2.7.1-2.fc25) version As of yesterday or so this is happening virtually constantly. Every 2-3 times I put load on the guest, it starts spinning. The "good" news is that I got a pstack trace without locking up the desktop like before.
In case this matters: I get these all the time with this guest: [ 2544.011413] kvm [5588]: vcpu0, guest rIP: 0xffffffffa305e7e2 unhandled rdmsr: 0x34 [ 2545.403615] kvm [5588]: vcpu0, guest rIP: 0xffffffffa305e7e2 unhandled rdmsr: 0x606 [ 2545.421787] kvm [5588]: vcpu1, guest rIP: 0xffffffffa305e7e2 unhandled rdmsr: 0x606 [ 2545.517298] kvm [5588]: vcpu0, guest rIP: 0xffffffffa305e7e2 unhandled rdmsr: 0x611 [ 2545.517304] kvm [5588]: vcpu0, guest rIP: 0xffffffffa305e7e2 unhandled rdmsr: 0x639 [ 2545.517306] kvm [5588]: vcpu0, guest rIP: 0xffffffffa305e7e2 unhandled rdmsr: 0x641 [ 2545.517308] kvm [5588]: vcpu0, guest rIP: 0xffffffffa305e7e2 unhandled rdmsr: 0x619
One more possible datapoint: When this happens, in terms of reported CPU load I see: - top/htop in guest: idle - CPU in VM "Details" view in virt-manager: one virtual core 100% busy. - Host: One core 100% busy. This is getting quite desperate at this point - reaching "forced to get a macbook" level in terms of workflow disruption.
I kinda wonder if this is two different issues... whatever Dimitris you are hitting independent of suspend/resume, and the suspend/resume issue. some ideas to try: - virsh edit $vmname, remove the whole <clock> block, see if it still reproduces - if so, switch <cpu mode='host-model'/> to <cpu mode='host-passthrough'/> - if still reproduces, drop the <cpu> block entirely are windows virtio drivers up to date?
I'll try these suggestions today (have already tried messing with the CPU model but not these options). BTW the guest is not Windows, it's F24 (also have a Windows guest, haven't seen this problem there). Have the virtio drivers in F24 changed recently (4.8/4.9 kernels)?
FYI, om ,y recent experience (last 2 months) I think I've seen the issue only when waking up wihtout power to the laptop. At least I can't remember seeing the VM spinning when I open laptop lid in the morning. So my current suspect is something related to power (at least for what I'm seeing). But I'm still on f24 host.
(In reply to Dimitris from comment #36) > I'll try these suggestions today (have already tried messing with the CPU > model but not these options). > > BTW the guest is not Windows, it's F24 (also have a Windows guest, haven't > seen this problem there). Have the virtio drivers in F24 changed recently > (4.8/4.9 kernels)? Sorry I thought this was a windows VM so ignore the virtio driver comment. There's been a lot of comments in this bug... Dimitris can you give an update of the current state of when this reproduces? For you it's completely independent of host power/suspend/resume? Was this originally a host suspend/resume issue but morphed into the current state? If so, any idea when that changed? (host update, guest update, etc)
(In reply to Aleksandar Kostadinov from comment #37) > FYI, om ,y recent experience (last 2 months) I think I've seen the issue > only when waking up wihtout power to the laptop. At least I can't remember > seeing the VM spinning when I open laptop lid in the morning. So my current > suspect is something related to power (at least for what I'm seeing). But > I'm still on f24 host. Aleksander, I reopened your bug 1380893, let's use that one to track cpu spin on host suspend/resume since Dimitris' issue has a different pattern... may very well be the same issue but it's getting a little confusing
No worries, it's a long thread by now. Initially (since F21 or thereabouts, with F21 host and guest): - Virtually never saw this without host suspend/resume involved. - Repro was: Guest running, run then stop app workload in guest, suspend/resume host, qemu process then spins one host CPU and often becomes unresponsive. - Briefly (roughly a year ago for a brief period) this seems to have gone away completely. Host was F23, guest was either F22 or F23. - Then reversion to original behavior, through F24 and F25. As of this point I consistently avoided suspending with the guest running - I'd shut it down first. But also, as of this point, I would very occasionally still see this without suspend: Run app workload in guest, guest spins one host CPU. - As of, very roughly, a week ago, this went from "very occasionally" to "almost every time I start the app workload in the guest and load a couple of pages". The app workload consists of a mixed application stack (ruby/rails server, Java, node processes running) and some heavy initial CPU load as things initialize. Other tasks that keep the guest busy (dnf updates, compiles due to e.g. ruby gem installs) don't do this. So it doesn't seem to be a *CPU load* trigger, but something about the specific workload.
Thanks for the info, and for your patience... Any host or guest updates that timewise coincide with the increased frequency of the hang? I wonder if the workload thing is related to network type=user, particularly if your workload is doing a bunch of gem installs from network. With qemu:///session you can do this <interface type='bridge'> <source bridge='virbr0'/> </interface> Which gives access to libvirt's 'default' network that qemu:///system uses by default. Your -redir args will stop working, but you can communicate with the guest IP like normal from the host so they may not be required, depends on your setup.
@Cole I tried the steps from comment #35, no change: Starting and (minimally) exercising the app stack in the guest, via the userspace networking connection (using the redirs) results in the same behavior.
I haven't tried the default network yet - I vaguely remember removing virbr0 for some reason some time ago - it doesn't show up in e.g. "ip link". Not sure how to restore it... But anyway, I switched to using a browser inside the VM (viewing via the virt-manager viewer) to drive the app stack and I can't reproduce this (tried several cycles, for 20-30 minutes now). So that usermode networking looks suspicious. Strange thing is I still am using that same usermode network to ssh into the VM, and even to sshfs-mount a guest directory on the host. That doesn't seem to trigger the problem.
Typical systems get virbr0 from libvirt-daemon-config-network package. This adds a libvirt virtual nat network which you can manage on the qemu:///system connection like: sudo virsh net-list --all sudo virsh net-start default sudo virsh net-destroy default sudo virsh net-autostart default Installing that package should set it to 'autostart' by default, which means it will be available on host boot. Depending on how you removed the network definition, you might have to recreate it manually. The XML is <network> <name>default</name> <bridge name="virbr0"/> <forward/> <ip address="192.168.122.1" netmask="255.255.255.0"> <dhcp> <range start="192.168.122.2" end="192.168.122.254"/> </dhcp> </ip> </network> So just stuff that in a foo.xml and sudo virsh net-define foo.xml sudo virsh net-start default sudo virsh net-autostart default
Thanks Cole, I'll try to set up that bridge when I get a quiet moment, as I'd like to restore access to the app through my host-side browsers - so much easier to navigate. Having said that, I've been (ab)using the guest throughout the day, including suspending the host with the guest running. No issues so far. Let's please leave this issue open though since the root cause hasn't been fixed. I intend to switch to usermode net again over the weekend and try to capture some better traces, with debug info included.
Created attachment 1264453 [details] pstack, strace, trace-cmd output when spinning These were captured when at this point: - Start VM - Start app stack - Load a page over the user-mode connection - qemu process (8401) is spinning one host core
Created attachment 1264454 [details] Current dumpxml output for the guest
Thanks for the info. If this latest hang-under-load is usermode network related, not sure if the trace-cmd stuff is useful (I'm not really sure how to read it anyways). So far does it appear that type=user networking is indeed the culprit, or has it reproduced with the bridge networking as well? Does the VM fully lock up for this hang, or does it recover after a while? Also if possible it'd be interesting if you can come up with a generic reproducer that I can try out, and if so I can try to bisect it and poke at it myself.
Dimitris are you still seeing this? If so please provide the info requested in comment #48
Sorry, I've procrastinated on a reproducer. I haven't seen this since I switched to using a browser running on the guest to access the server stack also running in the guest. Networking is still usermode, but other than a few ssh sessions all traffic is initiated outbound from the guest, and so far I haven't seen it. I'll try to set up a reproducing guest, something with a server on it that generates some traffic - perhaps a Fedora server with Cockpit - over the weekend.
OK, I managed to find some time and applied a little thinking. I can reproduce this just by running a tinyproxy server in the guest, redirecting from the host with: <qemu:commandline> <qemu:arg value='-redir'/> <qemu:arg value='tcp:8888::8888'/> </qemu:commandline> or similar (modulo firewall settings), then get a browser on the host to use that proxy as a SOCKS proxy. Load news.google.com, guest pegs one core as before.
Okay I gave it a spin but not sure if I'm doing things correctly. F26 host and F26 guest. * Install tinyproxy in the guest, add Allow 10.0.2.2 in tinyproxy.conf, start it * set firefox to SOCKS host=127.0.0.1, port 8888, socks v5, try to connect to a webpage * watch tinyproxy.log in the guest. I see confirmation that my host is connecting (CONNECT lines) but the page load never even starts, firefox just waits on the network. qemu doesn't spin either though Am I missing a tinyproxy config step?
Not sure, other than my host and guest still being on F25. I'm planning on doing the upgrade in a few days, I'll try to reproduce and post here.
This message is a reminder that Fedora 25 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '25'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 25 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 25 changed to end-of-life (EOL) status on 2017-12-12. Fedora 25 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.