Bug 1380893 - VM hangs after wake-up from suspend to ram
Summary: VM hangs after wake-up from suspend to ram
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: qemu
Version: 25
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Fedora Virtualization Maintainers
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 1165352 1178533 1221518 1233568 1389226 1393352 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-01 06:56 UTC by Aleksandar Kostadinov
Modified: 2017-09-15 17:03 UTC (History)
25 users (show)

Fixed In Version:
Clone Of: 1221518
Environment:
Last Closed: 2017-09-15 17:03:47 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
pstack from qemu VM process (1.57 KB, text/plain)
2016-10-01 06:56 UTC, Aleksandar Kostadinov
no flags Details
the Fedora 24 VM XML dump (4.15 KB, text/plain)
2016-10-01 07:04 UTC, Aleksandar Kostadinov
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1174654 0 None None None 2016-10-24 06:45:22 UTC

Description Aleksandar Kostadinov 2016-10-01 06:56:35 UTC
Created attachment 1206476 [details]
pstack from qemu VM process

Since recently I started to see an old bad behavior that was ok for some period of time. Now my host and guest are both Fedora 24.

Version-Release number of selected component (if applicable):
qemu-kvm-2.6.0-5.fc24.x86_64
Linux 4.6.4-301.fc24.x86_6

How reproducible:
hard, usually machine should be suspended for several hours

Attaching pstack.

+++ This bug was initially created as a clone of Bug #1221518 +++

Description of problem:
Sometimes my VMs hang with strange CPU usage pattern after wake up from suspend. I'm running fedora 20 and fedora 21 VMs.

Sometimes even virtual machine manager GUI becomes unresponsive tryingto work with such VM. Strangely though after some time the machine recovers  by itself (sometimes).
In this occasion both machines returned to normal after calling `pstack` on them.
<...>

Comment 1 Aleksandar Kostadinov 2016-10-01 07:02:02 UTC
Created attachment 1206477 [details]
qemu log for the Fedora 24 VM

Comment 2 Aleksandar Kostadinov 2016-10-01 07:04:53 UTC
Created attachment 1206478 [details]
the Fedora 24 VM XML dump

Comment 3 Aleksandar Kostadinov 2016-10-01 07:13:27 UTC
While filing the bug report the VM recovered from that high CPU usage.
The strange thing is that I don't see any changes in pstack and QEMU log during the high usage and after machine has recovered.

If you have other ideas how to debug the high CPU usage *and* forgot to say earlier *network access to VM lost*, please let me know. Perhaps I should look at network stats during the high CPU usage period.

Comment 6 Dimitris 2016-10-01 23:48:37 UTC
See bug 1352992 (duplicate?).  It just happened again, no suspend involved - the host and guest were fresh boots.

In my case, and after this regression with F24 (it had been OK for a few months under F23), when this happens it's almost 100% when I start up the work-related Rails stack in the guest.  That stack has a rather CPU-heavy initialization workload, but it normally (and usually) is done with this "legitimate" CPU peg after a few seconds.

Comment 7 Dimitris 2016-10-01 23:49:53 UTC
In the minority of occurrences I also lose network access to the VM (usermode networking, qemu:///session user-run VM).

Comment 8 Dimitris 2017-01-09 20:53:33 UTC
Still happens to me after upgrading to F25, qemu 2:2.7.0-8.fc25 on x86_64

Comment 9 Cole Robinson 2017-02-23 01:52:06 UTC
Duping to 1352992 since sounds like they probably have the same root issue. Let's follow up there

*** This bug has been marked as a duplicate of bug 1352992 ***

Comment 10 Cole Robinson 2017-03-16 15:31:53 UTC
The duped bug has a different reproducing pattern, so reopening this one to track VM spin after host suspend/resume.

Comment 11 Cole Robinson 2017-03-16 15:32:51 UTC
*** Bug 1389226 has been marked as a duplicate of this bug. ***

Comment 12 Cole Robinson 2017-03-16 15:34:09 UTC
*** Bug 1393352 has been marked as a duplicate of this bug. ***

Comment 13 Cole Robinson 2017-03-16 15:38:30 UTC
*** Bug 1233568 has been marked as a duplicate of this bug. ***

Comment 14 Cole Robinson 2017-03-16 15:43:12 UTC
*** Bug 1165352 has been marked as a duplicate of this bug. ***

Comment 15 Cole Robinson 2017-03-16 15:43:15 UTC
*** Bug 1178533 has been marked as a duplicate of this bug. ***

Comment 16 Cole Robinson 2017-03-16 15:43:19 UTC
*** Bug 1221518 has been marked as a duplicate of this bug. ***

Comment 18 Cole Robinson 2017-03-16 16:28:14 UTC
Aleksander, can you try this config changes:

* clear the <clock> xml. fully stop the VM, then do: sudo virt-xml fedora_work --edit --confirm --clock clearxml=yes
* if the issue still reproduces, fully stop the VM, then clear the <cpu> XML: sudo virt-xml fedora_work --edit --confirm --cpu clearxml=yes

if the issue still reproduces, report here and we can try some more. Please try to eliminate any other variables, like other VMs running, or any additional VM config changes.

thanks for your patience, I realize this has been lingering for too long...

Comment 19 Cole Robinson 2017-03-17 18:22:25 UTC
So, I had this (or something like it) 100% reproducible and narrowed it down to kvmclock. Then I updated my f25 guest and now it's not reproducing :( Went from kernel-4.8.6 to kernel-4.9.14 in the guest.

So, question for other users that are still hitting this: what host and guest are you reproducing this with? Can anyone reproduce with an up to date f25 guest?



===

My reproducing steps for posterity:

Setup:
* Up to date f25 host (kernel-4.9.14-200.fc25.x86_64)
* Out of date f25 guest (kernel-4.8.6-300.fc25.x86_64). VM installed via virt-manager but with the VM config pared down to only:

/usr/bin/qemu-kvm \
  -no-user-config \
  -nodefaults \
  -cpu qemu64 \
  -m 4096 \
  -smp 4 \
  -drive file=/var/lib/libvirt/images/fedora25.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 \
  -device virtio-blk-pci,scsi=off,drive=drive-virtio-disk0 \
  -vga std -usb -usbdevice tablet \
  -monitor vc \
  -display sdl


Steps:
* Launch the VM, log into standard gnome-shell desktop
* Close the laptop lid
* Set a timer for 20 minutes (10 and 15 minutes _dont_ reproduce, 20, 30, 60, 120 all reproduce the issue)
* After timer is up, open laptop, log in to host.

VM is frozen and unresponsive to any UI interaction. Top shows between 100% and 300% CPU spinning for qemu-system-x86. pstack is always completely uneventful, just showing CPU threads and main loop threads. The VM doesn't recover quickly, I waited 5 minutes once and it was still spinning before I gave up and killed it.

Config variations that made no difference: disabled s3/s4, -cpu host and -cpu Broadwell, default virt-manager timer settings, -rtc clock=guest, all the default virt-manager devices like spice, agent channels, network devices, reproduces with sdl and gtk and spice UI.

The only thing that avoided the issue was -cpu qemu64,-kvmclock. I never managed to make it reproduce with that setup

Comment 20 Cole Robinson 2017-03-21 18:46:43 UTC
RHEL/Centos 7 VMs are affected as well, so I filed a bug for that:
https://bugzilla.redhat.com/show_bug.cgi?id=1434566

Comment 21 Cole Robinson 2017-05-17 16:50:25 UTC
Been silent for a couple months. So:

- Anyone still hitting this? If so please report host/guest distro and kernel
- Anyone that was previously hitting this _not_ hitting it anymore? Guest kernel update fixed it for me but I want to be sure it's the same for others too

Comment 22 Dimitris 2017-05-17 16:55:19 UTC
I haven't hit this for a long while, definitely not after I implemented the workaround (not use usermode networking too much) for bug 1352992.  I've been suspending with the Fedora guest running several times a day for months.

Comment 23 Cole Robinson 2017-07-11 13:25:28 UTC
Okay given lack of confirmation that this is still an issue, I think it's safe to assume that latest kernels fix this, so closing

Comment 24 mario.mendoza 2017-07-30 22:11:23 UTC
I'm experiencing the same problem with. Guest hangs with high CPU after suspended. Open connections don't work anymore and this is impossible to open new connections.

HOST=Fedora 26 (Linux version 4.11.11-300.fc26.x86_64)
GUEST=RHEL 7.3-36 (Linux version 3.10.0-514.26.2.el7.x86_64)

Comment 25 Terry Wilson 2017-09-15 16:24:37 UTC
I'm also still experiencing.

HOST=Fedora 25 - 4.10.15-200.fc25.x86_64
GUEST=Centos 7 - 3.10.0-693.2.2.el7.x86_64

I'd say ~80% of the time I suspend, I get 100% guest CPU usage. No noticeable difference whether I pause the VM before closing laptop.

Comment 26 Cole Robinson 2017-09-15 17:03:47 UTC
This is a guest kernel bug, so for centos/rhel guests you should follow https://bugzilla.redhat.com/show_bug.cgi?id=1434566

Reclosing again since this is fixed for fedora guests


Note You need to log in before you can comment on or make changes to this bug.