Bug 747464

Summary: VMs occasionally become unresponsive and it is impossible to type into other applications until virt-manager is killed
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: spice-gtkAssignee: Alon Levy <alevy>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 16CC: bfay, cfergeau, crobinso, dblechte, ipilcher, jp, madko, marcandre.lureau, martin, mishu
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: xorg-x11-drv-qxl-0.0.22-0.fc17 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 794658 (view as bug list) Environment:
Last Closed: 2012-03-27 15:54:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 794658    
Attachments:
Description Flags
backtrace from virt-manager while it's hung
none
qemu backtrace
none
backtrace from qemu when hung, with recent F17 as host+guest none

Description Adam Williamson 2011-10-19 22:08:27 UTC
For the last few months on F16 I've been dealing with this frustrating bug. The nature of the VM doesn't seem to matter, I've had this happen on F16, F15 and even Windows VMs.

What happens is that the view of the VM - and all of virt-manager - will become unresponsive, essentially hung. The VM itself is running fine in the background. What's especially odd is that it becomes impossible to type into any other running application while this is happening: I can switch between and manipulate other apps with the mouse, but I cannot type into them.

If I switch to a VT and kill virt-manager it clears things: I can now type into apps on the desktop again, and I can re-run virt-manager and pick up where I left off with the VM. But it's a frustrating bug.

I haven't found a 100% reproducer of the bug yet, but one place where it does seem to trigger quite often, for whatever reason, is at anaconda's timezone selection screen - I click my city (Vancouver), and the hang happens.

Comment 1 Marc-Andre Lureau 2011-10-19 22:36:15 UTC
Most probably a dead-lock issue between spice-gtk, libvirt and qemu, which is caused by qemu/qxl doing some IO synchronously. A nasty kind of bug that involved several parts.

You need a fairly recent version of spice & qemu (I think the one in f16 is recent enough) but the xorg-qxl driver fix is not yet committed, afaik.

See also:
https://bugzilla.redhat.com/show_bug.cgi?id=700134
https://bugs.freedesktop.org/show_bug.cgi?id=41622

Alon, care to share the status of the various components in f16?

Comment 2 Marc-Andre Lureau 2011-10-19 22:37:41 UTC
Adam, please get a backtrace when the hang happen, to be sure it's the same bug. thanks

Comment 3 Adam Williamson 2011-10-19 23:02:28 UTC
okay, I'll try.

[adamw@adam grub2 (f16)]$ rpm -q spice-gtk libvirt qemu-kvm xorg-x11-drv-qxl
spice-gtk-0.7.39-1.fc16.x86_64
libvirt-0.9.6-2.fc16.x86_64
qemu-kvm-0.15.0-5.fc16.x86_64
xorg-x11-drv-qxl-0.0.21-5.fc16.x86_64

Comment 4 Adam Williamson 2011-10-19 23:03:49 UTC
note that when I talk about 'other apps' I am of course talking about other apps running *on the host* besides virt-manager, in case that wasn't clear.

Comment 5 Adam Williamson 2011-10-20 01:54:56 UTC
wasn't entirely sure what you want a backtrace from, but here's the one from virt-manager. is it any use? virt-manager is python, right?

Comment 6 Adam Williamson 2011-10-20 01:56:30 UTC
Created attachment 529158 [details]
backtrace from virt-manager while it's hung

Comment 7 Marc-Andre Lureau 2011-10-20 10:23:33 UTC
(In reply to comment #5)
> wasn't entirely sure what you want a backtrace from, but here's the one from
> virt-manager. is it any use? virt-manager is python, right?

try the one from qemu while the client hangs. Thanks

Comment 8 Adam Williamson 2011-10-20 18:38:00 UTC
crap. I got the qemu one first then decided you probably wanted the virt-manager one. sigh =) okay, will get qemu next time it happens.

Comment 9 Marc-Andre Lureau 2011-10-20 20:52:20 UTC
(In reply to comment #8)
> crap. I got the qemu one first then decided you probably wanted the
> virt-manager one. sigh =) okay, will get qemu next time it happens.

Sorry, and btw, we need all threads :)

Comment 10 Adam Williamson 2011-10-24 19:06:34 UTC
Created attachment 529945 [details]
qemu backtrace

here's the backtrace from qemu while the hang is happening

Comment 11 Alon Levy 2011-10-25 10:31:55 UTC
Thanks Adam. The hang is on an io write (update area) from the vm that is waiting on a read from spice server, that is waiting on a read that strangely appears in libpthread, but maybe this is the missing async support for the X qxl driver. Marc-Andre, have you done a build with the patch Gerd had that I sent you? if not I can do one.

Alon

Comment 12 Adam Williamson 2012-01-24 02:04:11 UTC
I'm still hitting this in current Rawhide. Any updates?



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 13 Cole Robinson 2012-02-14 12:15:12 UTC
There's an f15 bug with multiple stack traces attached:

https://bugzilla.redhat.com/show_bug.cgi?id=768404

Though at least my report in that bug was from an f16. Any word on this? I can reproduce fairly regularly if more info is needed.

Comment 14 Adam Williamson 2012-02-14 17:25:25 UTC
I've heard nothing further.

You can use 'spicec' as a kind of workaround: use virt-manager to launch the VM, then run spicec and connect to localhost, port 5900. virt-manager bugs out so often it's kind of unusable at present.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 15 Alon Levy 2012-02-14 17:56:04 UTC
Adam,

 Sorry for the long delay.

 Can you please install spice-server debug symbols and reproduce the qemu backtrace?

 Can you update the qxl driver in the F16 vm (not sure it's in F15) to 21-13, that's when Marc-Andre added the async patches, and retest?

Thanks for the patience,
Alon

Comment 16 Alon Levy 2012-02-15 08:30:38 UTC
Adam, I missed the "async = QXL_SYNC" in the qemu stack trace, so even if you don't provide an updated stack trace I'm sure this is the result of using a too old driver, or otherwise letting qemu tell the guest it is using a too old device - so:

 1. update the driver as noted in comment 15 to xorg-x11-drv-qxl-0.0.21-13.fc17 or newer
 2. please provide the qemu command line, although as long as there is no "revision=1" property for the qxl-vga device, it should be fine.
 3. tell me if it still reproduces. If so please have the spice-server debug symbols installed.

Thanks,
Alon

Comment 17 Adam Williamson 2012-02-15 17:31:21 UTC
well, 'too old' seems...well, it's one way of putting it. I've been experiencing this bug since F16. If there's a fix for the bug in the latest bleeding-edge qxl or whatever, you should probably backport it to F16, since that's our actual current release and all.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 18 Adam Williamson 2012-02-16 22:44:28 UTC
okay, I did hit this yesterday performing an install of F17 Alpha RC2, which has xorg-x11-drv-qxl-0.0.21-16.fc17 . The qemu command line is:

qemu     28580 23.5  6.2 6717988 1033380 ?     Sl   14:40   0:56 /usr/bin/qemu-kvm -S -M pc-0.14 -enable-kvm -m 2048 -smp 1,sockets=1,cores=1,threads=1 -name Test_1 -uuid 1b76a7fb-b6a4-251e-f415-e2a5ff08404b -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/Test_1.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/media/Sea500/images/Fedora-17-Alpha-x86_64-DVD.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1 -drive file=/media/Sea500/images/Test_1.img,if=none,id=drive-virtio-disk0,format=raw -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2 -netdev tap,fd=21,id=hostnet0,vhost=on,vhostfd=22 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:03:ad:59,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -device usb-tablet,id=input0 -spice port=5900,addr=127.0.0.1,disable-ticketing -vga qxl -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7

I'll try and reproduce again and get a new trace.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 19 Alon Levy 2012-02-17 07:04:27 UTC
Hi Adam.

 could you try http://people.freedesktop.org/~alon/qemu-1.0-7.fc18.src.rpm
 I think it should fix your hangs. (I'll attach a binary rpm if I manage to build it, it failed during tests for some reason, and I don't have permissions on the qemu package to do a scratch build).

Alon

Comment 20 Adam Williamson 2012-02-17 19:36:58 UTC
Nope, I can still reproduce even with that qemu, sorry.

Comment 21 Cole Robinson 2012-03-04 20:29:51 UTC
Okay, I came up with a 100% reliable reproducer by patching virt-manager to be a bit pathological.

F16 host, F16 guest. virt-manager git. patch:

diff --git a/src/virtManager/engine.py b/src/virtManager/engine.py
index 5bd12d2..820dcd0 100644
--- a/src/virtManager/engine.py
+++ b/src/virtManager/engine.py
@@ -1001,6 +1001,15 @@ class vmmEngine(vmmGObject):
             return
 
         logging.debug("Pausing vm '%s'", vm.get_name())
+        v = vm._backend
+        def func(idx):
+            print "lookup attempt", idx, v.info()
+            idx += 1
+            if idx < 10000:
+                gobject.idle_add(func, idx)
+
+        gobject.idle_add(func, 0)
+        return
         vmmAsyncJob.simple_async_noshow(vm.suspend, [], src,
                                         _("Error pausing domain"))


Makes it so whenever you click the pause button, it kicks of an idle loop to hit virDomainGetInfo, which hits the qemu monitor.

Run virt-manager --debug. Open the f16 guest, log in to gnome fallback desktop. Open a terminal. hit the pause button to kick off the loop. In the guest, open glxgears from the terminal. If it doesn't hang, keep closing and reopening glxgears. It usually hangs for me the first time. Causes the weird host X lockup that Adam reported.

The good news is that xorg-x11-drv-qxl-0.0.21-13.fc16 from updates-testing fixes it for me, after installing in the guest and rebooting. I gave the build karma so it should be heading to stable soon.

Comment 22 Adam Williamson 2012-03-05 22:47:27 UTC
Can we get an equivalent update for F17? Now it's branched from Rawhide, it's on the Bodhi treadmill, you have to fire off an update, just building through Koji isn't enough. thanks!



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 23 Alon Levy 2012-03-06 00:09:13 UTC
The f17 package contains that patch already since 0.0.21-13, see http://koji.fedoraproject.org/koji/buildinfo?buildID=299027

Comment 24 Adam Williamson 2012-03-06 00:46:19 UTC
okay. I may already have tested with that version in the client, but I'll have to double check.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 25 Adam Williamson 2012-03-06 20:31:38 UTC
sadly I can reproduce this with the current F17 nightly, with xorg-x11-drv-qxl-0.0.21-16.fc17.x86_64 . Happened to me twice while simply trying to do an install. Still doesn't happen with spicec.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 26 Marc-Andre Lureau 2012-03-07 18:38:33 UTC
Adam

(In reply to comment #25)
> sadly I can reproduce this with the current F17 nightly, with
> xorg-x11-drv-qxl-0.0.21-16.fc17.x86_64 . Happened to me twice while simply
> trying to do an install. Still doesn't happen with spicec.
> 

Please, get a full thread apply all bt qemu backtrace when this happen.

Something I forgot to mention is that you *need* -M pc-0.15 (>= 0.15) or qxl device revision >= 3. 

It doens't happen with spicec because it is a seperate process, so there is no deadlock situation between spice and libvirt.

Comment 27 Adam Williamson 2012-03-10 02:18:40 UTC
ah, my test machine has pc-0.14. can i just bodge that up to 0.15 and re-test?



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 28 Adam Williamson 2012-03-10 02:35:19 UTC
Reproduced even after bumping that to 0.15. Attaching backtrace from qemu.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 29 Adam Williamson 2012-03-10 02:36:01 UTC
Created attachment 569028 [details]
backtrace from qemu when hung, with recent F17 as host+guest

Comment 30 Alon Levy 2012-03-10 10:54:57 UTC
Hi Adam,

 That trace is totally normal:
 Thread 3 - vcpu thread, waiting on ioctl with KVM_RUN (optimized out but given by kvm_cpu_exec)
 Thread 2 is the red_worker thread waiting on epoll_wait
 Thread 1 is qemu main thread waiting on select

 Perhaps the hang is in virt-manager and it really isn't related to the async bug after all? can you give a backtrace of virt-manager and libvirtd when the hang happens?

Alon

Comment 31 Adam Williamson 2012-03-12 19:38:27 UTC
okay, I'll try and find some time to reproduce and provide the backtraces soon.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 32 Adam Williamson 2012-03-20 18:50:08 UTC
Interestingly, I think a qxl update may have fixed this. I'm yet to hit this bug in any VM I've built that uses https://admin.fedoraproject.org/updates/FEDORA-2012-3887/xorg-x11-drv-qxl-0.0.22-0.fc17 . I'd like to have a bit more testing time before declaring that this is fixed, but it looks positive so far. I'll keep an eye on it.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 33 Alon Levy 2012-03-20 18:58:12 UTC
(In reply to comment #32)
> Interestingly, I think a qxl update may have fixed this. I'm yet to hit this
> bug in any VM I've built that uses
> https://admin.fedoraproject.org/updates/FEDORA-2012-3887/xorg-x11-drv-qxl-0.0.22-0.fc17
> . I'd like to have a bit more testing time before declaring that this is fixed,
> but it looks positive so far. I'll keep an eye on it.
Thanks Adam,

Alon

> 
> 
> 
> -- 
> Fedora Bugzappers volunteer triage team
> https://fedoraproject.org/wiki/BugZappers

Comment 34 Adam Williamson 2012-03-21 17:27:45 UTC
of course, since the bug affects F16 too, it would be good to either update qxl in F16 or isolate the fix and backport it...



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 35 Alon Levy 2012-03-27 15:54:20 UTC
Marking as CLOSED CURRENTRELEASE according to comment #32. I must say I'm surprised the fix is apparently by something not related to the async patches, i.e. something that appeared in [21.16, 22.0].

Alon

Comment 36 Adam Williamson 2012-03-29 05:32:13 UTC
it's definitely fixed. I still see this all the time when booting images with qxl < 0.0.22, and never when booting images with 0.0.22.

The 'timezone selection' screen of anaconda is a fairly reliable (but not quite 100%) reproducer, FWIW.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 37 Adam Williamson 2012-03-29 05:32:35 UTC
still, couldn't we get a backport for f16? it'd make life much better.