Bug 747464
Summary: | VMs occasionally become unresponsive and it is impossible to type into other applications until virt-manager is killed | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Adam Williamson <awilliam> | ||||||||
Component: | spice-gtk | Assignee: | Alon Levy <alevy> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 16 | CC: | bfay, cfergeau, crobinso, dblechte, ipilcher, jp, madko, marcandre.lureau, martin, mishu | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | xorg-x11-drv-qxl-0.0.22-0.fc17 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 794658 (view as bug list) | Environment: | |||||||||
Last Closed: | 2012-03-27 15:54:20 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 794658 | ||||||||||
Attachments: |
|
Description
Adam Williamson
2011-10-19 22:08:27 UTC
Most probably a dead-lock issue between spice-gtk, libvirt and qemu, which is caused by qemu/qxl doing some IO synchronously. A nasty kind of bug that involved several parts. You need a fairly recent version of spice & qemu (I think the one in f16 is recent enough) but the xorg-qxl driver fix is not yet committed, afaik. See also: https://bugzilla.redhat.com/show_bug.cgi?id=700134 https://bugs.freedesktop.org/show_bug.cgi?id=41622 Alon, care to share the status of the various components in f16? Adam, please get a backtrace when the hang happen, to be sure it's the same bug. thanks okay, I'll try. [adamw@adam grub2 (f16)]$ rpm -q spice-gtk libvirt qemu-kvm xorg-x11-drv-qxl spice-gtk-0.7.39-1.fc16.x86_64 libvirt-0.9.6-2.fc16.x86_64 qemu-kvm-0.15.0-5.fc16.x86_64 xorg-x11-drv-qxl-0.0.21-5.fc16.x86_64 note that when I talk about 'other apps' I am of course talking about other apps running *on the host* besides virt-manager, in case that wasn't clear. wasn't entirely sure what you want a backtrace from, but here's the one from virt-manager. is it any use? virt-manager is python, right? Created attachment 529158 [details]
backtrace from virt-manager while it's hung
(In reply to comment #5) > wasn't entirely sure what you want a backtrace from, but here's the one from > virt-manager. is it any use? virt-manager is python, right? try the one from qemu while the client hangs. Thanks crap. I got the qemu one first then decided you probably wanted the virt-manager one. sigh =) okay, will get qemu next time it happens. (In reply to comment #8) > crap. I got the qemu one first then decided you probably wanted the > virt-manager one. sigh =) okay, will get qemu next time it happens. Sorry, and btw, we need all threads :) Created attachment 529945 [details]
qemu backtrace
here's the backtrace from qemu while the hang is happening
Thanks Adam. The hang is on an io write (update area) from the vm that is waiting on a read from spice server, that is waiting on a read that strangely appears in libpthread, but maybe this is the missing async support for the X qxl driver. Marc-Andre, have you done a build with the patch Gerd had that I sent you? if not I can do one. Alon I'm still hitting this in current Rawhide. Any updates? -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers There's an f15 bug with multiple stack traces attached: https://bugzilla.redhat.com/show_bug.cgi?id=768404 Though at least my report in that bug was from an f16. Any word on this? I can reproduce fairly regularly if more info is needed. I've heard nothing further. You can use 'spicec' as a kind of workaround: use virt-manager to launch the VM, then run spicec and connect to localhost, port 5900. virt-manager bugs out so often it's kind of unusable at present. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers Adam, Sorry for the long delay. Can you please install spice-server debug symbols and reproduce the qemu backtrace? Can you update the qxl driver in the F16 vm (not sure it's in F15) to 21-13, that's when Marc-Andre added the async patches, and retest? Thanks for the patience, Alon Adam, I missed the "async = QXL_SYNC" in the qemu stack trace, so even if you don't provide an updated stack trace I'm sure this is the result of using a too old driver, or otherwise letting qemu tell the guest it is using a too old device - so: 1. update the driver as noted in comment 15 to xorg-x11-drv-qxl-0.0.21-13.fc17 or newer 2. please provide the qemu command line, although as long as there is no "revision=1" property for the qxl-vga device, it should be fine. 3. tell me if it still reproduces. If so please have the spice-server debug symbols installed. Thanks, Alon well, 'too old' seems...well, it's one way of putting it. I've been experiencing this bug since F16. If there's a fix for the bug in the latest bleeding-edge qxl or whatever, you should probably backport it to F16, since that's our actual current release and all. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers okay, I did hit this yesterday performing an install of F17 Alpha RC2, which has xorg-x11-drv-qxl-0.0.21-16.fc17 . The qemu command line is: qemu 28580 23.5 6.2 6717988 1033380 ? Sl 14:40 0:56 /usr/bin/qemu-kvm -S -M pc-0.14 -enable-kvm -m 2048 -smp 1,sockets=1,cores=1,threads=1 -name Test_1 -uuid 1b76a7fb-b6a4-251e-f415-e2a5ff08404b -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/Test_1.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/media/Sea500/images/Fedora-17-Alpha-x86_64-DVD.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1 -drive file=/media/Sea500/images/Test_1.img,if=none,id=drive-virtio-disk0,format=raw -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2 -netdev tap,fd=21,id=hostnet0,vhost=on,vhostfd=22 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:03:ad:59,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -device usb-tablet,id=input0 -spice port=5900,addr=127.0.0.1,disable-ticketing -vga qxl -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 I'll try and reproduce again and get a new trace. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers Hi Adam. could you try http://people.freedesktop.org/~alon/qemu-1.0-7.fc18.src.rpm I think it should fix your hangs. (I'll attach a binary rpm if I manage to build it, it failed during tests for some reason, and I don't have permissions on the qemu package to do a scratch build). Alon Nope, I can still reproduce even with that qemu, sorry. Okay, I came up with a 100% reliable reproducer by patching virt-manager to be a bit pathological. F16 host, F16 guest. virt-manager git. patch: diff --git a/src/virtManager/engine.py b/src/virtManager/engine.py index 5bd12d2..820dcd0 100644 --- a/src/virtManager/engine.py +++ b/src/virtManager/engine.py @@ -1001,6 +1001,15 @@ class vmmEngine(vmmGObject): return logging.debug("Pausing vm '%s'", vm.get_name()) + v = vm._backend + def func(idx): + print "lookup attempt", idx, v.info() + idx += 1 + if idx < 10000: + gobject.idle_add(func, idx) + + gobject.idle_add(func, 0) + return vmmAsyncJob.simple_async_noshow(vm.suspend, [], src, _("Error pausing domain")) Makes it so whenever you click the pause button, it kicks of an idle loop to hit virDomainGetInfo, which hits the qemu monitor. Run virt-manager --debug. Open the f16 guest, log in to gnome fallback desktop. Open a terminal. hit the pause button to kick off the loop. In the guest, open glxgears from the terminal. If it doesn't hang, keep closing and reopening glxgears. It usually hangs for me the first time. Causes the weird host X lockup that Adam reported. The good news is that xorg-x11-drv-qxl-0.0.21-13.fc16 from updates-testing fixes it for me, after installing in the guest and rebooting. I gave the build karma so it should be heading to stable soon. Can we get an equivalent update for F17? Now it's branched from Rawhide, it's on the Bodhi treadmill, you have to fire off an update, just building through Koji isn't enough. thanks! -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers The f17 package contains that patch already since 0.0.21-13, see http://koji.fedoraproject.org/koji/buildinfo?buildID=299027 okay. I may already have tested with that version in the client, but I'll have to double check. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers sadly I can reproduce this with the current F17 nightly, with xorg-x11-drv-qxl-0.0.21-16.fc17.x86_64 . Happened to me twice while simply trying to do an install. Still doesn't happen with spicec. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers Adam (In reply to comment #25) > sadly I can reproduce this with the current F17 nightly, with > xorg-x11-drv-qxl-0.0.21-16.fc17.x86_64 . Happened to me twice while simply > trying to do an install. Still doesn't happen with spicec. > Please, get a full thread apply all bt qemu backtrace when this happen. Something I forgot to mention is that you *need* -M pc-0.15 (>= 0.15) or qxl device revision >= 3. It doens't happen with spicec because it is a seperate process, so there is no deadlock situation between spice and libvirt. ah, my test machine has pc-0.14. can i just bodge that up to 0.15 and re-test? -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers Reproduced even after bumping that to 0.15. Attaching backtrace from qemu. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers Created attachment 569028 [details]
backtrace from qemu when hung, with recent F17 as host+guest
Hi Adam, That trace is totally normal: Thread 3 - vcpu thread, waiting on ioctl with KVM_RUN (optimized out but given by kvm_cpu_exec) Thread 2 is the red_worker thread waiting on epoll_wait Thread 1 is qemu main thread waiting on select Perhaps the hang is in virt-manager and it really isn't related to the async bug after all? can you give a backtrace of virt-manager and libvirtd when the hang happens? Alon okay, I'll try and find some time to reproduce and provide the backtraces soon. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers Interestingly, I think a qxl update may have fixed this. I'm yet to hit this bug in any VM I've built that uses https://admin.fedoraproject.org/updates/FEDORA-2012-3887/xorg-x11-drv-qxl-0.0.22-0.fc17 . I'd like to have a bit more testing time before declaring that this is fixed, but it looks positive so far. I'll keep an eye on it. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers (In reply to comment #32) > Interestingly, I think a qxl update may have fixed this. I'm yet to hit this > bug in any VM I've built that uses > https://admin.fedoraproject.org/updates/FEDORA-2012-3887/xorg-x11-drv-qxl-0.0.22-0.fc17 > . I'd like to have a bit more testing time before declaring that this is fixed, > but it looks positive so far. I'll keep an eye on it. Thanks Adam, Alon > > > > -- > Fedora Bugzappers volunteer triage team > https://fedoraproject.org/wiki/BugZappers of course, since the bug affects F16 too, it would be good to either update qxl in F16 or isolate the fix and backport it... -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers Marking as CLOSED CURRENTRELEASE according to comment #32. I must say I'm surprised the fix is apparently by something not related to the async patches, i.e. something that appeared in [21.16, 22.0]. Alon it's definitely fixed. I still see this all the time when booting images with qxl < 0.0.22, and never when booting images with 0.0.22. The 'timezone selection' screen of anaconda is a fairly reliable (but not quite 100%) reproducer, FWIW. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers still, couldn't we get a backport for f16? it'd make life much better. |