Bug 819617
| Summary: | error in keepalive handling can cause failure to receive domain events | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Zeeshan Ali <zeenix> |
| Component: | libvirt | Assignee: | Libvirt Maintainers <libvirt-maint> |
| Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 17 | CC: | berrange, cfergeau, clalancette, crobinso, dallan, dougsland, dyasny, itamar, jforbes, jyang, laine, libvirt-maint, mclasen, tommyhp2, veillard, virt-maint, zeenix |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-09-04 22:56:12 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Zeeshan Ali
2012-05-07 18:18:50 UTC
I've noticed weird thing. gnome-boxes sometimes hangs, stop responding for more than 30 seconds. However, it's not waiting in libvirt API. But since it does not reply to server's keep alive requests, the connection is closed meanwhile and thus all events won't be delivered any longer. NB, the hang - there are two symptoms which make me think of that: 1) I cannot click/type into the guest, even the clock on guest's taskbar is not updated 2) the debug output stops as well Zeeshan, do you experience the same? I've taken look into libvirt-glib sources and I think I see the problem. It's not updating timeouts properly: http://libvirt.org/git/?p=libvirt-glib.git;a=blob;f=libvirt-glib/libvirt-glib-event.c;h=94f4de8e306474dd66ae2cea45c10f19fb6fe218;hb=HEAD#l369 This should look like this: http://fpaste.org/XUYj/ because libvirt internal apis allow timeout=0; but libvirt-glib mapping into glib mainloop doesn't. I've tested locally and it fixes the problem. Therefore changing the component to libvirt-glib. I've proposed patch upstream: https://www.redhat.com/archives/libvir-list/2012-May/msg01098.html (In reply to comment #1) > I've noticed weird thing. gnome-boxes sometimes hangs, stop responding for > more than 30 seconds. However, it's not waiting in libvirt API. But since it > does not reply to server's keep alive requests, the connection is closed > meanwhile and thus all events won't be delivered any longer. > > NB, the hang - there are two symptoms which make me think of that: > 1) I cannot click/type into the guest, even the clock on guest's taskbar is > not updated > 2) the debug output stops as well > > Zeeshan, do you experience the same? Yes, I have noticed this lately but I'll need to try your patch for a while to be sure if this is related to original issue I reported. We don't need to keep the bug open after your patch goes in, I can re-open if I see the isee again. Thanks so much for looking into and fixing this! So, from backtrace taken at the moment of stuck we can see gnome-boxes is calling libvirt API within an event loop. The problem is, libvirt-glib is mapping libvirt event loop into gmainloop. Therefore if an libvirt API called within gmainloop stuck, and client receive an event (e.g. keepalive request from the daemon) such event cannot be processed - it will be in the next iteration which, obviously, won't come until we jump from stuck API (e.g. after deamon kills off connection due to keepalive timeout). Backtrace: http://pastebin.test.redhat.com/90223 I've proposed patch: https://www.redhat.com/archives/libvir-list/2012-May/msg01157.html IOW, it's gnome boxes what's broken actually. But since (nearly) all glib applications are event driven - actual work is done within callbacks called from the gmainloop it's worth fixing libvirt-glib instead of all it's consumers. > IOW, it's gnome boxes what's broken actually.
I disagree. The libvirt keepalive code is broken. Boxes' API call is waiting in this function:
#1 0x000000340a324fe2 in virNetClientIOEventLoop (client=client@entry=0x7f5a60afb010, thiscall=thiscall@entry=0x3748b20) at rpc/virnetclient.c:1352
which should be responsible for processing the keepalive packets.
This has been fixed by http://www.redhat.com/archives/libvir-list/2012-June/msg00445.html which will be part of the soon-to-be-released libvirt 0.9.13 Yeah, Christophe thanks for updating this. Moving to POST per Comment 10. (In reply to comment #10) > This has been fixed by > http://www.redhat.com/archives/libvir-list/2012-June/msg00445.html which > will be part of the soon-to-be-released libvirt 0.9.13 That unfortunately is not true. Last week at GUADEC, me and Marc-Andre were able to reproduce this bug against git master of libvirt and libvirt-glib many times. Marc-Andre tried to trace the issue but then he couldn't reproduce it. The keepalive issue mentioned in this bug has been fixed, there seems to be a much harder bug to reproduce still lurking, but this can be dealt with in a separate bug report. (In reply to comment #13) > The keepalive issue mentioned in this bug has been fixed, there seems to be > a much harder bug to reproduce still lurking, but this can be dealt with in > a separate bug report. Well the original bug I filed was not about the 'keep alive' issue but rather libvirt events not received but yes, we can open another one for this new issue.. (In reply to comment #14) > Well the original bug I filed was not about the 'keep alive' issue but > rather libvirt events not received but yes, we can open another one for this > new issue.. Yes, please do. libvirt-0.9.11.5-1.fc17 has been submitted as an update for Fedora 17. https://admin.fedoraproject.org/updates/libvirt-0.9.11.5-1.fc17 Package libvirt-0.9.11.5-1.fc17: * should fix your issue, * was pushed to the Fedora 17 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing libvirt-0.9.11.5-1.fc17' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2012-11838/libvirt-0.9.11.5-1.fc17 then log in and leave karma (feedback). libvirt-0.9.11.5-2.fc17 has been submitted as an update for Fedora 17. https://admin.fedoraproject.org/updates/libvirt-0.9.11.5-2.fc17 libvirt-0.9.11.5-3.fc17 has been submitted as an update for Fedora 17. https://admin.fedoraproject.org/updates/libvirt-0.9.11.5-3.fc17 libvirt-0.9.11.5-3.fc17 fixed the error "missing auth scheme in graphics event" for me in addition to the autostart of the guests: [root@host ~]# rpm -qa|grep -i libvirt libvirt-daemon-config-nwfilter-0.9.11.5-3.fc17.x86_64 libvirt-daemon-config-network-0.9.11.5-3.fc17.x86_64 libvirt-python-0.9.11.5-3.fc17.x86_64 libvirt-client-0.9.11.5-3.fc17.x86_64 libvirt-daemon-0.9.11.5-3.fc17.x86_64 libvirt-0.9.11.5-3.fc17.x86_64 Thanks. libvirt-0.9.11.5-3.fc17 has been pushed to the Fedora 17 stable repository. If problems still persist, please make note of it in this bug report. |