Bug 819617

Summary: error in keepalive handling can cause failure to receive domain events
Product: [Fedora] Fedora Reporter: Zeeshan Ali <zali>
Component: libvirtAssignee: Libvirt Maintainers <libvirt-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 17CC: berrange, cfergeau, clalancette, crobinso, dallan, dougsland, dyasny, itamar, jforbes, jyang, laine, libvirt-maint, tommyhp2, veillard, virt-maint, zali
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-09-04 18:56:12 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:

Description Zeeshan Ali 2012-05-07 14:18:50 EDT
Description of problem:

Over the past many months, I have observed that session (haven't tested against system) libvirtd stops to send domain state change events. So far I didn't care to report this issue because the issue mysteriously disappears each time I try to investigate. I was just able to reproduce it again and before it vanishes again because of me investigating it, I thought I should file it here first.

How reproducible: sometimes


Steps to Reproduce: Don't know exactly but try the following a few times:

1. Run gnome-boxes as 'G_MESSAGES_DEBUG=Libvirt.GObject gnome-boxes'.
1. Run a VM in gnome-boxes and play/click around a bit on the desktop.
2. Save the VM (just quit Boxes).
3. Restore the VM (start Boxes again and select the VM).
  
Actual results:

There is no appropriate messages on the console from 'Libvirt.GObject' in the format: %s: %s event:%d, detail:%d (These messages are emitted from libvirt-glib's event handler function) on each event.

Expected results:

There is at least one message on the console from 'Libvirt.GObject' in the format: %s: %s event:%d, detail:%d

Additional info:

Don't know if these errors I see in the log are relevant or not but thats the only thing I see there:

$ tail ~/.libvirt/libvirtd.log 
2012-05-07 18:07:18.461+0000: 28350: warning : qemuMonitorJSONHandleGraphics:683 : missing auth scheme in graphics event
2012-05-07 18:07:18.518+0000: 28350: warning : qemuMonitorJSONHandleGraphics:683 : missing auth scheme in graphics event
2012-05-07 18:07:18.707+0000: 28350: warning : qemuMonitorJSONHandleGraphics:683 : missing auth scheme in graphics event
2012-05-07 18:07:27.602+0000: 28350: error : virNetSocketReadWire:991 : Cannot recv data: Connection reset by peer
2012-05-07 18:07:27.604+0000: 28350: warning : qemuMonitorJSONHandleGraphics:683 : missing auth scheme in graphics event
2012-05-07 18:07:27.604+0000: 28350: warning : qemuMonitorJSONHandleGraphics:683 : missing auth scheme in graphics event
2012-05-07 18:07:27.605+0000: 28350: warning : qemuMonitorJSONHandleGraphics:683 : missing auth scheme in graphics event
2012-05-07 18:07:27.605+0000: 28350: warning : qemuMonitorJSONHandleGraphics:683 : missing auth scheme in graphics event
2012-05-07 18:07:27.606+0000: 28350: warning : qemuMonitorJSONHandleGraphics:683 : missing auth scheme in graphics event
2012-05-07 18:07:27.606+0000: 28350: warning : qemuMonitorJSONHandleGraphics:683 : missing auth scheme in graphics event
Comment 1 Michal Privoznik 2012-05-22 04:02:33 EDT
I've noticed weird thing. gnome-boxes sometimes hangs, stop responding for more than 30 seconds. However, it's not waiting in libvirt API. But since it does not reply to server's keep alive requests, the connection is closed meanwhile and thus all events won't be delivered any longer.

NB, the hang - there are two symptoms which make me think of that:
1) I cannot click/type into the guest, even the clock on guest's taskbar is not updated
2) the debug output stops as well

Zeeshan, do you experience the same?
Comment 5 Michal Privoznik 2012-05-22 11:32:43 EDT
I've taken look into libvirt-glib sources and I think I see the problem. It's not updating timeouts properly:

http://libvirt.org/git/?p=libvirt-glib.git;a=blob;f=libvirt-glib/libvirt-glib-event.c;h=94f4de8e306474dd66ae2cea45c10f19fb6fe218;hb=HEAD#l369

This should look like this:

http://fpaste.org/XUYj/

because libvirt internal apis allow timeout=0; but libvirt-glib mapping into glib mainloop doesn't.

I've tested locally and it fixes the problem. Therefore changing the component to libvirt-glib.
Comment 6 Michal Privoznik 2012-05-22 11:39:04 EDT
I've proposed patch upstream:

https://www.redhat.com/archives/libvir-list/2012-May/msg01098.html
Comment 7 Zeeshan Ali 2012-05-22 14:08:40 EDT
(In reply to comment #1)
> I've noticed weird thing. gnome-boxes sometimes hangs, stop responding for
> more than 30 seconds. However, it's not waiting in libvirt API. But since it
> does not reply to server's keep alive requests, the connection is closed
> meanwhile and thus all events won't be delivered any longer.
> 
> NB, the hang - there are two symptoms which make me think of that:
> 1) I cannot click/type into the guest, even the clock on guest's taskbar is
> not updated
> 2) the debug output stops as well
> 
> Zeeshan, do you experience the same?

Yes, I have noticed this lately but I'll need to try your patch for a while to be sure if this is related to original issue I reported. We don't need to keep the bug open after your patch goes in, I can re-open if I see the isee again.

Thanks so much for looking into and fixing this!
Comment 8 Michal Privoznik 2012-05-24 05:20:38 EDT
So, from backtrace taken at the moment of stuck we can see gnome-boxes is calling libvirt API within an event loop. The problem is, libvirt-glib is mapping libvirt event loop into gmainloop. Therefore if an libvirt API called within gmainloop stuck, and client receive an event (e.g. keepalive request from the daemon) such event cannot be processed - it will be in the next iteration which, obviously, won't come until we jump from stuck API (e.g. after deamon kills off connection due to keepalive timeout).

Backtrace:
http://pastebin.test.redhat.com/90223

I've proposed patch:
https://www.redhat.com/archives/libvir-list/2012-May/msg01157.html

IOW, it's gnome boxes what's broken actually. But since (nearly) all glib applications are event driven - actual work is done within callbacks called from the gmainloop it's worth fixing libvirt-glib instead of all it's consumers.
Comment 9 Daniel Berrange 2012-05-24 05:39:49 EDT
> IOW, it's gnome boxes what's broken actually.

I disagree. The libvirt keepalive code is broken. Boxes' API call is waiting in this function:

#1  0x000000340a324fe2 in virNetClientIOEventLoop (client=client@entry=0x7f5a60afb010, thiscall=thiscall@entry=0x3748b20) at rpc/virnetclient.c:1352

which should be responsible for processing the keepalive packets.
Comment 10 Christophe Fergeau 2012-06-26 10:57:35 EDT
This has been fixed by http://www.redhat.com/archives/libvir-list/2012-June/msg00445.html which will be part of the soon-to-be-released libvirt 0.9.13
Comment 11 Michal Privoznik 2012-06-27 03:31:51 EDT
Yeah, Christophe thanks for updating this. Moving to POST per Comment 10.
Comment 12 Zeeshan Ali 2012-08-06 13:02:20 EDT
(In reply to comment #10)
> This has been fixed by
> http://www.redhat.com/archives/libvir-list/2012-June/msg00445.html which
> will be part of the soon-to-be-released libvirt 0.9.13

That unfortunately is not true. Last week at GUADEC, me and Marc-Andre were able to reproduce this bug against git master of libvirt and libvirt-glib many times. Marc-Andre tried to trace the issue but then he couldn't reproduce it.
Comment 13 Christophe Fergeau 2012-08-07 05:05:26 EDT
The keepalive issue mentioned in this bug has been fixed, there seems to be a much harder bug to reproduce still lurking, but this can be dealt with in a separate bug report.
Comment 14 Zeeshan Ali 2012-08-07 15:13:14 EDT
(In reply to comment #13)
> The keepalive issue mentioned in this bug has been fixed, there seems to be
> a much harder bug to reproduce still lurking, but this can be dealt with in
> a separate bug report.

Well the original bug I filed was not about the 'keep alive' issue but rather libvirt events not received but yes, we can open another one for this new issue..
Comment 15 Dave Allan 2012-08-07 21:25:41 EDT
(In reply to comment #14)
> Well the original bug I filed was not about the 'keep alive' issue but
> rather libvirt events not received but yes, we can open another one for this
> new issue..

Yes, please do.
Comment 16 Fedora Update System 2012-08-13 19:25:02 EDT
libvirt-0.9.11.5-1.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/libvirt-0.9.11.5-1.fc17
Comment 17 Fedora Update System 2012-08-14 05:19:18 EDT
Package libvirt-0.9.11.5-1.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing libvirt-0.9.11.5-1.fc17'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-11838/libvirt-0.9.11.5-1.fc17
then log in and leave karma (feedback).
Comment 18 Fedora Update System 2012-08-15 09:45:00 EDT
libvirt-0.9.11.5-2.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/libvirt-0.9.11.5-2.fc17
Comment 19 Fedora Update System 2012-08-22 11:15:15 EDT
libvirt-0.9.11.5-3.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/libvirt-0.9.11.5-3.fc17
Comment 20 Tommy P 2012-08-26 20:09:52 EDT
libvirt-0.9.11.5-3.fc17 fixed the error "missing auth scheme in graphics event" for me in addition to the autostart of the guests:

[root@host ~]# rpm -qa|grep -i libvirt
libvirt-daemon-config-nwfilter-0.9.11.5-3.fc17.x86_64
libvirt-daemon-config-network-0.9.11.5-3.fc17.x86_64
libvirt-python-0.9.11.5-3.fc17.x86_64
libvirt-client-0.9.11.5-3.fc17.x86_64
libvirt-daemon-0.9.11.5-3.fc17.x86_64
libvirt-0.9.11.5-3.fc17.x86_64

Thanks.
Comment 21 Fedora Update System 2012-09-04 18:56:12 EDT
libvirt-0.9.11.5-3.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.