Bug 1081227
Summary: | 100% CPU on 2nd migration | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Dr. David Alan Gilbert <dgilbert> | ||||||
Component: | spice-gtk | Assignee: | Marc-Andre Lureau <marcandre.lureau> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 22 | CC: | alon, berrange, cfergeau, crobinso, dblechte, dgilbert, hdegoede, kem, marcandre.lureau, sandmann, virt-maint | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2015-03-31 20:00:37 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Dr. David Alan Gilbert
2014-03-26 19:41:10 UTC
Hmm, interesting. Moving to spice-gtk, but maybe this is tickled by something virt-manager is doing Hans, this seems to be usbredir related. Hi, David, can you please start virt-manager on a terminal like this: LIBUSB_DEBUG=4 virt-manager &> log Reproduce the problem, gzip the log file and attach it here ? Regards, Hans You'll want to do virt-manager --debug or virt-manager --no-fork, since virt-manager forks by default and that command won't print much useful. Created attachment 881486 [details]
output of log file with libusb debug enabled
As requested.
Note that I think the thing that triggers the vast PCU usage might be having the
window open for the VM on both hosts; i.e. it starts off on vl403 and I open it's virt-manager windows, then I kick off a migrate (using migrate --live --compressed --verbose --desturi qemu+ssh://^Crtlab402/system --domain rhel7b1-v3 ) then when it gets to vl402 I open the window for the VM on that host.
Hi, We're you seeing the 100% cpu load problem when you generated that log file ? And if not can you please create another log file from a run where you actually see the problem ? If your initial analysis of usb_ev_thread being the culprit is correct, in combination with your backtrace showing libusb_handle_events failing with an error of -99, then I would expect to see a ton of log messages about this in the log file. But I only see a few regular libusb_handle_events calls, and not the ton of failures I would expect when libusb_handle_events fails unexpectedly and is called repeatedly. Regards, Hans In that run it did take ~100% towards the end of the run (at the point where the 2nd migration happened I think) - however, this time it wasn't showing the usb thread being hot in top - so where it was spending it's time I don't know. Dave Hi, (In reply to Dr. David Alan Gilbert from comment #7) > In that run it did take ~100% towards the end of the run (at the point where > the 2nd migration happened I think) - however, this time it wasn't showing > the usb thread being hot in top - so where it was spending it's time I don't > know. Hmm sounds like there is a deeper lying issue then which causes some thread to break out of poll with an error (or something like that) which happens to hit the usb thread sometimes, and sometimes another thread. Regards, Hans Hi, So looking at the new log, what stands out is that the poll in the usb-thread is returning with 2 because it is seeing activity on both the control pipe and the timer fd. The control pipe activity is normal the timer fd activity not so much. Then when handling the timer fd activity it gets an error while trying to disarm the timer. So it seems something has gone wrong with the timer fd, possibly some other thread has closed it and then later opened something else causing the fd number to now point to something which is not a timerfd. This would explain both the fact that the timerfd is waking poll while it should not as well as why disarming the timerfd is failing. Another possibilty is that libusb itself has closed the timerfd, because the context has been destroyed and that we're calling libusb_handle_events on a destroyed context. I've checked the spice-gtk code and it has g_return_if_fail code to check for this scenario, so if the context is getting destroyed underneath us it is likely not being done from the spice-gtk code. Regards, Hans One possible reason why this could be hapening is other libusb using code in virt-manager calling libusb_init(NULL) and libusb_exit(NULL) and it doing so in an unbalanced fashion. (In reply to Hans de Goede from comment #11) > One possible reason why this could be hapening is other libusb using code in > virt-manager calling libusb_init(NULL) and libusb_exit(NULL) and it doing so > in an unbalanced fashion. Without knowing anything about the innards of virt-manager, I would point out that when I trigger this, it normally involves a pair of migrations, so I've got the spice interface open on host A, then I've got it open on host B and then got it open on host A again; so whether there is some period where they fight as they hand off between the two hosts, or perhaps something like when it comes back to the same host again their is something stale - shrug. FWIW virt-manager doesn't use libusb directly, nor libvirt. (In reply to Cole Robinson from comment #13) > FWIW virt-manager doesn't use libusb directly, nor libvirt. Right, how do you get the list of usb-devices for the non-spice usb-redir parts of virt-manager ? Note I still think this whole USB thing is a red-herring as David is seeing the same hang sometimes occure without USB being involved at all. So chances are that we're just getting some memory corruption which sometimes hit USB. Or some init / exit race which sometime hits USB... (In reply to Hans de Goede from comment #14) > (In reply to Cole Robinson from comment #13) > > FWIW virt-manager doesn't use libusb directly, nor libvirt. > > Right, how do you get the list of usb-devices for the non-spice usb-redir > parts of virt-manager ? > via libvirt APIs that talk to udev. Okay, David gave me access to his setup and I can reliably reproduce. This seems to involve spice-gtk's seamless migration support, and an orphaned usb_device_manager or something along those lines. I added some printfs to spice-gtk to confirm that the usb_device_manager instance for each spice connection is correctly closed when we close the virt-manager console window, or the VM is destroyed. That all seems to work correctly. However, when the VM is migrated, despite us .destroy()ing the display widget and trying to clear all our state, the usb_device_manager isn't dispose()d at the spice-gtk level. I assume this is intended and part of the seamless migration magic... however we don't use seamless migration in virt-manager (would be nice to support, but would take a bit of work for us). So spice folks, any tips here? Is there something we can do to say 'dont use migration magic', or maybe this is just a spice-gtk or virt-manager bug? I'm happy to do more digging if people have suggestions Created attachment 941230 [details]
Debug output from virt-manager run
Although I did local migrations for testing (with a small patch in libvirt), I am unable to reproduce so far on f20. I try to open the Spice console as early as I can, and leave it open until the migration complete and closes it. Nevertheless, it would probably make sense to exit the loop on usb_handle_events() io/mem errors The main issue was virt-manager closing a fd it doesn't own [0]. For some reason, the same fd ends up being a libusb pipe. Then the USB thread spins, since no error are reported [1], and spice-gtk ignores errors [2]. Second issue I found, Spice session was leaked sometime on race cases (and USB pipe remained open too). This is related to spice-gtk audio manager [3] Then I figured when fd are provided to spice-gtk, there is no way migration can be made to work, and the migration session ends up stuck, and the spice-gtk session is stucked and leaked too. I solved a couple of migration issues in [4] Finally, there is a bad object model in spice-gtk, which keeps channels in session as long as the channel is alive. This is fine if client is aware of that, and only use weak references. But with bindings, this is problematic. Then calling spice_session_disconnect() several time might end up removing references that aren't owned, and I got virt-manager to crash then. I changed the spice-gtk mode [5] [0] https://www.redhat.com/archives/virt-tools-list/2014-November/msg00022.html [1] http://sourceforge.net/p/libusb/mailman/message/33015335/ [2] http://lists.freedesktop.org/archives/spice-devel/2014-November/017865.html [3] http://lists.freedesktop.org/archives/spice-devel/2014-November/017864.html [4] http://lists.freedesktop.org/archives/spice-devel/2014-November/017870.html [5] http://lists.freedesktop.org/archives/spice-devel/2014-November/017884.html Excellent work; a lot of fixes from one report :-) I'll give the new virt-manager a try when it lands for [0] but I guess it might be best to wait for the spice fixes to bubble through as well. Dave The virt-manager fix is in F21. spice-gtk fixes should be in F22+ now. Probably not big enough of a deal to backport the remaining stuff though, so just closing as CURRENTRELEASE for f22. |