Bug 1081227

Summary:

100% CPU on 2nd migration

Product:

[Fedora] Fedora

Reporter:

Dr. David Alan Gilbert <dgilbert>

Component:

spice-gtk

Assignee:

Marc-Andre Lureau <marcandre.lureau>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

CC:

alon, berrange, cfergeau, crobinso, dblechte, dgilbert, hdegoede, kem, marcandre.lureau, sandmann, virt-maint

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-03-31 20:00:37 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
output of log file with libusb debug enabled	none
Debug output from virt-manager run	none

Description Dr. David Alan Gilbert 2014-03-26 19:41:10 UTC

Description of problem:
virt-manager is currently using 100% cpu, apparently in usb_ev_thread
after a pair of migrations (but is otherwise working)

Version-Release number of selected component (if applicable):
virt-manager-1.0.0-6.fc20.noarch
libvirt-client-1.1.3.4-3.fc20.x86_64
spice-gtk-0.23-2.fc20.x86_64


How reproducible:
Not sure, just happened

Steps to Reproduce:
1. Connect virt-manager to a remote host ('a') running a RHEL7 guest, open it's display and see it's running
2. use virsh on that host to migrate it to another host ('b')
   migrate --live --compressed --verbose --desturi qemu+ssh://b/system --domain rhel7b1-v3
3. Open the VM on the new host - see it's display, (leave the window open for the original guest)
4. On 'b' use virsh to migrate the guest back to a
(Note this is happening while step 4 is in progress - they're long slow migrations).

Actual results:
virt-manager using 100% cpu apparently while not doing much,
top 'H' shows the time spent in usb_ev_thread.  Looks like it's a spice/USB problem
#0  0x00000037b660cb78 in __GI___pthread_getspecific (key=1) at pthread_getspecific.c:32
        data = <optimized out>
        result = <optimized out>
#1  0x00000037ba28a974 in g_private_get (key=key@entry=0x37ba52aae0 <private_thread_memory>) at gthread-posix.c:1019
No locals.
#2  0x00000037ba2655dc in thread_memory_from_self () at gslice.c:513
        tmem = <optimized out>
#3  g_slice_free1 (mem_size=mem_size@entry=24, mem_block=mem_block@entry=0x453e840) at gslice.c:1100
        tmem = 0x7f15300029b0
        ix = <optimized out>
        chunk_size = 32
#4  0x00000037ba268a02 in g_string_free (string=string@entry=0x453e840, free_segment=free_segment@entry=0) at gstring.c:229
        segment = 0x7f15300017e0 "\n(virt-manager:4187): GSpice-WARNING **: Error handling USB events: Other error [-99]\n"
        __PRETTY_FUNCTION__ = "g_string_free"
#5  0x00000037ba24fe8f in g_log_default_handler (log_domain=log_domain@entry=0x38d7acc5a5 "GSpice", log_level=log_level@entry=G_LOG_LEVEL_WARNING, message=message@entry=0x7f15300028e0 "Error handling USB events: Other error [-99]", 
    unused_data=unused_data@entry=0x0) at gmessages.c:1390
        level_prefix = "WARNING **\000\265\067\000\000\000\310\334 \272\067\000\000\000\060\335\177C\025\177\000\000\245Ŭ\327\070", '\000' <repeats 19 times>, "\300\351\177"
        string = <optimized out>
        gstring = 0x453e840
        fd = 2
        domains = <optimized out>
#6  0x00000037ba250429 in g_logv (log_domain=0x38d7acc5a5 "GSpice", log_level=G_LOG_LEVEL_WARNING, format=<optimized out>, args=args@entry=0x7f15437fde10) at gmessages.c:960
        domain = 0x0
        data = 0x0
        depth = 1
        log_func = 0x37ba24fc80 <g_log_default_handler>
        domain_fatal_mask = <optimized out>
        masquerade_fatal = 0
        test_level = G_LOG_LEVEL_WARNING
        was_fatal = <optimized out>
        was_recursion = <optimized out>
        msg = 0x7f15300028e0 "Error handling USB events: Other error [-99]"
        msg_alloc = 0x7f15300028e0 "Error handling USB events: Other error [-99]"
        i = 4
#7  0x00000037ba25063f in g_log (log_domain=log_domain@entry=0x38d7acc5a5 "GSpice", log_level=log_level@entry=G_LOG_LEVEL_WARNING, format=format@entry=0x38d7ad5cc8 "Error handling USB events: %s [%i]") at gmessages.c:1025
        args = {{gp_offset = 40, fp_offset = 48, overflow_arg_area = 0x7f15437fdef0, reg_save_area = 0x7f15437fde30}}
#8  0x00000038d7a3519b in spice_usb_device_manager_usb_ev_thread (user_data=<optimized out>) at usb-device-manager.c:1152
        desc = <optimized out>
        self = <optimized out>
        priv = 0x4987920
        rc = -99
#9  0x00000037ba26ea45 in g_thread_proxy (data=0x39e4d40) at gthread.c:798
        thread = 0x39e4d40
#10 0x00000037b6607f33 in start_thread (arg=0x7f15437fe700) at pthread_create.c:309
        __res = <optimized out>
        pd = 0x7f15437fe700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139729303496448, -2763313655369288088, 0, 0, 139729303497152, 139729303496448, 2849716576536114792, -2753440861348940184}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#11 0x00000037b5ef4ded in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111



Expected results:

no particularly heavy CPU usage.

Additional info:

I'm not doing anything like USB redirect or anything like that.

<domain type='kvm'>
  <name>rhel7b1-v3</name>
  <uuid>7340a501-4171-4230-8000-48c9da15d4cd</uuid>
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <vcpu placement='static'>6</vcpu>
  <os>
    <type arch='x86_64' machine='pc-i440fx-rhel7.0.0'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source file='/home/vms/rhel7b1-v3.img'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>

doesn't seem anything relevent in the log:
[Wed, 26 Mar 2014 19:25:33 virt-manager 4187] DEBUG (console:1469) Starting connect process for proto=spice trans=ssh connhost=vl402 connuser=root connport=None gaddr=127.0.0.1 gport=5900 gtlsport=None gsocket=None
[Wed, 26 Mar 2014 19:25:34 virt-manager 4187] DEBUG (console:271) Creating SSH tunnel: ssh -l root vl402 sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then   CMD="nc -q 0 127.0.0.1 5900";else   CMD="nc 127.
0.0.1 5900";fi;eval "$CMD";'
[Wed, 26 Mar 2014 19:25:34 virt-manager 4187] DEBUG (console:291) Open  tunnel PID=31903 OUTFD=19 ERRFD=23
[Wed, 26 Mar 2014 19:25:37 virt-manager 4187] DEBUG (console:660) Opening tunnel for channel: <RecordChannel object at 0x40dff50 (SpiceRecordChannel at 0x4e6dde0)>
[Wed, 26 Mar 2014 19:25:37 virt-manager 4187] DEBUG (console:660) Opening tunnel for channel: <PlaybackChannel object at 0x40dff50 (SpicePlaybackChannel at 0x4ecbb80)>
[Wed, 26 Mar 2014 19:25:37 virt-manager 4187] DEBUG (console:660) Opening tunnel for channel: <DisplayChannel object at 0x40dff50 (SpiceDisplayChannel at 0x4ef29c0)>
[Wed, 26 Mar 2014 19:25:37 virt-manager 4187] DEBUG (console:660) Opening tunnel for channel: <CursorChannel object at 0x40e5f00 (SpiceCursorChannel at 0x47e10e0)>
[Wed, 26 Mar 2014 19:25:37 virt-manager 4187] DEBUG (console:1376) Viewer connected
[Wed, 26 Mar 2014 19:25:37 virt-manager 4187] DEBUG (console:660) Opening tunnel for channel: <InputsChannel object at 0x40e9410 (SpiceInputsChannel at 0x4ebba10)>
[Wed, 26 Mar 2014 19:25:37 virt-manager 4187] DEBUG (console:271) Creating SSH tunnel: ssh -l root vl402 sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then   CMD="nc -q 0 127.0.0.1 5900";else   CMD="nc 127.
0.0.1 5900";fi;eval "$CMD";'
[Wed, 26 Mar 2014 19:25:37 virt-manager 4187] DEBUG (console:291) Open  tunnel PID=31905 OUTFD=37 ERRFD=50
[Wed, 26 Mar 2014 19:25:38 virt-manager 4187] DEBUG (console:271) Creating SSH tunnel: ssh -l root vl402 sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then   CMD="nc -q 0 127.0.0.1 5900";else   CMD="nc 127.
0.0.1 5900";fi;eval "$CMD";'
[Wed, 26 Mar 2014 19:25:38 virt-manager 4187] DEBUG (console:291) Open  tunnel PID=31906 OUTFD=52 ERRFD=55
[Wed, 26 Mar 2014 19:25:38 virt-manager 4187] DEBUG (console:271) Creating SSH tunnel: ssh -l root vl402 sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then   CMD="nc -q 0 127.0.0.1 5900";else   CMD="nc 127.
0.0.1 5900";fi;eval "$CMD";'
[Wed, 26 Mar 2014 19:25:38 virt-manager 4187] DEBUG (console:291) Open  tunnel PID=31907 OUTFD=59 ERRFD=63
[Wed, 26 Mar 2014 19:25:39 virt-manager 4187] DEBUG (console:271) Creating SSH tunnel: ssh -l root vl402 sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then   CMD="nc -q 0 127.0.0.1 5900";else   CMD="nc 127.
0.0.1 5900";fi;eval "$CMD";'
[Wed, 26 Mar 2014 19:25:39 virt-manager 4187] DEBUG (console:291) Open  tunnel PID=31908 OUTFD=67 ERRFD=70
[Wed, 26 Mar 2014 19:25:40 virt-manager 4187] DEBUG (console:271) Creating SSH tunnel: ssh -l root vl402 sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then   CMD="nc -q 0 127.0.0.1 5900";else   CMD="nc 127.
0.0.1 5900";fi;eval "$CMD";'
[Wed, 26 Mar 2014 19:25:40 virt-manager 4187] DEBUG (console:291) Open  tunnel PID=31909 OUTFD=72 ERRFD=74

    </disk>
    <controller type='usb' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'/>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </controller>
    <interface type='direct'>
      <mac address='52:54:00:70:10:01'/>
      <source dev='eno1' mode='bridge'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target port='0'/>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='spicevmc'>
      <target type='virtio' name='com.redhat.spice.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='tablet' bus='usb'/>
    <input type='mouse' bus='ps2'/>
    <graphics type='spice' autoport='yes'/>
    <sound model='ich6'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </sound>
    <video>
      <model type='qxl' ram='65536' vram='65536' heads='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </memballoon>
  </devices>
</domain>

Comment 1 Cole Robinson 2014-03-27 12:52:20 UTC

Hmm, interesting. Moving to spice-gtk, but maybe this is tickled by something virt-manager is doing

Comment 2 Marc-Andre Lureau 2014-03-31 12:16:36 UTC

Hans, this seems to be usbredir related.

Comment 3 Hans de Goede 2014-03-31 14:46:27 UTC

Hi,

David, can you please start virt-manager on a terminal like this:

LIBUSB_DEBUG=4 virt-manager &> log

Reproduce the problem, gzip the log file and attach it here ?

Regards,

Hans

Comment 4 Cole Robinson 2014-03-31 14:48:10 UTC

You'll want to do virt-manager --debug or virt-manager --no-fork, since virt-manager forks by default and that command won't print much useful.

Comment 5 Dr. David Alan Gilbert 2014-04-01 18:45:06 UTC

Created attachment 881486 [details]
output of log file with libusb debug enabled

As requested.
Note that I think the thing that triggers the vast PCU usage might be having the
window open for the VM on both hosts; i.e. it starts off on vl403 and I open it's virt-manager windows, then I kick off a migrate (using migrate --live --compressed --verbose --desturi qemu+ssh://^Crtlab402/system --domain rhel7b1-v3 ) then when it gets to vl402 I open the window for the VM on that host.

Comment 6 Hans de Goede 2014-04-02 07:10:24 UTC

Hi,

We're you seeing the 100% cpu load problem when you generated that log file ?
And if not can you please create another log file from a run where you actually see the problem ?

If your initial analysis of usb_ev_thread being the culprit is correct, in combination with your backtrace showing libusb_handle_events failing with an error of -99, then I would expect to see a ton of log messages about this in the log file. But I only see a few regular libusb_handle_events calls, and not
the ton of failures I would expect when libusb_handle_events fails unexpectedly and is called repeatedly.

Regards,

Hans

Comment 7 Dr. David Alan Gilbert 2014-04-04 14:58:59 UTC

In that run it did take ~100% towards the end of the run (at the point where the 2nd migration happened I think) - however, this time it wasn't showing the usb thread being hot in top - so where it was spending it's time I don't know.

Dave

Comment 8 Hans de Goede 2014-04-04 15:11:31 UTC

Hi,

(In reply to Dr. David Alan Gilbert from comment #7)
> In that run it did take ~100% towards the end of the run (at the point where
> the 2nd migration happened I think) - however, this time it wasn't showing
> the usb thread being hot in top - so where it was spending it's time I don't
> know.

Hmm sounds like there is a deeper lying issue then which causes some thread
to break out of poll with an error (or something like that) which happens to
hit the usb thread sometimes, and sometimes another thread.

Regards,

Hans

Comment 10 Hans de Goede 2014-04-17 14:56:51 UTC

Hi,

So looking at the new log, what stands out is that the poll in the usb-thread is returning with 2 because it is seeing activity on both the control pipe and the timer fd. The control pipe activity is normal the timer fd activity not so much. Then when handling the timer fd activity it gets an error while trying to disarm the timer. So it seems something has gone wrong with the timer fd, possibly some other thread has closed it and then later opened something else causing the fd number to now point to something which is not a timerfd. This would explain both the fact that the timerfd is waking poll while it should not as well as why disarming the timerfd is failing.

Another possibilty is that libusb itself has closed the timerfd, because the context has been destroyed and that we're calling libusb_handle_events on a destroyed context. I've checked the spice-gtk code and it has g_return_if_fail code to check for this scenario, so if the context is getting destroyed underneath us it is likely not being done from the spice-gtk code.

Regards,

Hans

Comment 11 Hans de Goede 2014-04-17 15:00:53 UTC

One possible reason why this could be hapening is other libusb using code in virt-manager calling libusb_init(NULL) and libusb_exit(NULL) and it doing so in an unbalanced fashion.

Comment 12 Dr. David Alan Gilbert 2014-04-17 16:30:46 UTC

(In reply to Hans de Goede from comment #11)
> One possible reason why this could be hapening is other libusb using code in
> virt-manager calling libusb_init(NULL) and libusb_exit(NULL) and it doing so
> in an unbalanced fashion.

Without knowing anything about the innards of virt-manager, I would point out that when I trigger this, it normally involves a pair of migrations, so I've got the spice interface open on host A, then I've got it open on host B and then got it open on host A again; so whether there is some period where they fight as they hand off between the two hosts, or perhaps something like when it comes back to the same host again their is something stale - shrug.

Comment 13 Cole Robinson 2014-04-17 16:43:57 UTC

FWIW virt-manager doesn't use libusb directly, nor libvirt.

Comment 14 Hans de Goede 2014-04-18 07:21:52 UTC

(In reply to Cole Robinson from comment #13)
> FWIW virt-manager doesn't use libusb directly, nor libvirt.

Right, how do you get the list of usb-devices for the non-spice usb-redir parts of virt-manager ?

Note I still think this whole USB thing is a red-herring as David is seeing the same hang sometimes occure without USB being involved at all. So chances are that we're just getting some memory corruption which sometimes hit USB. Or some init / exit race which sometime hits USB...

Comment 15 Cole Robinson 2014-04-19 21:39:47 UTC

(In reply to Hans de Goede from comment #14)
> (In reply to Cole Robinson from comment #13)
> > FWIW virt-manager doesn't use libusb directly, nor libvirt.
> 
> Right, how do you get the list of usb-devices for the non-spice usb-redir
> parts of virt-manager ?
> 

via libvirt APIs that talk to udev.

Comment 16 Cole Robinson 2014-09-25 17:58:05 UTC

Okay, David gave me access to his setup and I can reliably reproduce. This seems to involve spice-gtk's seamless migration support, and an orphaned usb_device_manager or something along those lines.

I added some printfs to spice-gtk to confirm that the usb_device_manager instance for each spice connection is correctly closed when we close the virt-manager console window, or the VM is destroyed. That all seems to work correctly.

However, when the VM is migrated, despite us .destroy()ing the display widget and trying to clear all our state, the usb_device_manager isn't dispose()d at the spice-gtk level. I assume this is intended and part of the seamless migration magic... however we don't use seamless migration in virt-manager (would be nice to support, but would take a bit of work for us).

So spice folks, any tips here? Is there something we can do to say 'dont use migration magic', or maybe this is just a spice-gtk or virt-manager bug? I'm happy to do more digging if people have suggestions

Comment 17 Cole Robinson 2014-09-25 17:58:32 UTC

Created attachment 941230 [details]
Debug output from virt-manager run

Comment 18 Marc-Andre Lureau 2014-11-04 16:11:45 UTC

Although I did local migrations for testing (with a small patch in libvirt), I am unable to reproduce so far on f20. I try to open the Spice console as early as I can, and leave it open until the migration complete and closes it.

Nevertheless, it would probably make sense to exit the loop on usb_handle_events() io/mem errors

Comment 22 Marc-Andre Lureau 2014-11-09 21:12:27 UTC

The main issue was virt-manager closing a fd it doesn't own [0]. For some reason, the same fd ends up being a libusb pipe. Then the USB thread spins, since no error are reported [1], and spice-gtk ignores errors [2].

Second issue I found, Spice session was leaked sometime on race cases (and USB pipe remained open too). This is related to spice-gtk audio manager [3]

Then I figured when fd are provided to spice-gtk, there is no way migration can be made to work, and the migration session ends up stuck, and the spice-gtk session is stucked and leaked too. I solved a couple of migration issues in [4]

Finally, there is a bad object model in spice-gtk, which keeps channels in session as long as the channel is alive. This is fine if client is aware of that, and only use weak references. But with bindings, this is problematic. Then calling spice_session_disconnect() several time might end up removing references that aren't owned, and I got virt-manager to crash then. I changed the spice-gtk mode [5]


[0] https://www.redhat.com/archives/virt-tools-list/2014-November/msg00022.html
[1] http://sourceforge.net/p/libusb/mailman/message/33015335/
[2] http://lists.freedesktop.org/archives/spice-devel/2014-November/017865.html
[3] http://lists.freedesktop.org/archives/spice-devel/2014-November/017864.html
[4] http://lists.freedesktop.org/archives/spice-devel/2014-November/017870.html
[5] http://lists.freedesktop.org/archives/spice-devel/2014-November/017884.html

Comment 23 Dr. David Alan Gilbert 2014-11-17 13:04:36 UTC

Excellent work; a lot of fixes from one report :-) I'll give the new virt-manager a try when it lands for [0] but I guess it might be best to wait for the spice fixes to bubble through as well.

Dave

Comment 24 Cole Robinson 2015-03-31 20:00:37 UTC

The virt-manager fix is in F21. spice-gtk fixes should be in F22+ now. Probably not big enough of a deal to backport the remaining stuff though, so just closing as CURRENTRELEASE for f22.