Description of problem: Starting a 5.9 VM via virt-manager on an F19 box crashes with an OOPS with the following trace. Note that I am able to reproduce it with both 3.9.8-300.fc19.x86_64 and 3.10.0-1.fc20.x86_64. I am not 100% sure if this is a dupe of 975065 as the trace is slightly different. KERNEL: /usr/lib/debug/lib/modules/3.10.0-1.fc20.x86_64/vmlinux DUMPFILE: 127.0.0.1-2013.07.02-23:23:41/vmcore [PARTIAL DUMP] CPUS: 4 DATE: Wed Jul 3 00:23:40 2013 UPTIME: 00:03:06 LOAD AVERAGE: 0.86, 1.17, 0.55 TASKS: 221 NODENAME: nas.int.rhx RELEASE: 3.10.0-1.fc20.x86_64 VERSION: #1 SMP Mon Jul 1 20:50:44 UTC 2013 MACHINE: x86_64 (2793 Mhz) MEMORY: 3.9 GB PANIC: "Oops: 0000 [#1] SMP " (check log for details) PID: 11876 COMMAND: "vhost-11875" TASK: ffff880136d6af00 [THREAD_INFO: ffff88011ea26000] CPU: 1 STATE: TASK_RUNNING (PANIC) crash> bt PID: 11876 TASK: ffff880136d6af00 CPU: 1 COMMAND: "vhost-11875" #0 [ffff88013fa838d8] machine_kexec at ffffffff8103bfb2 #1 [ffff88013fa83928] crash_kexec at ffffffff810c67f3 #2 [ffff88013fa839f0] oops_end at ffffffff816492b0 #3 [ffff88013fa83a18] no_context at ffffffff8163d46d #4 [ffff88013fa83a60] __bad_area_nosemaphore at ffffffff8163d4ed #5 [ffff88013fa83aa8] bad_area_nosemaphore at ffffffff8163d659 #6 [ffff88013fa83ab8] __do_page_fault at ffffffff8164be7e #7 [ffff88013fa83bb0] do_page_fault at ffffffff8164c07e #8 [ffff88013fa83bc0] page_fault at ffffffff81648718 [exception RIP: vhost_poll_queue+10] RIP: ffffffffa05d5baa RSP: ffff88013fa83c78 RFLAGS: 00010286 RAX: 00000000c23988bd RBX: ffff88011f504200 RCX: 0000000000000003 RDX: 0000000000000000 RSI: 00000000000000c0 RDI: 0000000000000080 RBP: ffff88013fa83ca8 R8: ffffea0000dc1ba0 R9: 000000000009f414 R10: 00000000000040b4 R11: 0000000000000000 R12: ffff88011ddb0000 R13: 0000000000000000 R14: 0000000000000000 R15: ffff88011dcf5a00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff88013fa83c80] vhost_zerocopy_callback at ffffffffa05d8dac [vhost_net] #10 [ffff88013fa83cb0] skb_copy_ubufs at ffffffff8152e0cd #11 [ffff88013fa83d10] skb_clone at ffffffff8152e214 #12 [ffff88013fa83d30] deliver_clone at ffffffffa05840f5 [bridge] #13 [ffff88013fa83d58] br_flood at ffffffffa05846ad [bridge] #14 [ffff88013fa83da0] br_flood_forward at ffffffffa0584875 [bridge] #15 [ffff88013fa83db0] br_handle_frame_finish at ffffffffa0585514 [bridge] #16 [ffff88013fa83df0] br_handle_frame at ffffffffa0585825 [bridge] #17 [ffff88013fa83e28] __netif_receive_skb_core at ffffffff8153b522 #18 [ffff88013fa83e80] __netif_receive_skb at ffffffff8153bae8 #19 [ffff88013fa83ea0] process_backlog at ffffffff8153c61e #20 [ffff88013fa83ee8] net_rx_action at ffffffff8153bed9 #21 [ffff88013fa83f40] __do_softirq at ffffffff81064997 #22 [ffff88013fa83fb0] call_softirq at ffffffff81651adc --- <IRQ stack> --- #23 [ffff88011ea27ca0] netif_rx at ffffffff8153ad99 #24 [ffff88011ea27cd8] netif_rx_ni at ffffffff8153b218 #25 [ffff88011ea27cf0] tun_get_user at ffffffffa05a78ff [tun] #26 [ffff88011ea27d80] tun_sendmsg at ffffffffa05a7d27 [tun] #27 [ffff88011ea27da0] handle_tx at ffffffffa05d8830 [vhost_net] #28 [ffff88011ea27e60] handle_tx_kick at ffffffffa05d8c55 [vhost_net] #29 [ffff88011ea27e70] vhost_worker at ffffffffa05d578d [vhost_net] #30 [ffff88011ea27ed0] kthread at ffffffff81080ba0 #31 [ffff88011ea27f50] ret_from_fork at ffffffff8165066c [exception RIP: unknown or invalid address] RIP: 0000000000000000 RSP: ffff88011ea27f58 RFLAGS: 00000202 RAX: 0000000000000000 RBX: ffffffff81080ae0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff8801371cfcf8 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 bt: WARNING: possibly bogus exception frame I have collected vmcores for both 3.9.8 and 3.10.0 if needed
The patch that mst posted on netdev (it's not applied in any tree yet as it has to be respun) fixes it for me. I had to slightly change it as it didn't apply cleanly: diff -up linux-3.10.0-1.fc20.x86_64/drivers/vhost/net.c.afterfree linux-3.10.0-1.fc20.x86_64/drivers/vhost/net.c --- linux-3.10.0-1.fc20.x86_64/drivers/vhost/net.c.afterfree 2013-07-03 01:14:04.871884424 +0200 +++ linux-3.10.0-1.fc20.x86_64/drivers/vhost/net.c 2013-07-03 01:17:14.047924856 +0200 @@ -150,6 +150,11 @@ static void vhost_net_ubuf_put_and_wait( { kref_put(&ubufs->kref, vhost_net_zerocopy_done_signal); wait_event(ubufs->wait, !atomic_read(&ubufs->kref.refcount)); +} + +static void vhost_net_ubuf_put_wait_and_free(struct vhost_net_ubuf_ref *ubufs) +{ + vhost_net_ubuf_put_and_wait(ubufs); kfree(ubufs); } @@ -948,7 +953,7 @@ static long vhost_net_set_backend(struct mutex_unlock(&vq->mutex); if (oldubufs) { - vhost_net_ubuf_put_and_wait(oldubufs); + vhost_net_ubuf_put_wait_and_free(oldubufs); mutex_lock(&vq->mutex); vhost_zerocopy_signal_used(n, vq); mutex_unlock(&vq->mutex); @@ -966,7 +971,7 @@ err_used: rcu_assign_pointer(vq->private_data, oldsock); vhost_net_enable_vq(n, vq); if (ubufs) - vhost_net_ubuf_put_and_wait(ubufs); + vhost_net_ubuf_put_wait_and_free(ubufs); err_ubufs: fput(sock->file); err_vq:
For all of those having trouble with vhost and/or bridging in guests, please try the scratch build below when it completes. It contains the patch from bug 880035 for the timer fix and the use-after-free fix for vhost-net backported to 3.9.8. http://koji.fedoraproject.org/koji/taskinfo?taskID=5569247
Sigh. Of course, it would help if I didn't typo the patch. Anyway, here is a scratch build that should actually finish building: http://koji.fedoraproject.org/koji/taskinfo?taskID=5569571
Third time is a charm. This one actually looks like it built. Sigh, sorry about that. http://koji.fedoraproject.org/koji/taskinfo?taskID=5569631
Thanks Josh :) Just tested it and all is well here. Cheers, Michele
I've applied the patches to F17-F19 now. Assuming the testing holds, this should be fixed with the next update.
kernel-3.9.9-201.fc18 has been submitted as an update for Fedora 18. https://admin.fedoraproject.org/updates/kernel-3.9.9-201.fc18
Package kernel-3.9.9-201.fc18: * should fix your issue, * was pushed to the Fedora 18 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing kernel-3.9.9-201.fc18' as soon as you are able to, then reboot. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2013-12530/kernel-3.9.9-201.fc18 then log in and leave karma (feedback).
kernel-3.9.9-302.fc19 has been submitted as an update for Fedora 19. https://admin.fedoraproject.org/updates/kernel-3.9.9-302.fc19
kernel-3.9.9-201.fc18 has been pushed to the Fedora 18 stable repository. If problems still persist, please make note of it in this bug report.
kernel-3.9.9-302.fc19 has been pushed to the Fedora 19 stable repository. If problems still persist, please make note of it in this bug report.
kernel-3.9.10-100.fc17 has been submitted as an update for Fedora 17. https://admin.fedoraproject.org/updates/kernel-3.9.10-100.fc17
kernel-3.9.10-100.fc17 has been pushed to the Fedora 17 stable repository. If problems still persist, please make note of it in this bug report.