Bug 980643
| Summary: | Kernel oops with [exception RIP: vhost_poll_queue+10] when booting a 5.9x64 VM on F19 host | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Michele Baldessari <michele> |
| Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
| Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 19 | CC: | gansalmon, itamar, jonathan, kernel-maint, luvilla, madhu.chinakonda |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | kernel-3.9.10-100.fc17 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-07-12 03:10:22 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 984722 | ||
The patch that mst posted on netdev (it's not applied in any tree yet as it has to be respun) fixes it for me. I had to slightly change it as it didn't apply cleanly:
diff -up linux-3.10.0-1.fc20.x86_64/drivers/vhost/net.c.afterfree linux-3.10.0-1.fc20.x86_64/drivers/vhost/net.c
--- linux-3.10.0-1.fc20.x86_64/drivers/vhost/net.c.afterfree 2013-07-03 01:14:04.871884424 +0200
+++ linux-3.10.0-1.fc20.x86_64/drivers/vhost/net.c 2013-07-03 01:17:14.047924856 +0200
@@ -150,6 +150,11 @@ static void vhost_net_ubuf_put_and_wait(
{
kref_put(&ubufs->kref, vhost_net_zerocopy_done_signal);
wait_event(ubufs->wait, !atomic_read(&ubufs->kref.refcount));
+}
+
+static void vhost_net_ubuf_put_wait_and_free(struct vhost_net_ubuf_ref *ubufs)
+{
+ vhost_net_ubuf_put_and_wait(ubufs);
kfree(ubufs);
}
@@ -948,7 +953,7 @@ static long vhost_net_set_backend(struct
mutex_unlock(&vq->mutex);
if (oldubufs) {
- vhost_net_ubuf_put_and_wait(oldubufs);
+ vhost_net_ubuf_put_wait_and_free(oldubufs);
mutex_lock(&vq->mutex);
vhost_zerocopy_signal_used(n, vq);
mutex_unlock(&vq->mutex);
@@ -966,7 +971,7 @@ err_used:
rcu_assign_pointer(vq->private_data, oldsock);
vhost_net_enable_vq(n, vq);
if (ubufs)
- vhost_net_ubuf_put_and_wait(ubufs);
+ vhost_net_ubuf_put_wait_and_free(ubufs);
err_ubufs:
fput(sock->file);
err_vq:
For all of those having trouble with vhost and/or bridging in guests, please try the scratch build below when it completes. It contains the patch from bug 880035 for the timer fix and the use-after-free fix for vhost-net backported to 3.9.8. http://koji.fedoraproject.org/koji/taskinfo?taskID=5569247 Sigh. Of course, it would help if I didn't typo the patch. Anyway, here is a scratch build that should actually finish building: http://koji.fedoraproject.org/koji/taskinfo?taskID=5569571 Third time is a charm. This one actually looks like it built. Sigh, sorry about that. http://koji.fedoraproject.org/koji/taskinfo?taskID=5569631 Thanks Josh :) Just tested it and all is well here. Cheers, Michele I've applied the patches to F17-F19 now. Assuming the testing holds, this should be fixed with the next update. kernel-3.9.9-201.fc18 has been submitted as an update for Fedora 18. https://admin.fedoraproject.org/updates/kernel-3.9.9-201.fc18 Package kernel-3.9.9-201.fc18: * should fix your issue, * was pushed to the Fedora 18 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing kernel-3.9.9-201.fc18' as soon as you are able to, then reboot. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2013-12530/kernel-3.9.9-201.fc18 then log in and leave karma (feedback). kernel-3.9.9-302.fc19 has been submitted as an update for Fedora 19. https://admin.fedoraproject.org/updates/kernel-3.9.9-302.fc19 kernel-3.9.9-201.fc18 has been pushed to the Fedora 18 stable repository. If problems still persist, please make note of it in this bug report. kernel-3.9.9-302.fc19 has been pushed to the Fedora 19 stable repository. If problems still persist, please make note of it in this bug report. kernel-3.9.10-100.fc17 has been submitted as an update for Fedora 17. https://admin.fedoraproject.org/updates/kernel-3.9.10-100.fc17 kernel-3.9.10-100.fc17 has been pushed to the Fedora 17 stable repository. If problems still persist, please make note of it in this bug report. |
Description of problem: Starting a 5.9 VM via virt-manager on an F19 box crashes with an OOPS with the following trace. Note that I am able to reproduce it with both 3.9.8-300.fc19.x86_64 and 3.10.0-1.fc20.x86_64. I am not 100% sure if this is a dupe of 975065 as the trace is slightly different. KERNEL: /usr/lib/debug/lib/modules/3.10.0-1.fc20.x86_64/vmlinux DUMPFILE: 127.0.0.1-2013.07.02-23:23:41/vmcore [PARTIAL DUMP] CPUS: 4 DATE: Wed Jul 3 00:23:40 2013 UPTIME: 00:03:06 LOAD AVERAGE: 0.86, 1.17, 0.55 TASKS: 221 NODENAME: nas.int.rhx RELEASE: 3.10.0-1.fc20.x86_64 VERSION: #1 SMP Mon Jul 1 20:50:44 UTC 2013 MACHINE: x86_64 (2793 Mhz) MEMORY: 3.9 GB PANIC: "Oops: 0000 [#1] SMP " (check log for details) PID: 11876 COMMAND: "vhost-11875" TASK: ffff880136d6af00 [THREAD_INFO: ffff88011ea26000] CPU: 1 STATE: TASK_RUNNING (PANIC) crash> bt PID: 11876 TASK: ffff880136d6af00 CPU: 1 COMMAND: "vhost-11875" #0 [ffff88013fa838d8] machine_kexec at ffffffff8103bfb2 #1 [ffff88013fa83928] crash_kexec at ffffffff810c67f3 #2 [ffff88013fa839f0] oops_end at ffffffff816492b0 #3 [ffff88013fa83a18] no_context at ffffffff8163d46d #4 [ffff88013fa83a60] __bad_area_nosemaphore at ffffffff8163d4ed #5 [ffff88013fa83aa8] bad_area_nosemaphore at ffffffff8163d659 #6 [ffff88013fa83ab8] __do_page_fault at ffffffff8164be7e #7 [ffff88013fa83bb0] do_page_fault at ffffffff8164c07e #8 [ffff88013fa83bc0] page_fault at ffffffff81648718 [exception RIP: vhost_poll_queue+10] RIP: ffffffffa05d5baa RSP: ffff88013fa83c78 RFLAGS: 00010286 RAX: 00000000c23988bd RBX: ffff88011f504200 RCX: 0000000000000003 RDX: 0000000000000000 RSI: 00000000000000c0 RDI: 0000000000000080 RBP: ffff88013fa83ca8 R8: ffffea0000dc1ba0 R9: 000000000009f414 R10: 00000000000040b4 R11: 0000000000000000 R12: ffff88011ddb0000 R13: 0000000000000000 R14: 0000000000000000 R15: ffff88011dcf5a00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff88013fa83c80] vhost_zerocopy_callback at ffffffffa05d8dac [vhost_net] #10 [ffff88013fa83cb0] skb_copy_ubufs at ffffffff8152e0cd #11 [ffff88013fa83d10] skb_clone at ffffffff8152e214 #12 [ffff88013fa83d30] deliver_clone at ffffffffa05840f5 [bridge] #13 [ffff88013fa83d58] br_flood at ffffffffa05846ad [bridge] #14 [ffff88013fa83da0] br_flood_forward at ffffffffa0584875 [bridge] #15 [ffff88013fa83db0] br_handle_frame_finish at ffffffffa0585514 [bridge] #16 [ffff88013fa83df0] br_handle_frame at ffffffffa0585825 [bridge] #17 [ffff88013fa83e28] __netif_receive_skb_core at ffffffff8153b522 #18 [ffff88013fa83e80] __netif_receive_skb at ffffffff8153bae8 #19 [ffff88013fa83ea0] process_backlog at ffffffff8153c61e #20 [ffff88013fa83ee8] net_rx_action at ffffffff8153bed9 #21 [ffff88013fa83f40] __do_softirq at ffffffff81064997 #22 [ffff88013fa83fb0] call_softirq at ffffffff81651adc --- <IRQ stack> --- #23 [ffff88011ea27ca0] netif_rx at ffffffff8153ad99 #24 [ffff88011ea27cd8] netif_rx_ni at ffffffff8153b218 #25 [ffff88011ea27cf0] tun_get_user at ffffffffa05a78ff [tun] #26 [ffff88011ea27d80] tun_sendmsg at ffffffffa05a7d27 [tun] #27 [ffff88011ea27da0] handle_tx at ffffffffa05d8830 [vhost_net] #28 [ffff88011ea27e60] handle_tx_kick at ffffffffa05d8c55 [vhost_net] #29 [ffff88011ea27e70] vhost_worker at ffffffffa05d578d [vhost_net] #30 [ffff88011ea27ed0] kthread at ffffffff81080ba0 #31 [ffff88011ea27f50] ret_from_fork at ffffffff8165066c [exception RIP: unknown or invalid address] RIP: 0000000000000000 RSP: ffff88011ea27f58 RFLAGS: 00000202 RAX: 0000000000000000 RBX: ffffffff81080ae0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff8801371cfcf8 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 bt: WARNING: possibly bogus exception frame I have collected vmcores for both 3.9.8 and 3.10.0 if needed