980643 – Kernel oops with [exception RIP: vhost_poll_queue+10] when booting a 5.9x64 VM on F19 host

Bug 980643 - Kernel oops with [exception RIP: vhost_poll_queue+10] when booting a 5.9x64 VM on F19 host

Summary: Kernel oops with [exception RIP: vhost_poll_queue+10] when booting a 5.9x64 V...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	19
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	CVE-2013-4127
TreeView+	depends on / blocked

Reported:	2013-07-02 22:44 UTC by Michele Baldessari
Modified:	2013-07-18 06:09 UTC (History)
CC List:	6 users (show)
Fixed In Version:	kernel-3.9.10-100.fc17
Clone Of:
Environment:
Last Closed:	2013-07-12 03:10:22 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Michele Baldessari 2013-07-02 22:44:00 UTC

Description of problem:
Starting a 5.9 VM via virt-manager on an F19 box crashes with an OOPS with the following trace. Note that I am able to reproduce it with both 3.9.8-300.fc19.x86_64 and 3.10.0-1.fc20.x86_64.

I am not 100% sure if this is a dupe of 975065 as the trace is slightly different.

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-1.fc20.x86_64/vmlinux
    DUMPFILE: 127.0.0.1-2013.07.02-23:23:41/vmcore  [PARTIAL DUMP]
        CPUS: 4
        DATE: Wed Jul  3 00:23:40 2013
      UPTIME: 00:03:06
LOAD AVERAGE: 0.86, 1.17, 0.55
       TASKS: 221
    NODENAME: nas.int.rhx
     RELEASE: 3.10.0-1.fc20.x86_64
     VERSION: #1 SMP Mon Jul 1 20:50:44 UTC 2013
     MACHINE: x86_64  (2793 Mhz)
      MEMORY: 3.9 GB
       PANIC: "Oops: 0000 [#1] SMP " (check log for details)
         PID: 11876
     COMMAND: "vhost-11875"
        TASK: ffff880136d6af00  [THREAD_INFO: ffff88011ea26000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 11876  TASK: ffff880136d6af00  CPU: 1   COMMAND: "vhost-11875"
 #0 [ffff88013fa838d8] machine_kexec at ffffffff8103bfb2
 #1 [ffff88013fa83928] crash_kexec at ffffffff810c67f3
 #2 [ffff88013fa839f0] oops_end at ffffffff816492b0
 #3 [ffff88013fa83a18] no_context at ffffffff8163d46d
 #4 [ffff88013fa83a60] __bad_area_nosemaphore at ffffffff8163d4ed
 #5 [ffff88013fa83aa8] bad_area_nosemaphore at ffffffff8163d659
 #6 [ffff88013fa83ab8] __do_page_fault at ffffffff8164be7e
 #7 [ffff88013fa83bb0] do_page_fault at ffffffff8164c07e
 #8 [ffff88013fa83bc0] page_fault at ffffffff81648718
    [exception RIP: vhost_poll_queue+10]
    RIP: ffffffffa05d5baa  RSP: ffff88013fa83c78  RFLAGS: 00010286
    RAX: 00000000c23988bd  RBX: ffff88011f504200  RCX: 0000000000000003
    RDX: 0000000000000000  RSI: 00000000000000c0  RDI: 0000000000000080
    RBP: ffff88013fa83ca8   R8: ffffea0000dc1ba0   R9: 000000000009f414
    R10: 00000000000040b4  R11: 0000000000000000  R12: ffff88011ddb0000
    R13: 0000000000000000  R14: 0000000000000000  R15: ffff88011dcf5a00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff88013fa83c80] vhost_zerocopy_callback at ffffffffa05d8dac [vhost_net]
#10 [ffff88013fa83cb0] skb_copy_ubufs at ffffffff8152e0cd
#11 [ffff88013fa83d10] skb_clone at ffffffff8152e214
#12 [ffff88013fa83d30] deliver_clone at ffffffffa05840f5 [bridge]
#13 [ffff88013fa83d58] br_flood at ffffffffa05846ad [bridge]
#14 [ffff88013fa83da0] br_flood_forward at ffffffffa0584875 [bridge]
#15 [ffff88013fa83db0] br_handle_frame_finish at ffffffffa0585514 [bridge]
#16 [ffff88013fa83df0] br_handle_frame at ffffffffa0585825 [bridge]
#17 [ffff88013fa83e28] __netif_receive_skb_core at ffffffff8153b522
#18 [ffff88013fa83e80] __netif_receive_skb at ffffffff8153bae8
#19 [ffff88013fa83ea0] process_backlog at ffffffff8153c61e
#20 [ffff88013fa83ee8] net_rx_action at ffffffff8153bed9
#21 [ffff88013fa83f40] __do_softirq at ffffffff81064997
#22 [ffff88013fa83fb0] call_softirq at ffffffff81651adc
--- <IRQ stack> ---
#23 [ffff88011ea27ca0] netif_rx at ffffffff8153ad99
#24 [ffff88011ea27cd8] netif_rx_ni at ffffffff8153b218
#25 [ffff88011ea27cf0] tun_get_user at ffffffffa05a78ff [tun]
#26 [ffff88011ea27d80] tun_sendmsg at ffffffffa05a7d27 [tun]
#27 [ffff88011ea27da0] handle_tx at ffffffffa05d8830 [vhost_net]
#28 [ffff88011ea27e60] handle_tx_kick at ffffffffa05d8c55 [vhost_net]
#29 [ffff88011ea27e70] vhost_worker at ffffffffa05d578d [vhost_net]
#30 [ffff88011ea27ed0] kthread at ffffffff81080ba0
#31 [ffff88011ea27f50] ret_from_fork at ffffffff8165066c
    [exception RIP: unknown or invalid address]
    RIP: 0000000000000000  RSP: ffff88011ea27f58  RFLAGS: 00000202
    RAX: 0000000000000000  RBX: ffffffff81080ae0  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: ffff8801371cfcf8   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
bt: WARNING: possibly bogus exception frame

I have collected vmcores for both 3.9.8 and 3.10.0 if needed

Comment 1 Michele Baldessari 2013-07-03 05:55:44 UTC

The patch that mst posted on netdev (it's not applied in any tree yet as it has to be respun) fixes it for me. I had to slightly change it as it didn't apply cleanly:

diff -up linux-3.10.0-1.fc20.x86_64/drivers/vhost/net.c.afterfree linux-3.10.0-1.fc20.x86_64/drivers/vhost/net.c
--- linux-3.10.0-1.fc20.x86_64/drivers/vhost/net.c.afterfree	2013-07-03 01:14:04.871884424 +0200
+++ linux-3.10.0-1.fc20.x86_64/drivers/vhost/net.c	2013-07-03 01:17:14.047924856 +0200
@@ -150,6 +150,11 @@ static void vhost_net_ubuf_put_and_wait(
 {
 	kref_put(&ubufs->kref, vhost_net_zerocopy_done_signal);
 	wait_event(ubufs->wait, !atomic_read(&ubufs->kref.refcount));
+}
+
+static void vhost_net_ubuf_put_wait_and_free(struct vhost_net_ubuf_ref *ubufs)
+{
+	vhost_net_ubuf_put_and_wait(ubufs);
 	kfree(ubufs);
 }
 
@@ -948,7 +953,7 @@ static long vhost_net_set_backend(struct
 	mutex_unlock(&vq->mutex);
 
 	if (oldubufs) {
-		vhost_net_ubuf_put_and_wait(oldubufs);
+		vhost_net_ubuf_put_wait_and_free(oldubufs);
 		mutex_lock(&vq->mutex);
 		vhost_zerocopy_signal_used(n, vq);
 		mutex_unlock(&vq->mutex);
@@ -966,7 +971,7 @@ err_used:
 	rcu_assign_pointer(vq->private_data, oldsock);
 	vhost_net_enable_vq(n, vq);
 	if (ubufs)
-		vhost_net_ubuf_put_and_wait(ubufs);
+		vhost_net_ubuf_put_wait_and_free(ubufs);
 err_ubufs:
 	fput(sock->file);
 err_vq:

Comment 2 Josh Boyer 2013-07-03 13:11:28 UTC

For all of those having trouble with vhost and/or bridging in guests, please try the scratch build below when it completes.  It contains the patch from bug 880035 for the timer fix and the use-after-free fix for vhost-net backported to 3.9.8.

http://koji.fedoraproject.org/koji/taskinfo?taskID=5569247

Comment 3 Josh Boyer 2013-07-03 14:22:38 UTC

Sigh.  Of course, it would help if I didn't typo the patch.  Anyway, here is a scratch build that should actually finish building:

http://koji.fedoraproject.org/koji/taskinfo?taskID=5569571

Comment 4 Josh Boyer 2013-07-03 16:37:10 UTC

Third time is a charm.  This one actually looks like it built.  Sigh, sorry about that.

http://koji.fedoraproject.org/koji/taskinfo?taskID=5569631

Comment 5 Michele Baldessari 2013-07-03 16:54:02 UTC

Thanks Josh :)

Just tested it and all is well here. 

Cheers,
Michele

Comment 6 Josh Boyer 2013-07-05 13:03:22 UTC

I've applied the patches to F17-F19 now.  Assuming the testing holds, this should be fixed with the next update.

Comment 7 Fedora Update System 2013-07-05 19:04:23 UTC

kernel-3.9.9-201.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/kernel-3.9.9-201.fc18

Comment 8 Fedora Update System 2013-07-07 01:39:04 UTC

Package kernel-3.9.9-201.fc18:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.9.9-201.fc18'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-12530/kernel-3.9.9-201.fc18
then log in and leave karma (feedback).

Comment 9 Fedora Update System 2013-07-11 22:15:51 UTC

kernel-3.9.9-302.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/kernel-3.9.9-302.fc19

Comment 10 Fedora Update System 2013-07-12 03:10:22 UTC

kernel-3.9.9-201.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 11 Fedora Update System 2013-07-14 03:29:59 UTC

kernel-3.9.9-302.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 12 Fedora Update System 2013-07-14 11:22:08 UTC

kernel-3.9.10-100.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/kernel-3.9.10-100.fc17

Comment 13 Fedora Update System 2013-07-18 06:09:45 UTC

kernel-3.9.10-100.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.