Bug 623735

Summary: hot unplug of vhost net virtio NIC causes qemu segfault
Product: Red Hat Enterprise Linux 6 Reporter: Alex Williamson <alex.williamson>
Component: qemu-kvmAssignee: jason wang <jasowang>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 6.0CC: akong, alex.williamson, berrange, chayang, clalance, gcosta, jasowang, khong, lihuang, llim, michen, mkenneth, szhou, tburke, virt-maint
Target Milestone: beta   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-0.12.1.2-2.128.el6 Doc Type: Bug Fix
Doc Text:
Cause: bug in the vhost start/stop code inside qemu-kvm. Consequence: Hotplug a virtio nic with vhost as its backend would crash qemu-kvm. Fix: Fix the vhost/virtio-net start/stop code. Result: Hotplug a virtio nic works well.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 11:26:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 580951, 642595    

Description Alex Williamson 2010-08-12 16:18:21 UTC
Description of problem:
If a virtio NIC making use of vhost is hot unplugged from a guest using virt-manager, the VM segfaults.

Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.109.el6.x86_64
kernel-2.6.32-59.el6.bz615118.x86_64
libvirt-0.8.1-23.el6.x86_64
virt-manager-0.8.4-8.el6.noarch

How reproducible:
easy

Steps to Reproduce:
1. create and run rhel6 VM using virt-manager with a virtio NIC
2. use virt-manager to hot remove NIC
3.
  
Actual results:
segfault

Expected results:
no segfault

Additional info:
This only happens when using libvirt because of a race between the device_del and the netdev_del.  When automated through libvirt, the device_del is initiated, firing off an ACPI interrupt to the guest.  Before the guest has time to confirm the device unload and eject, libvirt issues the netdev_del, which clears state that is then later attempted to be used by the device_del.

Comment 2 Alex Williamson 2010-08-12 16:25:02 UTC
Chris, Daniel - I can imagine the answer, but should libvirt be polling for the
device_del to have actually completed before we do the netdev_del?  I'm not
sure any of the drivers are expecting the backend to go away while the frontend
is still running.  Shouldn't an OS be able to NAK the hot unplug and still have
the device working?

Comment 3 Alex Williamson 2010-08-12 16:26:15 UTC
this time cc'd, Chris, Daniel - please see comment2

Comment 4 RHEL Program Management 2010-08-12 16:37:58 UTC
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 5 Daniel Berrangé 2010-08-12 16:42:17 UTC
QEMU offers no way to determine if device_del has completed or not and doesn't block on completion, nor return an error if the guest refuses. 

Regardless QEMU shouldn't crash if you remove the backend of a device. We explicitly want to be able todo that to change the network backend on the fly without changing the frontend.

So yes, we should try to wait for completion, but that's not possible with QEMU today.

Comment 7 Alex Williamson 2010-08-30 18:15:17 UTC
Re-assigning back to Michael.  We either need some kind of synchronization with libvirt to delay the netdev_del until after the device_del completes, or vhost needs to be robust enough to handle the netdev going away before the device_del completes.

Comment 8 Shirley Zhou 2010-09-16 08:06:49 UTC
reproduce this bug on qemu-kvm-0.12.1.2-2.113.el6_0.1.x86_64.
(gdb) bt
#0  tap_set_offload (nc=0x0, csum=1, tso4=1, tso6=1, ecn=1, ufo=1) at net/tap.c:252
#1  0x00000000004205a3 in virtio_net_set_features (vdev=0x37b2560, features=269484003)
    at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio-net.c:220
#2  0x000000000042102e in virtio_ioport_write (opaque=<value optimized out>, addr=<value optimized out>, val=269484003)
    at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio-pci.c:204
#3  0x000000000042ab48 in kvm_handle_io (env=0x2d75010) at /usr/src/debug/qemu-kvm-0.12.1.2/kvm-all.c:541
#4  kvm_run (env=0x2d75010) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:975
#5  0x000000000042ac09 in kvm_cpu_exec (env=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1658
#6  0x000000000042b82f in kvm_main_loop_cpu (_env=0x2d75010) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1900
#7  ap_main_loop (_env=0x2d75010) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1950
#8  0x00000035896077e1 in start_thread () from /lib64/libpthread.so.0
#9  0x0000003588ee153d in clone () from /lib64/libc.so.6

Comment 9 lihuang 2010-09-27 02:55:17 UTC
*** Bug 637505 has been marked as a duplicate of this bug. ***

Comment 14 Shirley Zhou 2010-10-25 10:00:51 UTC
Test hot-unplug virtio nic with vhost=on via virt-manager using package https://brewweb.devel.redhat.com/taskinfo?taskID=2841906, core dump again, but it is different from comment8.
(gdb) bt
#0  0x00000034c8e75782 in malloc_consolidate () from /lib64/libc.so.6
#1  0x00000034c8e78612 in _int_malloc () from /lib64/libc.so.6
#2  0x00000034c8e79a3d in malloc () from /lib64/libc.so.6
#3  0x0000000000475b35 in qemu_malloc (size=<value optimized out>) at qemu-malloc.c:59
#4  0x0000000000475c16 in qemu_mallocz (size=4120) at qemu-malloc.c:75
#5  0x0000000000494aae in qdict_new () at qdict.c:38
#6  0x0000000000495f95 in json_message_process_token (lexer=0x11cefa0, token=0x12052a0, type=JSON_OPERATOR, x=1, y=63) at json-streamer.c:45
#7  0x0000000000495da3 in json_lexer_feed_char (lexer=0x11cefa0, ch=123 '{') at json-lexer.c:299
#8  0x0000000000495ed7 in json_lexer_feed (lexer=0x11cefa0, buffer=0x7fffea30e020 "{", size=1) at json-lexer.c:322
#9  0x00000000004124d2 in monitor_control_read (opaque=<value optimized out>, buf=<value optimized out>, size=<value optimized out>)
    at /usr/src/debug/qemu-kvm-0.12.1.2/monitor.c:4478
#10 0x00000000004b6f2a in qemu_chr_read (opaque=0x10d6640) at qemu-char.c:154
#11 tcp_chr_read (opaque=0x10d6640) at qemu-char.c:2072
#12 0x000000000040b4af in main_loop_wait (timeout=1000) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4234
#13 0x0000000000428cfa in kvm_main_loop () at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:2133
#14 0x000000000040e5cb in main_loop (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>)
    at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4444
#15 main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:6601

Comment 24 Miya Chen 2011-03-10 03:23:20 UTC
move to verified based on comment#22

Comment 25 jason wang 2011-05-05 09:40:30 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: bug in the vhost start/stop code inside qemu-kvm.

Consequence: Hotplug a virtio nic with vhost as its backend would crash qemu-kvm.

Fix: Fix the vhost/virtio-net start/stop code.

Result: Hotplug a virtio nic works well.

Comment 26 errata-xmlrpc 2011-05-19 11:26:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0534.html

Comment 27 errata-xmlrpc 2011-05-19 12:47:01 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0534.html