Bug 1004608

Summary: Qemu core dumpd, when delete the tap interface that is used by a vhost=on enabled guest
Product: Red Hat Enterprise Linux 6 Reporter: Qian Guo <qiguo>
Component: qemu-kvmAssignee: Vlad Yasevich <vyasevic>
Status: CLOSED WONTFIX QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 6.5CC: acathrow, areis, bsarathy, chayang, juzhang, michen, mkenneth, qiguo, qzhang, rhod, virt-bugs, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-04-03 15:59:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Qian Guo 2013-09-05 03:16:28 UTC
Description of problem:
Boot a guest with vhost=on, if delete the tap deivce that is used by guest, qemu core dumpd. if disable vhost, won't hit this issue.

Version-Release number of selected component (if applicable):


How reproducible:
# uname -r
2.6.32-416.el6.x86_64

# rpm -q qemu-kvm
qemu-kvm-0.12.1.2-2.398.el6.x86_64

Steps to Reproduce:
1.Boot guest w/ vhost=on:
# /usr/libexec/qemu-kvm -cpu Penryn -enable-kvm -m 4096 -smp 4,sockets=1,cores=4,threads=1 -name rhel7base  -drive file=/home/RHEL-Server-6.5-64.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,werror=stop,rerror=stop,aio=native -device virtio-blk-pci,drive=drive-virtio-disk0,id=virtio-disk0 -boot menu=on -monitor stdio -netdev tap,id=hostnet0,ifname=guest1,script=/etc/qemu-ifup,vhost=on -device virtio-net-pci,netdev=hostnet0,mac=54:52:1b:35:3c:16,id=test -nodefaults -nodefconfig -spice port=5930,seamless-migration=on,disable-ticketing -vga qxl -global qxl-vga.vram_size=67108864   -device virtio-balloon-pci,id=balloon1 -qmp tcp:0:4446,server,nowait -device intel-hda,id=hda1 -device hda-duplex -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -serial unix:/tmp/qiguo,server,nowait

2.Delete the tap device in host:
# ip -d link delete guest1



Actual results:
qemu core dumpd:
TUNSETOFFLOAD ioctl() failed: File descriptor in bad state
TUNSETVNETHDRSZ ioctl() failed: File descriptor in bad state. Exiting.
qemu-kvm: /builddir/build/BUILD/qemu-kvm-0.12.1.2/net/tap-linux.c:160: tap_fd_set_vnet_hdr_len: Assertion `0' failed.

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffe75fe700 (LWP 3781)]
0x00007ffff4c9e925 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install alsa-lib-1.0.22-3.el6.x86_64 celt051-0.5.1.3-0.el6.x86_64 cyrus-sasl-gssapi-2.1.23-13.el6_3.1.x86_64 cyrus-sasl-lib-2.1.23-13.el6_3.1.x86_64 cyrus-sasl-md5-2.1.23-13.el6_3.1.x86_64 cyrus-sasl-plain-2.1.23-13.el6_3.1.x86_64 db4-4.7.25-17.el6.x86_64 dbus-libs-1.2.24-7.el6_3.x86_64 flac-1.2.1-6.1.el6.x86_64 glib2-2.26.0-3.el6.x86_64 glibc-2.12-1.128.el6.x86_64 glusterfs-api-3.4.0.21rhs-1.el6.x86_64 glusterfs-libs-3.4.0.21rhs-1.el6.x86_64 gnutls-2.8.5-10.el6_4.2.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6_4.4.x86_64 libICE-1.0.6-1.el6.x86_64 libSM-1.2.1-2.el6.x86_64 libX11-1.5.0-4.el6.x86_64 libXau-1.0.6-4.el6.x86_64 libXext-1.3.1-2.el6.x86_64 libXi-1.6.1-3.el6.x86_64 libXtst-1.2.1-2.el6.x86_64 libaio-0.3.107-10.el6.x86_64 libasyncns-0.8-1.1.el6.x86_64 libcom_err-1.41.12-18.el6.x86_64 libgcrypt-1.4.5-9.el6_2.2.x86_64 libgpg-error-1.7-4.el6.x86_64 libjpeg-turbo-1.2.1-1.el6.x86_64 libogg-1.1.4-2.1.el6.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64 libsndfile-1.0.20-5.el6.x86_64 libtasn1-2.3-3.el6_2.1.x86_64 libuuid-2.17.2-12.14.el6.x86_64 libvorbis-1.2.3-4.el6_2.1.x86_64 libxcb-1.8.1-1.el6.x86_64 nss-softokn-freebl-3.14.3-5.el6.x86_64 openssl-1.0.1e-8.el6.x86_64 pixman-0.26.2-5.el6_4.x86_64 pulseaudio-libs-0.9.21-14.el6_3.x86_64 spice-server-0.12.4-2.el6.x86_64 tcp_wrappers-libs-7.6-57.el6.x86_64 usbredir-0.5.1-1.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x00007ffff4c9e925 in raise () from /lib64/libc.so.6
#1  0x00007ffff4ca0105 in abort () from /lib64/libc.so.6
#2  0x00007ffff4c97a4e in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007ffff4c97b10 in __assert_fail () from /lib64/libc.so.6
#4  0x00007ffff7e3ad1e in tap_fd_set_vnet_hdr_len (fd=<value optimized out>, len=12) at /usr/src/debug/qemu-kvm-0.12.1.2/net/tap-linux.c:160
#5  0x00007ffff7e3a9dd in tap_set_vnet_hdr_len (nc=0x7ffff86e5f00, len=12) at /usr/src/debug/qemu-kvm-0.12.1.2/net/tap.c:252
#6  0x00007ffff7de59d1 in vhost_net_start (net=0x7ffff86f75c0, dev=0x7ffff9ced6c0) at /usr/src/debug/qemu-kvm-0.12.1.2/hw/vhost_net.c:151
#7  0x00007ffff7ddf727 in virtio_net_vhost_status (vdev=0x7ffff9ced6c0, status=7 '\a') at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio-net.c:130
#8  virtio_net_set_status (vdev=0x7ffff9ced6c0, status=7 '\a') at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio-net.c:147
#9  0x00007ffff7de2747 in virtio_set_status (opaque=0x7ffff88f3b90, addr=<value optimized out>, val=7) at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio.h:138
#10 virtio_ioport_write (opaque=0x7ffff88f3b90, addr=<value optimized out>, val=7) at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio-pci.c:367
#11 0x00007ffff7deea7f in kvm_handle_io (env=0x7ffff88a9db0) at /usr/src/debug/qemu-kvm-0.12.1.2/kvm-all.c:145
#12 kvm_run (env=0x7ffff88a9db0) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1049
#13 0x00007ffff7deecb9 in kvm_cpu_exec (env=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1744
#14 0x00007ffff7defb9d in kvm_main_loop_cpu (_env=0x7ffff88a9db0) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:2005
#15 ap_main_loop (_env=0x7ffff88a9db0) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:2061
#16 0x00007ffff77029d1 in start_thread () from /lib64/libpthread.so.0
#17 0x00007ffff4d54a8d in clone () from /lib64/libc.so.6
(gdb) 


Expected results:
No core dumped happened.

Additional info:

Comment 2 Qian Guo 2013-09-05 03:33:43 UTC
Test w/ openvswitch, has same issue.
# rpm -q openvswitch
openvswitch-1.11.0-1.el6.x86_64

Comment 3 Qunfang Zhang 2013-09-05 06:00:47 UTC
There's a similar scenario in bug 1004625. We are not sure whether they are same bug. Please help confirm and feel free to close one if they are the same.

Comment 4 Vlad Yasevich 2013-09-23 19:52:15 UTC
(In reply to Qunfang Zhang from comment #3)
> There's a similar scenario in bug 1004625. We are not sure whether they are
> same bug. Please help confirm and feel free to close one if they are the
> same.

The 2 bugs are not the same, but there is some relationship between them.

This particular bug heavily depends on the time of when the interface is deleted.  If the interface is deleted after qemu finished initialized the tap/vhost devices, then the problem will not be observed.  If the interface is deleted before qemu had a chance to initialize the tap/vhost devices, the problem will be present.

This is due the fact that qemu calls assert() if TUNSETVNETHDRSZ ioctl fails for any reason.  This problem is also present in upstream qemu.  To solve this, the
error recovery needs to this specific case.

Comment 7 Vlad Yasevich 2013-09-30 13:46:54 UTC
This bug is similar in its behaviour to Bug 1004275.  There are certain errors
that qemu considers fatal as there is no really good recoverable scenario when
these errors occur.

Here are the options for the solution as I can see:
 1) Change assert/abort into a clear exit solution and continue reporting error.
 2) Remove assert/abort and disable or delete the device that fails to initialize
    correctly.  This may allow qemu to run, but the interface will not function
    properly.  This will at least make the behaviour consistent between
    between initialization state and already running state.  It does bring up
    some interesting issues thought.

-vlad

Comment 10 Ronen Hod 2014-04-03 15:59:25 UTC
Closing.
Since deletion of a tap interface is somewhat malicious and not a standard workflow, we will only fix it in RHEL7 (if necessary).
QE, Please test on RHEL7.

Comment 11 Qian Guo 2014-04-04 00:17:25 UTC
(In reply to Ronen Hod from comment #10)
> Closing.
> Since deletion of a tap interface is somewhat malicious and not a standard
> workflow, we will only fix it in RHEL7 (if necessary).
> QE, Please test on RHEL7.

Tested in rhel7 host, did not hit such issue:
compoments:
# uname -r
3.10.0-118.el7.x86_64
[root@ibm-x3650m4-05 ~]# rpm -q qemu-kvm-rhev
qemu-kvm-rhev-1.5.3-60.el7ev.x86_64

cli:
# /usr/libexec/qemu-kvm -S -name rhel7 -M pc -nodefaults -vga std -device ich9-usb-uhci1,id=usb1,bus=pci.0,addr=03 -drive id=drive_image1,if=none,cache=none,aio=native,file=/home/rhel70326cp1.qcow2_v3 -device virtio-blk-pci,id=image1,drive=drive_image1,bus=pci.0,addr=04 -device virtio-net-pci,mac=16:33:3f:09:12:78,id=vnet0,netdev=hostdev0,bus=pci.0,addr=05 -netdev tap,id=hostdev0,vhost=on,script=/etc/qemu-ifup -m 4G -smp 4,sockets=1,cores=4,threads=1 -cpu SandyBridge -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 -vnc :3 -rtc base=localtime,clock=host,driftfix=slew -boot menu=on -enable-kvm -monitor stdio -qmp unix:/tmp/q1,server,nowait -device virtio-balloon-pci,id=b1 -monitor unix:/tmp/monitor-unix,nowait,server

after delete tap, guest and host work well both, so won't file bug against rhel7.

thanks,

Comment 12 Ronen Hod 2014-04-07 11:40:47 UTC
(In reply to Qian Guo from comment #11)
> after delete tap, guest and host work well both, so won't file bug against
> rhel7.

Thanks, so although we can probably find a fix to backport, let's leave it CLOSED WONTFIX for now.