1530002 – qemu hangs in vhost-net after kernel 4.14 update

Bug 1530002 - qemu hangs in vhost-net after kernel 4.14 update

Summary: qemu hangs in vhost-net after kernel 4.14 update

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	qemu
Sub Component:
Version:	27
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Fedora Virtualization Maintainers
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-12-31 21:26 UTC by Cagney
Modified:	2018-08-01 14:20 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-08-01 14:20:52 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Cagney 2017-12-31 21:26:25 UTC

Note the 'D' in:

cagney   14175 36.9  2.1 3446808 355444 ?      D    12:18  88:09 /usr/bin/qemu-system-x86_64 -machine accel=kvm -name guest=l.east,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-489-l.east/master-key.aes -machine pc-0.15,accel=kvm,usb=off,dump-guest-core=off -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 2b47a9c0-ab71-47e3-a818-b812e68fdd46 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-489-l.east/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0xa -drive file=/home/libreswan/pool/l.east.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=writeback -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0xb,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -fsdev local,security_model=none,id=fsdev-fs0,path=/home/libreswan/wip-lswlog/testing -device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=testing,bus=pci.0,addr=0x3 -fsdev local,security_model=none,id=fsdev-fs1,path=/home/libreswan/wip-lswlog -device virtio-9p-pci,id=fs1,fsdev=fsdev-fs1,mount_tag=swansource,bus=pci.0,addr=0x4 -fsdev local,security_model=none,id=fsdev-fs2,path=/tmp -device virtio-9p-pci,id=fs2,fsdev=fsdev-fs2,mount_tag=tmp,bus=pci.0,addr=0x5 -netdev tap,fd=28,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=12:00:00:dc:bc:ff,bus=pci.0,addr=0x6 -netdev tap,fd=31,id=hostnet1,vhost=on,vhostfd=32 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=12:00:00:64:64:23,bus=pci.0,addr=0x8 -netdev tap,fd=33,id=hostnet2,vhost=on,vhostfd=34 -device virtio-net-pci,netdev=hostnet2,id=net2,mac=12:00:00:32:64:23,bus=pci.0,addr=0x9 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -spice port=5901,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0xc -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -msg timestamp=on

$ sudo virsh dominfo l.east
Id:             489
Name:           l.east
UUID:           2b47a9c0-ab71-47e3-a818-b812e68fdd46
OS Type:        hvm
State:          in shutdown
CPU(s):         1
CPU time:       5289.4s
Max memory:     524288 KiB
Used memory:    524288 KiB
Persistent:     yes
Autostart:      disable
Managed save:   no
Security model: none
Security DOI:   0

$ uname -a
Linux bernard 4.14.8-300.fc27.x86_64 #1 SMP Wed Dec 20 19:00:18 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$ rpm -q kernel
kernel-4.14.5-300.fc27.x86_64
kernel-4.14.8-300.fc27.x86_64
$ rpm -q qemu-kvm
qemu-kvm-2.11.0-4.fc27.x86_64

This happens with the 4.14.8-300 kernel but not with the 4.14.5-300 kernel.

As usual, this can be reproduced by running  libreswan's test-suite (which is rebooting domains hundreds of times).  See: https://libreswan.org/wiki/Test_Suite

Comment 1 Richard W.M. Jones 2018-01-01 09:31:11 UTC

I guess it's hanging in a system call but without knowing what
syscall it's hard to say more.  The only thing which is going to
help us is if you can dump a stack trace of the qemu process when
it gets into this state.  See:
http://blog.kevac.org/2013/02/uninterruptible-sleep-d-state.html

Comment 2 Cagney 2018-01-02 21:16:36 UTC

Attempt #1

[21703.440546] sysrq: SysRq : This sysrq operation is disabled.

Comment 3 Cagney 2018-01-02 21:20:26 UTC

Attempt #2

[root@bernard ~]# echo "1" > /proc/sys/kernel/sysrq
[root@bernard ~]# cat !$
cat /proc/sys/kernel/sysrq
1
[root@bernard ~]# echo w > /proc/sysrq-trigger 
[root@bernard ~]# dmesg -c
[22052.776488] sysrq: SysRq : Show Blocked State
[22052.776499]   task                        PC stack   pid father
[22052.776849] qemu-system-x86 D    0 11335      1 0x00000004
[22052.776857] Call Trace:
[22052.776870]  __schedule+0x239/0x860
[22052.776876]  schedule+0x2c/0x80
[22052.776886]  vhost_net_ubuf_put_and_wait+0x60/0x90 [vhost_net]
[22052.776894]  ? finish_wait+0x80/0x80
[22052.776901]  vhost_net_ioctl+0x532/0x900 [vhost_net]
[22052.776909]  ? kmem_cache_free+0x1ba/0x1e0
[22052.776915]  ? __dentry_kill+0x115/0x150
[22052.776919]  ? dput.part.23+0x18d/0x1c0
[22052.776926]  do_vfs_ioctl+0xa5/0x600
[22052.776933]  ? ____fput+0xe/0x10
[22052.776939]  SyS_ioctl+0x79/0x90
[22052.776946]  entry_SYSCALL_64_fastpath+0x1a/0xa5
[22052.776951] RIP: 0033:0x7f47b684f817
[22052.776954] RSP: 002b:00007ffdde355b48 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[22052.776959] RAX: ffffffffffffffda RBX: 00007f47a542b0f0 RCX: 00007f47b684f817
[22052.776963] RDX: 00007ffdde355b50 RSI: 000000004008af30 RDI: 0000000000000020
[22052.776966] RBP: 0000000000000000 R08: 000055794a8b94b0 R09: 000055794a8b4b12
[22052.776969] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000
[22052.776972] R13: 000055794ca622d0 R14: 00007f47a542b090 R15: 0000000000000001

Comment 4 Richard W.M. Jones 2018-01-02 21:55:58 UTC

Interesting, something related to vhost-net, and I notice that you're
using 3 x vhost-net network interfaces in your guest.

You could try turning vhost-net off and seeing if that makes a difference
but based on your stack trace that would seem to be the cause of the problem.

Comment 5 Richard W.M. Jones 2018-01-02 21:58:36 UTC

BTW next step is to examine all the commits in the 4.14 stable branch
of the kernel and see if any of them are likely causing a regression
with vhost-net.

Comment 6 Cagney 2018-01-04 02:01:56 UTC

(In reply to Richard W.M. Jones from comment #4)
> Interesting, something related to vhost-net, and I notice that you're
> using 3 x vhost-net network interfaces in your guest.

Yes, and multiple domains are also sharing these virtual networks. l.east just happens to be the first domain being told to reboot (and had "successfully" rebooted - but note bug 1374918 - for several hours before locking up).

I'll look at my logs and see if any thing jumps out.

> You could try turning vhost-net off and seeing if that makes a difference
> but based on your stack trace that would seem to be the cause of the problem.

Comment 7 Cagney 2018-01-05 23:13:55 UTC

Kernel kernel-4.14.11-300.fc27.x86_64 doesn't appear to be broken.

Note You need to log in before you can comment on or make changes to this bug.