Bug 1090938

Summary:	VM dies under huge traffic with virtio and vhost_net and after that qemu-kvm process cant be killed
Product:	Red Hat Enterprise Linux 6	Reporter:	Thomas Heil <heil>
Component:	kernel	Assignee:	Vlad Yasevich <vyasevic>
Status:	CLOSED DUPLICATE	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	6.6	CC:	acathrow, areis, bsarathy, bugzilla_rh, chayang, csieh, dgibson, ezzze, gchakkar, heil, juzhang, mario.ende, mazhang, michen, milan.zelenka, mkenneth, qzhang, sellis, soeren.grunewald, virt-bugs, virt-maint
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-06-05 16:05:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Thomas Heil 2014-04-24 12:32:29 UTC

Description of problem:
We are running some debian vms under redhat 6.5 with all updates. After some time when huge network traffic occours the vm stops working and remains in state
"in shutdown".

Wenn we try to kill the process, nothing happens. Only a reboot solves that problem.

The VM uses virtio for storage and network and vhost_net. The Hypervisor uses
bridged networking.

Version-Release number of selected component (if applicable):
- libvirt-0.10.2-29.el6_5.7.x86_64
- qemu-kvm-0.12.1.2-2.415.el6_5.8.x86_64
- 2.6.32-431.11.2.el6.x86_64

How reproducible:

produces huge network traffic.

Steps to Reproduce:
1. start vm
2. produce a  lot of network traffic
3. vm become inresponsive and cant be stopped / killed

Actual results:
unstable debian vms

Expected results:
stable networking

Additional info:

Comment 2 Thomas Heil 2014-04-24 15:46:07 UTC

on the hostsystem we can see message like this
--

INFO: task qemu-kvm:7153 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.11.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
qemu-kvm      D 0000000000000000     0  7153      1 0x00000084
 ffff8803dc5a7cf8 0000000000000086 ffff8803dc5a7e58 ffffffffa01aa9d8
 0000000055436840 0000000000000000 ffffffffdc5a7c98 ffff8803dc5a7ca0
 ffff880632a6f058 ffff8803dc5a7fd8 000000000000fbc8 ffff880632a6f058
Call Trace:
 [<ffffffffa01aa9d8>] ? kvm_vm_ioctl+0x218/0x1030 [kvm]
 [<ffffffff8109b5be>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa022b3cd>] vhost_work_flush+0x9d/0x120 [vhost_net]
 [<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa022b469>] vhost_poll_flush+0x19/0x20 [vhost_net]
 [<ffffffffa022d148>] vhost_net_flush_vq+0x38/0x60 [vhost_net]
 [<ffffffffa022e3bf>] vhost_net_ioctl+0x31f/0x5d0 [vhost_net]
 [<ffffffff8108b14e>] ? send_signal+0x3e/0x90
 [<ffffffff8119dc12>] vfs_ioctl+0x22/0xa0
 [<ffffffff8119ddb4>] do_vfs_ioctl+0x84/0x580
 [<ffffffff810b1bfb>] ? sys_futex+0x7b/0x170
 [<ffffffff8119e331>] sys_ioctl+0x81/0xa0
 [<ffffffff810e1e4e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task qemu-kvm:7153 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.11.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
qemu-kvm      D 0000000000000000     0  7153      1 0x00000084
 ffff8803dc5a7cf8 0000000000000086 ffff8803dc5a7e58 ffffffffa01aa9d8
 0000000055436840 0000000000000000 ffffffffdc5a7c98 ffff8803dc5a7ca0
 ffff880632a6f058 ffff8803dc5a7fd8 000000000000fbc8 ffff880632a6f058
Call Trace:
 [<ffffffffa01aa9d8>] ? kvm_vm_ioctl+0x218/0x1030 [kvm]
 [<ffffffff8109b5be>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa022b3cd>] vhost_work_flush+0x9d/0x120 [vhost_net]
 [<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa022b469>] vhost_poll_flush+0x19/0x20 [vhost_net]
 [<ffffffffa022d148>] vhost_net_flush_vq+0x38/0x60 [vhost_net]
 [<ffffffffa022e3bf>] vhost_net_ioctl+0x31f/0x5d0 [vhost_net]
 [<ffffffff8108b14e>] ? send_signal+0x3e/0x90
 [<ffffffff8119dc12>] vfs_ioctl+0x22/0xa0
 [<ffffffff8119ddb4>] do_vfs_ioctl+0x84/0x580
 [<ffffffff810b1bfb>] ? sys_futex+0x7b/0x170
 [<ffffffff8119e331>] sys_ioctl+0x81/0xa0
 [<ffffffff810e1e4e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

--

Comment 3 Ademar Reis 2014-04-29 20:52:19 UTC

Thomas, thanks for taking the time to enter a bug report with us. We appreciate
the feedback and look to use reports such as this to guide our efforts at
improving our products. That being said, we're not able to guarantee the
timeliness or suitability of a resolution for issues entered here because this
is not a mechanism for requesting support.

If this issue is critical or in any way time sensitive, please raise a ticket
through your regular Red Hat support channels to make certain  it receives the
proper attention and prioritization to assure a timely resolution.

For information on how to contact the Red Hat production support team, please
visit: https://www.redhat.com/support/process/production/#howto

Comment 4 Milan Zelenka 2014-05-09 09:25:15 UTC

We noticed same problems in customer's RHEV infrastructure and RHEL 6.5 host. Network freeze when huge traffic occures in RHEL 6.5 virtual guest.

Kernel log on host:

Apr 30 16:03:36 rhelh11 kernel: INFO: task qemu-kvm:14796 blocked for more than 120 seconds.
Apr 30 16:03:36 rhelh11 kernel:      Not tainted 2.6.32-431.11.2.el6.x86_64 #1
Apr 30 16:03:36 rhelh11 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 30 16:03:36 rhelh11 kernel: qemu-kvm      D 0000000000000012     0 14796      1 0x00000084
Apr 30 16:03:36 rhelh11 kernel: ffff881817eb5cf8 0000000000000086 0000000000000000 0000000000000282
Apr 30 16:03:36 rhelh11 kernel: ffff881817eb5c88 0000000000000475 000000000000008d 000000000000000a
Apr 30 16:03:36 rhelh11 kernel: ffff881818606638 ffff881817eb5fd8 000000000000fbc8 ffff881818606638
Apr 30 16:03:36 rhelh11 kernel: Call Trace:
Apr 30 16:03:36 rhelh11 kernel: [<ffffffffa028f3cd>] vhost_work_flush+0x9d/0x120 [vhost_net]
Apr 30 16:03:36 rhelh11 kernel: [<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40
Apr 30 16:03:36 rhelh11 kernel: [<ffffffffa028f469>] vhost_poll_flush+0x19/0x20 [vhost_net]
Apr 30 16:03:36 rhelh11 kernel: [<ffffffffa0291148>] vhost_net_flush_vq+0x38/0x60 [vhost_net]
Apr 30 16:03:36 rhelh11 kernel: [<ffffffffa02923bf>] vhost_net_ioctl+0x31f/0x5d0 [vhost_net]
Apr 30 16:03:36 rhelh11 kernel: [<ffffffff8122db91>] ? avc_has_perm+0x71/0x90
Apr 30 16:03:36 rhelh11 kernel: [<ffffffff8119dc12>] vfs_ioctl+0x22/0xa0
Apr 30 16:03:36 rhelh11 kernel: [<ffffffff8119ddb4>] do_vfs_ioctl+0x84/0x580
Apr 30 16:03:36 rhelh11 kernel: [<ffffffff8119e331>] sys_ioctl+0x81/0xa0
Apr 30 16:03:36 rhelh11 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

If you have some progress with this issue, I will be happy to know something about it. I will make sosreports and create official Red Hat ticket, if problem occures next time...

Comment 5 Ademar Reis 2014-05-09 16:09:03 UTC

(In reply to Milan Zelenka from comment #4)
> 
> If you have some progress with this issue, I will be happy to know something
> about it. I will make sosreports and create official Red Hat ticket, if
> problem occures next time...

Unfortunately we don't have any progress at the moment. If you have more information about the environment and the scenario where it happened, please let us know. An official support ticket will help us prioritize it. Thanks.

Meanwhile, I'm asking our QE team to try to reproduce it, or if they have any record of such problem happening in the past.

Comment 7 mazhang 2014-05-12 07:59:23 UTC

Can not reproduce this bug both on qemu-kvm-rhev-0.12.1.2-2.415.el6_5.8 and qemu-kvm-0.12.1.2-2.415.el6_5.8

Host:
kernel-2.6.32-431.11.2.el6
qemu-kvm-0.12.1.2-2.415.el6_5.8/qemu-kvm-rhev-0.12.1.2-2.415.el6_5.8

Guest:
kernel-2.6.32-431.11.2.el6

Steps:
1. Boot two vms with virtio-net-pci device and enable vhost
-netdev tap,id=hostnet2,vhost=on \
-device virtio-net-pci,netdev=hostnet2,id=net2,mac=22:11:22:45:61:97 \

2. Start 16 netperf instance by run below script in each guest.

#!/bin/bash

COUNT=0
while [ $COUNT -lt 16 ] 
do
        netperf -H 10.66.83.169 -l 600 &
        COUNT=$((1+$COUNT))
done

Result:
Qemu-kvm and guest works well, not found problem.

Comment 8 mazhang 2014-05-12 08:12:31 UTC

Hi Tomas,

We could not reproduce this problem, could you please provide more detail steps.

1. Could you paste the qemu-kvm command line in here ?
2. How did you produced network traffic ?
3. How many vms running on the host?

Thanks,
Mazhang.

Comment 9 mazhang 2014-05-12 09:06:30 UTC

Test UDP_STREAM instead of TCP_STREAM, and try reboot and shutdown guest during heavy traffic, still can not reproduce this bug.
Will try run netperf 12 hours and wait the result.

Comment 10 Thomas Heil 2014-05-12 13:28:50 UTC

We also try to reproduce it with the actual kernel 2.6.32-431.17.1.el6.x86_64.
We have two varnish vms on this host and traffic is generated from outside, not from the same machine. Normally it took about 1day to reproduce it.

Comment 12 mazhang 2014-05-13 01:47:31 UTC

After 12 hours netperf UDP_STREAM test, both guest and host network works well, not hit the problem.

switch    Link encap:Ethernet  HWaddr 00:23:AE:77:09:7F  
          inet addr:10.66.83.169  Bcast:10.66.83.255  Mask:255.255.254.0
          inet6 addr: fe80::223:aeff:fe77:97f/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1896129319 errors:0 dropped:0 overruns:0 frame:0
          TX packets:147362756 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:116150277766554 (105.6 TiB)  TX bytes:9860516267 (9.1 GiB)

tap0      Link encap:Ethernet  HWaddr 96:8C:09:11:5F:B2  
          inet6 addr: fe80::948c:9ff:fe11:5fb2/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:863941905 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1520034 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500 
          RX bytes:56630210481297 (51.5 TiB)  TX bytes:140195103 (133.7 MiB)

tap1      Link encap:Ethernet  HWaddr CA:DE:4C:17:66:EF  
          inet6 addr: fe80::c8de:4cff:fe17:66ef/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:865269257 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1520151 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500 
          RX bytes:56717423460025 (51.5 TiB)  TX bytes:140153587 (133.6 MiB)

Comment 13 thommy 2014-05-13 14:12:42 UTC

Hi,

i'am a colleague of Thomas Heil and here is my update after some tests:

since the last kernelupdate "2.6.32-431.17.1.el6.x86_64" i can't reproduce this issue. its maybe fixed in this update?

>> 1. Could you paste the qemu-kvm command line in here?
qemu      8102 26.3 16.7 17476012 16635096 ?   Sl   May12 449:58 /usr/libexec/qemu-kvm -name varnish-41 -S -M rhel6.3.0 -cpu Westmere,+rdtscp,+pdpe1gb,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -enable-kvm -m 16384 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid a5ad77e6-6b2f-7268-c23d-38462ae2a9d7 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/varnish-41.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/libvirt/images/varnish-41-root.img,if=none,id=drive-virtio-disk0,format=raw,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/dev/vg01/varnish-41-swap,if=none,id=drive-virtio-disk1,format=raw,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1 -netdev tap,fd=22,id=hostnet0,vhost=on,vhostfd=25 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:c5:eb:b0,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc 127.0.0.1:2 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

>> 2. How did you produced network traffic ?
- rsync some files in up- and downstream to another server across the same nic + some http-traffic to the varnish webcache

>> 3. How many vms running on the host?
- from 3 up to 7vms are in use

We use primäry TCP-Traffic!

Comment 14 mazhang 2014-05-14 07:40:27 UTC

Hi Thommy,

Thanks for your reply, seems your host was RHEL6.5, but run qemu-kvm with "-M rhel6.3.0",it's a old machine type.
Could you please try with "-M rhel6.5.0"?

Thanks,
Mazhang.

Comment 15 ezzze 2014-05-20 02:29:30 UTC

same issue in our production enviorment. Issue disappeard when VMs disabled vhost-net. 

dmesg log:

INFO: task qemu-kvm:10092 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.11.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
qemu-kvm      D 0000000000000007     0 10092      1 0x00000084
 ffff881862bd3cf8 0000000000000086 ffff881862bd3e58 ffffffffa02099d8
 ffff880c6a2b9728 0000000000000001 ffff881862bd3ca8 ffffffff81169339
 ffff881869e0a5f8 ffff881862bd3fd8 000000000000fbc8 ffff881869e0a5f8
Call Trace:
 [<ffffffffa02099d8>] ? kvm_vm_ioctl+0x218/0x1030 [kvm]
 [<ffffffff81169339>] ? shmem_check_acl+0x29/0x80
 [<ffffffff8109b5be>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa01e43cd>] vhost_work_flush+0x9d/0x120 [vhost_net]
 [<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa01e4469>] vhost_poll_flush+0x19/0x20 [vhost_net]
 [<ffffffffa01e6148>] vhost_net_flush_vq+0x38/0x60 [vhost_net]
 [<ffffffffa01e73bf>] vhost_net_ioctl+0x31f/0x5d0 [vhost_net]
 [<ffffffff8119dc12>] vfs_ioctl+0x22/0xa0
 [<ffffffff8119a749>] ? do_unlinkat+0xb9/0x260
 [<ffffffff8119ddb4>] do_vfs_ioctl+0x84/0x580
 [<ffffffff8119e331>] sys_ioctl+0x81/0xa0
 [<ffffffff810e1e4e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

process info:

[root@OE-VDI1 ~]# ps axopid,comm,wchan | grep 10092
10092 qemu-kvm        vhost_work_flush
10116 vhost-10092     -
[root@OE-VDI1 ~]# 


qemu parameters:

/usr/libexec/qemu-kvm -name i-00000070 -S -M rhel6.5.0 -cpu SandyBridge,+erms,+smep,+fsgsbase,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pcid,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -enable-kvm -m 3072 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 7683c648-c5ac-4d15-9460-7a341ea3db0f  -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/i-00000070.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive file=/opt/storage/instances/i-00000070/disk_0,if=none,id=drive-virtio-disk2,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk2,id=virtio-disk2,bootindex=2 -drive file=/opt/storage/instances/i-00000070/disk_1,if=none,id=drive-virtio-disk3,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk3,id=virtio-disk3 -netdev tap,fd=22,id=hostnet0,vhost=on,vhostfd=32 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=12:03:fe:65:50:43,bus=pci.0,addr=0x3 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -device usb-tablet,id=input0 -spice port=5970,addr=0.0.0.0,image-compression=auto_glz,jpeg-wan-compression=auto,streaming-video=filter,seamless-migration=on -vga qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x9

Comment 16 David Gibson 2014-05-22 05:48:54 UTC

I'm working on a customer case with what looks like the same bug.  Bumping priority accordingly.

I have a qemu gcore for that case - it looks like the vcpu thread is blocked uninterruptibly handling a cpu IO.  Don't have kernel side info, unfortunately, but there are also the vhost blocked thread warnings like above, so it's probably the same issue.

Comment 17 Vlad Yasevich 2014-05-22 17:24:46 UTC

(In reply to David Gibson from comment #16)
> I'm working on a customer case with what looks like the same bug.  Bumping
> priority accordingly.
> 
> I have a qemu gcore for that case - it looks like the vcpu thread is blocked
> uninterruptibly handling a cpu IO.  Don't have kernel side info,
> unfortunately, but there are also the vhost blocked thread warnings like
> above, so it's probably the same issue.

David

Would you customer be willing to try a test kernel?  There was a very
recent patch to vhost to prevent it from hogging the cpu when its very
busy, and it cerntainly looks like this is the scenario here.

I'd need a desired kernel version to do a test ternel.

Thanks
-vlad

Comment 20 Vlad Yasevich 2014-06-05 16:05:06 UTC


*** This bug has been marked as a duplicate of bug 1095627 ***

Comment 21 csieh 2014-06-05 19:10:05 UTC

Can bug 1095627 be made public please.

Thanks

Comment 22 Steven Ellis 2014-08-11 06:31:58 UTC

As a Red Hatter I also can't access 1095627.