Bug 1090938
| Summary: | VM dies under huge traffic with virtio and vhost_net and after that qemu-kvm process cant be killed | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Thomas Heil <heil> |
| Component: | kernel | Assignee: | Vlad Yasevich <vyasevic> |
| Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 6.6 | CC: | acathrow, areis, bsarathy, bugzilla_rh, chayang, csieh, dgibson, ezzze, gchakkar, heil, juzhang, mario.ende, mazhang, michen, milan.zelenka, mkenneth, qzhang, sellis, soeren.grunewald, virt-bugs, virt-maint |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2014-06-05 16:05:06 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Thomas Heil
2014-04-24 12:32:29 UTC
on the hostsystem we can see message like this
--
INFO: task qemu-kvm:7153 blocked for more than 120 seconds.
Not tainted 2.6.32-431.11.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
qemu-kvm D 0000000000000000 0 7153 1 0x00000084
ffff8803dc5a7cf8 0000000000000086 ffff8803dc5a7e58 ffffffffa01aa9d8
0000000055436840 0000000000000000 ffffffffdc5a7c98 ffff8803dc5a7ca0
ffff880632a6f058 ffff8803dc5a7fd8 000000000000fbc8 ffff880632a6f058
Call Trace:
[<ffffffffa01aa9d8>] ? kvm_vm_ioctl+0x218/0x1030 [kvm]
[<ffffffff8109b5be>] ? prepare_to_wait+0x4e/0x80
[<ffffffffa022b3cd>] vhost_work_flush+0x9d/0x120 [vhost_net]
[<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40
[<ffffffffa022b469>] vhost_poll_flush+0x19/0x20 [vhost_net]
[<ffffffffa022d148>] vhost_net_flush_vq+0x38/0x60 [vhost_net]
[<ffffffffa022e3bf>] vhost_net_ioctl+0x31f/0x5d0 [vhost_net]
[<ffffffff8108b14e>] ? send_signal+0x3e/0x90
[<ffffffff8119dc12>] vfs_ioctl+0x22/0xa0
[<ffffffff8119ddb4>] do_vfs_ioctl+0x84/0x580
[<ffffffff810b1bfb>] ? sys_futex+0x7b/0x170
[<ffffffff8119e331>] sys_ioctl+0x81/0xa0
[<ffffffff810e1e4e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task qemu-kvm:7153 blocked for more than 120 seconds.
Not tainted 2.6.32-431.11.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
qemu-kvm D 0000000000000000 0 7153 1 0x00000084
ffff8803dc5a7cf8 0000000000000086 ffff8803dc5a7e58 ffffffffa01aa9d8
0000000055436840 0000000000000000 ffffffffdc5a7c98 ffff8803dc5a7ca0
ffff880632a6f058 ffff8803dc5a7fd8 000000000000fbc8 ffff880632a6f058
Call Trace:
[<ffffffffa01aa9d8>] ? kvm_vm_ioctl+0x218/0x1030 [kvm]
[<ffffffff8109b5be>] ? prepare_to_wait+0x4e/0x80
[<ffffffffa022b3cd>] vhost_work_flush+0x9d/0x120 [vhost_net]
[<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40
[<ffffffffa022b469>] vhost_poll_flush+0x19/0x20 [vhost_net]
[<ffffffffa022d148>] vhost_net_flush_vq+0x38/0x60 [vhost_net]
[<ffffffffa022e3bf>] vhost_net_ioctl+0x31f/0x5d0 [vhost_net]
[<ffffffff8108b14e>] ? send_signal+0x3e/0x90
[<ffffffff8119dc12>] vfs_ioctl+0x22/0xa0
[<ffffffff8119ddb4>] do_vfs_ioctl+0x84/0x580
[<ffffffff810b1bfb>] ? sys_futex+0x7b/0x170
[<ffffffff8119e331>] sys_ioctl+0x81/0xa0
[<ffffffff810e1e4e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
--
Thomas, thanks for taking the time to enter a bug report with us. We appreciate the feedback and look to use reports such as this to guide our efforts at improving our products. That being said, we're not able to guarantee the timeliness or suitability of a resolution for issues entered here because this is not a mechanism for requesting support. If this issue is critical or in any way time sensitive, please raise a ticket through your regular Red Hat support channels to make certain it receives the proper attention and prioritization to assure a timely resolution. For information on how to contact the Red Hat production support team, please visit: https://www.redhat.com/support/process/production/#howto We noticed same problems in customer's RHEV infrastructure and RHEL 6.5 host. Network freeze when huge traffic occures in RHEL 6.5 virtual guest. Kernel log on host: Apr 30 16:03:36 rhelh11 kernel: INFO: task qemu-kvm:14796 blocked for more than 120 seconds. Apr 30 16:03:36 rhelh11 kernel: Not tainted 2.6.32-431.11.2.el6.x86_64 #1 Apr 30 16:03:36 rhelh11 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 30 16:03:36 rhelh11 kernel: qemu-kvm D 0000000000000012 0 14796 1 0x00000084 Apr 30 16:03:36 rhelh11 kernel: ffff881817eb5cf8 0000000000000086 0000000000000000 0000000000000282 Apr 30 16:03:36 rhelh11 kernel: ffff881817eb5c88 0000000000000475 000000000000008d 000000000000000a Apr 30 16:03:36 rhelh11 kernel: ffff881818606638 ffff881817eb5fd8 000000000000fbc8 ffff881818606638 Apr 30 16:03:36 rhelh11 kernel: Call Trace: Apr 30 16:03:36 rhelh11 kernel: [<ffffffffa028f3cd>] vhost_work_flush+0x9d/0x120 [vhost_net] Apr 30 16:03:36 rhelh11 kernel: [<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40 Apr 30 16:03:36 rhelh11 kernel: [<ffffffffa028f469>] vhost_poll_flush+0x19/0x20 [vhost_net] Apr 30 16:03:36 rhelh11 kernel: [<ffffffffa0291148>] vhost_net_flush_vq+0x38/0x60 [vhost_net] Apr 30 16:03:36 rhelh11 kernel: [<ffffffffa02923bf>] vhost_net_ioctl+0x31f/0x5d0 [vhost_net] Apr 30 16:03:36 rhelh11 kernel: [<ffffffff8122db91>] ? avc_has_perm+0x71/0x90 Apr 30 16:03:36 rhelh11 kernel: [<ffffffff8119dc12>] vfs_ioctl+0x22/0xa0 Apr 30 16:03:36 rhelh11 kernel: [<ffffffff8119ddb4>] do_vfs_ioctl+0x84/0x580 Apr 30 16:03:36 rhelh11 kernel: [<ffffffff8119e331>] sys_ioctl+0x81/0xa0 Apr 30 16:03:36 rhelh11 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b If you have some progress with this issue, I will be happy to know something about it. I will make sosreports and create official Red Hat ticket, if problem occures next time... (In reply to Milan Zelenka from comment #4) > > If you have some progress with this issue, I will be happy to know something > about it. I will make sosreports and create official Red Hat ticket, if > problem occures next time... Unfortunately we don't have any progress at the moment. If you have more information about the environment and the scenario where it happened, please let us know. An official support ticket will help us prioritize it. Thanks. Meanwhile, I'm asking our QE team to try to reproduce it, or if they have any record of such problem happening in the past. Can not reproduce this bug both on qemu-kvm-rhev-0.12.1.2-2.415.el6_5.8 and qemu-kvm-0.12.1.2-2.415.el6_5.8
Host:
kernel-2.6.32-431.11.2.el6
qemu-kvm-0.12.1.2-2.415.el6_5.8/qemu-kvm-rhev-0.12.1.2-2.415.el6_5.8
Guest:
kernel-2.6.32-431.11.2.el6
Steps:
1. Boot two vms with virtio-net-pci device and enable vhost
-netdev tap,id=hostnet2,vhost=on \
-device virtio-net-pci,netdev=hostnet2,id=net2,mac=22:11:22:45:61:97 \
2. Start 16 netperf instance by run below script in each guest.
#!/bin/bash
COUNT=0
while [ $COUNT -lt 16 ]
do
netperf -H 10.66.83.169 -l 600 &
COUNT=$((1+$COUNT))
done
Result:
Qemu-kvm and guest works well, not found problem.
Hi Tomas, We could not reproduce this problem, could you please provide more detail steps. 1. Could you paste the qemu-kvm command line in here ? 2. How did you produced network traffic ? 3. How many vms running on the host? Thanks, Mazhang. Test UDP_STREAM instead of TCP_STREAM, and try reboot and shutdown guest during heavy traffic, still can not reproduce this bug. Will try run netperf 12 hours and wait the result. We also try to reproduce it with the actual kernel 2.6.32-431.17.1.el6.x86_64. We have two varnish vms on this host and traffic is generated from outside, not from the same machine. Normally it took about 1day to reproduce it. After 12 hours netperf UDP_STREAM test, both guest and host network works well, not hit the problem.
switch Link encap:Ethernet HWaddr 00:23:AE:77:09:7F
inet addr:10.66.83.169 Bcast:10.66.83.255 Mask:255.255.254.0
inet6 addr: fe80::223:aeff:fe77:97f/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1896129319 errors:0 dropped:0 overruns:0 frame:0
TX packets:147362756 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:116150277766554 (105.6 TiB) TX bytes:9860516267 (9.1 GiB)
tap0 Link encap:Ethernet HWaddr 96:8C:09:11:5F:B2
inet6 addr: fe80::948c:9ff:fe11:5fb2/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:863941905 errors:0 dropped:0 overruns:0 frame:0
TX packets:1520034 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:500
RX bytes:56630210481297 (51.5 TiB) TX bytes:140195103 (133.7 MiB)
tap1 Link encap:Ethernet HWaddr CA:DE:4C:17:66:EF
inet6 addr: fe80::c8de:4cff:fe17:66ef/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:865269257 errors:0 dropped:0 overruns:0 frame:0
TX packets:1520151 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:500
RX bytes:56717423460025 (51.5 TiB) TX bytes:140153587 (133.6 MiB)
Hi, i'am a colleague of Thomas Heil and here is my update after some tests: since the last kernelupdate "2.6.32-431.17.1.el6.x86_64" i can't reproduce this issue. its maybe fixed in this update? >> 1. Could you paste the qemu-kvm command line in here? qemu 8102 26.3 16.7 17476012 16635096 ? Sl May12 449:58 /usr/libexec/qemu-kvm -name varnish-41 -S -M rhel6.3.0 -cpu Westmere,+rdtscp,+pdpe1gb,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -enable-kvm -m 16384 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid a5ad77e6-6b2f-7268-c23d-38462ae2a9d7 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/varnish-41.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/libvirt/images/varnish-41-root.img,if=none,id=drive-virtio-disk0,format=raw,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/dev/vg01/varnish-41-swap,if=none,id=drive-virtio-disk1,format=raw,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1 -netdev tap,fd=22,id=hostnet0,vhost=on,vhostfd=25 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:c5:eb:b0,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc 127.0.0.1:2 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 >> 2. How did you produced network traffic ? - rsync some files in up- and downstream to another server across the same nic + some http-traffic to the varnish webcache >> 3. How many vms running on the host? - from 3 up to 7vms are in use We use primäry TCP-Traffic! Hi Thommy, Thanks for your reply, seems your host was RHEL6.5, but run qemu-kvm with "-M rhel6.3.0",it's a old machine type. Could you please try with "-M rhel6.5.0"? Thanks, Mazhang. same issue in our production enviorment. Issue disappeard when VMs disabled vhost-net.
dmesg log:
INFO: task qemu-kvm:10092 blocked for more than 120 seconds.
Not tainted 2.6.32-431.11.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
qemu-kvm D 0000000000000007 0 10092 1 0x00000084
ffff881862bd3cf8 0000000000000086 ffff881862bd3e58 ffffffffa02099d8
ffff880c6a2b9728 0000000000000001 ffff881862bd3ca8 ffffffff81169339
ffff881869e0a5f8 ffff881862bd3fd8 000000000000fbc8 ffff881869e0a5f8
Call Trace:
[<ffffffffa02099d8>] ? kvm_vm_ioctl+0x218/0x1030 [kvm]
[<ffffffff81169339>] ? shmem_check_acl+0x29/0x80
[<ffffffff8109b5be>] ? prepare_to_wait+0x4e/0x80
[<ffffffffa01e43cd>] vhost_work_flush+0x9d/0x120 [vhost_net]
[<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40
[<ffffffffa01e4469>] vhost_poll_flush+0x19/0x20 [vhost_net]
[<ffffffffa01e6148>] vhost_net_flush_vq+0x38/0x60 [vhost_net]
[<ffffffffa01e73bf>] vhost_net_ioctl+0x31f/0x5d0 [vhost_net]
[<ffffffff8119dc12>] vfs_ioctl+0x22/0xa0
[<ffffffff8119a749>] ? do_unlinkat+0xb9/0x260
[<ffffffff8119ddb4>] do_vfs_ioctl+0x84/0x580
[<ffffffff8119e331>] sys_ioctl+0x81/0xa0
[<ffffffff810e1e4e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
process info:
[root@OE-VDI1 ~]# ps axopid,comm,wchan | grep 10092
10092 qemu-kvm vhost_work_flush
10116 vhost-10092 -
[root@OE-VDI1 ~]#
qemu parameters:
/usr/libexec/qemu-kvm -name i-00000070 -S -M rhel6.5.0 -cpu SandyBridge,+erms,+smep,+fsgsbase,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pcid,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -enable-kvm -m 3072 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 7683c648-c5ac-4d15-9460-7a341ea3db0f -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/i-00000070.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive file=/opt/storage/instances/i-00000070/disk_0,if=none,id=drive-virtio-disk2,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk2,id=virtio-disk2,bootindex=2 -drive file=/opt/storage/instances/i-00000070/disk_1,if=none,id=drive-virtio-disk3,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk3,id=virtio-disk3 -netdev tap,fd=22,id=hostnet0,vhost=on,vhostfd=32 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=12:03:fe:65:50:43,bus=pci.0,addr=0x3 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -device usb-tablet,id=input0 -spice port=5970,addr=0.0.0.0,image-compression=auto_glz,jpeg-wan-compression=auto,streaming-video=filter,seamless-migration=on -vga qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x9
I'm working on a customer case with what looks like the same bug. Bumping priority accordingly. I have a qemu gcore for that case - it looks like the vcpu thread is blocked uninterruptibly handling a cpu IO. Don't have kernel side info, unfortunately, but there are also the vhost blocked thread warnings like above, so it's probably the same issue. (In reply to David Gibson from comment #16) > I'm working on a customer case with what looks like the same bug. Bumping > priority accordingly. > > I have a qemu gcore for that case - it looks like the vcpu thread is blocked > uninterruptibly handling a cpu IO. Don't have kernel side info, > unfortunately, but there are also the vhost blocked thread warnings like > above, so it's probably the same issue. David Would you customer be willing to try a test kernel? There was a very recent patch to vhost to prevent it from hogging the cpu when its very busy, and it cerntainly looks like this is the scenario here. I'd need a desired kernel version to do a test ternel. Thanks -vlad *** This bug has been marked as a duplicate of bug 1095627 *** Can bug 1095627 be made public please. Thanks As a Red Hatter I also can't access 1095627. |