Hide Forgot
Description of problem: The following was reported upstream: http://www.spinics.net/lists/linux-virtualization/msg12361.html Under harsh testing conditions, including low memory, the guest would stop receiving packets. With this patch applied we no longer see any problems in the driver while performing these tests for extended periods of time. The bug is that if ring is consumed before napi is enabled, we don't get another interrupt. Regular interrupt handler fixes this by processing packets in-place, but oom handler missed this check. Version-Release number of selected component (if applicable): How reproducible: not sure. this was reported upstream and looking at code makes it clear it applies to rhel 6.0. Steps to Reproduce: 1. stress memory so atomic allocations start failing (how to do this? not sure) 2. at the same time stress with large incoming packets Actual results: at some point networking will stop and wont recover when Expected results: keeps going slowly Additional info: stress with nfs reads might trigger this?
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
This bug can be reproduced through following scenario: +---------------+ +------------------------------+ | Netserver | LAN | | |---------------|---------|Netperf(*2000+ tasks) | | VM (512M) | | | +---------------+ +------------------------------+ Run thousands of netperf clients in background to stress the netserver. kernel-2.6.32-113.el6.x86_64
Patch(es) available on kernel-2.6.32-117.el6
Reproduced on kernel-2.6.32-116.el6.x86_64, and verified on kernel-2.6.32-117.el6.x86_64. PASS. Steps: 1) boot guest with 512M mem and virtio net. 2) run netserver inside guest. 3) on host, launch 2000 netperf clients in background to stress netserver. 4) ping guest (network lost, need to restart guest network to restore) Update guest kernel to 2.6.32-117 and test again, no network lost. CLI: /usr/libexec/qemu-kvm -S -M rhel6.1.0 -enable-kvm -m 512 -smp 2,sockets=2,cores=1,threads=1 -name RHEL6.1-virtio_net_test -uuid 362f0255-b6e4-2a75-9506-af9c2e5ceb5d -nodefconfig -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/RHEL6.1-virtio_net_test.monitor,server,nowait -mon chardev=monitor,mode=control -rtc base=utc -boot c -drive file=/home/khong/RHEL6.1-virtio_net_test.img,if=none,id=drive-virtio-disk0,format=raw,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,fd=20,id=hostnet0,vhost=on,vhostfd=22 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:1f:f2:62,bus=pci.0,addr=0x3 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -device usb-tablet,id=input0 -vnc 127.0.0.1:0 -vga cirrus -device AC97,id=sound0,bus=pci.0,addr=0x4 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
script to run netperf clients: #! /bin/sh ip=$guest_ip i=0 while [ $i -lt 2000 ] do netperf -H $ip -l 300 & i=`expr $i + 1` echo launch Client-No.$i done
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Intensive usage of resources on a guest lead to a failure of networking on that guest: packets could no longer be received. The failure occurred when a DMA (Direct Memory Access) ring was consumed before NAPI (New API; an interface for networking devices which makes use of interrupt mitigation techniques) was enabled which resulted in a failure to receive the next interrupt request. The regular interrupt handler was not affected in this situation (because it can process packets in-place), however, the OOM (Out Of Memory) handler did not detect the aforementioned situation and caused networking to fail. With this update, NAPI is subsequently scheduled for each napi_enable operation; thus, networking no longer fails under the aforementioned circumstances.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0542.html