This bug has been copied from bug #574785 and has been proposed to be backported to 4.8 z-stream (EUS).
Committed in 89.0.25.
1. boot a guest with the CLI /usr/libexec/qemu-kvm -smp 2 -m 2G -drive file=/root/rhel4.8-64-virtio.bak,if=virtio,boot=on -net nic,vlan=0,macaddr=00:1a:4a:91:00:37,model=virtio -net tap,vlan=0,script=/etc/qemu-ifup -uuid `uuidgen` -no-hpet -usbdevice tablet -rtc-td-hack -startdate now -cpu qemu64,+sse2 -monitor stdio -vnc :1 -boot c 2. ping baidu.com to check that guest network work well. 3. run ./srv on guest 4. run "./stress.sh $guestip" from three hosts at the same times. repeat several times result: in host: "connect errorpid=29193" display after guest lost ip. can not ping guest from host. in rhel4.8-32: guest switch to login page and can not start x server after it lost ip. in rhel4.8-64: it only lost ip, no other issue
Created attachment 411174 [details] script kernel: 89.0.25
Created attachment 411175 [details] issue image
Summary: Guest lost network with virtio net when in OOM condition. In addition, guest network could be restarted through "service network restart" Host info: 2.6.18-194.el5 kvm-83-164.el5 Guest info: 2.6.9-89.0.25.ELsmp Steps: see https://bugzilla.redhat.com/show_bug.cgi?id=554078#c5
Did we apply the OOM virtio patch to RHEL4, i.e., bugzilla 554078? If not then this might be the cause of the test failure. The patch Vitaly referred to above is for TX only AFAICS. In these failure cases it is very important to determine which direction (i.e., guest=>host/TX or host=>guest/RX) is failing. You can do so by looking at the packet counters on the virtio_net interface in the guest and the tun interface on the host side.
(In reply to comment #10) > Did we apply the OOM virtio patch to RHEL4, i.e., bugzilla 554078? If not then > this might be the cause of the test failure. No, that patch is not in rhel4. need we apply that patch when testing 580089? or just looking at the package counters and tun interface on host side is enough? > > The patch Vitaly referred to above is for TX only AFAICS. In these failure > cases it is very important to determine which direction (i.e., guest=>host/TX > or host=>guest/RX) is failing. You can do so by looking at the packet counters > on the virtio_net interface in the guest and the tun interface on the host > side. Virt-qe has now get off work, will ask them to test it tomorrow. could you give us some specific instructions on how to look at the packet counters and tun interface on the host side please? thanks a lot.
In response to comment 10 - No, the patch for 554078 is not present in 4.9 or 4.8.z.
(In reply to comment #11) > (In reply to comment #10) > > Did we apply the OOM virtio patch to RHEL4, i.e., bugzilla 554078? If not then > > this might be the cause of the test failure. > No, that patch is not in rhel4. need we apply that patch when testing 580089? Of course you don't need to apply the patch yourself. That patch should be applied to RHEL4. > or just looking at the package counters and tun interface on host side is > enough? As this bugzilla is about the TX direction, you should look at the counters to see whether the TX direction is still functioning. That is, if only the RX direction is broken then you can consider this bugzilla to be fixed. We should open a different bugzilla for the RX direction. > Virt-qe has now get off work, will ask them to test it tomorrow. could you give > us some specific instructions on how to look at the packet counters and tun > interface on the host side please? You should look at the counters on vnetX in the host, and its corresponding ethX interface in the guest. To see if TX is working, try to send packets from the guest to the host (e.g., a ping or anything that elicits an ARP would do). If the TX counter on ethX increases while the RX counter on vnetX does not then you have a TX problem. For RX, try to send packets from the host to the guest. If the TX counter on vnetX goes up while the RX counter on ethX does not then you know that you have an RX problem. Typically only one direction is stalled.
(In reply to comment #13) > To see if TX is working, try to send packets from the guest to the host (e.g., > a ping or anything that elicits an ARP would do). If the TX counter on ethX > increases while the RX counter on vnetX does not then you have a TX problem. > Tested that TX was OK. (both TX counter on ethX and RX counter on vnetX went up when pinging from inside guest) > For RX, try to send packets from the host to the guest. If the TX counter on > vnetX goes up while the RX counter on ethX does not then you know that you have > an RX problem. RX failed as expected. BTW, neither TX counter on vnetX nor RX counter on ethX went up. They both kept unchanged. > > Typically only one direction is stalled.
according to comment#14, this is fixed, set it verified.
(In reply to comment #14) > > > For RX, try to send packets from the host to the guest. If the TX counter on > > vnetX goes up while the RX counter on ethX does not then you know that you have > > an RX problem. > > RX failed as expected. BTW, neither TX counter on vnetX nor RX counter on ethX > went up. They both kept unchanged. Right, the TX counter would only go up during the early stages of the stall. Once the virtio queue is completely filled, you will observe what you saw. Another way of diagnosing this is to run "tc -s qdisc". It should show a full backlog queue on the affected interface. Thanks!
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0394.html
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: A race condition caused TX to stop in a guest using the virtio_net driver.