Bug 580089 - virtio_net 'eth0' interface in a RHEL 4.8 KVM virtual machine becomes unresponsive due to stopped state [rhel-4.8.z]
Summary: virtio_net 'eth0' interface in a RHEL 4.8 KVM virtual machine becomes unrespo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.8
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Vitaly Mayatskikh
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On: 574785
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-04-07 13:16 UTC by RHEL Program Management
Modified: 2011-01-30 22:17 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
A race condition caused TX to stop in a guest using the virtio_net driver.
Clone Of:
Environment:
Last Closed: 2010-05-05 13:05:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
script (50.00 KB, application/x-tar)
2010-05-04 03:31 UTC, Suqin Huang
no flags Details
issue image (50.00 KB, application/x-tar)
2010-05-04 03:34 UTC, Suqin Huang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0394 0 normal SHIPPED_LIVE Important: kernel security, bug fix, and enhancement update 2010-05-05 13:05:05 UTC

Description RHEL Program Management 2010-04-07 13:16:49 UTC
This bug has been copied from bug #574785 and has been proposed
to be backported to 4.8 z-stream (EUS).

Comment 3 Vitaly Mayatskikh 2010-04-20 08:32:25 UTC
Committed in 89.0.25.

Comment 6 Suqin Huang 2010-05-04 03:30:08 UTC
1. boot a guest with the CLI
/usr/libexec/qemu-kvm -smp 2 -m 2G -drive file=/root/rhel4.8-64-virtio.bak,if=virtio,boot=on -net nic,vlan=0,macaddr=00:1a:4a:91:00:37,model=virtio -net tap,vlan=0,script=/etc/qemu-ifup -uuid `uuidgen` -no-hpet -usbdevice tablet -rtc-td-hack -startdate now -cpu qemu64,+sse2 -monitor stdio -vnc :1 -boot c

2. ping baidu.com to check that guest network work well.

3. run ./srv on guest
4. run "./stress.sh $guestip" from three hosts at the same times.
repeat several times

result:
in host: "connect errorpid=29193" display after guest lost ip.
can not ping guest from host.

in rhel4.8-32:
guest switch to login page and can not start x server after it lost ip.

in rhel4.8-64:
it only lost ip, no other issue

Comment 7 Suqin Huang 2010-05-04 03:31:24 UTC
Created attachment 411174 [details]
script

kernel: 89.0.25

Comment 8 Suqin Huang 2010-05-04 03:34:15 UTC
Created attachment 411175 [details]
issue image

Comment 9 Keqin Hong 2010-05-04 09:51:56 UTC
Summary: Guest lost network with virtio net when in OOM condition. In addition, guest network could be restarted through "service network restart"

Host info:
2.6.18-194.el5
kvm-83-164.el5

Guest info:
2.6.9-89.0.25.ELsmp

Steps: see https://bugzilla.redhat.com/show_bug.cgi?id=554078#c5

Comment 10 Herbert Xu 2010-05-04 13:13:59 UTC
Did we apply the OOM virtio patch to RHEL4, i.e., bugzilla 554078? If not then this might be the cause of the test failure.

The patch Vitaly referred to above is for TX only AFAICS.  In these failure cases it is very important to determine which direction (i.e., guest=>host/TX or host=>guest/RX) is failing.  You can do so by looking at the packet counters on the virtio_net interface in the guest and the tun interface on the host side.

Comment 11 Zhang Kexin 2010-05-04 14:49:58 UTC
(In reply to comment #10)
> Did we apply the OOM virtio patch to RHEL4, i.e., bugzilla 554078? If not then
> this might be the cause of the test failure.
No, that patch is not in rhel4. need we apply that patch when testing 580089? or just looking at the package counters and tun interface on host side is enough?

> 
> The patch Vitaly referred to above is for TX only AFAICS.  In these failure
> cases it is very important to determine which direction (i.e., guest=>host/TX
> or host=>guest/RX) is failing.  You can do so by looking at the packet counters
> on the virtio_net interface in the guest and the tun interface on the host
> side.    
Virt-qe has now get off work, will ask them to test it tomorrow. could you give us some specific instructions on how to look at the packet counters and tun interface on the host side please?

thanks a lot.

Comment 12 Don Howard 2010-05-04 14:57:13 UTC
In response to comment 10 - 

No, the patch for 554078 is not present in 4.9 or 4.8.z.

Comment 13 Herbert Xu 2010-05-04 23:32:23 UTC
(In reply to comment #11)
> (In reply to comment #10)
> > Did we apply the OOM virtio patch to RHEL4, i.e., bugzilla 554078? If not then
> > this might be the cause of the test failure.
> No, that patch is not in rhel4. need we apply that patch when testing 580089?

Of course you don't need to apply the patch yourself.  That patch should be applied to RHEL4.

> or just looking at the package counters and tun interface on host side is
> enough?

As this bugzilla is about the TX direction, you should look at the counters to see whether the TX direction is still functioning.  That is, if only the RX direction is broken then you can consider this bugzilla to be fixed.

We should open a different bugzilla for the RX direction.
    
> Virt-qe has now get off work, will ask them to test it tomorrow. could you give
> us some specific instructions on how to look at the packet counters and tun
> interface on the host side please?

You should look at the counters on vnetX in the host, and its corresponding ethX interface in the guest.

To see if TX is working, try to send packets from the guest to the host (e.g., a ping or anything that elicits an ARP would do).  If the TX counter on ethX increases while the RX counter on vnetX does not then you have a TX problem.

For RX, try to send packets from the host to the guest.  If the TX counter on vnetX goes up while the RX counter on ethX does not then you know that you have an RX problem.

Typically only one direction is stalled.

Comment 14 Keqin Hong 2010-05-05 02:37:21 UTC
(In reply to comment #13)

> To see if TX is working, try to send packets from the guest to the host (e.g.,
> a ping or anything that elicits an ARP would do).  If the TX counter on ethX
> increases while the RX counter on vnetX does not then you have a TX problem.
> 

Tested that TX was OK. (both TX counter on ethX and RX counter on vnetX went up when pinging from inside guest)

> For RX, try to send packets from the host to the guest.  If the TX counter on
> vnetX goes up while the RX counter on ethX does not then you know that you have
> an RX problem.

RX failed as expected. BTW, neither TX counter on vnetX nor RX counter on ethX went up. They both kept unchanged.

> 
> Typically only one direction is stalled.

Comment 15 Zhang Kexin 2010-05-05 02:49:50 UTC
according to comment#14, this is fixed, set it verified.

Comment 17 Herbert Xu 2010-05-05 08:08:50 UTC
(In reply to comment #14)
>
> > For RX, try to send packets from the host to the guest.  If the TX counter on
> > vnetX goes up while the RX counter on ethX does not then you know that you have
> > an RX problem.
> 
> RX failed as expected. BTW, neither TX counter on vnetX nor RX counter on ethX
> went up. They both kept unchanged.

Right, the TX counter would only go up during the early stages of the stall.  Once the virtio queue is completely filled, you will observe what you saw.  Another way of diagnosing this is to run "tc -s qdisc".  It should show a full backlog queue on the affected interface.

Thanks!

Comment 18 errata-xmlrpc 2010-05-05 13:05:38 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0394.html

Comment 19 Douglas Silas 2011-01-30 22:17:01 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
A race condition caused TX to stop in a guest using the virtio_net driver.


Note You need to log in before you can comment on or make changes to this bug.