Bug 469859 - F10 kvm: network stall with qemu rtl8139 NIC emulation
Summary: F10 kvm: network stall with qemu rtl8139 NIC emulation
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: kvm
Version: 10
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Mark McLoughlin
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 476452 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-11-04 14:29 UTC by James Laska
Modified: 2013-09-02 06:29 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-12-18 06:44:44 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
tcpdump.log -i br0 (virt host) (348.08 KB, text/plain)
2008-11-04 14:29 UTC, James Laska
no flags Details
dmesg (43.52 KB, text/plain)
2008-11-04 14:30 UTC, James Laska
no flags Details
/var/log/messages (183.08 KB, text/plain)
2008-11-04 14:30 UTC, James Laska
no flags Details
/var/log/libvirt/qemu/vguest1.log (i386 guest) (730 bytes, text/plain)
2008-11-04 17:15 UTC, James Laska
no flags Details
/var/log/libvirt/qemu/vguest2.log (x86_64 guest) (681 bytes, text/plain)
2008-11-04 17:15 UTC, James Laska
no flags Details
Screenshot showing the kernel panic when trying to install RHEL4.8 under kvm (185.06 KB, image/png)
2008-12-11 10:40 UTC, Hans de Goede
no flags Details
Screenshot showing slightly different kernel panic when trying to install RHEL4.8 under kvm (117.60 KB, image/png)
2008-12-11 10:41 UTC, Hans de Goede
no flags Details

Description James Laska 2008-11-04 14:29:31 UTC
Created attachment 322433 [details]
tcpdump.log -i br0 (virt host)

Description of problem:

Network traffic to/from my bridged KVM guests stalls during large file transfers.  This can be observed while scp'ing a DVD.iso to a guest, and often just by starting a network installation on a guest (stalls while transferring install.img).

I'm not clear on what component this should be assigned to.  Please advise.

Version-Release number of selected component (if applicable):

libvirt-0.4.6-3.fc10.x86_64
kernel-2.6.27.4-68.fc10.x86_64
bridge-utils-1.2-6.fc10.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Install F10 x86_64
2. Install F10 KVM x86_64 or i386 guest
3. SCP a large file from the F10 host to the F10 KVM guest
  
Actual results:

$ scp ~guest/Download/Fedora-10-Preview-i386-DVD.iso root.34.91:/iso/
root.34.91's password: 
Fedora-10-Preview-i386-DVD.iso                                                                               0% 5088KB  26.2KB/s - stalled -

Expected results:

No network stall

Additional info:

 * `service network restart` is required to get networking to guests running again.

 * I've seen this occur while downloading install.img during installation of bridged guests.  Rebooting eventually works around the issue.

Comment 1 James Laska 2008-11-04 14:30:01 UTC
Created attachment 322434 [details]
dmesg

Comment 2 James Laska 2008-11-04 14:30:31 UTC
Created attachment 322435 [details]
/var/log/messages

Comment 3 Daniel Berrangé 2008-11-04 16:54:31 UTC
Stalled networking traffic isn't a libvirt problem - its almost certainly a KVM device emulation problem, so changing to KVM component.

Comment 4 James Laska 2008-11-04 17:15:26 UTC
Created attachment 322451 [details]
/var/log/libvirt/qemu/vguest1.log (i386 guest)

Comment 5 James Laska 2008-11-04 17:15:46 UTC
Created attachment 322452 [details]
/var/log/libvirt/qemu/vguest2.log (x86_64 guest)

Comment 6 Fabian Deutsch 2008-11-05 00:51:22 UTC
I experienced similar problems. Thsi might be a regression, have a look at:

http://article.gmane.org/gmane.comp.emulators.kvm.devel/21423

Comment 7 Mark McLoughlin 2008-11-05 14:53:35 UTC
jlaska: could you try and reproduce with the e1000 and virtio NICs ? The default is rtl8139

You can do that by using "virsh edit MyDomain" to edit the guests definition and changing the <interface> to add a <model> tag:

    <interface type='bridge'>
      ...
      <model type='virtio'/>
    </interface>

Also, note that if you tell virt-install (with e.g. --os-variant=fedora10) or virt-manager that you are install F9/F10 it will use virtio automatically.

(It's not the timer issue I suggested in the link Fabian points to; that's fixed in F10)

Comment 8 Fabian Deutsch 2008-11-05 19:21:44 UTC
Is the vm guest on a host connected to the network via a 1 GiG link?

If so, you might post the guests /proc/net/softnet_stat after stalling.

Comment 9 James Laska 2008-11-05 20:49:46 UTC
markmc: After make the suggested change noted in comment#7 ... I'm not able to reproduce this anymore

Comment 10 Mark McLoughlin 2008-11-06 10:29:03 UTC
Okay, so we need to do further debugging to figure out whether this is a qemu rtl8139 emulation issue or a problem with the driver in the guest.

The fact that upstream 2.6.28-rc2 guest kernel worked for me suggests the latter, but it still needs further confirmation.

Comment 11 Daniel Berrangé 2008-11-06 11:24:27 UTC
I think I'd still bet on a bug in QEMU emulation of rtl8139 - if new kernel works I'm more inclined to think its merely changed somehow to avoid tickling a QEMU bug. 

FYI I checked Xen's QEMU tree which traditionally has a lot of rtl8139 fixes, but there's only one now that isn't in upstream QEMU, and the comment suggests it is only relevant for windows

changeset:   17420:40c0dda6eae6
user:        Keir Fraser <keir.fraser>
date:        Wed Apr 09 16:03:40 2008 +0100
files:       tools/ioemu/hw/rtl8139.c
description:
ioemu: Fix rtl8139 emulation so that reboot works correctly in 64-bit
Windows VMs. Return an error if the guest OS tries to transmit a
packet with the transmitter disabled, so that it doesn't spin forever
waiting for it to complete.

Signed-off-by: Steven Smith <Steven.Smith.com>

Comment 12 Bug Zapper 2008-11-26 04:44:21 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 10 development cycle.
Changing version to '10'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 13 Hans de Goede 2008-12-11 09:08:03 UTC
I've been seeing the network stall, causing net based virt-install's to hang too, while doing RHEL-5.3 and F-10 test installs inside a F-10 kvm guest. Today I've been doing some RHEL-4.8 installs and those outright oops as soon as the stage1 loader tries to download install.img .

I've also been seeing the virtual network sometimes being painfully slow (atleast 10 times as slow as normal) strange enough passing --sound to the virt-install command fixes this slowness (this was observed with rhel5.3 test installs), so this might be irq-routing related ??

Note that I've only seen the network stalls in the fast case, iow in the case where I passed --sound (I've never bother to finish a slow install).

I've done the following to work around the 100% reproducable rhel4 oops:
--- FullVirtGuest.py~   2008-12-11 09:55:39.000000000 +0100
+++ FullVirtGuest.py    2008-12-11 09:55:39.000000000 +0100
@@ -68,7 +68,10 @@
             "rhel3": { "label": "Red Hat Enterprise Linux 3",
                        "distro": "rhel" },
             "rhel4": { "label": "Red Hat Enterprise Linux 4",
-                       "distro": "rhel" },
+                       "distro": "rhel",
+                          "devices" : {
+                            "net"  : { "model" : [ (["kvm"], "e1000") ] }
+                          }},
             "rhel5": { "label": "Red Hat Enterprise Linux 5",
                        "distro": "rhel" },
             "fedora5": { "label": "Fedora Core 5", "distro": "fedora" },

So this definitively is an issue with the rtl8139 support, may I suggest changing the default to e1000 as a workaround until this is fixed ?

Comment 14 Chris Lalancette 2008-12-11 09:27:28 UTC
(In reply to comment #13)
> I've been seeing the network stall, causing net based virt-install's to hang
> too, while doing RHEL-5.3 and F-10 test installs inside a F-10 kvm guest. Today
> I've been doing some RHEL-4.8 installs and those outright oops as soon as the
> stage1 loader tries to download install.img .

FYI; this last one is *probably* a bug in the 4.8 kernel; it's BZ 474479 (despite being called an ia64 issue, it affects all RHEL-4 kernels).

Chris Lalancette

Comment 15 Hans de Goede 2008-12-11 09:38:43 UTC
(In reply to comment #14)
> FYI; this last one is *probably* a bug in the 4.8 kernel; it's BZ 474479
> (despite being called an ia64 issue, it affects all RHEL-4 kernels).
> 
> Chris Lalancette

If that is the case may I then advocate to apply my workaround from comment 13 to python-virtinst ?

Comment 16 Chris Lalancette 2008-12-11 09:45:33 UTC
Your workaround from Comment #13 will have no effect on the RHEL-4 bug; it's in the generic network stack, so it will effect all drivers.

Chris Lalancette

Comment 17 Hans de Goede 2008-12-11 09:53:12 UTC
(In reply to comment #16)
> Your workaround from Comment #13 will have no effect on the RHEL-4 bug; it's in
> the generic network stack, so it will effect all drivers.
> 
> Chris Lalancette

In that case that is not the bug I'm hitting, as I've just completed a RHEL-4.8 nightly install with my workaround, where as without it it wouldn't even start to download install.img .

Comment 18 Chris Lalancette 2008-12-11 10:07:21 UTC
Please get a backtrace of the RHEL-4.8 crash, so we can compare it with the other BZ.

Chris Lalancette

Comment 19 Hans de Goede 2008-12-11 10:40:36 UTC
Created attachment 326600 [details]
Screenshot showing the kernel panic when trying to install RHEL4.8 under kvm

Ok, here is a screenshot of the kernel panic (not an oops, but a panic, sorry I wasn't clear before) I get when trying to install rhel4.8 i386 latest nightly in kvm.

I also have a different dump, although I believe the cause is the same, I'll attach that too.

Comment 20 Hans de Goede 2008-12-11 10:41:18 UTC
Created attachment 326601 [details]
Screenshot showing slightly different kernel panic when trying to install RHEL4.8 under kvm

Comment 21 Chris Lalancette 2008-12-11 10:49:24 UTC
OK, yeah.  That is the same bug as BZ 474479.  The thing is, it doesn't necessarily happen all of the time, and certain things tickle it more than others.  In any case, we can't work around all guest bugs, especially ones in pre-released versions.  The above bug will be fixed for 4.8 (there's already a patch pending), so there is no real need for the patch in Comment #13.  Whether we change the default to e1000 to work around other slowness issues is up in the air.

Chris Lalancette

Comment 22 François Cami 2009-02-02 18:17:32 UTC
*** Bug 476452 has been marked as a duplicate of this bug. ***

Comment 23 James G. Brown III 2009-03-04 00:19:26 UTC
Right, so in Fedora 10 proper I have the same issue across the board with RHEL5.x installs being extremely slow. Essentially, watching from vty3 during install it takes approximately 3 minutes and 10 seconds for url/images/updates.img to not be found before moving onto url/disc1/updates.img for another 3 minutes and 10 seconds, then onto url/images/product.img for yet another 3 minutes and 10 seconds, etc... 

This is killing me...

# virt-install -n guest -r 512 -s 5 --os-type=linux --os-variant=rhel5 --accelerate -l http://url/RHEL-5-Server/U1/i386/os -x "text" -f /var/lib/libvirt/images/guest.img 


I am happy to provide whatever information is necessary.

Comment 24 Mark McLoughlin 2009-03-20 16:50:44 UTC
james: in order to work around the issue, try hacking virt-install to use e1000 for 5.x guests like Hans did in comment #13

If anyone can check whether this is reproducible with an F11 host, that would be very helpful

Comment 25 Daniel Berrangé 2009-03-20 16:57:10 UTC
I still see stalls using QEMU from F11 host with rtl8139 nic. Sometimes it just gets stuck while anaconda is downloading the stage2 image, othertimes it gets stuck during download of the RPMs. This seems more flakey that before, because rtl8193 used to work reasonably reliably in the past.

Comment 26 Lance Albertson 2009-04-13 19:15:47 UTC
Just a note: I'm hitting the same problem but using the virtio driver instead of the rtl8139 driver while installing F10. I was able to work around it by switching it to e1000.

Comment 27 Mark McLoughlin 2009-04-19 14:47:11 UTC
(In reply to comment #26)
> Just a note: I'm hitting the same problem but using the virtio driver instead
> of the rtl8139 driver while installing F10. I was able to work around it by
> switching it to e1000.  

Please file a new bug about your virtio hang - they are quite likely to be different issues

Comment 28 Bug Zapper 2009-11-18 08:45:02 UTC
This message is a reminder that Fedora 10 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 10.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '10'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 10's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 10 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 29 Fabian Deutsch 2009-11-25 12:22:02 UTC
Well, it seems as if this issue is solved or a known restriction. 

Can we close this bug?

Comment 30 Bug Zapper 2009-12-18 06:44:44 UTC
Fedora 10 changed to end-of-life (EOL) status on 2009-12-17. Fedora 10 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.