Bug 490266 - virtio_net tx stall with segmentation offload
Summary: virtio_net tx stall with segmentation offload
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: F11VirtBlocker
TreeView+ depends on / blocked
 
Reported: 2009-03-14 13:50 UTC by George Iosif
Modified: 2009-03-30 09:05 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-03-30 09:05:12 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
[qemu server] hardware (2.16 KB, text/plain)
2009-03-14 13:50 UTC, George Iosif
no flags Details
qemu config for [qemu guest] (1.41 KB, text/plain)
2009-03-14 13:51 UTC, George Iosif
no flags Details
kvm guest strace output (381.31 KB, application/x-zip-compressed)
2009-03-26 23:25 UTC, George Iosif
no flags Details
gso: Fix support for linear packets (411 bytes, patch)
2009-03-29 02:02 UTC, Herbert Xu
no flags Details | Diff

Description George Iosif 2009-03-14 13:50:14 UTC
Created attachment 335205 [details]
[qemu server] hardware

Description of problem:
*setup: [laptop]---[switch]---[qemu server]==bridge==[qemu guest]
*problem: if I try to copy a file (e.g. 200 MB) - tried with scp and via Samba share - from [qemu guest] to [laptop], the transfer stalls. However, if I try to copy the file from [laptop] to [qemu guest] machine, it works without issues.


Version-Release number of selected component (if applicable):
* [qemu server] = FC rawhide
* [qemu guest] = Slackware 12.2 configured with virtio network


How reproducible:
Always

Steps to Reproduce:
1. copy a large file from [laptop] to [qemu guest] - it works
2. copy a large file from [qemu guest] to [laptop] - transfer stalls
  
Actual results:
The transfer at step 2 stalls.

Expected results:
Transfer should work both ways, including in the direction [qemu guest] -> [laptop].

Additional info:
*this problem happens only when [qemu server] is running kernel version 2.6.29 (currently at kernel-2.6.29-0.237.rc7.git4.fc11.x86_64). When running 2.6.27 (kernel-2.6.27.15-170.2.24.fc10.x86_64), this problem doesn't occur.
*copying to/from [laptop] from/to [qemu server] works without issues.
*copying to/from [qemu server] from/to [qemu guest] works without issues.

Comment 1 George Iosif 2009-03-14 13:51:35 UTC
Created attachment 335206 [details]
qemu config for [qemu guest]

Comment 2 George Iosif 2009-03-14 21:13:30 UTC
I forgot to mention I was seeing this since kvm was still included as a separate package in Fedora rawhide, but always only with kernel 2.6.29.

Back then (when kvm was a separate package) I tried to troubleshoot the problem and I noticed the following: if I configured qemu to use a different driver for the network adapter (instead of virtio), that seemed to improve the situation in the sense that, for a while, the copying wouldn't stall.
But it was only for a while, because after some time of copying a big file (2GB or so) network connectivity for [qemu guest] would be lost completely and I would only be able to restore that (network connectivity) by unloading and reloading the network adapter driver in the [qemu guest] operating system.

Please let me know if you need more information.

Regards,
George Iosif

Comment 3 George Iosif 2009-03-21 10:44:29 UTC
Anyone any insights ?

Thanks,
George Iosif

Comment 4 Mark McLoughlin 2009-03-25 12:55:12 UTC
George: thanks much for the detailed report

Just to summarize:

  - This is a regression introduced between 2.6.27 and 2.6.29
  - Copying from remote machine to guest is what stalls
  - It may or may not be virtio specific

This sounds somewhat similar to:

  http://www.mail-archive.com/kvm@vger.kernel.org/msg07006.html

Needs further debugging.

Could you try some of the suggestions I made in that email thread? Very interested in an strace of the hang.

Comment 5 George Iosif 2009-03-26 23:25:40 UTC
Created attachment 336926 [details]
kvm guest strace output

Strace output for the kvm process corresponding to the guest machine

Comment 6 George Iosif 2009-03-26 23:32:30 UTC
Hi Mark,

Thanks for the reply !

Your summary is not entirely accurate (please see comments below):
"  - This is a regression introduced between 2.6.27 and 2.6.29"
Yes, this is correct.

"  - Copying from remote machine to guest is what stalls"
No. Copying in the direction you mention goes fine.
It is copying from the guest to the remote machine that stalls.
During this time, however, I can still reach the guest machine (ping, for instance, works).

"  - It may or may not be virtio specific"
The problem with the copying stalling only happens with virtio.

The other issue I mention in the previous messages is only remotely related to the initial problem: I wanted to point out that when not using virtio, the copying stalling problem is not present.
The fact that using something different than virtio is not a workaround (because after 5-10 mins of copying a large file, the guest loses network connectivity completely) is another problem.

Looking at the thread at the link you included, the problem I have is different for the following reasons:
1) For my case, every time I try to copy a medium size file (200MB) from the guest to a remote machine, the copy stalls. However, I can still reach the guest over the network (even from the same remote machine).
   vs
   For the case in the thread, network connectivity works fine for a while, but after some time, it is lost completely, requiring a module removal and reloading to restore it. This is indeed similar to the second issue I mention (when not using virtio), so that's probably where you saw the similarity.

2) When copying stalls, running
ip link set eth0 down
ip link set eth0 up
    on the guest machine doesn't make any difference.

3) Looking at the details of the vnet0 interface on my qemu server, I don't see anything unusual (like the overruns in the case in the thread).


Moving forward, I ran
watch -n 0 "cat /proc/net/dev"
on both the qemu server and the guest machine and tried to copy the 200 MB file from the guest to a remote machine (different than the qemu server).
When the copy stalled, I didn't notice anything unusual in the two watch outputs: all numbers had a value of 0, with the exception of the transmit/receive bytes and packets which were slowly increasing.

Finally, I ran
strace -pKVM_PID -ttt -T
(where KVM_PID is the PID of the KVM instance for my guest machine)
and reproduced the copy stalling.
You can find the output attached to the bug.

Please let me know if you need additional information.

Thanks,
George Iosif

Comment 7 Mark McLoughlin 2009-03-27 18:12:29 UTC
Thanks George; I've reproduced this now

The problem appears to be GSO related. To workaround the problem, you can do:

  $> ethtool -K eth0 tx off

on the guest

Summary:

  - scp from guest to remote machine stalls
  - happens with 2.6.29 host, but not 2.6.27
  - happens whether host is using NAT or bridging
  - only happens when GSO is enabled on the guest interface

Comment 8 George Iosif 2009-03-27 21:26:53 UTC
Hi Mark,

The summary looks good now (although it is not only scp that stalls; I tried via Samba and it was the same behavior).

Still, the workaround doesn't work for me. :-(

Thanks,
George

Comment 9 George Iosif 2009-03-27 21:48:58 UTC
Ok, I did something else on the guest machine:
  modprobe -r virtio_net
followed by
  modprobe virtio_net gso=0

This seems to have corrected the problem.
I'll do some more testing and will get back with updates.

Thanks Mark !

Regards,
George

Comment 10 George Iosif 2009-03-27 22:50:25 UTC
After additional testing, I can say the following:
1) running "ethtool -K eth0 tx off" when the virtio_net module was loaded without any parameters does not make the problem go away (although the command executes successfully);

2) loading the virtio_net module with gso=0 parameter definitely solves the problem;

3) running "ethtool -k eth0" when the virtio_net module was loaded without any parameters (and presents the stall problem) gives this output:
   Offload parameters for eth0:
   Cannot get device rx csum settings: Operation not supported
   rx-checksumming: off
   tx-checksumming: on
   scatter-gather: on
   tcp segmentation offload: on
   udp fragmentation offload: off
   generic segmentation offload: on

3) running "ethtool -k eth0" when virtio_net is loaded with the command "modprobe virtio_net dso=0" (and when the problem goes away) gives this output:
   Offload parameters for eth0:
   Cannot get device rx csum settings: Operation not supported
   rx-checksumming: off
   tx-checksumming: on
   scatter-gather: on
   tcp segmentation offload: off
   udp fragmentation offload: off
   generic segmentation offload: on

As you can see, the only difference from when the stall problem is present and when not is "tcp segmentation offload" being set on, respectively off.

4) trying to set tso (tcp segmentation offload) off via ethtool (so one doesn't have to remove the virtio_net module and load it with dso=0) doesn't work:
   ethtool -K eth0 tso off
   Cannot set device tcp segmentation offload settings: Operation not supported
The command doesn't complete successfully.

Many thanks, Mark, for your valuable support on this and please let me know if you need additional information.

Regards,
George

Comment 11 Mark McLoughlin 2009-03-28 16:34:35 UTC
Looks like the regression was introduced by this commit:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=71d93b39e5

 Author: Herbert Xu <herbert.org.au>
 Date:   Mon Dec 15 23:26:06 2008 -0800
 net: Add frag_list support to skb_segment

Comment 12 Herbert Xu 2009-03-29 01:25:36 UTC
Mark, do you mean 89319d3801d1d3ac29c7df1f067038986f267d29? 71d93b39e5 adds GRO support so shouldn't really affect this.

Also, is it just GSO from the guest that is broken or is the host as well? That is, if you copy a file from the host itself to the outside, does that work? Thanks!

Comment 13 Herbert Xu 2009-03-29 01:40:54 UTC
OK, I see the problem.  skb_segment fails to deal with linear GSO packets, which happens because tun prefers (in 29 at least) to allocate linear packets.

Comment 14 Herbert Xu 2009-03-29 02:02:07 UTC
Created attachment 337137 [details]
gso: Fix support for linear packets

Comment 15 Mark McLoughlin 2009-03-30 07:57:18 UTC
Here's where Herbert submitted the patch upstream:

http://marc.info/?l=linux-netdev&m=123829249327577
http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commit;h=2f181855a0b

Comment 16 Mark McLoughlin 2009-03-30 09:05:12 UTC
Okay, build kernel-2.6.29-21.fc11 in rawhide:

* Mon Mar 30 2009 Mark McLoughlin <markmc> 2.6.29-21
- Fix guest->remote network stall with virtio/GSO (#490266)


Note You need to log in before you can comment on or make changes to this bug.