Created attachment 335205 [details] [qemu server] hardware Description of problem: *setup: [laptop]---[switch]---[qemu server]==bridge==[qemu guest] *problem: if I try to copy a file (e.g. 200 MB) - tried with scp and via Samba share - from [qemu guest] to [laptop], the transfer stalls. However, if I try to copy the file from [laptop] to [qemu guest] machine, it works without issues. Version-Release number of selected component (if applicable): * [qemu server] = FC rawhide * [qemu guest] = Slackware 12.2 configured with virtio network How reproducible: Always Steps to Reproduce: 1. copy a large file from [laptop] to [qemu guest] - it works 2. copy a large file from [qemu guest] to [laptop] - transfer stalls Actual results: The transfer at step 2 stalls. Expected results: Transfer should work both ways, including in the direction [qemu guest] -> [laptop]. Additional info: *this problem happens only when [qemu server] is running kernel version 2.6.29 (currently at kernel-2.6.29-0.237.rc7.git4.fc11.x86_64). When running 2.6.27 (kernel-2.6.27.15-170.2.24.fc10.x86_64), this problem doesn't occur. *copying to/from [laptop] from/to [qemu server] works without issues. *copying to/from [qemu server] from/to [qemu guest] works without issues.
Created attachment 335206 [details] qemu config for [qemu guest]
I forgot to mention I was seeing this since kvm was still included as a separate package in Fedora rawhide, but always only with kernel 2.6.29. Back then (when kvm was a separate package) I tried to troubleshoot the problem and I noticed the following: if I configured qemu to use a different driver for the network adapter (instead of virtio), that seemed to improve the situation in the sense that, for a while, the copying wouldn't stall. But it was only for a while, because after some time of copying a big file (2GB or so) network connectivity for [qemu guest] would be lost completely and I would only be able to restore that (network connectivity) by unloading and reloading the network adapter driver in the [qemu guest] operating system. Please let me know if you need more information. Regards, George Iosif
Anyone any insights ? Thanks, George Iosif
George: thanks much for the detailed report Just to summarize: - This is a regression introduced between 2.6.27 and 2.6.29 - Copying from remote machine to guest is what stalls - It may or may not be virtio specific This sounds somewhat similar to: http://www.mail-archive.com/kvm@vger.kernel.org/msg07006.html Needs further debugging. Could you try some of the suggestions I made in that email thread? Very interested in an strace of the hang.
Created attachment 336926 [details] kvm guest strace output Strace output for the kvm process corresponding to the guest machine
Hi Mark, Thanks for the reply ! Your summary is not entirely accurate (please see comments below): " - This is a regression introduced between 2.6.27 and 2.6.29" Yes, this is correct. " - Copying from remote machine to guest is what stalls" No. Copying in the direction you mention goes fine. It is copying from the guest to the remote machine that stalls. During this time, however, I can still reach the guest machine (ping, for instance, works). " - It may or may not be virtio specific" The problem with the copying stalling only happens with virtio. The other issue I mention in the previous messages is only remotely related to the initial problem: I wanted to point out that when not using virtio, the copying stalling problem is not present. The fact that using something different than virtio is not a workaround (because after 5-10 mins of copying a large file, the guest loses network connectivity completely) is another problem. Looking at the thread at the link you included, the problem I have is different for the following reasons: 1) For my case, every time I try to copy a medium size file (200MB) from the guest to a remote machine, the copy stalls. However, I can still reach the guest over the network (even from the same remote machine). vs For the case in the thread, network connectivity works fine for a while, but after some time, it is lost completely, requiring a module removal and reloading to restore it. This is indeed similar to the second issue I mention (when not using virtio), so that's probably where you saw the similarity. 2) When copying stalls, running ip link set eth0 down ip link set eth0 up on the guest machine doesn't make any difference. 3) Looking at the details of the vnet0 interface on my qemu server, I don't see anything unusual (like the overruns in the case in the thread). Moving forward, I ran watch -n 0 "cat /proc/net/dev" on both the qemu server and the guest machine and tried to copy the 200 MB file from the guest to a remote machine (different than the qemu server). When the copy stalled, I didn't notice anything unusual in the two watch outputs: all numbers had a value of 0, with the exception of the transmit/receive bytes and packets which were slowly increasing. Finally, I ran strace -pKVM_PID -ttt -T (where KVM_PID is the PID of the KVM instance for my guest machine) and reproduced the copy stalling. You can find the output attached to the bug. Please let me know if you need additional information. Thanks, George Iosif
Thanks George; I've reproduced this now The problem appears to be GSO related. To workaround the problem, you can do: $> ethtool -K eth0 tx off on the guest Summary: - scp from guest to remote machine stalls - happens with 2.6.29 host, but not 2.6.27 - happens whether host is using NAT or bridging - only happens when GSO is enabled on the guest interface
Hi Mark, The summary looks good now (although it is not only scp that stalls; I tried via Samba and it was the same behavior). Still, the workaround doesn't work for me. :-( Thanks, George
Ok, I did something else on the guest machine: modprobe -r virtio_net followed by modprobe virtio_net gso=0 This seems to have corrected the problem. I'll do some more testing and will get back with updates. Thanks Mark ! Regards, George
After additional testing, I can say the following: 1) running "ethtool -K eth0 tx off" when the virtio_net module was loaded without any parameters does not make the problem go away (although the command executes successfully); 2) loading the virtio_net module with gso=0 parameter definitely solves the problem; 3) running "ethtool -k eth0" when the virtio_net module was loaded without any parameters (and presents the stall problem) gives this output: Offload parameters for eth0: Cannot get device rx csum settings: Operation not supported rx-checksumming: off tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: on 3) running "ethtool -k eth0" when virtio_net is loaded with the command "modprobe virtio_net dso=0" (and when the problem goes away) gives this output: Offload parameters for eth0: Cannot get device rx csum settings: Operation not supported rx-checksumming: off tx-checksumming: on scatter-gather: on tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: on As you can see, the only difference from when the stall problem is present and when not is "tcp segmentation offload" being set on, respectively off. 4) trying to set tso (tcp segmentation offload) off via ethtool (so one doesn't have to remove the virtio_net module and load it with dso=0) doesn't work: ethtool -K eth0 tso off Cannot set device tcp segmentation offload settings: Operation not supported The command doesn't complete successfully. Many thanks, Mark, for your valuable support on this and please let me know if you need additional information. Regards, George
Looks like the regression was introduced by this commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=71d93b39e5 Author: Herbert Xu <herbert.org.au> Date: Mon Dec 15 23:26:06 2008 -0800 net: Add frag_list support to skb_segment
Mark, do you mean 89319d3801d1d3ac29c7df1f067038986f267d29? 71d93b39e5 adds GRO support so shouldn't really affect this. Also, is it just GSO from the guest that is broken or is the host as well? That is, if you copy a file from the host itself to the outside, does that work? Thanks!
OK, I see the problem. skb_segment fails to deal with linear GSO packets, which happens because tun prefers (in 29 at least) to allocate linear packets.
Created attachment 337137 [details] gso: Fix support for linear packets
Here's where Herbert submitted the patch upstream: http://marc.info/?l=linux-netdev&m=123829249327577 http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commit;h=2f181855a0b
Okay, build kernel-2.6.29-21.fc11 in rawhide: * Mon Mar 30 2009 Mark McLoughlin <markmc> 2.6.29-21 - Fix guest->remote network stall with virtio/GSO (#490266)