Bug 474191

Summary: bad network tx checksum
Product: Red Hat Enterprise Linux 5 Reporter: Alain RICHARD <alain.richard>
Component: kernel-xenAssignee: Herbert Xu <herbert.xu>
Status: CLOSED CANTFIX QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: low    
Version: 5.2CC: bbs2web, herbert.xu, jmh, rlerch, xen-maint
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Due to technical problems with passing TX checksum offload information between paravirtual domains, the use of TX checksum offload in conjunction with NAT for traffic originating from another domain is not supported. TX checksum offload can be used together with NAT as long as the NAT rule is applied in the domain where the traffic originates. Note that this also applies to fully virtualised domains using paravirtual network drivers. Fully virtualised domains using fully virtualised drivers are not affected as they do not support TX checksum offload at all.
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-06-29 18:23:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 492570    

Description Alain RICHARD 2008-12-02 18:03:24 UTC
Description of problem:

I have discovered a beavior that affect network communication between xen domu. The bug is triggered by a combination of routing between two DomUs, bridging under Dom0 and having a network adaptator that is able to offload tx checksuming.

It seams that in some cases, the DomU le the network card do the tx checksum calculation, but in fact this calculation is not done because the destination is routed by an over DomU.

Version-Release number of selected component (if applicable):

current RHEL 5.2, with kernel-xen-2.6.18-92.1.18.el5 and xen-3.0.3-69.0.

the bug is present since several kernel-xen but I can't be sure it was present at start (RHEL 5.0).

How reproducible:

this is 100% reproductible using the setup shown later.

Steps to Reproduce:

We have the following setup :

VLAN 1 = intranet
VLAN 10 = internet DMZ

VM1 = a firewall between VLAN 10 (eth0) and VLAN 1 (eth1)
VM2 = a host on the VLAN 10 DMZ (eth0)

PHYS 1 = a xen dom0 that bridges its eth0 to VLAN1 and its eth1 to VLAN10
PHYS 2 = a xen dom0 that bridges its eth0 to VLAN1 and its eth1 to VLAN10
HOST A = an intranet host on VLAN 10
HOST B = an internet host, somewhere in the world

PHYS 1 and PHYS 2 have the same setup (RHEL 5.2, with all the same software versions and the same brctl bridges vlan1 and vlan10). Theses two machines uses an intel 82571EB gigabit controleur, using the kernel module e1000e version 0.2.0. 

ethtool -k eth3
Offload parameters for eth3:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off


Actual results:

setup 1 : if VM1 is on PHYS1 and VM2 is on PHYS2, there is no communication problem

- from VM1 to VM2, HOST A and any HOST B
- from VM2 to VM1 and any HOST B

setup 2 : if VM1 and VM2 are on the same physical host (either PHYS1 or PHYS2), then

from VM1 to VM2 : OK
from VM1 to HOST A : OK
from VM1 to HOST B : OK
from VM2 to VM1 : OK
from VM2 to HOST B : OK
from HOST B to VM1 : OK
from HOST B to VM2 : OK
from HOST A to VM1 : OK
from HOST A to HOST B : OK
from HOST A to VM2 : ping ok, but no tcp sessions


The problem is corrected by desactivating offload tx checksuming on the xen pseudo interface of VM2 (ethtool -K eth0 tx off).

I suspect the following :

step 0 - the HOST A client initiate a session to a tcp socket on VM2
step 1 - tcp packets emitted by VM2 are not checksumed because of the tx checksum is marked as offload.
step 2 - the packet is then forwarded to the bridge vlan10 under Dom0. The checksum is bad, but it is not verifed probably because the bridge is also marked as tx checksum offload.
step 3 - the packet is forwarded to the eth0 pseuso interface of VM1. The checksum is bad, but not verified because the interface is marked as tx checksum offload.
step 4 - VM1 route the packet from eth0 (DMZ) to eth1 (intranet) (the firewall do not drop it as it is related to an established session; also disabling completly the firewall have no impact on the problem).
step 5 - VM1 verify the packet checksum before to route it and drop it because the checksum is bad.
step 6 - HOST A hangs waiting for VM2 reply, and after several attempts, it timeouts.

This problem is only seen in the setup 2. In the setup 1, the problem is not triggered because the packet is effectively emitted on the wire by PHYS2 interface and the checksum is calculated by the adaptator.

Comment 2 Herbert Xu 2008-12-03 03:58:19 UTC
Are both VMs paravirt and running the same kernel version as dom0? Is your TCP traffic IPv4 (we only support IPv4 virtual forwarding in Xen in RHEL5).

Also what processing does VM1 do to the packets, e.g., please list netfilter rules and other relevant configuration details in VM1.

Please also test if disabling tx checksums in VM1 fixes the problem.  If that doesn't, please test disabling tx checksums in dom0's netback interface to VM1.

Thanks!

Comment 3 Alain RICHARD 2008-12-24 11:06:43 UTC
Both VMs are paravirt and runnning the same kernel as dom0.

All the trafic is IPv4.

The netfilter rules involved filtering (iptables -t filter), marking (mangle -t mangle), natting (-t nat).

If I remove all rules but the nat one :

[root@sol ~]# iptables -L -n -v -t nat
Chain PREROUTING (policy ACCEPT 11M packets, 1925M bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain POSTROUTING (policy ACCEPT 5296K packets, 381M bytes)
 pkts bytes target     prot opt in     out     source               destination         
  247 17199 SNAT       all  --  *      eth3    192.168.240.0/24     0.0.0.0/0           to:195.189.65.250 

Chain OUTPUT (policy ACCEPT 5257K packets, 378M bytes)
 pkts bytes target     prot opt in     out     source               destination         

The result is the same (my intranet host is on net 192.168.240.0/24, the VM1 has address 195.189.65.250 and the VM2 has address 195.189.65.253).


If I remove this last one (so that their is no iptables rules at all), the problem disapear.

So the problem is related to tx offload AND iptables nat. A simple solution is to issue "ethtool -K eth0 tx off" in the VM2.

If I issue a "ethtool -K eth3 tx off" in the VM1, the problem is still present.

If I issue a "ethtool -K vif4.3 tx off" on the physical host to disable the tx checksum in dom0's netback interface to VM1, the problem disapear.

Regards,

Comment 4 Herbert Xu 2009-01-08 06:18:10 UTC
Right, unfortunately we can't support NAT + TX offload in RHEL5 for packets originating from another guest because the underlying netfilter infrastructure for doing so simply isn't there.  Disabling checksum offload on the interface going into VM1 (vif4.3) is the best workaround.

Comment 6 Alain RICHARD 2009-01-09 10:06:17 UTC
Is there any other bug tracking number int netfilter that we can follow in order to keep an eye on the resolution of this problem ?

The vif interface (here vif4.3) is created during the xen vm launch. In my case, this is in fact launched by cman's rgmanager on one of the physical servers of the cluster. 

That is the best way to intercept the vm creation on the physical host in order to issue the required "ethtool -K vif4.3 tx off" ? 

Is there any hook in the xen package where we can for example interpret a vif = [ "mac=00:16:3e:72:4e:d2,bridge=vlan50,txoffload=off" ] ?

Waiting for such a solution, the best way for me was to disabled it in the VM2, so that it works with rgmanager.

Comment 8 Herbert Xu 2009-01-11 00:17:24 UTC
Alain, IIRC /etc/xen/scripts/vif-bridge is the one you can use to disable TX offload on vif4.3.

Comment 11 Bill Burns 2009-01-12 17:44:45 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Due to technical problems with passing TX checksum offload information between paravirtual domains, we do not support the use of TX checksum offload in conjunction with NAT for traffic originating from another domain.  In other words, you may use TX checksum offload together with NAT as long as the NAT rule is applied in the domain where the traffic originates.

 If you have to apply NAT to traffic originating from another domain, then TX checksum should be disabled on the vif interface entering the domain with the rules.

 Note that this also applies to fully virtualised domains using paravirtual network drivers.  Fully virtualised domains using fully virtualised drivers are not affected as they do not support TX checksum offload at all.

Comment 14 Ryan Lerch 2009-01-16 03:32:17 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,5 +1,3 @@
-Due to technical problems with passing TX checksum offload information between paravirtual domains, we do not support the use of TX checksum offload in conjunction with NAT for traffic originating from another domain.  In other words, you may use TX checksum offload together with NAT as long as the NAT rule is applied in the domain where the traffic originates.
+Due to technical problems with passing TX checksum offload information between paravirtual domains, the use of TX checksum offload in conjunction with NAT for traffic originating from another domain is not supported. TX checksum offload can be used together with NAT as long as the NAT rule is applied in the domain where the traffic originates.
 
- If you have to apply NAT to traffic originating from another domain, then TX checksum should be disabled on the vif interface entering the domain with the rules.
+Note that this also applies to fully virtualised domains using paravirtual network drivers. Fully virtualised domains using fully virtualised drivers are not affected as they do not support TX checksum offload at all.-
- Note that this also applies to fully virtualised domains using paravirtual network drivers.  Fully virtualised domains using fully virtualised drivers are not affected as they do not support TX checksum offload at all.

Comment 15 David Herselman 2009-05-15 21:45:58 UTC
The following is perhaps a duplication of the information above but it would have helped me, and hopefully others, isolate the problem and provide a solution. Our setup is fairly convoluted with load balanced, connection tracking and traffic marking and shaping so we stumbled on to this bug report almost by chance.

We experience the same problem with Dom0 when using PREROUTING nat rules. The problems are however not limited to communications to/from DomU guests.

Dom0 running Squid proxy server (transparent port configured on tcp:3129)
iptables -t nat -A PREROUTING -s 192.168.0.0/16 -d ! 192.168.1.0/29 -p tcp --dport 80 -j DNAT --to 192.168.1.1:3129

Clients are able to retrieve about 100KB without problems but connection thereafter gets extremely slow and then eventually stops altogether. If we however skip NAT and get workstations to configure their browsers to connect directly to the Squid proxy server everything works perfectly well.

Running Xen with a single domU (Windows 2003 with Xen paravirtualisation drivers).

Dom0 eth0: 192.168.1.1/29
DomU NIC1: 192.168.1.3/29

PS: Dom0 has routes for 192.168.0.0/16 to 192.168.1.2 (a WiFi router).

eth0 - xenbr0


Hardware: Confirmed to affect Intel Original S3210SH and S5000VSA motherboards
with Intel 80003ES2LAN, 82566DM-2 and 82541GI embedded Gigabit Ethernet Controllers.

ethtool -k eth0
  Offload parameters for eth0:
  Cannot get device rx csum settings: Operation not supported
  Cannot get device udp large send offload settings: Operation not supported
  rx-checksumming: off
  tx-checksumming: on
  scatter-gather: off
  tcp segmentation offload: off
  udp fragmentation offload: off
  generic segmentation offload: off

Adding 'ethtool -K eth0 tx off' to /etc/rc.d/rc.local fixes the problem...

Comment 18 Bill Burns 2009-06-29 18:23:41 UTC
This issue cannot be fixed and has been release noted. Closing. If the workaround is not sufficient, please reopen.