Hide Forgot
Description of problem: Inbound bandwidth available to instances using an external gateway from a provider network is extremely low. This bug presents itself when trying to run a network operation, such as yum update, from an internal fixed IP network instance against a server that is located externally to the tenant network. This appears to be because of improper TCP checksums causing TCP retries. Outbound instance bandwidth is typically at or near line speed (near 900MBps on 1GbE). Version-Release number of selected component (if applicable): This problem is experienced on the latest RHEL 6.5 kernel (2.6.32-431) using both quantum from RHOS 3.0 and neutron from RHEL-OSP 4.0b. Steps to Reproduce: 1. Spin up instance on internal fixed IP network 2. Assign floating IP to instance 3. Run iperf against floating IP from external host Actual results: Between 14kb-30kb on a 1GbE line Expected results: Near line speed (>900 MBps) Additional info: Joe Talerico from performance engineering will comment with specific performance testing results.
From my latest email to RHOS-TECH on this: Here is a packet capture of the checksum issue on my environment 14:05:47.706700 IP (tos 0x0, ttl 64, id 11789, offset 0, flags [DF], proto GRE (47), length 1442) 192.168.3.200 > 192.168.3.202: GREv0, Flags [key present], key=0x1, length 1422 IP (tos 0x8, ttl 63, id 11789, offset 0, flags [DF], proto TCP (6), length 1400) 22.16.1.9.ssh > 10.0.0.6.46691: Flags [.], cksum 0x2689 (incorrect -> 0x0444), seq 16401:17749, ack 320, win 188, options [nop,nop,TS val 2375193 ecr 2421340], length 1348 Netperf going from the Guest to a physical machine: [root@mongo1 ~]# netperf -4 -l 60 -H 22.16.1.9 -T1,1 -- -m 1024 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 22.16.1.9 () port 0 AF_INET : demo : cpu bind Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 1024 60.01 3067.70 Netperf going from a physical machine to the Guest: [root@sandyone ~]# netperf -4 -l 60 -H 22.16.1.3 -T1,1 -- -m 1024 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 22.16.1.3 () port 0 AF_INET : demo : cpu bind Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 1024 64.00 0.06 Going from the Neutron Host to the Guest (where BR-EX w/ eth5 attached is): [root@athos ~]# netperf -4 -l 60 -H 22.16.1.3 -T1,1 -- -m 1024 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 22.16.1.3 () port 0 AF_INET : demo : cpu bind Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 1024 60.01 1907.41
I linked an upstream launchpad bug that may be related. That bug hasn't been completely resolved yet, but suggests that having GRO offloading enabled could be at least part of the problem. The email thread at http://lists.openstack.org/pipermail/openstack/2013-October/thread.html#1778 and http://lists.openstack.org/pipermail/openstack/2013-November/thread.html#2705 is also related to this upstream bug. On the chance that this issue is related to GRO offloading, please try running "ethtool -k eth5 on the network node, and if that shows "generic-receive-offload: on", try turning it off with "ethtool -K eth5 gro off", and see if that helps.
Brent indicated in email that disabling GRO resolved the issue. Brent, can you update the bug with any relevant details?
BobK I also re-tested with the GRO disabled, here are my results: External machine to Guest (where the issue existed): [root@sandyone ~]# netperf -4 -H 22.16.1.5 -l 60 -- -m 1024 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 22.16.1.5 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 1024 60.01 2405.41 [root@sandyone ~]# Guest to external machine: -bash-4.1# netperf -4 -H 22.16.1.9 -l 60 -- -m 1024 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 22.16.1.9 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 1024 60.01 3423.49 -bash-4.1#
Thanks Joe. Still not quite symmetric, but much better. Looks like the outbound bandwidth improved a bit too, but not sure if this is significant. Disabling GRO everywhere (not just the network node) might provide further improvement. Not sure if there are any disadvantages to this. Concluding that this is a kernel bug, and we need to document disabling GRO when using GRE with OpenStack until it is fixed.
Adding a documentation flag and leaving this bug on Neutron to capture and document the issue. Bob, did you report the relevant bug on the kernel?
NEEDINFO Bob Kukura Could you please supply a command (or reference to explanatory text) in the Doc Text's workaraound, to show the user how to disable GRO offloading on the network node? Thanks
I've cloned this as BZ 1042507 against the kernel. Livnat, shouldn't we keep this open to track eventually verifying the kernel fix with OpenStack? I've updated the wording of the doc text slightly, and added the command to disable GRO.
This has been reproduced using both provider external networks and bridge-based (br-ex) external networks.
Updated doc text workaround to persistently turn off GRO by adding: ETHTOOL_OPTS="-K ethX gro off" to /etc/sysconfig/network-scripts/ifcfg-ethX.
(In reply to Bob Kukura from comment #9) > I've cloned this as BZ 1042507 against the kernel. > > Livnat, shouldn't we keep this open to track eventually verifying the kernel > fix with OpenStack? > I don't think we need to, I added a comment on the Kernel bug to ask Joe or Ofer to verify also in the context of Neutron.