Bug 1033260 - Asymmetric instance bandwidth using Floating IP w/ tunneling (GRE/VXLAN) through Neutron
Summary: Asymmetric instance bandwidth using Floating IP w/ tunneling (GRE/VXLAN) thro...
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 3.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.0
Assignee: Bob Kukura
QA Contact: Ofer Blaut
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-11-21 19:17 UTC by Brent Holden
Modified: 2016-04-26 13:57 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
When “generic receive offload” (GRO) is enabled while using GRE or VXLAN tunneling, inbound bandwidth available to instances from an external network using a OpenStack Networking router is extremely low. Workaround: Disable GRO offloading on the network node where the l3-agent runs by adding the following line to /etc/sysconfig/network-scripts/ifcfg-ethX: ETHTOOL_OPTS="-K ethX gro off" where ethX is the network interface device used for the external network. Either reboot or run "ifdown ethX; ifup ethX" for the setting to take effect. This will provide more symmetric bandwidth and faster inbound data flow.
Clone Of:
: 1042507 (view as bug list)
Environment:
Last Closed: 2013-12-08 20:39:48 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1252900 0 None None None Never

Description Brent Holden 2013-11-21 19:17:33 UTC
Description of problem:
Inbound bandwidth available to instances using an external gateway from a provider network is extremely low. This bug presents itself when trying to run a network operation, such as yum update, from an internal fixed IP network instance against a server that is located externally to the tenant network. This appears to be because of improper TCP checksums causing TCP retries.

Outbound instance bandwidth is typically at or near line speed (near 900MBps on 1GbE).


Version-Release number of selected component (if applicable):
This problem is experienced on the latest RHEL 6.5 kernel (2.6.32-431) using both quantum from RHOS 3.0 and neutron from RHEL-OSP 4.0b.


Steps to Reproduce:
1. Spin up instance on internal fixed IP network
2. Assign floating IP to instance
3. Run iperf against floating IP from external host

Actual results:
Between 14kb-30kb on a 1GbE line

Expected results:
Near line speed (>900 MBps)

Additional info:
Joe Talerico from performance engineering will comment with specific performance testing results.

Comment 2 Joe Talerico 2013-11-21 20:24:04 UTC
From my latest email to RHOS-TECH on this:

Here is a packet capture of the checksum issue on my environment 
14:05:47.706700 IP (tos 0x0, ttl 64, id 11789, offset 0, flags [DF], proto GRE (47), length 1442)
    192.168.3.200 > 192.168.3.202: GREv0, Flags [key present], key=0x1, length 1422
	IP (tos 0x8, ttl 63, id 11789, offset 0, flags [DF], proto TCP (6), length 1400)
    22.16.1.9.ssh > 10.0.0.6.46691: Flags [.], cksum 0x2689 (incorrect -> 0x0444), seq 16401:17749, ack 320, win 188, options [nop,nop,TS val 2375193 ecr 2421340], length 1348

Netperf going from the Guest to a physical machine:
[root@mongo1 ~]# netperf -4 -l 60 -H 22.16.1.9 -T1,1 -- -m 1024
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 22.16.1.9 () port 0 AF_INET : demo : cpu bind
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384   1024    60.01    3067.70

Netperf going from a physical machine to the Guest:
[root@sandyone ~]# netperf -4 -l 60 -H 22.16.1.3  -T1,1 -- -m 1024
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 22.16.1.3 () port 0 AF_INET : demo : cpu bind
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384   1024    64.00       0.06

Going from the Neutron Host to the Guest (where BR-EX w/ eth5 attached is):
[root@athos ~]# netperf -4 -l 60 -H 22.16.1.3  -T1,1 -- -m 1024
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 22.16.1.3 () port 0 AF_INET : demo : cpu bind
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384   1024    60.01    1907.41

Comment 3 Bob Kukura 2013-12-03 22:28:10 UTC
I linked an upstream launchpad bug that may be related. That bug hasn't been completely resolved yet, but suggests that having GRO offloading enabled could be at least part of the problem. The email thread at http://lists.openstack.org/pipermail/openstack/2013-October/thread.html#1778 and http://lists.openstack.org/pipermail/openstack/2013-November/thread.html#2705 is also related to this upstream bug.

On the chance that this issue is related to GRO offloading, please try running "ethtool -k eth5 on the network node, and if that shows "generic-receive-offload: on", try turning it off with "ethtool -K eth5 gro off", and see if that helps.

Comment 4 Bob Kukura 2013-12-05 03:15:28 UTC
Brent indicated in email that disabling GRO resolved the issue. Brent, can you update the bug with any relevant details?

Comment 5 Joe Talerico 2013-12-05 13:35:39 UTC
BobK I also re-tested with the GRO disabled, here are my results:

External machine to Guest (where the issue existed):
[root@sandyone ~]# netperf -4 -H 22.16.1.5 -l 60 -- -m 1024
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 22.16.1.5 () port 0 AF_INET : demo
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384   1024    60.01    2405.41
[root@sandyone ~]#


Guest to external machine:
-bash-4.1# netperf -4 -H 22.16.1.9 -l 60 -- -m 1024
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 22.16.1.9 () port 0 AF_INET : demo
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384   1024    60.01    3423.49
-bash-4.1#

Comment 6 Bob Kukura 2013-12-05 14:58:56 UTC
Thanks Joe. Still not quite symmetric, but much better. Looks like the outbound bandwidth improved a bit too, but not sure if this is significant. Disabling GRO everywhere (not just the network node) might provide further improvement. Not sure if there are any disadvantages to this.

Concluding that this is a kernel bug, and we need to document disabling GRO when using GRE with OpenStack until it is fixed.

Comment 7 lpeer 2013-12-08 20:39:48 UTC
Adding a documentation flag and leaving this bug on Neutron to capture and document the issue.

Bob, did you report the relevant bug on the kernel?

Comment 8 Bruce Reeler 2013-12-12 06:05:22 UTC
NEEDINFO Bob Kukura

Could you please supply a command (or reference to explanatory text) in the Doc Text's workaraound, to show the user how to  disable GRO offloading on the network node?
Thanks

Comment 9 Bob Kukura 2013-12-12 23:16:00 UTC
I've cloned this as BZ 1042507 against the kernel.

Livnat, shouldn't we keep this open to track eventually verifying the kernel fix with OpenStack?

I've updated the wording of the doc text slightly, and added the command to disable GRO.

Comment 10 Bob Kukura 2013-12-13 14:44:30 UTC
This has been reproduced using both provider external networks and bridge-based (br-ex) external networks.

Comment 11 Bob Kukura 2013-12-13 16:02:06 UTC
Updated doc text workaround to persistently turn off GRO by adding:

ETHTOOL_OPTS="-K ethX gro off"

to /etc/sysconfig/network-scripts/ifcfg-ethX.

Comment 12 lpeer 2013-12-15 07:35:01 UTC
(In reply to Bob Kukura from comment #9)
> I've cloned this as BZ 1042507 against the kernel.
> 
> Livnat, shouldn't we keep this open to track eventually verifying the kernel
> fix with OpenStack?
> 

I don't think we need to, I added a comment on the Kernel bug to ask Joe or Ofer to verify also in the context of Neutron.


Note You need to log in before you can comment on or make changes to this bug.