Bug 772806

Summary: Disable LRO for all NICs that have LRO enabled
Product: Red Hat Enterprise Linux 6 Reporter: Mike Burns <mburns>
Component: ovirt-nodeAssignee: Mike Burns <mburns>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.3CC: acathrow, agospoda, bsarathy, cpelland, cshao, djuran, dledford, dyasny, fyu, gouyang, jboggs, jturner, leiwang, llim, mburns, moli, mwagner, nhorman, ovirt-maint, pcao, plundin, plyons, sghosh, sgordon, thildred, tvvcox, vbian, ycui, yeylon
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-node-2.2.1-1.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 772317 Environment:
Last Closed: 2012-07-19 14:17:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 772809, 773675    

Description Mike Burns 2012-01-10 01:47:10 UTC
ovirt-node should disable LRO for all nics that have it enabled by default.  This workaround will be removed later when 772317 is fixed.



+++ This bug was initially created as a clone of Bug #772317 +++

There are significant performance issues reported for NICs that use LRO.  We need to disable LRO for all nics that have it enabled.

--- Additional comment from mburns on 2012-01-09 13:58:08 EST ---

Original mail comments:

So, running RHEV 3 beta for a customer this week and we've been seeing 
horrible performance on the RHEV-H hosts running the bnx2x driver. It 
turns out this is a problem with LRO. So we have a fix and it works 
(ethtool -K eth$ lro off).

However how do we make this change persistent across reboots ? We want 
to verify that the "normal" method of putting the appropriate option 
(options bnx2x disable_tpa=1) in modprobe.conf is supported. (There is 
no /etc/modprobe.conf and / is ro ... ).

--- Additional comment from mburns on 2012-01-09 13:59:56 EST ---

Applying the workaround is doable by placing commands in /etc/rc.local and persisting.  

This issue originally came up in 5.6/5.7 but was supposed to be fixed in the kernel.  Can I get some help from the kernel team with debugging/triaging this problem?

--- Additional comment from nhorman on 2012-01-09 15:56:50 EST ---

Mike, can you tell me:

1) What the environment looks like?  Specifically what kind of network interfaces are in play here?  Specific effected drivers, vlans in use, bridges in use vs. sriov or other offload technologies?

2) The specific nature of the failure.  Are frames getting dropped, and if so, where?  Specific netstat, ethtool, and /proc/net/dev|snmp stats are useful here

3) History.  You said this came up in 5.6/5.7. Is the problem fixed there, or does it persist there the same way it does in RHEL6?

--- Additional comment from mburns on 2012-01-09 16:08:40 EST ---

Paul,  Can you provide the information for 1 and 2 above?? 

Neil,

In 5.6/5.7, we explicitly disabled LRO on all nics where it was enabled by default.  The rhev-hypervisor bug (bug 696374) mentioned bug 696374.  I don't know if this partitcular environment has vlans or not though.  In the 5.7/5.8 branches, we still have that workaround in place, but it was never ported forward to the RHEL 6 stream.

--- Additional comment from plundin on 2012-01-09 16:48:36 EST ---

In response to the above:

1. A single RHEV-M instance managing a cluster of 6 HP nodes running RHEV-H, all using the bnx2x driver (as is normal with HP kit). No tagging, STP or SRIOV in use. Interfaces were however mode 1 bonded (active/failover) pairs.

2. It appeared to mimic a bug I found online when debugging the issue (duplicate responses/acks), but truthfully we were under the gun and did not save the tcpdump output. No errors or collisions shown on the interfaces, and everything else was defaults (eg nothing fancy here). 

Upon making the above LRO change network speeds increased significantly. The specific test use case was kickstarting VM's over the network. A base RHEL install took over 4 hours (as the only VM running on the hypervisor) before disabling LRO. Once LRO was disabled in the hypervisor the install took less than 5 minutes. (Not scientific, but it pointed us where we needed to go)

--- Additional comment from nhorman on 2012-01-09 16:54:39 EST ---

Thank you Mike, if you could also provide some details as to what exactly needed to be fixed in RHEL5 so we can compare to RHEL6.  IIRC the only thing that had to be done in RHEL5 was the disabling of lro automatically when a device was added to a bridge.  That functionality should already be in RHEL6. If you are using some offload technology like sriov or some other pci virtual function technology, manual lro disabling (or some other per-device-driver automatic disabling is still going to be required).

--- Additional comment from mburns on 2012-01-09 17:05:19 EST ---

The fix in RHEL5 was to simply disable LRO in all instances on all nics that supported it.  It was a hack and workaround, but was sufficient for our use.  

There should be no sr-iov or anything like that in this situation.  

My recollection of the issue was the same.  We needed to disable lro when adding the nic to a bridge.  Based on what Risar is saying, this wasn't happening for them.  The nic was added to a bridge, but they were still seeing problems until they explicitly disabled LRO on that interface.

--- Additional comment from nhorman on 2012-01-09 17:05:50 EST ---

Paul, thank you.  so it sounds like no vlans are in use, which is good.  That confirms that this is no relation to the vlan lro bug I fixed in RHEL5.  That said, if you're using bonding, then I think thats where the problem lies.  I don't see any way that the bonding driver can disable slave lro at the moment, or for that matter, tell its slaves to do so.  Can we test this theory.  Does the problem go away if you stop using the bond? If you attach a single interface to your bridge, does lro get disabled, and does your performance increase?

Mike, I can take this bug over if you like.

--- Additional comment from agospoda on 2012-01-09 17:15:58 EST ---

I suspect Neil is correct on this one.  The bonding driver does not have a set_flags ethtool op and this would be required to pass down the need to disable LRO on all slave devices.

--- Additional comment from plundin on 2012-01-09 17:19:23 EST ---

Neil, I can ask the customer if they are willing to test this (The problem was encountered during a RH Consulting engagement which ended last week) but it may be a few days until they get a chance to do so.

Comment 1 Mike Burns 2012-01-10 01:51:25 UTC
Patch is posted upstream:

http://gerrit.ovirt.org/927

Comment 3 Mike Burns 2012-01-12 17:41:39 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
A serious performance problem occurs when running using a bond and a bridge on top of NICs that use LRO.  LRO should get disabled automatically when the NIC is added to a bridge but this doesn't work right when there is a bond in between.  This patch disables LRO on all nics.

Comment 5 Tim Hildred 2012-01-25 02:22:13 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1,3 @@
-A serious performance problem occurs when running using a bond and a bridge on top of NICs that use LRO.  LRO should get disabled automatically when the NIC is added to a bridge but this doesn't work right when there is a bond in between.  This patch disables LRO on all nics.+Previously, using network interfaces in a bond and a bridge prevented LRO from being disabled on LRO-enabled network interface cards, causing serious network performance issues. 
+
+Now, LRO is disabled on all hypervisor network interface cards, preventing any LRO related network performance issues from occurring.

Comment 6 cshao 2012-02-24 09:19:37 UTC
Test version: 
rhev-hypervisor6-6.3-20120215.0.el6

# cat mlx4_en.conf 
options mlx4_en num_lro=0

# cat enic.conf 
options enic lro_disable=1

# cat s2io.conf 
options s2io lro=0

# cat bnx2x.conf 
options bnx2x disable_tpa=1

The bug is fixed, so change bug status to VERIFIED.

Comment 7 cshao 2012-02-27 08:27:19 UTC
As Mike's confirmation on bug 773675 #11 and #12, so I just check the configuration file for this bug on 6.3 build. 

Hi Mike,
Is it sufficient to verify this bug on our side? or need zstream to verify it?

Comment 9 Stephen Gordon 2012-03-27 20:57:52 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,3 +1,3 @@
 Previously, using network interfaces in a bond and a bridge prevented LRO from being disabled on LRO-enabled network interface cards, causing serious network performance issues. 
 
-Now, LRO is disabled on all hypervisor network interface cards, preventing any LRO related network performance issues from occurring.+Now, LRO is disabled on all Hypervisor network interface cards, avoiding LRO related network performance issues.

Comment 10 Stephen Gordon 2012-05-28 16:27:34 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,3 +1,3 @@
-Previously, using network interfaces in a bond and a bridge prevented LRO from being disabled on LRO-enabled network interface cards, causing serious network performance issues. 
+Previously, using network interfaces in both a bond and a bridge prevented LRO from being disabled on LRO-enabled network interface cards, causing serious network performance issues. 
 
 Now, LRO is disabled on all Hypervisor network interface cards, avoiding LRO related network performance issues.

Comment 11 Stephen Gordon 2012-05-28 16:29:00 UTC
Removing the technical note flag given that the next on my list was Bug # 772809 which appears to revert this change...

Comment 12 Stephen Gordon 2012-06-15 13:58:15 UTC
Deleted Technical Notes Contents.

Old Contents:
Previously, using network interfaces in both a bond and a bridge prevented LRO from being disabled on LRO-enabled network interface cards, causing serious network performance issues. 

Now, LRO is disabled on all Hypervisor network interface cards, avoiding LRO related network performance issues.

Comment 14 errata-xmlrpc 2012-07-19 14:17:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0741.html