RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 772809 - Revert Disable LRO for all NICs that have LRO enabled
Summary: Revert Disable LRO for all NICs that have LRO enabled
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: ovirt-node
Version: 6.3
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Mike Burns
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On: 772317 772806
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-01-10 01:49 UTC by Mike Burns
Modified: 2016-04-26 14:22 UTC (History)
27 users (show)

Fixed In Version: ovirt-node-2.3.0-4.el6
Doc Type: Bug Fix
Doc Text:
Clone Of: 772317
Environment:
Last Closed: 2012-07-19 14:17:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2012:0741 0 normal SHIPPED_LIVE ovirt-node bug fix and enhancement update 2012-07-19 18:10:46 UTC

Description Mike Burns 2012-01-10 01:49:48 UTC
Revert workaround from bug 772806 once the kernel bug is fixed

+++ This bug was initially created as a clone of Bug #772317 +++

+++ This bug was initially created as a clone of Bug #692656 +++

There are significant performance issues reported for NICs that use LRO.  We need to disable LRO for all nics that have it enabled.

--- Additional comment from mburns on 2012-01-09 13:58:08 EST ---

Original mail comments:

So, running RHEV 3 beta for a customer this week and we've been seeing 
horrible performance on the RHEV-H hosts running the bnx2x driver. It 
turns out this is a problem with LRO. So we have a fix and it works 
(ethtool -K eth$ lro off).

However how do we make this change persistent across reboots ? We want 
to verify that the "normal" method of putting the appropriate option 
(options bnx2x disable_tpa=1) in modprobe.conf is supported. (There is 
no /etc/modprobe.conf and / is ro ... ).

--- Additional comment from mburns on 2012-01-09 13:59:56 EST ---

Applying the workaround is doable by placing commands in /etc/rc.local and persisting.  

This issue originally came up in 5.6/5.7 but was supposed to be fixed in the kernel.  Can I get some help from the kernel team with debugging/triaging this problem?

--- Additional comment from nhorman on 2012-01-09 15:56:50 EST ---

Mike, can you tell me:

1) What the environment looks like?  Specifically what kind of network interfaces are in play here?  Specific effected drivers, vlans in use, bridges in use vs. sriov or other offload technologies?

2) The specific nature of the failure.  Are frames getting dropped, and if so, where?  Specific netstat, ethtool, and /proc/net/dev|snmp stats are useful here

3) History.  You said this came up in 5.6/5.7. Is the problem fixed there, or does it persist there the same way it does in RHEL6?

--- Additional comment from mburns on 2012-01-09 16:08:40 EST ---

Paul,  Can you provide the information for 1 and 2 above?? 

Neil,

In 5.6/5.7, we explicitly disabled LRO on all nics where it was enabled by default.  The rhev-hypervisor bug (bug 696374) mentioned bug 696374.  I don't know if this partitcular environment has vlans or not though.  In the 5.7/5.8 branches, we still have that workaround in place, but it was never ported forward to the RHEL 6 stream.

--- Additional comment from plundin on 2012-01-09 16:48:36 EST ---

In response to the above:

1. A single RHEV-M instance managing a cluster of 6 HP nodes running RHEV-H, all using the bnx2x driver (as is normal with HP kit). No tagging, STP or SRIOV in use. Interfaces were however mode 1 bonded (active/failover) pairs.

2. It appeared to mimic a bug I found online when debugging the issue (duplicate responses/acks), but truthfully we were under the gun and did not save the tcpdump output. No errors or collisions shown on the interfaces, and everything else was defaults (eg nothing fancy here). 

Upon making the above LRO change network speeds increased significantly. The specific test use case was kickstarting VM's over the network. A base RHEL install took over 4 hours (as the only VM running on the hypervisor) before disabling LRO. Once LRO was disabled in the hypervisor the install took less than 5 minutes. (Not scientific, but it pointed us where we needed to go)

--- Additional comment from nhorman on 2012-01-09 16:54:39 EST ---

Thank you Mike, if you could also provide some details as to what exactly needed to be fixed in RHEL5 so we can compare to RHEL6.  IIRC the only thing that had to be done in RHEL5 was the disabling of lro automatically when a device was added to a bridge.  That functionality should already be in RHEL6. If you are using some offload technology like sriov or some other pci virtual function technology, manual lro disabling (or some other per-device-driver automatic disabling is still going to be required).

--- Additional comment from mburns on 2012-01-09 17:05:19 EST ---

The fix in RHEL5 was to simply disable LRO in all instances on all nics that supported it.  It was a hack and workaround, but was sufficient for our use.  

There should be no sr-iov or anything like that in this situation.  

My recollection of the issue was the same.  We needed to disable lro when adding the nic to a bridge.  Based on what Risar is saying, this wasn't happening for them.  The nic was added to a bridge, but they were still seeing problems until they explicitly disabled LRO on that interface.

--- Additional comment from nhorman on 2012-01-09 17:05:50 EST ---

Paul, thank you.  so it sounds like no vlans are in use, which is good.  That confirms that this is no relation to the vlan lro bug I fixed in RHEL5.  That said, if you're using bonding, then I think thats where the problem lies.  I don't see any way that the bonding driver can disable slave lro at the moment, or for that matter, tell its slaves to do so.  Can we test this theory.  Does the problem go away if you stop using the bond? If you attach a single interface to your bridge, does lro get disabled, and does your performance increase?

Mike, I can take this bug over if you like.

--- Additional comment from agospoda on 2012-01-09 17:15:58 EST ---

I suspect Neil is correct on this one.  The bonding driver does not have a set_flags ethtool op and this would be required to pass down the need to disable LRO on all slave devices.

--- Additional comment from plundin on 2012-01-09 17:19:23 EST ---

Neil, I can ask the customer if they are willing to test this (The problem was encountered during a RH Consulting engagement which ended last week) but it may be a few days until they get a chance to do so.

--- Additional comment from mburns on 2012-01-09 17:24:12 EST ---

(In reply to comment #10)

> 
> Mike, I can take this bug over if you like.

Neil, go ahead.  I'll clone if I end up needing to put a workaround into RHEV-H directly.

--- Additional comment from mburns on 2012-01-09 20:45:02 EST ---

Moving to kernel

Comment 1 Mike Burns 2012-01-10 01:52:23 UTC
Blocked by kernel bug 772317

Comment 2 Mike Burns 2012-04-11 15:48:53 UTC
http://gerrit.ovirt.org/3474

Comment 4 Mike Burns 2012-04-20 15:02:08 UTC
Testing: 

no /etc/modprobe.conf/*.conf files for disabling lro
check for enic.conf bnx2x.conf mlx4_en.conf s2io.conf
None should be there

Comment 7 Guohua Ouyang 2012-04-28 05:21:55 UTC
Verified on rhevh-6.3-20120426.2, no /etc/modprobe.conf/*.conf files for disabling lro.

[root@dhcp-8-209 modprobe.d]# ls
blacklist.conf      dist-alsa.conf  dist-oss.conf  ovirt-qla4xxx.conf
blacklist-kvm.conf  dist.conf       libmlx4.conf   vdsm.conf

Comment 8 Stephen Gordon 2012-05-28 16:30:48 UTC
Please advise whether a release note is required on this one. I had a release note on Bug # 772806, which this bug apparently reverts - but they are both attached to the same errata. Does this mean that the actual change from customer POV, at least as far as this component, is that there was no change?

Comment 9 Mike Burns 2012-06-13 17:19:15 UTC
(In reply to comment #8)
> Please advise whether a release note is required on this one. I had a
> release note on Bug # 772806, which this bug apparently reverts - but they
> are both attached to the same errata. Does this mean that the actual change
> from customer POV, at least as far as this component, is that there was no
> change?

Yes, from customer POV, there is no change.

Comment 10 Mike Burns 2012-06-13 17:19:16 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously a workaround was introduced in ovirt-node due to a bug in the kernel.  Now, since the kernel is fixed, we can revert that workaround from ovirt-node.

Comment 11 Stephen Gordon 2012-06-15 13:57:33 UTC
Deleted Technical Notes Contents.

Old Contents:
Previously a workaround was introduced in ovirt-node due to a bug in the kernel.  Now, since the kernel is fixed, we can revert that workaround from ovirt-node.

Comment 13 errata-xmlrpc 2012-07-19 14:17:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0741.html


Note You need to log in before you can comment on or make changes to this bug.