Bug 582367

Summary: implement dev_disable_lro for RHEL5
Product: Red Hat Enterprise Linux 5 Reporter: Neil Horman <nhorman>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact: Hangbin Liu <haliu>
Severity: high Docs Contact:
Priority: high    
Version: 5.6CC: apevec, dhoward, djuran, haliu, hjia, jon.mason, jpirko, jrankin, jwest, liko, noc, pasik, sgruszka, sgzijl, tao
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Large-Receive Offload (LRO) is a performance optimization that enables the kernel to fetch and process, as a unit, more than one received packet from a network device. It was previously not possible to dynamically disable LRO for devices in a forwarding mode. This has been fixed with this update so that the kernel is able to dynamically disable LRO for devices in a forwarding state, or which had had LRO turned on manually.
Story Points: ---
Clone Of:
: 584359 (view as bug list) Environment:
Last Closed: 2011-01-13 21:27:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514491, 584359, 596385, 612856    
Attachments:
Description Flags
omnibus patch for this bug
none
infrastrucuture dev_disable_lro patch
none
bnx2x patch
none
enic patch
none
mlx4 patch
none
s2io patch none

Description Neil Horman 2010-04-14 18:26:28 UTC
Description of problem:
Nominally LRO and devices in a forwarding mode are mutually exclusive.  Upstream has the ability to dynamically disable LRO for devices put into a forwarding state, while RHEL5 does not.  The current RHEL5 solution is for drivers supporting LRO, we default disable it using the driver specific module options, but this doesn't cover the case where users enable LRO specifically. in these cases forwarding can result in various odd behavior if LRO is on.

Version-Release number of selected component (if applicable):



Additional info:

curently these driver require the lro_disable feature:
s2io
mlx4_en
bnx2x
enic

Comment 1 Neil Horman 2010-04-14 20:22:37 UTC
Created attachment 406634 [details]
omnibus patch for this bug

This is my first pass at this patch.  It should be separated out before posting to a core and individual driver components.

I'm validating it on bnx2x currently.  Need to find mlx4, enic and s2io cards to test there prior to posting.

Comment 2 Neil Horman 2010-04-14 20:36:57 UTC
note, need to validate that the NIC is up prior to resetting the hardware, or we get wierd behavior.  Just got a BUG on bnx2x in testing, which resulted from adding an interface to a bridge while the interface was down. This will likely be the case for all affected drivers, as their lro configuration occurs during probe on module insert, rather than on device open/close.

Comment 3 Neil Horman 2010-04-15 19:44:14 UTC
Created attachment 406909 [details]
infrastrucuture dev_disable_lro patch

This is the cleaned up broken out patch series for this bug

Comment 4 Neil Horman 2010-04-15 19:45:08 UTC
Created attachment 406910 [details]
bnx2x patch

Comment 5 Neil Horman 2010-04-15 19:45:45 UTC
Created attachment 406911 [details]
enic patch

Comment 6 Neil Horman 2010-04-15 19:46:41 UTC
Created attachment 406912 [details]
mlx4 patch

Comment 7 Neil Horman 2010-04-15 19:46:58 UTC
Created attachment 406913 [details]
s2io patch

Comment 8 Stanislaw Gruszka 2010-04-16 14:30:04 UTC
*** Bug 518531 has been marked as a duplicate of this bug. ***

Comment 11 Siert Z. 2010-05-13 11:25:03 UTC
I tested Stanislaw's xen kernel on RHEL5.5 x86_64 (http://people.redhat.com/sgruszka/rhel5/bz573114/). 

The hardware: HP BL460c + bnx2x (Virtual connect - Broadcom Corporation NetXtreme II BCM57711E 10-Gigabit PCIe).

[root@hsl0000 ~]# uname -a
Linux hsl0000.domain.local 2.6.18-197.el5.bnx2x_testxen #1 SMP Wed Apr 28
08:58:56 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

Everything works like a charm now.

Comment 17 Jarod Wilson 2010-05-28 15:21:49 UTC
in kernel-2.6.18-201.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 19 Siert Z. 2010-05-28 19:40:53 UTC
I was unable to boot the Xen kernel with the append dom0_mem=512M:                                                                  

  title Red Hat Enterprise Linux Server (2.6.18-201.el5xen)
    root (hd0,0)
    kernel /xen.gz-2.6.18-201.el5 dom0_mem=512M
    module /vmlinuz-2.6.18-201.el5xen ro root=/dev/vgsan/root

Rebooted and noticed that all other kernels failed with the same error:
    "Starting udev: Kernel panic - not syncing: Out of memory and no killable processes..."

Decided to change the append to dom0_mem=2048M and the problem doesn't occur any more.

Testing bandwidth to the Dom0 itself (10*100MB) from my desktop:

zijls@htn-ws-1376:~$ for i in {0..10}; do scp hsl0000:100MB /dev/null ; done
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    


Testing bandwith to DomU -guest- (10*100MB) from my desktop:

zijls@htn-ws-1376:~$ for i in {0..10}; do scp 10.10.4.7:100MB /dev/null ; done
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    
100MB                                               100%  100MB  11.1MB/s   00:09    

Also tested bandwidth from another physical server to the Xen guest. On the guest I started `nc -l 12345 >/dev/null` and on the physical machine a dd of 1GB over the wire:

[zijls@hsl2000 ~]$ dd if=/dev/zero bs=1024k count=1000 | nc 10.10.4.7 12345
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 14.7358 seconds, 71.2 MB/s

Everything still seems to be fine.

If any further testing is desired, don't hestitate to contact me.

PS: I created an SOS report and uploaded it to RHN, case 2022358.

Comment 20 Stanislaw Gruszka 2010-05-31 08:05:23 UTC
(In reply to comment #19)
> I was unable to boot the Xen kernel with the append dom0_mem=512M:              
> 
>   title Red Hat Enterprise Linux Server (2.6.18-201.el5xen)
>     root (hd0,0)
>     kernel /xen.gz-2.6.18-201.el5 dom0_mem=512M
>     module /vmlinuz-2.6.18-201.el5xen ro root=/dev/vgsan/root
> 
> Rebooted and noticed that all other kernels failed with the same error:
>     "Starting udev: Kernel panic - not syncing: Out of memory and no killable
> processes..."

IIRC this is regression, but seems to be unrelated with network drivers patches. Please open new bug report for xen component for that issue.

Comment 25 Douglas Silas 2010-06-28 20:26:57 UTC
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

New Contents:
Large-Receive Offload (LRO) is a performance optimization that enables the kernel to fetch and process, as a unit, more than one received packet from a network device. It was previously not possible to dynamically disable LRO for devices in a forwanding mode. This has been fixed with this update so that the kernel is able to dynamically disable LRO for devices in a forwarding state, or which had had LRO turned on manually.

Comment 28 Stanislaw Gruszka 2010-08-02 09:24:56 UTC
*** Bug 586352 has been marked as a duplicate of this bug. ***

Comment 29 Alan Pevec 2010-11-19 08:44:07 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Large-Receive Offload (LRO) is a performance optimization that enables the kernel to fetch and process, as a unit, more than one received packet from a network device. It was previously not possible to dynamically disable LRO for devices in a forwanding mode. This has been fixed with this update so that the kernel is able to dynamically disable LRO for devices in a forwarding state, or which had had LRO turned on manually.+Large-Receive Offload (LRO) is a performance optimization that enables the kernel to fetch and process, as a unit, more than one received packet from a network device. It was previously not possible to dynamically disable LRO for devices in a forwarding mode. This has been fixed with this update so that the kernel is able to dynamically disable LRO for devices in a forwarding state, or which had had LRO turned on manually.

Comment 33 Alan Pevec 2010-11-26 09:15:07 UTC
Comment on attachment 406910 [details]
bnx2x patch

Thanks for clarification Stanislaw!
Marking  patch in comment 4 obsolete to avoid further confusion.

Comment 34 Hangbin Liu 2010-12-03 07:43:34 UTC
verified bnx2x driver on kernel 2.6.18-235.el5

# uname -a
Linux hp-bl685cg6-01.rhts.eng.bos.redhat.com 2.6.18-235.el5 #1 SMP Wed Dec 1 12:27:10 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
# ethtool -i eth0
driver: bnx2x
version: 1.52.53-4
firmware-version: bc 4.8.0 phy baa0:0105
bus-info: 0000:03:00.0
# service network stop
# modprobe -r bnx2x
# modprobe bnx2x disable_tpa=0     
PCI: Enabling device 0000:03:00.0 (0040 -> 0042)
PCI: Enabling device 0000:03:00.1 (0040 -> 0042)
# service network start
# brctl addbr bridge
Bridge firewalling registered
# brctl addif bridge eth0
# dmesg | tail
...
Disabled lro on eth0
...

On kernel 2.6.18-194.el5 there is no such message


didn't test mlx4_en and s2io as there is no these patches .
enic have a  update driver version on kernel 2.6.18-227.el5 , replace LRO with GRO , needn't test too .

Comment 36 errata-xmlrpc 2011-01-13 21:27:23 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html