Bug 640803

Summary: [RHEL4.8.z] soft lockup on vlan with bonding in balance-alb mode
Product: Red Hat Enterprise Linux 4 Reporter: Flavio Leitner <fleitner>
Component: kernelAssignee: Flavio Leitner <fleitner>
Status: CLOSED ERRATA QA Contact: Network QE <network-qe>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.8CC: agospoda, benj, benlu, bshepher, dhoward, dmoore, fleitner, gianluca.cecchi, govind.rhul, haliu, herbert.xu, hjia, jolsa, jpirko, justdave, jwest, khorenko, kzhang, liko, nenad, nhorman, pep, peterm, plsmith, plyons, roy.keene, schlegel, sgruszka, shyam, Stuart.Kirk, tao, tgraf, villapla, YKonovalov
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-16 15:39:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 578531    
Bug Blocks: 641254    
Attachments:
Description Flags
patch from customer. none

Description Flavio Leitner 2010-10-06 19:52:51 UTC
+++ This bug was initially created as a clone of Bug #578531 +++

Customer has eth0 & eth1 in bond0. He has native vlan bond0, bond0.151
and bond0.167. Following update to 2.6.9-89.0.28.ELsmp he has experienced
kernel panic when we bring up bond0.151 or bond0.167.

Attempt to create VLAN iface on bond of two bnx2 adapters in two switch configuration results in soft lockup after a few seconds.

Stable kernel: 2.6.9-89.0.19.ELsmp
Unstable kernel: 2.6.9-89.0.28.ELsmp

Customer applied the patch from RHBZ#578531 and reported that has fixed
the problem. Also, he provided a screenshot of the panic and it matches
with the problem description.

... copy&paste from the original bz ....
--- Additional comment from fleitner on 2010-07-19 10:28:36 EDT ---

Created attachment 432903 [details]
suggested patch

Hi Andy,

The problem has been introduced by the following patch:
[net] bonding: allow arp_ip_targets on separate vlan from bond device

and not fixed by the later patch:
[net] fixup problems with vlans and bonding

The problem happens because in rlb_arp_recv(), the struct bonding *bond
pointer is a vlan's net_device struct instead, so it can either oops or
just hangs on a invalid spinlock. 

I can reproduce both situations following the instructions in the
ticket's summary.

The upstream fixes rlb_arp_recv() to look for the flag IFF_802_1Q_VLAN 
and if it is present, then find the underlying bonding device.

I have the patch backported and it works out on my tests.
Please review.
fbl

---
Later the patch has been reworked to the final one:

--- Additional comment from agospoda on 2010-07-26 17:42:28 EDT ---

OK, this looks correct:

http://git.engineering.redhat.com/?p=users/agospoda/rhel5-gtest.git;a=commitdiff;h=c35d16c57231b4700b6f4b27dbe088d7b187472a

--- Additional comment from agospoda on 2010-07-27 16:38:24 EDT ---

Created attachment 434836 [details]
bonding-fix-alb-mode-to-balance-traffic-on-vlans-updated.patch

Here is an updated patch.  Feedback is welcome.

Comment 1 Flavio Leitner 2010-10-06 19:59:21 UTC
Created attachment 451975 [details]
patch from customer.

Comment 8 Vivek Goyal 2010-10-12 18:40:21 UTC
Committed in 89.41.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 13 errata-xmlrpc 2011-02-16 15:39:47 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html