Bug 659594 - Kernel panic when restart network on vlan with bonding
Summary: Kernel panic when restart network on vlan with bonding
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.6
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Neil Horman
QA Contact: Petr Beňas
URL:
Whiteboard:
: 654600 679499 689759 725849 (view as bug list)
Depends On:
Blocks: 675664 689759 707606
TreeView+ depends on / blocked
 
Reported: 2010-12-03 05:40 UTC by Liang Zheng
Modified: 2018-12-01 14:24 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
A bug was discovered in the bonding driver that occurred when using netpoll and changing, adding or removing slaves from a bond. The misuse of a per-cpu flag in the bonding driver during these operations at the wrong time could lead to the detection of an invalid state in the bonding driver, triggering kernel panic. With this update, the use of the aforementioned per-cpu flag has been corrected and a kernel panic no longer occurs.
Clone Of:
Environment:
Last Closed: 2011-07-21 09:54:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
panic on shutdown rh el 5.6 cluster with 5 ip alias defined on 5 services (120.64 KB, image/png)
2011-02-11 10:32 UTC, Gianluca Cecchi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1065 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update 2011-07-21 09:21:37 UTC

Description Liang Zheng 2010-12-03 05:40:29 UTC
Description of problem:
Attempt to create VLAN iface on bond of two adapters.
Service network restart ,kernel panic.

Version-Release number of selected component (if applicable):
kernel panci on 2.6.18-232-el5
but not on 2.6.18-194-el5

How reproducible:
often

Steps to Reproduce:
1.Configure network as follows
vim /etc/modprobe.com
alias eth0 bnx2
alias eth1 bnx2
alias bond0 bonding

/etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0                                                                     
ONBOOT=yes                                                                      
MASTER=bond0                                                                    
SLAVE=yes
HOTPLUG=no

/etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1                                                                   
ONBOOT=yes                                                                      
MASTER=bond0                                                                    
SLAVE=yes
HOTPLUG=no

/etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0                                                                    
BONDING_OPTS="mode=6 miimon=300"                                                
ONBOOT=yes                                                                      
BOOTPROTO=none

/etc/sysconfig/network-scripts/ifcfg-bond0.10
DEVICE=bond0.10                                                               
ONBOOT=yes                                                                      
REORDER_HDR=no                                                                  
VLAN=yes                                                                        
BOOTPROTO=static                                                                
IPADDR=192.168.18.18                                                        
NETMASK=255.255.255.0


2.service network start

3.repeat “service network restart“ command several times(about 10 times)

Actual results:
kernel panic
[root@ibm-ls21-03 network-scripts]# service network restart
Shutting down interface bond0.10:  Removed VLAN -:bond0.10:-
[  OK  ]
Shutting down interface bond0:  bonding: bond0: Warning: the permanent HWaddr
of eth0 - 00:14:5E:6D:1C:B8 - is still in use by bond0. Set the HWaddr of eth0
to a different address to avoid conflicts.
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at drivers/net/bonding/bonding.h:135
invalid opcode: 0000 [1] SMP 
last sysfs file: /class/net/bond0/bonding/slaves
CPU 0 
Modules linked in: bonding 8021q autofs4 hidp rfcomm l2cap bluetooth lockd
sunrpc ipv6 xfrm_nalgo crypto_api loop dm_multipath scsi_dh video backlight sbs
power_meter i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac
parport_pc lp parport sg i2c_piix4 tpm_tis k8temp i2c_core k8_edac bnx2 tpm
hwmon edac_mc serio_raw tpm_bios pcspkr dm_raid45 dm_message dm_region_hash
dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod shpchp mptsas mptscsih
mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 12674, comm: ifdown-eth Not tainted 2.6.18-232.el5 #1
RIP: 0010:[<ffffffff884c2c0b>]  [<ffffffff884c2c0b>]
:bonding:bond_release+0x62/0x4f1
RSP: 0018:ffff810127609e28  EFLAGS: 00010286
RAX: 00000000ffffffff RBX: 00000000000005dc RCX: ffffffff80318f28
RDX: ffffffff80318f28 RSI: ffff81022c488000 RDI: ffff8101281f2530
RBP: ffff8101281f2500 R08: ffffffff80318f28 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000080 R12: ffff8101281f2000
R13: 0000000000000006 R14: ffff81022c488000 R15: ffff81012ea50ac0
FS:  00002addc8667f50(0000) GS:ffffffff80424000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000388ca69220 CR3: 000000012743a000 CR4: 00000000000006e0
Process ifdown-eth (pid: 12674, threadinfo ffff810127608000, task
ffff810128bc57a0)
Stack:  00000000000080d0 ffffffff8006456b ffff810128bc57a0 00000000000005dc
 ffff81022c488000 ffff8101281f2500 0000000000000006 0000000000000006
 ffff81012ea50ac0 ffffffff884cbb54 000000316874652d 0000000000000000
Call Trace:
 [<ffffffff8006456b>] __down_write_nested+0x12/0x92
 [<ffffffff884cbb54>] :bonding:bonding_store_slaves+0x25c/0x2f7
 [<ffffffff8010fdb5>] sysfs_write_file+0xb9/0xe8
 [<ffffffff80016af0>] vfs_write+0xce/0x174
 [<ffffffff800173a8>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: 0f 0b 68 aa d4 4c 88 c2 87 00 4c 8b 6d 08 31 c0 eb 0c 4d 39 
RIP  [<ffffffff884c2c0b>] :bonding:bond_release+0x62/0x4f1
 RSP <ffff810127609e28>
 <0>Kernel panic - not syncing: Fatal exception

Expected results:


Additional info:

Comment 1 Liang Zheng 2010-12-03 05:42:22 UTC
It does not reproduce on 2.6.18.194-el5
so I think it is a regression bug.

Comment 2 Neil Horman 2010-12-06 15:28:29 UTC
*** Bug 659558 has been marked as a duplicate of this bug. ***

Comment 3 Neil Horman 2010-12-06 16:02:17 UTC
*** Bug 654600 has been marked as a duplicate of this bug. ***

Comment 4 Neil Horman 2010-12-06 19:10:18 UTC
http://marc.info/?l=linux-netdev&m=129166237512572&w=3

I've posted a patch for this upstream, and will backport for RHEL once its accepted.

Comment 6 Neil Horman 2010-12-10 18:39:49 UTC
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2966935

Test build with backport

Comment 7 RHEL Program Management 2011-02-01 16:51:10 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 13 Jarod Wilson 2011-02-09 14:56:59 UTC
in kernel-2.6.18-243.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 16 Gianluca Cecchi 2011-02-11 10:31:11 UTC
Hello,
I have a case open for a similar problem: 00414963
You can check it for further details.
My problem doesn't involve VLAN, but a rh cluster where I have 7 services, and 5  with an IP associated (so 5 ip alias defined).
When I run
shutdown -r 
with all the services running I have a panic with screeshot I'm going to  attach.
If I manually stop the services and then shutdown all goes well without panic.
It seems that kernel-2.6.18-243.el5 solves it for me too.

Comment 17 Gianluca Cecchi 2011-02-11 10:32:25 UTC
Created attachment 478208 [details]
panic on shutdown rh el 5.6 cluster with 5 ip alias defined on 5 services

Comment 29 Jiri Pirko 2011-02-25 22:56:21 UTC
*** Bug 679499 has been marked as a duplicate of this bug. ***

Comment 34 Gary Smith 2011-03-22 12:48:35 UTC
*** Bug 689759 has been marked as a duplicate of this bug. ***

Comment 37 Martin Wilck 2011-04-04 08:47:39 UTC
As this problem is understood and a fix is available, when can we expect a z-stream release?

Comment 38 Anton Arapov 2011-04-04 08:52:22 UTC
Martin, very soon: https://bugzilla.redhat.com/show_bug.cgi?id=675664

Comment 39 Martin Prpič 2011-04-14 10:15:11 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
A bug was discovered in the bonding driver that occurred when using netpoll and changing, adding or removing slaves from a bond. The misuse of a per-cpu flag in the bonding driver during these operations at the wrong time could lead to the detection of an invalid state in the bonding driver, triggering kernel panic. With this update, the use of the aforementioned per-cpu flag has been corrected and a kernel panic no longer occurs.

Comment 41 Petr Beňas 2011-04-19 14:07:30 UTC
Reproduced in 2.6.18-241.el5 and verified in 2.6.18-243.el5.

Comment 42 errata-xmlrpc 2011-07-21 09:54:36 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html

Comment 43 Neil Horman 2011-07-27 17:52:30 UTC
*** Bug 725849 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.