Bug 554706 - Kernel: network: bonding: scheduling while atomic: ifdown-eth/0x00000100/21775
Summary: Kernel: network: bonding: scheduling while atomic: ifdown-eth/0x00000100/21775
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.4
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Andy Gospodarek
QA Contact: Network QE
URL:
Whiteboard:
Depends On:
Blocks: 581148 631853 631854
TreeView+ depends on / blocked
 
Reported: 2010-01-12 13:11 UTC by Oded Ramraz
Modified: 2018-11-14 20:18 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
The cnic parts resets could cause a deadlock when the bnx2 device was enslaved in a bonding device and that device had an associated VLAN.
Clone Of:
Environment:
Last Closed: 2011-01-13 20:59:13 UTC


Attachments (Terms of Use)
screen shots (2.03 MB, application/octet-stream)
2010-01-27 16:01 UTC, Oded Ramraz
no flags Details
Patch to fix the issue. (4.79 KB, patch)
2010-02-16 20:19 UTC, Michael Chan
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Oded Ramraz 2010-01-12 13:11:09 UTC
Description of problem:

My host crashed during removal of Bond ( with VLAN tagging )
See additional information for call trace

Version-Release number of selected component (if applicable):

Linux silver-vdsa.qa.lab.tlv.redhat.com 2.6.18-164.9.1.el5 #1 SMP Wed Dec 9 03:27:37 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:


2010-01-11 22:23:04,483 sw2: port 9(e1000_14_4) entering disabled state

2010-01-11 22:25:24,644 bonding: bond0: Removing slave eth2

2010-01-11 22:25:24,644 bonding: bond0: Warning: the permanent HWaddr of eth2 - 00:1D:09:68:71:4E - is still in use by bond0. Set the HWaddr of eth2 to a different address to avoid conflicts.

2010-01-11 22:25:24,644 bonding: bond0: releasing active interface eth2

2010-01-11 22:25:24,644 BUG: scheduling while atomic: ifdown-eth/0x00000100/21775

2010-01-11 22:25:24,644 

2010-01-11 22:25:24,644 Call Trace:

2010-01-11 22:25:24,644  [<ffffffff8006240d>] __sched_text_start+0x7d/0xbd6

2010-01-11 22:25:24,644  [<ffffffff8009fdd8>] autoremove_wake_function+0x9/0x2e

2010-01-11 22:25:24,644  [<ffffffff8008a9ae>] __wake_up_common+0x3e/0x68

2010-01-11 22:25:24,644  [<ffffffff80063137>] wait_for_completion+0x79/0xa2

2010-01-11 22:25:24,644  [<ffffffff8008c584>] default_wake_function+0x0/0xe

2010-01-11 22:25:24,644  [<ffffffff8027d66b>] klist_next+0xf/0x56

2010-01-11 22:25:24,644  [<ffffffff8009e13f>] synchronize_rcu+0x30/0x36

2010-01-11 22:25:24,644  [<ffffffff8009dc7b>] wakeme_after_rcu+0x0/0x9

2010-01-11 22:25:24,644  [<ffffffff8851beac>] :cnic:cnic_stop_hw+0x38/0xa6

2010-01-11 22:25:24,644  [<ffffffff8851da62>] :cnic:cnic_ctl+0x2b/0x50

2010-01-11 22:25:24,644  [<ffffffff881f988d>] :bnx2:bnx2_netif_stop+0x3a/0xdf

2010-01-11 22:25:24,644  [<ffffffff881fc1e2>] :bnx2:bnx2_vlan_rx_register+0x19/0x4e

2010-01-11 22:25:24,644  [<ffffffff88750942>] :bonding:bond_del_vlans_from_slave+0xa6/0xb9

2010-01-11 22:25:24,644  [<ffffffff887527ae>] :bonding:bond_release+0x294/0x39a

2010-01-11 22:25:24,644  [<ffffffff8006457b>] __down_write_nested+0x12/0x92

2010-01-11 22:25:24,644  [<ffffffff8875a3f5>] :bonding:bonding_store_slaves+0x25c/0x2f7

2010-01-11 22:25:24,644  [<ffffffff8010ae88>] sysfs_write_file+0xb9/0xe8

2010-01-11 22:25:24,644  [<ffffffff80016942>] vfs_write+0xce/0x174

2010-01-11 22:25:24,644  [<ffffffff800171fa>] sys_write+0x45/0x6e

2010-01-11 22:25:24,644  [<ffffffff8005d28d>] tracesys+0xd5/0xe0

2010-01-11 22:25:24,644 

2010-01-11 22:25:34,644 BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0]

2010-01-11 22:25:34,644 CPU 0:
2010-01-11 22:25:34,644 

2010-01-11 22:25:34,644 Modules linked in:
2010-01-11 22:25:34,644  nfs
2010-01-11 22:25:34,644  fscache
2010-01-11 22:25:34,644  nfs_acl
2010-01-11 22:25:34,644  netconsole
2010-01-11 22:25:34,644  bonding
2010-01-11 22:25:34,644  tun
2010-01-11 22:25:34,644  autofs4
2010-01-11 22:25:34,644  hidp
2010-01-11 22:25:34,644  rfcomm
2010-01-11 22:25:34,644  l2cap
2010-01-11 22:25:34,644  bluetooth
2010-01-11 22:25:34,644  lockd
2010-01-11 22:25:34,644  sunrpc
2010-01-11 22:25:34,644  bridge
2010-01-11 22:25:34,644  ip_conntrack_netbios_ns
2010-01-11 22:25:34,644  ip_conntrack
2010-01-11 22:25:34,644  nfnetlink
2010-01-11 22:25:34,644  iptable_filter
2010-01-11 22:25:34,644  ip_tables
2010-01-11 22:25:34,644  ip6t_REJECT
2010-01-11 22:25:34,644  xt_tcpudp
2010-01-11 22:25:34,644  ip6table_filter
2010-01-11 22:25:34,644  ip6_tables
2010-01-11 22:25:34,644  x_tables
2010-01-11 22:25:34,644  ib_iser
2010-01-11 22:25:34,644  rdma_cm
2010-01-11 22:25:34,644  ib_cm
2010-01-11 22:25:34,644  iw_cm
2010-01-11 22:25:34,644  ib_sa
2010-01-11 22:25:34,644  ib_mad
2010-01-11 22:25:34,644  ib_core
2010-01-11 22:25:34,644  ib_addr
2010-01-11 22:25:34,644  iscsi_tcp
2010-01-11 22:25:34,644  bnx2i
2010-01-11 22:25:34,644  cnic
2010-01-11 22:25:34,644  ipv6
2010-01-11 22:25:34,644  xfrm_nalgo
2010-01-11 22:25:34,644  crypto_api
2010-01-11 22:25:34,644  uio
2010-01-11 22:25:34,644  cxgb3i
2010-01-11 22:25:34,644  cxgb3
2010-01-11 22:25:34,644  8021q
2010-01-11 22:25:34,644  libiscsi_tcp
2010-01-11 22:25:34,644  libiscsi2
2010-01-11 22:25:34,644  scsi_transport_iscsi2
2010-01-11 22:25:34,644  scsi_transport_iscsi
2010-01-11 22:25:34,644  dm_round_robin
2010-01-11 22:25:34,644  dm_multipath
2010-01-11 22:25:34,644  scsi_dh
2010-01-11 22:25:34,644  video
2010-01-11 22:25:34,644  hwmon
2010-01-11 22:25:34,644  backlight
2010-01-11 22:25:34,644  sbs
2010-01-11 22:25:34,644  i2c_ec
2010-01-11 22:25:34,644  i2c_core
2010-01-11 22:25:34,644  button
2010-01-11 22:25:34,644  battery
2010-01-11 22:25:34,644  asus_acpi
2010-01-11 22:25:34,644  acpi_memhotplug
2010-01-11 22:25:34,644  ac
2010-01-11 22:25:34,644  parport_pc
2010-01-11 22:25:34,644  lp
2010-01-11 22:25:34,644  parport
2010-01-11 22:25:34,644  ksm(U)
2010-01-11 22:25:34,644  kvm_intel(U)
2010-01-11 22:25:34,660  kvm(U)
2010-01-11 22:25:34,660  sg
2010-01-11 22:25:34,660  ide_cd
2010-01-11 22:25:34,660  i5000_edac
2010-01-11 22:25:34,660  serio_raw
2010-01-11 22:25:34,660  edac_mc
2010-01-11 22:25:34,660  e1000e
2010-01-11 22:25:34,660  cdrom
2010-01-11 22:25:34,660  bnx2
2010-01-11 22:25:34,660  pcspkr
2010-01-11 22:25:34,660  dm_raid45
2010-01-11 22:25:34,660  dm_message
2010-01-11 22:25:34,660  dm_region_hash
2010-01-11 22:25:34,660  dm_mem_cache
2010-01-11 22:25:34,660  dm_snapshot
2010-01-11 22:25:34,660  dm_zero
2010-01-11 22:25:34,660  dm_mirror
2010-01-11 22:25:34,660  dm_log
2010-01-11 22:25:34,660  dm_mod
2010-01-11 22:25:34,660  ata_piix
2010-01-11 22:25:34,660  libata
2010-01-11 22:25:34,660  shpchp
2010-01-11 22:25:34,660  mptsas
2010-01-11 22:25:34,660  mptscsih
2010-01-11 22:25:34,660  mptbase
2010-01-11 22:25:34,660  scsi_transport_sas
2010-01-11 22:25:34,660  sd_mod
2010-01-11 22:25:34,660  scsi_mod
2010-01-11 22:25:34,660  ext3
2010-01-11 22:25:34,660  jbd
2010-01-11 22:25:34,660  uhci_hcd
2010-01-11 22:25:34,660  ohci_hcd
2010-01-11 22:25:34,660  ehci_hcd
2010-01-11 22:25:34,660 

2010-01-11 22:25:34,660 Pid: 0, comm: swapper Tainted: G      2.6.18-164.9.1.el5 #1

2010-01-11 22:25:34,660 RIP: 0010:[<ffffffff8006216d>] 
2010-01-11 22:25:34,660  [<ffffffff8006216d>] __read_lock_failed+0x5/0x14

2010-01-11 22:25:34,660 RSP: 0018:ffffffff8043dc10  EFLAGS: 00000297

2010-01-11 22:25:34,660 RAX: ffffffff804efa50 RBX: 0000000000000001 RCX: ffff81022a938000

2010-01-11 22:25:34,660 RDX: ffff810225ee3678 RSI: ffff810225ee3000 RDI: ffff810225ee352c

2010-01-11 22:25:34,660 RBP: ffffffff8043db90 R08: 0000000000000000 R09: ffff81022f43e070

Comment 2 Oded Ramraz 2010-01-26 08:49:41 UTC
I managed to reproduce this issue on this kernel(5.5):
root@silver-vdsb ~]# uname -a
Linux silver-vdsb.qa.lab.tlv.redhat.com 2.6.18-183.el5 #1 SMP Mon Dec 21 18:37:42 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

Network adapters information:

[root@silver-vdsb ~]# ethtool -i eth2
driver: bnx2
version: 2.0.2
firmware-version: 3.5.12 UMP 1.1.8
bus-info: 0000:03:00.0
[root@silver-vdsb ~]# ethtool -i eth3
driver: bnx2
version: 2.0.2
firmware-version: 3.5.12 UMP 1.1.8
bus-info: 0000:07:00.0

Comment 4 Andy Gospodarek 2010-01-26 14:29:28 UTC
This looks to be specific to the use of the bnx2i and cnic drivers as well.

Can you provide some more details about your configuration so we can try and reproduce it?  Information like the bonding mode, vlan configuration, as well as any iscsi usage by the bnx2 devices.

It would also be nice to know if the use of vlans is important to reproduce this failure.

Comment 5 Andy Gospodarek 2010-01-26 14:38:25 UTC
Mike do you have the equipment to set something like this up?

Comment 6 Oded Ramraz 2010-01-26 17:42:22 UTC
I just reproduced this bug with the following bnx2 adapters:

root@silver-vdsa ~]# ethtool -i eth2
driver: bnx2
version: 2.0.2
firmware-version: 3.5.12 ipms 1.6.0
bus-info: 0000:03:00.0
[root@silver-vdsa ~]# ethtool -i eth3
driver: bnx2
version: 2.0.2
firmware-version: 3.5.12 ipms 1.6.0
bus-info: 0000:07:00.0
[root@silver-vdsa ~]#


Bonding mode: mode 4 ( CISCO mode )
bridge configuration :

[root@silver-vdsa ~]# brctl show
bridge name     bridge id               STP enabled     interfaces
rhevm           8000.001517a76a4c       no              eth0
sw1             8000.001517a76a4d       no              eth1
sw2             8000.001d09687150       no              bond0.162
[root@silver-vdsa ~]#  

No ISCSI usage by the bnx devices. 
The error occurred when i tried to remove the bond interface (with the VLAN tag)

Comment 7 Oded Ramraz 2010-01-26 17:46:44 UTC
The bond interface is on nics eth2,eth3 on bridge sw2 
VLAN tag is 162

Comment 8 Andy Gospodarek 2010-01-26 18:27:37 UTC
I didn't see anything about a bridge configuration earlier, I will put the bond in a bridge and see if I can make this fail.  Were you using initscripts to get it into the bridge or manually doing that after the bond0.162 interface was up?

Comment 9 Andy Gospodarek 2010-01-26 20:49:45 UTC
I still cannot reproduce this without loading the cnic and bnx2i modules.  I'm not at all surprised based on the backtrace.  Here is the important info from my config.

::::::::::::::
ifcfg-bond0
::::::::::::::
DEVICE=bond0
BOOTPROTO=none
ONBOOT=no
BONDING_OPTS="mode=4 miimon=100"
::::::::::::::
ifcfg-bond0.100
::::::::::::::
DEVICE=bond0.100
BOOTPROTO=none
ONBOOT=yes
::::::::::::::
ifcfg-eth2
::::::::::::::
DEVICE=eth2
ONBOOT=yes
HWADDR=00:10:18:36:0a:d4
MASTER=bond0
SLAVE=yes
::::::::::::::
ifcfg-eth3
::::::::::::::
DEVICE=eth3
ONBOOT=yes
HWADDR=00:10:18:36:0a:d6
MASTER=bond0
SLAVE=yes


# ifup bond0
# ifup bond0.100 
Added VLAN with VID == 100 to IF -:bond0:-
# brctl addbr br0
# brctl addif br0 bond0.100
# brctl delif br0 bond0.100
# ifdown bond0 
bonding: bond0: Warning: the permanent HWaddr of eth2 - 00:10:18:36:0A:D4 - is still in use by bond0. Set the HWaddr of eth2 to a diffe.
bonding: bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond

# ifup bond0
# ifup bond0.100 
Added VLAN with VID == 100 to IF -:bond0:-
# brctl addbr br1 
# brctl addif br1 bond0.100 
# ifdown bond0 
bonding: bond0: Warning: the permanent HWaddr of eth2 - 00:10:18:36:0A:D4 - is still in use by bond0. Set the HWaddr of eth2 to a diffe.
bonding: bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
bonding: bond0: Warning: clearing HW address of bond0 while it still has VLANs.
bonding: bond0: When re-adding slaves, make sure the bond's HW address matches its VLANs'.
# ifdown bond0.100 
Removed VLAN -:bond0.100:-

# ifup bond0
# ifup bond0.100 
Added VLAN with VID == 100 to IF -:bond0:-
# brctl addbr br2 
# brctl addif br2 bond0.100
# rmmod bonding
bonding: bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
bonding: bond0: Warning: clearing HW address of bond0 while it still has VLANs.
bonding: bond0: When re-adding slaves, make sure the bond's HW address matches its VLANs'.

# uname -a 
Linux xw4400 2.6.18-185.el5 #1 SMP Thu Jan 14 16:44:40 EST 2010 x86_64 x86_64 x86_64 GNU/Linux

# ethtool -i eth2
driver: bnx2
version: 2.0.2
firmware-version: 4.4.14
bus-info: 0000:10:00.0
# ethtool -i eth3
driver: bnx2
version: 2.0.2
firmware-version: 4.4.14
bus-info: 0000:10:00.1

None of these produce the hang or deadlock described in this bug until I insmoded cnic and bnx2i.

I setup everything as I did before and when doing and then did this:

# rmmod bonding 
bonding: bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
BUG: scheduling while atomic: rmmod/0x00000100/7310

Call Trace:
 [<ffffffff8006343d>] __sched_text_start+0x7d/0xbd6
 [<ffffffff80093131>] vprintk+0x2cb/0x317
 [<ffffffff80064167>] wait_for_completion+0x79/0xa2
 [<ffffffff8008dd0d>] default_wake_function+0x0/0xe
 [<ffffffff80283971>] klist_next+0xf/0x56
 [<ffffffff8009faa6>] synchronize_rcu+0x30/0x36
 [<ffffffff8009f5e2>] wakeme_after_rcu+0x0/0x9
 [<ffffffff88674d74>] :cnic:cnic_stop_hw+0x38/0xa6
 [<ffffffff88679a2e>] :cnic:cnic_ctl+0x35/0xac
 [<ffffffff8822588f>] :bnx2:bnx2_netif_stop+0x3a/0xea
 [<ffffffff88229be5>] :bnx2:bnx2_vlan_rx_register+0x20/0x61
 [<ffffffff887b4a37>] :bonding:bond_del_vlans_from_slave+0xa6/0xb9
 [<ffffffff887b6728>] :bonding:bond_release_all+0xb3/0x21c
 [<ffffffff887b68c0>] :bonding:bond_free_all+0x2f/0xb5
 [<ffffffff887bff88>] :bonding:bonding_exit+0x30/0x36
 [<ffffffff800a7394>] sys_delete_module+0x196/0x1c5
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0

BUG: scheduling while atomic: rmmod/0x00000100/7310

Call Trace:
 [<ffffffff8006343d>] __sched_text_start+0x7d/0xbd6
 [<ffffffff8003de05>] lock_timer_base+0x1b/0x3c
 [<ffffffff8001cc14>] __mod_timer+0x100/0x10f
 [<ffffffff800648ab>] schedule_timeout+0x8a/0xad
 [<ffffffff80098a2a>] process_timeout+0x0/0x5
 [<ffffffff800990f3>] msleep+0x21/0x2c
 [<ffffffff887a0332>] :bnx2i:bnx2i_start+0x1f/0x32
 [<ffffffff88674a18>] :cnic:cnic_ulp_start+0x6b/0x87
 [<ffffffff88679a48>] :cnic:cnic_ctl+0x4f/0xac
 [<ffffffff88221480>] :bnx2:bnx2_netif_start+0xab/0xbe
 [<ffffffff88221641>] :bnx2:bnx2_fw_sync+0x34/0xc8
 [<ffffffff88229c15>] :bnx2:bnx2_vlan_rx_register+0x50/0x61
 [<ffffffff887b4a37>] :bonding:bond_del_vlans_from_slave+0xa6/0xb9
 [<ffffffff887b6728>] :bonding:bond_release_all+0xb3/0x21c
 [<ffffffff887b68c0>] :bonding:bond_free_all+0x2f/0xb5
 [<ffffffff887bff88>] :bonding:bonding_exit+0x30/0x36
 [<ffffffff800a7394>] sys_delete_module+0x196/0x1c5
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0

BUG: scheduling while atomic: rmmod/0x00000100/7310

Call Trace:
 [<ffffffff8006343d>] __sched_text_start+0x7d/0xbd6
 [<ffffffff800a1746>] autoremove_wake_function+0x9/0x2e
 [<ffffffff8008c137>] __wake_up_common+0x3e/0x68
 [<ffffffff80064167>] wait_for_completion+0x79/0xa2
 [<ffffffff8008dd0d>] default_wake_function+0x0/0xe
 [<ffffffff80283971>] klist_next+0xf/0x56
 [<ffffffff8009faa6>] synchronize_rcu+0x30/0x36
 [<ffffffff8009f5e2>] wakeme_after_rcu+0x0/0x9
 [<ffffffff88674d74>] :cnic:cnic_stop_hw+0x38/0xa6
 [<ffffffff88679a2e>] :cnic:cnic_ctl+0x35/0xac
 [<ffffffff8822588f>] :bnx2:bnx2_netif_stop+0x3a/0xea
 [<ffffffff88229be5>] :bnx2:bnx2_vlan_rx_register+0x20/0x61
 [<ffffffff887b4a37>] :bonding:bond_del_vlans_from_slave+0xa6/0xb9
 [<ffffffff887b6728>] :bonding:bond_release_all+0xb3/0x21c
 [<ffffffff887b68c0>] :bonding:bond_free_all+0x2f/0xb5
 [<ffffffff887bff88>] :bonding:bonding_exit+0x30/0x36
 [<ffffffff800a7394>] sys_delete_module+0x196/0x1c5
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0

BUG: scheduling while atomic: rmmod/0x00000100/7310

Call Trace:
 [<ffffffff8006343d>] __sched_text_start+0x7d/0xbd6
 [<ffffffff8003de05>] lock_timer_base+0x1b/0x3c
 [<ffffffff8001cc14>] __mod_timer+0x100/0x10f
 [<ffffffff800648ab>] schedule_timeout+0x8a/0xad
 [<ffffffff80098a2a>] process_timeout+0x0/0x5
 [<ffffffff800990f3>] msleep+0x21/0x2c
 [<ffffffff887a0332>] :bnx2i:bnx2i_start+0x1f/0x32
 [<ffffffff88674a18>] :cnic:cnic_ulp_start+0x6b/0x87
 [<ffffffff88679a48>] :cnic:cnic_ctl+0x4f/0xac
 [<ffffffff88221480>] :bnx2:bnx2_netif_start+0xab/0xbe
 [<ffffffff88221641>] :bnx2:bnx2_fw_sync+0x34/0xc8
 [<ffffffff88229c15>] :bnx2:bnx2_vlan_rx_register+0x50/0x61
 [<ffffffff887b4a37>] :bonding:bond_del_vlans_from_slave+0xa6/0xb9
 [<ffffffff887b6728>] :bonding:bond_release_all+0xb3/0x21c
 [<ffffffff887b68c0>] :bonding:bond_free_all+0x2f/0xb5
 [<ffffffff887bff88>] :bonding:bonding_exit+0x30/0x36
 [<ffffffff800a7394>] sys_delete_module+0x196/0x1c5
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0

bonding: bond0: Warning: clearing HW address of bond0 while it still has VLANs.
bonding: bond0: When re-adding slaves, make sure the bond's HW address matches its VLANs'.

It's not a panic though -- just ugly noise on the console.

Comment 10 Andy Gospodarek 2010-01-26 21:08:57 UTC
Now that I can reproduce this I was able to scale it down:

- The bnx2i driver does not even need to be loaded (only cnic).
- The vlan interface doesn't need to be in the bridge.
- This works fine with active-backup bonding too (no fancy switch needed).

The 'scheduling while atomic' messages are ugly, but not showstoppers.  The fact that I can deadlock the system is bad.

Use config-files like these:

::::::::::::::
ifcfg-bond0
::::::::::::::
DEVICE=bond0
BOOTPROTO=none
ONBOOT=no
BONDING_OPTS="mode=1 miimon=100"
::::::::::::::
ifcfg-bond0.100
::::::::::::::
DEVICE=bond0.100
BOOTPROTO=none
ONBOOT=yes
::::::::::::::
ifcfg-eth2
::::::::::::::
DEVICE=eth2
ONBOOT=yes
HWADDR=00:10:18:36:0a:d4
MASTER=bond0
SLAVE=yes
::::::::::::::
ifcfg-eth3
::::::::::::::
DEVICE=eth3
ONBOOT=yes
HWADDR=00:10:18:36:0a:d6
MASTER=bond0
SLAVE=yes

(obviously with different mac addresses) and type these commands.

# ifup bond0
# ifup bond0.100
# rmmod bonding 
BUG: scheduling while atomic: rmmod/0x00000100/8590

Call Trace:
 [<ffffffff8006343d>] __sched_text_start+0x7d/0xbd6
 [<ffffffff80150c87>] __next_cpu+0x19/0x28
 [<ffffffff8008c850>] find_busiest_group+0x20d/0x621
 [<ffffffff8008dd0d>] default_wake_function+0x0/0xe
 [<ffffffff80064167>] wait_for_completion+0x79/0xa2
 [<ffffffff8008dd0d>] default_wake_function+0x0/0xe
 [<ffffffff8009faa6>] synchronize_rcu+0x30/0x36
 [<ffffffff8009f5e2>] wakeme_after_rcu+0x0/0x9
 [<ffffffff88674d74>] :cnic:cnic_stop_hw+0x38/0xa6
 [<ffffffff88679a2e>] :cnic:cnic_ctl+0x35/0xac
 [<ffffffff8822588f>] :bnx2:bnx2_netif_stop+0x3a/0xea
 [<ffffffff88229be5>] :bnx2:bnx2_vlan_rx_register+0x20/0x61
 [<ffffffff887a0a37>] :bonding:bond_del_vlans_from_slave+0xa6/0xb9
 [<ffffffff887a2728>] :bonding:bond_release_all+0xb3/0x21c
 [<ffffffff887a28c0>] :bonding:bond_free_all+0x2f/0xb5
 [<ffffffff887abf88>] :bonding:bonding_exit+0x30/0x36
 [<ffffffff800a7394>] sys_delete_module+0x196/0x1c5
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0

BUG: soft lockup - CPU#0 stuck for 10s! [avahi-daemon:3183]
CPU 0:
Modules linked in: bonding libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi cnic uio ipt_MASQUERADE iptable_nat ip_nat bridge autofd
Pid: 3183, comm: avahi-daemon Not tainted 2.6.18-185.el5 #1
RIP: 0010:[<ffffffff8006319d>]  [<ffffffff8006319d>] __read_lock_failed+0x5/0x14
RSP: 0018:ffff81002f99b918  EFLAGS: 00000297
RAX: 0000000000000056 RBX: ffff81002cea2280 RCX: ffff81002cae7980
RDX: ffffffff80350500 RSI: ffff81002d03d000 RDI: ffff81002d03d530
RBP: ffff81002f99b860 R08: ffff810038ddb2e0 R09: ffff81002cea2080
R10: ffff8100380c1ea8 R11: 000000408009f78d R12: ffffffff8027edd2
R13: d40a36feff181002 R14: 00000000000080fe R15: fb00000000000000
FS:  00002b74b9604ff0(0000) GS:ffffffff803c9000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000003a7eed3280 CR3: 000000002f993000 CR4: 00000000000006e0

Call Trace:
 [<ffffffff80065b75>] _read_lock+0xb/0xc
 [<ffffffff887a4bb1>] :bonding:bond_xmit_activebackup+0x19/0x6c
 [<ffffffff88556acd>] :ipv6:ip6_output_finish+0x0/0xf8
 [<ffffffff8022f5e2>] dev_hard_start_xmit+0x1b7/0x28a
 [<ffffffff8002fcbd>] dev_queue_xmit+0x1c5/0x271
 [<ffffffff88556f28>] :ipv6:ip6_output2+0x2cb/0x33d
 [<ffffffff88557dd1>] :ipv6:ip6_output+0xbbe/0xbe2
 [<ffffffff80056f3d>] nf_hook_slow+0x58/0xbc
 [<ffffffff88556730>] :ipv6:dst_output+0x0/0xe
 [<ffffffff8855827b>] :ipv6:ip6_push_pending_frames+0x486/0x55f
 [<ffffffff8856b26f>] :ipv6:udp_v6_push_pending_frames+0x123/0x145
 [<ffffffff8856ca42>] :ipv6:udpv6_sendmsg+0x68c/0x8e0
 [<ffffffff8012a6c8>] avc_has_perm+0x46/0x58
 [<ffffffff800556ac>] sock_sendmsg+0xf8/0x14a
 [<ffffffff8008dd0d>] default_wake_function+0x0/0xe
 [<ffffffff800a173d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8008dd0d>] default_wake_function+0x0/0xe
 [<ffffffff8012a6c8>] avc_has_perm+0x46/0x58
 [<ffffffff802266a2>] sys_sendmsg+0x217/0x28a
 [<ffffffff8000e2e9>] current_fs_time+0x3b/0x40
 [<ffffffff8002e586>] __wake_up+0x38/0x4f
 [<ffffffff8002a2ac>] file_update_time+0x30/0xdb
 [<ffffffff80029f36>] pipe_writev+0x448/0x4b0
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0

BUG: soft lockup - CPU#1 stuck for 10s! [swapper:0]
CPU 1:
Modules linked in: bonding libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi cnic uio ipt_MASQUERADE iptable_nat ip_nat bridge autofd
Pid: 0, comm: swapper Not tainted 2.6.18-185.el5 #1
RIP: 0010:[<ffffffff8006319d>]  [<ffffffff8006319d>] __read_lock_failed+0x5/0x14
RSP: 0018:ffff81000176fcf0  EFLAGS: 00000297
RAX: 0000000000000056 RBX: ffff81002c988dc0 RCX: ffff81002fb7c280
RDX: ffffffff80350500 RSI: ffff81002d03d000 RDI: ffff81002d03d530
RBP: ffff81000176fc70 R08: ffff81002d03d000 R09: 0000000000000038
R10: 0000000080000000 R11: ffffffff8002faf8 R12: ffffffff8005ec8e
R13: ffff81002d03d500 R14: ffffffff80078f50 R15: ffff81000176fc70
FS:  0000000000000000(0000) GS:ffff8100017437c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b64f6d430a0 CR3: 000000002fb44000 CR4: 00000000000006e0

Call Trace:
 <IRQ>  [<ffffffff80065b75>] _read_lock+0xb/0xc
 [<ffffffff887a4bb1>] :bonding:bond_xmit_activebackup+0x19/0x6c
 [<ffffffff8022f5e2>] dev_hard_start_xmit+0x1b7/0x28a
 [<ffffffff8002fcbd>] dev_queue_xmit+0x1c5/0x271
 [<ffffffff88556f3c>] :ipv6:ip6_output2+0x2df/0x33d
 [<ffffffff88557dd1>] :ipv6:ip6_output+0xbbe/0xbe2
 [<ffffffff80056f3d>] nf_hook_slow+0x58/0xbc
 [<ffffffff88567af0>] :ipv6:dst_output+0x0/0xe
 [<ffffffff8856a790>] :ipv6:ndisc_send_rs+0x3de/0x505
 [<ffffffff8855f4b4>] :ipv6:addrconf_rs_timer+0x0/0xe2
 [<ffffffff8855f55f>] :ipv6:addrconf_rs_timer+0xab/0xe2
 [<ffffffff8009880f>] run_timer_softirq+0x193/0x241
 [<ffffffff80012388>] __do_softirq+0x89/0x133
 [<ffffffff8005f2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006dba8>] do_softirq+0x2c/0x85
 [<ffffffff800575ff>] mwait_idle+0x0/0x4a
 [<ffffffff8005ec8e>] apic_timer_interrupt+0x66/0x6c
 <EOI>  [<ffffffff80057635>] mwait_idle+0x36/0x4a
 [<ffffffff800497ef>] cpu_idle+0x95/0xb8
 [<ffffffff800786bc>] start_secondary+0x495/0x4a4

Comment 11 Andy Gospodarek 2010-01-26 21:50:07 UTC
Problem exists on at least 2.6.32-rc8 as well:

[root@xw4400 ~]# modprobe cnic 
[root@xw4400 ~]# rmmod bonding 
BUG: sleeping function called from invalid context at kernel/mutex.c:280
in_atomic(): 1, irqs_disabled(): 0, pid: 4063, name: rmmod
2 locks held by rmmod/4063:
 #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff812aac4b>] rtnl_lock+0x12/0x14
 #1:  (&bond->lock){++.?..}, at: [<ffffffffa049397a>] bond_del_vlans_from_slave+0x2e/0x109 [bonding]
Pid: 4063, comm: rmmod Not tainted 2.6.32-rc8 #204
Call Trace:
 [<ffffffff810690bf>] ? __debug_show_held_locks+0x22/0x24
 [<ffffffff81036908>] __might_sleep+0xe9/0xee
 [<ffffffff81330044>] mutex_lock_nested+0x32/0x2b8
 [<ffffffff810608bf>] ? sched_clock_cpu+0xbc/0xc7
 [<ffffffffa02bbbf0>] bnx2_netif_stop+0x25/0xfc [bnx2]
 [<ffffffffa049397a>] ? bond_del_vlans_from_slave+0x2e/0x109 [bonding]
 [<ffffffffa02bf752>] bnx2_vlan_rx_register+0x28/0x6a [bnx2]
 [<ffffffffa049397a>] ? bond_del_vlans_from_slave+0x2e/0x109 [bonding]
 [<ffffffffa0493a3d>] bond_del_vlans_from_slave+0xf1/0x109 [bonding]
 [<ffffffffa0494d04>] bond_release_all+0xc6/0x214 [bonding]
 [<ffffffff8104f457>] ? del_timer_sync+0x0/0x84
 [<ffffffffa0494e80>] bond_free_all+0x2e/0x84 [bonding]
 [<ffffffffa049e864>] bonding_exit+0x30/0x37 [bonding]
 [<ffffffff81076105>] sys_delete_module+0x1b3/0x222
 [<ffffffff81069cd5>] ? trace_hardirqs_on_caller+0x113/0x13e
 [<ffffffff81084a52>] ? audit_syscall_entry+0x1bb/0x1ee
 [<ffffffff8100b9ab>] system_call_fastpath+0x16/0x1b
BUG: scheduling while atomic: rmmod/4063/0x10000100
2 locks held by rmmod/4063:
 #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff812aac4b>] rtnl_lock+0x12/0x14
 #1:  (&bond->lock){++.?..}, at: [<ffffffffa049397a>] bond_del_vlans_from_slave+0x2e/0x109 [bonding]
Modules linked in: cnic uio ipt_REJECT bridge stp autofs4 i2c_dev i2c_core hidp rfcomm l2cap crc16 bluetooth rfkill sunrpc bonding(-) 8]
Pid: 4063, comm: rmmod Not tainted 2.6.32-rc8 #204
Call Trace:
 [<ffffffff810690bf>] ? __debug_show_held_locks+0x22/0x24
 [<ffffffff8103b4a1>] __schedule_bug+0x6d/0x72
 [<ffffffff8132ece2>] schedule+0x86/0x91e
 [<ffffffff8100f247>] ? show_trace+0x10/0x12
 [<ffffffff81040a2d>] __cond_resched+0x25/0x30
 [<ffffffff8132f77a>] _cond_resched+0x24/0x2f
 [<ffffffff81330049>] mutex_lock_nested+0x37/0x2b8
 [<ffffffff810608bf>] ? sched_clock_cpu+0xbc/0xc7
 [<ffffffffa02bbbf0>] bnx2_netif_stop+0x25/0xfc [bnx2]
 [<ffffffffa049397a>] ? bond_del_vlans_from_slave+0x2e/0x109 [bonding]
 [<ffffffffa02bf752>] bnx2_vlan_rx_register+0x28/0x6a [bnx2]
 [<ffffffffa049397a>] ? bond_del_vlans_from_slave+0x2e/0x109 [bonding]
 [<ffffffffa0493a3d>] bond_del_vlans_from_slave+0xf1/0x109 [bonding]
 [<ffffffffa0494d04>] bond_release_all+0xc6/0x214 [bonding]
 [<ffffffff8104f457>] ? del_timer_sync+0x0/0x84
 [<ffffffffa0494e80>] bond_free_all+0x2e/0x84 [bonding]
 [<ffffffffa049e864>] bonding_exit+0x30/0x37 [bonding]
 [<ffffffff81076105>] sys_delete_module+0x1b3/0x222
 [<ffffffff81069cd5>] ? trace_hardirqs_on_caller+0x113/0x13e
 [<ffffffff81084a52>] ? audit_syscall_entry+0x1bb/0x1ee
 [<ffffffff8100b9ab>] system_call_fastpath+0x16/0x1b

======================================================
[ INFO: SOFTIRQ-READ-safe -> SOFTIRQ-READ-unsafe lock order detected ]
2.6.32-rc8 #204
------------------------------------------------------
rmmod/4063 [HC0[0]:SC0[1]:HE1:SE0] is trying to acquire:
 (&bp->cnic_lock){+.+...}, at: [<ffffffffa02bbbf0>] bnx2_netif_stop+0x25/0xfc [bnx2]

and this task is already holding:
 (&bond->lock){++.?..}, at: [<ffffffffa049397a>] bond_del_vlans_from_slave+0x2e/0x109 [bonding]
which would create a new lock dependency:
 (&bond->lock){++.?..} -> (&bp->cnic_lock){+.+...}

but this new dependency connects a SOFTIRQ-READ-irq-safe lock:
 (&bond->lock){++.?..}
... which became SOFTIRQ-READ-irq-safe at:
  [<ffffffff8106d378>] __lock_acquire+0x5fa/0x816
  [<ffffffff8106d65b>] lock_acquire+0xc7/0xe4
  [<ffffffff81331f66>] _read_lock+0x34/0x69
  [<ffffffffa0497437>] bond_start_xmit+0xed/0x37c [bonding]
  [<ffffffff812a050c>] dev_hard_start_xmit+0x260/0x316
  [<ffffffff812a37f7>] dev_queue_xmit+0x2e0/0x3e9
  [<ffffffff812a93ca>] neigh_resolve_output+0x2b7/0x2ec
  [<ffffffffa03fb40d>] ip6_output_finish+0x6f/0xd6 [ipv6]
  [<ffffffffa03fb9ab>] ip6_output2+0x271/0x27c [ipv6]
  [<ffffffffa03fc8aa>] ip6_output+0xd20/0xd45 [ipv6]
  [<ffffffffa04160a2>] mld_sendpack+0x29d/0x495 [ipv6]
  [<ffffffffa0417361>] mld_ifc_timer_expire+0x1d6/0x20f [ipv6]
  [<ffffffff8104f18b>] run_timer_softirq+0x1d0/0x284
  [<ffffffff81049961>] __do_softirq+0xdb/0x1ab
  [<ffffffff8100cb5c>] call_softirq+0x1c/0x34
  [<ffffffff8100e1b3>] do_softirq+0x38/0x85
  [<ffffffff81049884>] irq_exit+0x45/0x47
  [<ffffffff81020ed4>] smp_apic_timer_interrupt+0x89/0x99
  [<ffffffff8100c533>] apic_timer_interrupt+0x13/0x20

to a SOFTIRQ-READ-irq-unsafe lock:
 (&bp->cnic_lock){+.+...}
... which became SOFTIRQ-READ-irq-unsafe at:
...  [<ffffffff8106d3e6>] __lock_acquire+0x668/0x816
  [<ffffffff8106d65b>] lock_acquire+0xc7/0xe4
  [<ffffffff8133006f>] mutex_lock_nested+0x5d/0x2b8
  [<ffffffffa02bbbf0>] bnx2_netif_stop+0x25/0xfc [bnx2]
  [<ffffffffa02bf752>] bnx2_vlan_rx_register+0x28/0x6a [bnx2]
  [<ffffffffa0496a30>] bond_vlan_rx_register+0x48/0x5f [bonding]
  [<ffffffffa04874e1>] register_vlan_dev+0x216/0x295 [8021q]
  [<ffffffffa0487df8>] vlan_ioctl_handler+0x36b/0x403 [8021q]
  [<ffffffff81292b53>] sock_ioctl+0x198/0x231
  [<ffffffff810ebb08>] vfs_ioctl+0x2a/0x77
  [<ffffffff810ec051>] do_vfs_ioctl+0x484/0x4d5
  [<ffffffff810ec0f9>] sys_ioctl+0x57/0x7a
  [<ffffffff8100b9ab>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

2 locks held by rmmod/4063:
 #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff812aac4b>] rtnl_lock+0x12/0x14
 #1:  (&bond->lock){++.?..}, at: [<ffffffffa049397a>] bond_del_vlans_from_slave+0x2e/0x109 [bonding]

the dependencies between SOFTIRQ-READ-irq-safe lock and the holding lock:
-> (&bond->lock){++.?..} ops: 146 {
   HARDIRQ-ON-W at:
                        [<ffffffff8106d3c4>] __lock_acquire+0x646/0x816
                        [<ffffffff8106d65b>] lock_acquire+0xc7/0xe4
                        [<ffffffff81331c84>] _write_lock_bh+0x36/0x6b
                        [<ffffffffa04982ab>] bond_close+0x50/0x131 [bonding]
                        [<ffffffff812a10ed>] dev_close+0x81/0x9c
                        [<ffffffff812a0ace>] dev_change_flags+0xa8/0x168
                        [<ffffffff812ebaac>] devinet_ioctl+0x269/0x5da
                        [<ffffffff812ecc22>] inet_ioctl+0x8a/0xa2
                        [<ffffffff81292bc3>] sock_ioctl+0x208/0x231
                        [<ffffffff810ebb08>] vfs_ioctl+0x2a/0x77
                        [<ffffffff810ec051>] do_vfs_ioctl+0x484/0x4d5
                        [<ffffffff810ec0f9>] sys_ioctl+0x57/0x7a
                        [<ffffffff8100b9ab>] system_call_fastpath+0x16/0x1b
   HARDIRQ-ON-R at:
                        [<ffffffff8106d39b>] __lock_acquire+0x61d/0x816
                        [<ffffffff8106d65b>] lock_acquire+0xc7/0xe4
                        [<ffffffff81331e18>] _read_lock_bh+0x39/0x6c
                        [<ffffffffa0496d7f>] bond_get_stats+0x4a/0x17d [bonding]
                        [<ffffffff8129d56e>] dev_get_stats+0x19/0x7d
                        [<ffffffff812aa556>] rtnl_fill_ifinfo+0x302/0x553
                        [<ffffffff812aaa61>] rtmsg_ifinfo+0x66/0xca
                        [<ffffffff812aab05>] rtnetlink_event+0x40/0x44
                        [<ffffffff813343ad>] notifier_call_chain+0x33/0x5b
                        [<ffffffff8105fcfd>] __raw_notifier_call_chain+0x9/0xb
                        [<ffffffff8105fd0e>] raw_notifier_call_chain+0xf/0x11
                        [<ffffffff812a0797>] call_netdevice_notifiers+0x16/0x18
                        [<ffffffff812a198a>] register_netdevice+0x2a9/0x2f5
                        [<ffffffffa0494065>] bond_create+0xa8/0xf1 [bonding]
                        [<ffffffffa04aa7c0>] 0xffffffffa04aa7c0
                        [<ffffffff81009060>] do_one_initcall+0x5a/0x14f
                        [<ffffffff81078dfa>] sys_init_module+0xcd/0x22b
                        [<ffffffff8100b9ab>] system_call_fastpath+0x16/0x1b
   IN-SOFTIRQ-R at:
                        [<ffffffff8106d378>] __lock_acquire+0x5fa/0x816
                        [<ffffffff8106d65b>] lock_acquire+0xc7/0xe4
                        [<ffffffff81331f66>] _read_lock+0x34/0x69
                        [<ffffffffa0497437>] bond_start_xmit+0xed/0x37c [bonding]
                        [<ffffffff812a050c>] dev_hard_start_xmit+0x260/0x316
                        [<ffffffff812a37f7>] dev_queue_xmit+0x2e0/0x3e9
                        [<ffffffff812a93ca>] neigh_resolve_output+0x2b7/0x2ec
                        [<ffffffffa03fb40d>] ip6_output_finish+0x6f/0xd6 [ipv6]
                        [<ffffffffa03fb9ab>] ip6_output2+0x271/0x27c [ipv6]
                        [<ffffffffa03fc8aa>] ip6_output+0xd20/0xd45 [ipv6]
                        [<ffffffffa04160a2>] mld_sendpack+0x29d/0x495 [ipv6]
                        [<ffffffffa0417361>] mld_ifc_timer_expire+0x1d6/0x20f [ipv6]
                        [<ffffffff8104f18b>] run_timer_softirq+0x1d0/0x284
                        [<ffffffff81049961>] __do_softirq+0xdb/0x1ab
                        [<ffffffff8100cb5c>] call_softirq+0x1c/0x34
                        [<ffffffff8100e1b3>] do_softirq+0x38/0x85
                        [<ffffffff81049884>] irq_exit+0x45/0x47
                        [<ffffffff81020ed4>] smp_apic_timer_interrupt+0x89/0x99
                        [<ffffffff8100c533>] apic_timer_interrupt+0x13/0x20
   SOFTIRQ-ON-R at:
                        [<ffffffff8106d3e6>] __lock_acquire+0x668/0x816
                        [<ffffffff8106d65b>] lock_acquire+0xc7/0xe4
                        [<ffffffff81331f66>] _read_lock+0x34/0x69
                        [<ffffffffa0495570>] bond_mii_monitor+0x27/0x4d8 [bonding]
                        [<ffffffff81058dbe>] worker_thread+0x1af/0x2ae
                        [<ffffffff8105c5a4>] kthread+0x7d/0x85
                        [<ffffffff8100ca5a>] child_rip+0xa/0x20
   INITIAL USE at:
                       [<ffffffff8106d431>] __lock_acquire+0x6b3/0x816
                       [<ffffffff8106d65b>] lock_acquire+0xc7/0xe4
                       [<ffffffff81331e18>] _read_lock_bh+0x39/0x6c
                       [<ffffffffa0496d7f>] bond_get_stats+0x4a/0x17d [bonding]
                       [<ffffffff8129d56e>] dev_get_stats+0x19/0x7d
                       [<ffffffff812aa556>] rtnl_fill_ifinfo+0x302/0x553
                       [<ffffffff812aaa61>] rtmsg_ifinfo+0x66/0xca
                       [<ffffffff812aab05>] rtnetlink_event+0x40/0x44
                       [<ffffffff813343ad>] notifier_call_chain+0x33/0x5b
                       [<ffffffff8105fcfd>] __raw_notifier_call_chain+0x9/0xb
                       [<ffffffff8105fd0e>] raw_notifier_call_chain+0xf/0x11
                       [<ffffffff812a0797>] call_netdevice_notifiers+0x16/0x18
                       [<ffffffff812a198a>] register_netdevice+0x2a9/0x2f5
                       [<ffffffffa0494065>] bond_create+0xa8/0xf1 [bonding]
                       [<ffffffffa04aa7c0>] 0xffffffffa04aa7c0
                       [<ffffffff81009060>] do_one_initcall+0x5a/0x14f
                       [<ffffffff81078dfa>] sys_init_module+0xcd/0x22b
                       [<ffffffff8100b9ab>] system_call_fastpath+0x16/0x1b
 }
 ... key      at: [<ffffffffa04a3a98>] __key.43926+0x0/0xffffffffffffadd3 [bonding]
 ... acquired at:
   [<ffffffff8106c02e>] check_irq_usage+0xb3/0xc5
   [<ffffffff8106c7b6>] validate_chain+0x776/0xd3e
   [<ffffffff8106d52e>] __lock_acquire+0x7b0/0x816
   [<ffffffff8106d65b>] lock_acquire+0xc7/0xe4
   [<ffffffff8133006f>] mutex_lock_nested+0x5d/0x2b8
   [<ffffffffa02bbbf0>] bnx2_netif_stop+0x25/0xfc [bnx2]
   [<ffffffffa02bf752>] bnx2_vlan_rx_register+0x28/0x6a [bnx2]
   [<ffffffffa0493a3d>] bond_del_vlans_from_slave+0xf1/0x109 [bonding]
   [<ffffffffa0494d04>] bond_release_all+0xc6/0x214 [bonding]
   [<ffffffffa0494e80>] bond_free_all+0x2e/0x84 [bonding]
   [<ffffffffa049e864>] bonding_exit+0x30/0x37 [bonding]
   [<ffffffff81076105>] sys_delete_module+0x1b3/0x222
   [<ffffffff8100b9ab>] system_call_fastpath+0x16/0x1b


the dependencies between the lock to be acquired and SOFTIRQ-READ-irq-unsafe lock:
-> (&bp->cnic_lock){+.+...} ops: 5 {
   HARDIRQ-ON-W at:
                        [<ffffffff8106d3c4>] __lock_acquire+0x646/0x816
                        [<ffffffff8106d65b>] lock_acquire+0xc7/0xe4
                        [<ffffffff8133006f>] mutex_lock_nested+0x5d/0x2b8
                        [<ffffffffa02bbbf0>] bnx2_netif_stop+0x25/0xfc [bnx2]
                        [<ffffffffa02bf752>] bnx2_vlan_rx_register+0x28/0x6a [bnx2]
                        [<ffffffffa0496a30>] bond_vlan_rx_register+0x48/0x5f [bonding]
                        [<ffffffffa04874e1>] register_vlan_dev+0x216/0x295 [8021q]
                        [<ffffffffa0487df8>] vlan_ioctl_handler+0x36b/0x403 [8021q]
                        [<ffffffff81292b53>] sock_ioctl+0x198/0x231
                        [<ffffffff810ebb08>] vfs_ioctl+0x2a/0x77
                        [<ffffffff810ec051>] do_vfs_ioctl+0x484/0x4d5
                        [<ffffffff810ec0f9>] sys_ioctl+0x57/0x7a
                        [<ffffffff8100b9ab>] system_call_fastpath+0x16/0x1b
   SOFTIRQ-ON-W at:
                        [<ffffffff8106d3e6>] __lock_acquire+0x668/0x816
                        [<ffffffff8106d65b>] lock_acquire+0xc7/0xe4
                        [<ffffffff8133006f>] mutex_lock_nested+0x5d/0x2b8
                        [<ffffffffa02bbbf0>] bnx2_netif_stop+0x25/0xfc [bnx2]
                        [<ffffffffa02bf752>] bnx2_vlan_rx_register+0x28/0x6a [bnx2]
                        [<ffffffffa0496a30>] bond_vlan_rx_register+0x48/0x5f [bonding]
                        [<ffffffffa04874e1>] register_vlan_dev+0x216/0x295 [8021q]
                        [<ffffffffa0487df8>] vlan_ioctl_handler+0x36b/0x403 [8021q]
                        [<ffffffff81292b53>] sock_ioctl+0x198/0x231
                        [<ffffffff810ebb08>] vfs_ioctl+0x2a/0x77
                        [<ffffffff810ec051>] do_vfs_ioctl+0x484/0x4d5
                        [<ffffffff810ec0f9>] sys_ioctl+0x57/0x7a
                        [<ffffffff8100b9ab>] system_call_fastpath+0x16/0x1b
   INITIAL USE at:
                       [<ffffffff8106d431>] __lock_acquire+0x6b3/0x816
                       [<ffffffff8106d65b>] lock_acquire+0xc7/0xe4
                       [<ffffffff8133006f>] mutex_lock_nested+0x5d/0x2b8
                       [<ffffffffa02bbbf0>] bnx2_netif_stop+0x25/0xfc [bnx2]
                       [<ffffffffa02bf752>] bnx2_vlan_rx_register+0x28/0x6a [bnx2]
                       [<ffffffffa0496a30>] bond_vlan_rx_register+0x48/0x5f [bonding]
                       [<ffffffffa04874e1>] register_vlan_dev+0x216/0x295 [8021q]
                       [<ffffffffa0487df8>] vlan_ioctl_handler+0x36b/0x403 [8021q]
                       [<ffffffff81292b53>] sock_ioctl+0x198/0x231
                       [<ffffffff810ebb08>] vfs_ioctl+0x2a/0x77
                       [<ffffffff810ec051>] do_vfs_ioctl+0x484/0x4d5
                       [<ffffffff810ec0f9>] sys_ioctl+0x57/0x7a
                       [<ffffffff8100b9ab>] system_call_fastpath+0x16/0x1b
 }
 ... key      at: [<ffffffffa02c4790>] __key.50251+0x0/0xffffffffffffd7b5 [bnx2]
 ... acquired at:
   [<ffffffff8106c02e>] check_irq_usage+0xb3/0xc5
   [<ffffffff8106c7b6>] validate_chain+0x776/0xd3e
   [<ffffffff8106d52e>] __lock_acquire+0x7b0/0x816
   [<ffffffff8106d65b>] lock_acquire+0xc7/0xe4
   [<ffffffff8133006f>] mutex_lock_nested+0x5d/0x2b8
   [<ffffffffa02bbbf0>] bnx2_netif_stop+0x25/0xfc [bnx2]
   [<ffffffffa02bf752>] bnx2_vlan_rx_register+0x28/0x6a [bnx2]
   [<ffffffffa0493a3d>] bond_del_vlans_from_slave+0xf1/0x109 [bonding]
   [<ffffffffa0494d04>] bond_release_all+0xc6/0x214 [bonding]
   [<ffffffffa0494e80>] bond_free_all+0x2e/0x84 [bonding]
   [<ffffffffa049e864>] bonding_exit+0x30/0x37 [bonding]
   [<ffffffff81076105>] sys_delete_module+0x1b3/0x222
   [<ffffffff8100b9ab>] system_call_fastpath+0x16/0x1b


stack backtrace:
Pid: 4063, comm: rmmod Not tainted 2.6.32-rc8 #204
Call Trace:
 [<ffffffff8106bf67>] check_usage+0x453/0x467
 [<ffffffff8100f345>] ? print_context_stack+0x91/0xa9
 [<ffffffff8106c02e>] check_irq_usage+0xb3/0xc5
 [<ffffffff8106c7b6>] validate_chain+0x776/0xd3e
 [<ffffffff81069d0d>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff81331841>] ? _spin_unlock_irq+0x2b/0x30
 [<ffffffff8106079e>] ? sched_clock_local+0x11/0x76
 [<ffffffff810608bf>] ? sched_clock_cpu+0xbc/0xc7
 [<ffffffff8106d52e>] __lock_acquire+0x7b0/0x816
 [<ffffffff8106d65b>] lock_acquire+0xc7/0xe4
 [<ffffffffa02bbbf0>] ? bnx2_netif_stop+0x25/0xfc [bnx2]
 [<ffffffffa02bbbf0>] ? bnx2_netif_stop+0x25/0xfc [bnx2]
 [<ffffffff8133006f>] mutex_lock_nested+0x5d/0x2b8
 [<ffffffffa02bbbf0>] ? bnx2_netif_stop+0x25/0xfc [bnx2]
 [<ffffffff810608bf>] ? sched_clock_cpu+0xbc/0xc7
 [<ffffffffa02bbbf0>] bnx2_netif_stop+0x25/0xfc [bnx2]
 [<ffffffffa049397a>] ? bond_del_vlans_from_slave+0x2e/0x109 [bonding]
 [<ffffffffa02bf752>] bnx2_vlan_rx_register+0x28/0x6a [bnx2]
 [<ffffffffa049397a>] ? bond_del_vlans_from_slave+0x2e/0x109 [bonding]
 [<ffffffffa0493a3d>] bond_del_vlans_from_slave+0xf1/0x109 [bonding]
 [<ffffffffa0494d04>] bond_release_all+0xc6/0x214 [bonding]
 [<ffffffff8104f457>] ? del_timer_sync+0x0/0x84
 [<ffffffffa0494e80>] bond_free_all+0x2e/0x84 [bonding]
 [<ffffffffa049e864>] bonding_exit+0x30/0x37 [bonding]
 [<ffffffff81076105>] sys_delete_module+0x1b3/0x222
 [<ffffffff81069cd5>] ? trace_hardirqs_on_caller+0x113/0x13e
 [<ffffffff81084a52>] ? audit_syscall_entry+0x1bb/0x1ee
 [<ffffffff8100b9ab>] system_call_fastpath+0x16/0x1b

Comment 12 Michael Chan 2010-01-26 22:56:54 UTC
bond_del_vlans_from_slave() holds bond->lock and calls ndo_vlan_rx_register().  We then call bnx2_netif_stop() -> bnx2_cnic_stop() which has many sleeping functions.

Even without cnic loaded, bnx2_netif_stop() -> bnx2_disable_int_sync() -> synchrinize_irq() -> wait_event() potentially is a problem.  I think most of the time wait_event() never sleeps because the IRQ is not pending, so we don't see the problem.

Assuming we cannot remove the spinlock in the bonding driver, I need to think about how to fix this.

Thanks.

Comment 13 Andy Gospodarek 2010-01-26 23:02:56 UTC
I'm still thinking about this as well, Michael.  Just wanted to be sure you were aware of the problem.

Comment 14 Oded Ramraz 2010-01-27 12:49:09 UTC
I encountered errors while trying to add bond interface to bridge( without VLAN tagging )
Whenever i'm trying to add the bond interface the host become non responsive.

This is what i found on the net console log:

2010-01-27 15:03:10,822 bonding: bond0: enslaving eth2 as a backup interface with a down link.

2010-01-27 15:03:10,822 bnx2i: iSCSI not supported, dev=eth2

2010-01-27 15:03:10,869 ADDRCONF(NETDEV_UP): bond0: link is not ready

2010-01-27 15:03:10,869 bonding: bond0: link status definitely up for interface eth3.

2010-01-27 15:03:10,947 device eth3 entered promiscuous mode

2010-01-27 15:03:10,947 device eth2 entered promiscuous mode

2010-01-27 15:03:10,947 device bond0 entered promiscuous mode

2010-01-27 15:03:11,980 bnx2: eth2 NIC Copper Link is Up, 
2010-01-27 15:03:11,980 1000 Mbps 
2010-01-27 15:03:11,980 full duplex
2010-01-27 15:03:11,980 

2010-01-27 15:03:12,058 bonding: bond0: link status definitely up for interface eth2.


On the host console i saw this error message:
Unable to handle kernel null pointer dereference.

Host and network adapters information: 

Linux silver-vdsa.qa.lab.tlv.redhat.com 2.6.18-164.9.1.el5 #1 SMP Wed Dec 9
03:27:37 EST 2009 x86_64 x86_64 x86_64 GNU/Linux


root@silver-vdsa ~]# ethtool -i eth2
driver: bnx2
version: 2.0.2
firmware-version: 3.5.12 ipms 1.6.0
bus-info: 0000:03:00.0
[root@silver-vdsa ~]# ethtool -i eth3
driver: bnx2
version: 2.0.2
firmware-version: 3.5.12 ipms 1.6.0
bus-info: 0000:07:00.0
[root@silver-vdsa ~]#

Comment 15 Andy Gospodarek 2010-01-27 15:32:41 UTC
(In reply to comment #14)
> 
> 
> On the host console i saw this error message:
> Unable to handle kernel null pointer dereference.
> 

Without the rest of the message, unfortunately this isn't helpful.

Comment 16 Oded Ramraz 2010-01-27 16:01:59 UTC
Created attachment 387114 [details]
screen shots

Comment 17 Michael Chan 2010-02-16 20:19:09 UTC
Created attachment 394626 [details]
Patch to fix the issue.

This patch should fix the issue.  Please review and test to confirm.  We'll do more testing before sending upstream.  Thanks.

Comment 18 Andy Gospodarek 2010-02-16 21:28:22 UTC
This patch seems fine to ms as long as you do not feel like the cnic part needs to actually be reset when adding and removing vlans.

I don't have a way to test cnic functionality with this patch, but I will verify that it will resolve the problem (I suspect it will).

Comment 19 Andy Gospodarek 2010-02-22 22:12:00 UTC
Michael, I tested the patch in comment #17 on 2.6.33-rc8 and it seems to work.

I still get some lockdep warnings (below), but these are not new. 

How is it looking based on your testing?



=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.33-rc8 #2
-------------------------------------------------------
rmmod/4410 is trying to acquire lock:
 ((bond_dev->name)){+.+...}, at: [<ffffffff810514ce>] cleanup_workqueue_thread+0x1e/0xb8

but task is already holding lock:
 (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff8103c7c4>] cpu_maps_update_begin+0x12/0x14

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (cpu_add_remove_lock){+.+.+.}:
       [<ffffffff81065a09>] validate_chain+0xa40/0xd38
       [<ffffffff810664ae>] __lock_acquire+0x7ad/0x813
       [<ffffffff810665db>] lock_acquire+0xc7/0xe4
       [<ffffffff8133443b>] mutex_lock_nested+0x5d/0x2d0
       [<ffffffff8103c7c4>] cpu_maps_update_begin+0x12/0x14
       [<ffffffff810515c8>] destroy_workqueue+0x2b/0x9f
       [<ffffffffa04bc5de>] bond_uninit+0x30a/0x338 [bonding]
       [<ffffffff812a3e11>] rollback_registered_many+0xeb/0x16b
       [<ffffffff812a3ea5>] unregister_netdevice_many+0x14/0x3f
       [<ffffffff812adca0>] __rtnl_kill_links+0x5f/0x6a
       [<ffffffff812adcc9>] __rtnl_link_unregister+0x1e/0x46
       [<ffffffff812adf92>] rtnl_link_unregister+0x19/0x22
       [<ffffffffa04c3e36>] bonding_exit+0x32/0x40 [bonding]
       [<ffffffff8106fc9d>] sys_delete_module+0x1c5/0x236
       [<ffffffff8100296b>] system_call_fastpath+0x16/0x1b

-> #2 (rtnl_mutex){+.+.+.}:
       [<ffffffff81065a09>] validate_chain+0xa40/0xd38
       [<ffffffff810664ae>] __lock_acquire+0x7ad/0x813
       [<ffffffff810665db>] lock_acquire+0xc7/0xe4
       [<ffffffff8133443b>] mutex_lock_nested+0x5d/0x2d0
       [<ffffffff812add5c>] rtnl_lock+0x12/0x14
       [<ffffffffa04ba463>] bond_mii_monitor+0x27e/0x4d9 [bonding]
       [<ffffffff810520cd>] worker_thread+0x1af/0x2ae
       [<ffffffff81054d41>] kthread+0x7d/0x85
       [<ffffffff81003794>] kernel_thread_helper+0x4/0x10

-> #1 ((&(&bond->mii_work)->work)){+.+...}:
       [<ffffffff81065a09>] validate_chain+0xa40/0xd38
       [<ffffffff810664ae>] __lock_acquire+0x7ad/0x813
       [<ffffffff810665db>] lock_acquire+0xc7/0xe4
       [<ffffffff810520c7>] worker_thread+0x1a9/0x2ae
       [<ffffffff81054d41>] kthread+0x7d/0x85
       [<ffffffff81003794>] kernel_thread_helper+0x4/0x10

-> #0 ((bond_dev->name)){+.+...}:
       [<ffffffff810656f5>] validate_chain+0x72c/0xd38
       [<ffffffff810664ae>] __lock_acquire+0x7ad/0x813
       [<ffffffff810665db>] lock_acquire+0xc7/0xe4
       [<ffffffff810514f5>] cleanup_workqueue_thread+0x45/0xb8
       [<ffffffff81051600>] destroy_workqueue+0x63/0x9f
       [<ffffffffa04bc5de>] bond_uninit+0x30a/0x338 [bonding]
       [<ffffffff812a3e11>] rollback_registered_many+0xeb/0x16b
       [<ffffffff812a3ea5>] unregister_netdevice_many+0x14/0x3f
       [<ffffffff812adca0>] __rtnl_kill_links+0x5f/0x6a
       [<ffffffff812adcc9>] __rtnl_link_unregister+0x1e/0x46
       [<ffffffff812adf92>] rtnl_link_unregister+0x19/0x22
       [<ffffffffa04c3e36>] bonding_exit+0x32/0x40 [bonding]
       [<ffffffff8106fc9d>] sys_delete_module+0x1c5/0x236
       [<ffffffff8100296b>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

2 locks held by rmmod/4410:
 #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff812add5c>] rtnl_lock+0x12/0x14
 #1:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff8103c7c4>] cpu_maps_update_begin+0x12/0x14

stack backtrace:
Pid: 4410, comm: rmmod Not tainted 2.6.33-rc8 #2
Call Trace:
 [<ffffffff810648c1>] print_circular_bug+0xb3/0xc1
 [<ffffffff810656f5>] validate_chain+0x72c/0xd38
 [<ffffffff810664ae>] __lock_acquire+0x7ad/0x813
 [<ffffffff810665db>] lock_acquire+0xc7/0xe4
 [<ffffffff810514ce>] ? cleanup_workqueue_thread+0x1e/0xb8
 [<ffffffff810514f5>] cleanup_workqueue_thread+0x45/0xb8
 [<ffffffff810514ce>] ? cleanup_workqueue_thread+0x1e/0xb8
 [<ffffffff81051600>] destroy_workqueue+0x63/0x9f
 [<ffffffffa04bc5de>] bond_uninit+0x30a/0x338 [bonding]
 [<ffffffff812a3e11>] rollback_registered_many+0xeb/0x16b
 [<ffffffff812a3ea5>] unregister_netdevice_many+0x14/0x3f
 [<ffffffff812adca0>] __rtnl_kill_links+0x5f/0x6a
 [<ffffffff812adcc9>] __rtnl_link_unregister+0x1e/0x46
 [<ffffffff812adf92>] rtnl_link_unregister+0x19/0x22
 [<ffffffffa04c3e36>] bonding_exit+0x32/0x40 [bonding]
 [<ffffffff8106fc9d>] sys_delete_module+0x1c5/0x236
 [<ffffffff813366e9>] ? retint_swapgs+0xe/0x13
 [<ffffffff8107db7a>] ? audit_syscall_entry+0x1d0/0x203
 [<ffffffff8100296b>] system_call_fastpath+0x16/0x1b

Comment 20 RHEL Product and Program Management 2010-02-22 22:16:27 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 21 Peter Martuccelli 2010-02-23 14:07:07 UTC
Patch is still in progress, targeting RHEL 5.6 for submission.

Comment 23 Andy Gospodarek 2010-03-25 00:38:17 UTC
New test kernels available here:

http://people.redhat.com/agospoda/#rhel5

Any feedback you can provide is greatly apprecaited.

Comment 24 Oded Ramraz 2010-04-08 07:13:46 UTC
I installed the test kernel on my hosts and I tried to reproduce this problem:
In order to reproduce this problem I changed the interfaces removal order in the scripts to the original order when I found this issue : Nics , Bond, VLAN , Bridge.
I used to encounter this issue whenever I tried to remove bond interface ( with or without VLAN tagging ). I tried to do it several times and I didn't manage to reproduce it again.

Comment 25 Andy Gospodarek 2010-04-08 14:11:36 UTC
Michael, It looks like your patch works well.  What is the target for upstream inclusion?

Comment 28 Zhongqiang Dou 2010-04-12 04:36:29 UTC
Andy, I can't reproduce this bug on kernel 2.6.18-183 and 2.6.18-185. Here is
our test steps and configuration files. 
In my test, the firmware version of bnx2 is "4.0.3 ipms 1.6.0" which is
different with #comment 9.
Would you please help me to clarify verify steps?

# uname -a
Linux amd-8356-32-3 2.6.18-185.el5 #1 SMP Thu Jan 14 16:44:40 EST 2010 x86_64
x86_64 x86_64 GNU/Linux

# ethtool -i eth4
driver: bnx2
version: 2.0.2
firmware-version: 4.0.3 ipms 1.6.0
bus-info: 0000:0d:00.0

# ethtool -i eth5
driver: bnx2
version: 2.0.2
firmware-version: 4.0.3 ipms 1.6.0
bus-info: 0000:18:00.0

##The configuration file##
==/etc/sysconfig/network-scripts/ifcfg-bond0==
DEVICE=bond0
BOOTPROTO=none
ONBOOT=no
BONDING_OPTS="mode=1 miimon=100"
=============================
==/etc/sysconfig/network-scripts/ifcfg-bond0.100==
DEVICE=bond0.100
BOOTPROTO=none
ONBOOT=yes
VLAN=yes
=============================
==/etc/sysconfig/network-scripts/ifcfg-eth4==
# Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet
DEVICE=eth4
#HWADDR=00:14:5E:F4:95:F2
ONBOOT=no
MASTER=bond0
SLAVE=yes
=============================
==/etc/sysconfig/network-scripts/ifcfg-eth5==
# Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet
DEVICE=eth5
#HWADDR=00:14:5E:F4:95:F4
ONBOOT=no
MASTER=bond0
SLAVE=yes
=============================
==/etc/modprobe.conf==
alias eth0 e1000e
alias eth1 e1000e
alias eth2 e1000e
alias eth3 e1000e
alias eth4 bnx2
alias eth5 bnx2
alias scsi_hostadapter aacraid
alias scsi_hostadapter1 lpfc
alias bond0 bonding
=============================

####Steps#########
# ifup bond0

# ifup bond0.100
Added VLAN with VID == 100 to IF -:bond0:-

# brctl addbr br0

# brctl addif br0 bond0.100

# brctl show
bridge name bridge id  STP enabled interfaces
br0  8000.00145ef495f2 no  bond0.100

# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.4.0 (October 7, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth4
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth4
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:14:5e:f4:95:f2

Slave Interface: eth5
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:14:5e:f4:95:f4

# rmmod bonding

# cat /proc/net/bonding/bond0
cat: /proc/net/bonding/bond0: No such file or directory

#####dmesg########
bonding: bond0: setting mode to active-backup (1).
bonding: bond0: Setting MII monitoring interval to 100.
ADDRCONF(NETDEV_UP): bond0: link is not ready
bonding: bond0: Adding slave eth4.
bnx2: eth4: using MSI
bonding: bond0: enslaving eth4 as a backup interface with a down link.
bonding: bond0: Adding slave eth5.
bnx2: eth5: using MSI
bonding: bond0: enslaving eth5 as a backup interface with a down link.
bnx2: eth4 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit
flow control ON
bonding: bond0: link status definitely up for interface eth4.
bonding: bond0: making interface eth4 the new active one.
bonding: bond0: first active interface up!
ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
bnx2: eth5 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit
flow control ON
bonding: bond0: link status definitely up for interface eth5.
802.1Q VLAN Support v1.8 Ben Greear <greearb@candelatech.com>
All bugs added by David S. Miller <davem@redhat.com>
bond0: no IPv6 routers present
bond0.100: no IPv6 routers present
Bridge firewalling registered
bond0.100: dev_set_promiscuity(master, 1)
device eth4 entered promiscuous mode
device bond0 entered promiscuous mode
device bond0.100 entered promiscuous mode
device eth4 left promiscuous mode
bonding: bond0: Warning: clearing HW address of bond0 while it still has VLANs.
bonding: bond0: When re-adding slaves, make sure the bond's HW address matches
its VLANs'.
bonding: bond0: released all slaves
br0: port 1(bond0.100) entering disabled state

Comment 29 Andy Gospodarek 2010-04-12 14:01:09 UTC
Comment#9 and comment#10 describe how to reproduce this.  All you need to do on your setup is:

# modprobe cnic
# ifup bond0
# ifup bond0.100
# rmmod bonding

I suspect your previous tests were not run with cnic loaded.  It will not be loaded by default so it must be manually inserted.

Comment 31 Zhongqiang Dou 2010-04-16 09:41:26 UTC
Hi Andy, thanks for your help! I have already reproduce this bug following the four steps.

Comment 32 Zhongqiang Dou 2010-04-16 10:25:48 UTC
I verified this bug on kernel 2.6.18-185.el5 and kernel 2.6.18-196.el5.
There are no bug messages output on kernel 2.6.18-196.el5.
==On kernel 2.6.18-185.el5==
# modprobe cnic
# ifup bond0
# ifup bond0.100
# rmmod bonding

# uname -a
Linux amd-8356-32-3 2.6.18-185.el5 #1 SMP Thu Jan 14 16:44:40 EST 2010 x86_64 x86_64 x86_64 GNU/Linux

And then got the messages in `dmesg`,
#####dmesg########
cnic: Added CNIC device: eth4
cnic: Added CNIC device: eth5
bonding: bond0: setting mode to active-backup (1).
bonding: bond0: Setting MII monitoring interval to 100.
ADDRCONF(NETDEV_UP): bond0: link is not ready
bonding: bond0: Adding slave eth4.
bnx2: eth4: using MSI
bonding: bond0: enslaving eth4 as a backup interface with a down link.
bonding: bond0: Adding slave eth5.
bnx2: eth5: using MSI
bonding: bond0: enslaving eth5 as a backup interface with a down link.
bnx2: eth4 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
bonding: bond0: link status definitely up for interface eth4.
bonding: bond0: making interface eth4 the new active one.
bonding: bond0: first active interface up!
ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
bnx2: eth5 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
bonding: bond0: link status definitely up for interface eth5.
bond0: no IPv6 routers present
bond0.100: no IPv6 routers present
BUG: scheduling while atomic: rmmod/0x00000100/6486

Call Trace:
 [<ffffffff8006343d>] __sched_text_start+0x7d/0xbd6
 [<ffffffff80150c87>] __next_cpu+0x19/0x28
 [<ffffffff8008c850>] find_busiest_group+0x20d/0x621
 [<ffffffff8008dd0d>] default_wake_function+0x0/0xe
 [<ffffffff80064167>] wait_for_completion+0x79/0xa2
 [<ffffffff8008dd0d>] default_wake_function+0x0/0xe
 [<ffffffff8009faa6>] synchronize_rcu+0x30/0x36
 [<ffffffff8009f5e2>] wakeme_after_rcu+0x0/0x9
 [<ffffffff88587d74>] :cnic:cnic_stop_hw+0x38/0xa6
 [<ffffffff8858ca2e>] :cnic:cnic_ctl+0x35/0xac
 [<ffffffff8823b88f>] :bnx2:bnx2_netif_stop+0x3a/0xea
 [<ffffffff8823fbe5>] :bnx2:bnx2_vlan_rx_register+0x20/0x61
 [<ffffffff88483a37>] :bonding:bond_del_vlans_from_slave+0xa6/0xb9
 [<ffffffff88485728>] :bonding:bond_release_all+0xb3/0x21c
 [<ffffffff884858c0>] :bonding:bond_free_all+0x2f/0xb5
 [<ffffffff8848ef88>] :bonding:bonding_exit+0x30/0x36
 [<ffffffff800a7394>] sys_delete_module+0x196/0x1c5
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0

BUG: scheduling while atomic: rmmod/0x00000100/6486

Call Trace:
 [<ffffffff8006343d>] __sched_text_start+0x7d/0xbd6
 [<ffffffff80064167>] wait_for_completion+0x79/0xa2
 [<ffffffff8008dd0d>] default_wake_function+0x0/0xe
 [<ffffffff8009faa6>] synchronize_rcu+0x30/0x36
 [<ffffffff8009f5e2>] wakeme_after_rcu+0x0/0x9
 [<ffffffff88587d74>] :cnic:cnic_stop_hw+0x38/0xa6
 [<ffffffff8858ca2e>] :cnic:cnic_ctl+0x35/0xac
 [<ffffffff8823b88f>] :bnx2:bnx2_netif_stop+0x3a/0xea
 [<ffffffff88426f78>] :ipv6:fib6_age+0x0/0x65
 [<ffffffff8823fbe5>] :bnx2:bnx2_vlan_rx_register+0x20/0x61
 [<ffffffff88483a37>] :bonding:bond_del_vlans_from_slave+0xa6/0xb9
 [<ffffffff88485728>] :bonding:bond_release_all+0xb3/0x21c
 [<ffffffff884858c0>] :bonding:bond_free_all+0x2f/0xb5
 [<ffffffff8848ef88>] :bonding:bonding_exit+0x30/0x36
 [<ffffffff800a7394>] sys_delete_module+0x196/0x1c5
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0

bonding: bond0: Warning: clearing HW address of bond0 while it still has VLANs.
bonding: bond0: When re-adding slaves, make sure the bond's HW address matches its VLANs'.
bonding: bond0: released all slaves

==On kernel 2.6.18-196.el5==
# modprobe cnic
# ifup bond0
# ifup bond0.100
# rmmod bonding

# uname -a
Linux amd-8356-32-3 2.6.18-196.el5 #1 SMP Tue Apr 13 12:36:38 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

And then got the messages in `dmesg`,
==demesg====
cnic: Added CNIC device: eth4
cnic: Added CNIC device: eth5
bonding: bond0: setting mode to active-backup (1).
bonding: bond0: Setting MII monitoring interval to 100.
ADDRCONF(NETDEV_UP): bond0: link is not ready
bonding: bond0: Adding slave eth4.
bnx2: eth4: using MSI
bonding: bond0: enslaving eth4 as a backup interface with a down link.
bonding: bond0: Adding slave eth5.
bnx2: eth5: using MSI
bonding: bond0: enslaving eth5 as a backup interface with a down link.
bnx2: eth4 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
bonding: bond0: link status definitely up for interface eth4.
bonding: bond0: making interface eth4 the new active one.
bonding: bond0: first active interface up!
ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
bnx2: eth5 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
bonding: bond0: link status definitely up for interface eth5.
bond0: no IPv6 routers present
bond0.100: no IPv6 routers present
bonding: bond0: Warning: clearing HW address of bond0 while it still has VLANs.
bonding: bond0: When re-adding slaves, make sure the bond's HW address matches its VLANs'.
bonding: bond0: released all slaves

Comment 33 Jarod Wilson 2010-04-21 19:41:07 UTC
in kernel-2.6.18-197.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 34 Michael Chan 2010-04-28 21:52:05 UTC
(In reply to comment #25)
> Michael, It looks like your patch works well.  What is the target for upstream
> inclusion?    

Sorry for taking so long.  It has been merged upstream yesterday.

Comment 36 RHEL Product and Program Management 2010-06-02 05:13:00 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 47 Jaromir Hradilek 2010-10-12 22:47:14 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
The cnic parts resets could cause a deadlock when the bnx2 device was enslaved in a bonding device and that device had an associated VLAN.

Comment 50 errata-xmlrpc 2011-01-13 20:59:13 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.