Hide Forgot
Description of problem: System interruption may occur in bond_release while doing ifdown-eth, or when removing an IO CRU. Version-Release number of selected component (if applicable): How reproducible: Easily reproducible during tests. Steps to Reproduce: 1. Run the tests 2. 3. Actual results: System crashed. Expected results: No crash Additional info:
PID: 19350 TASK: ffff81087f9ca860 CPU: 7 COMMAND: "ifdown-eth" #0 [ffff810789617b80] crash_kexec at ffffffff800af85a #1 [ffff810789617c40] __die at ffffffff80065117 #2 [ffff810789617c80] die at ffffffff8006c73a #3 [ffff810789617cb0] do_invalid_op at ffffffff8006ccfa #4 [ffff810789617d70] error_exit at ffffffff8005dde9 [exception RIP: bond_release+98] RIP: ffffffff887f6c0b RSP: ffff810789617e28 RFLAGS: 00010286 RAX: 00000000ffffffff RBX: 00000000000005dc RCX: 0000000000000282 RDX: 00000000ffffffff RSI: ffff81080fb7a000 RDI: ffff810872cbf530 RBP: ffff810872cbf500 R8: 0000000000000008 R9: 0000000000000000 R10: 0000000000000004 R11: ffff81087cea3fd0 R12: ffff810872cbf000 R13: 000000000000000b R14: ffff81080fb7a000 R15: ffff81087ce892c0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffff810789617e20] bond_release at ffffffff887f6bf5 #6 [ffff810789617e70] bonding_store_slaves at ffffffff887ffb54 #7 [ffff810789617ed0] sysfs_write_file at ffffffff8010fee2 #8 [ffff810789617f10] vfs_write at ffffffff80016aa3 #9 [ffff810789617f40] sys_write at ffffffff8001735b #10 [ffff810789617f80] tracesys at ffffffff8005d28d (via system_call) RIP: 0000003b56cc6420 RSP: 00007fffc74c64d8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff RDX: 000000000000000b RSI: 00002ba20009d000 RDI: 0000000000000001 RBP: 000000000000000b R8: 00000000ffffffff R9: 00002ba1fcac3f50 R10: 0000000000000022 R11: 0000000000000246 R12: 0000003b56f52780 R13: 00002ba20009d000 R14: 000000000000000b R15: 0000000000000000 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
PID: 5569 TASK: ffff81087c1c1040 CPU: 5 COMMAND: "bond1" #0 [ffff810872777b70] crash_kexec at ffffffff800af85a #1 [ffff810872777c30] __die at ffffffff80065117 #2 [ffff810872777c70] die at ffffffff8006c73a #3 [ffff810872777ca0] do_invalid_op at ffffffff8006ccfa #4 [ffff810872777d60] error_exit at ffffffff8005dde9 [exception RIP: bond_mii_monitor+1054] RIP: ffffffff887f8e95 RSP: ffff810872777e10 RFLAGS: 00010286 RAX: 00000000ffffffff RBX: ffff810872692530 RCX: 0000000000000286 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff80358ac0 RBP: ffff810872692500 R8: ffff810872776000 R9: 0000000000000039 R10: 00000000ffffffff R11: 0a64657265747369 R12: ffff810399a57800 R13: 0000000000000000 R14: 0000000000000002 R15: ffffffff887f8a77 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffff810872777e38] run_workqueue at ffffffff8004d7d0 #6 [ffff810872777e78] worker_thread at ffffffff8004a108 #7 [ffff810872777ee8] kthread at ffffffff80032996 #8 [ffff810872777f48] kernel_thread at ffffffff8005dfb1
block_netpoll_tx and unblock_netpoll_tx are protected in bond_release but not in other places. int bond_release(struct net_device *bond_dev, struct net_device *slave_dev) { .... write_lock_bh(&bond->lock); block_netpoll_tx(); static void bond_miimon_commit(struct bonding *bond) { .... block_netpoll_tx(); write_lock_bh(&bond->curr_slave_lock); bond_select_active_slave(bond); write_unlock_bh(&bond->curr_slave_lock); unblock_netpoll_tx();
Kernel BUG at drivers/net/bonding/bonding.h:135 invalid opcode: 0000 [1] SMP last sysfs file: /block/dm-0/dev CPU 5 Modules linked in: ppp_deflate zlib_deflate ppp_async crc_ccitt ppp_generic slhc autofs4 hidp nfs fscach e nfs_acl rfcomm l2cap bluetooth lockd sunrpc bonding be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr iscsi_tcp bnx2i cnic uio cxgb3i iw_cxgb3 ib_core cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transpo rt_iscsi2 scsi_transport_iscsi video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi batter y asus_acpi acpi_memhotplug ac ipv6 xfrm_nalgo crypto_api parport_pc lp parport ipmi_devintf ftmod(PU) i pmi_msghandler button vtm(FU) sr_mod cdrom(U) radeonfb(FU) fosil(U) sg(U) pcspkr e1000(U) tpm_tis tpm tp m_bios dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_round_robin(U) dm_multipath(U) sc si_dh dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) lpfc(U) scsi_transport_fc ata_piix(U) a ic94xx(U) libsas(U) libata(U) scsi_transport_sas sd_mod(U) scsi_mod(U) ext3 jbd uhci_hcd(U) ohci_hcd ehc i_hcd Pid: 5569, comm: bond1 Tainted: PF 2.6.18-238.1.1.el5 #1 RIP: 0010:[<ffffffff887f8e95>] [<ffffffff887f8e95>] :bonding:bond_mii_monitor+0x41e/0x4c0 RSP: 0018:ffff810872777e10 EFLAGS: 00010286 RAX: 00000000ffffffff RBX: ffff810872692530 RCX: 0000000000000286 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff80358ac0 RBP: ffff810872692500 R08: ffff810872776000 R09: 0000000000000039 R10: 00000000ffffffff R11: 0a64657265747369 R12: ffff810399a57800 R13: 0000000000000000 R14: 0000000000000002 R15: ffffffff887f8a77 FS: 0000000000000000(0000) GS:ffff81011ddbe4c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 000000000ebfaf58 CR3: 0000000000201000 CR4: 00000000000006e0 Process bond1 (pid: 5569, threadinfo ffff810872776000, task ffff81087c1c1040) Stack: ffff810872692878 ffff810872692880 ffff81087b192940 0000000000000282 ffff810872692500 ffffffff8004d7d0 ffff810872777e80 ffff81087b192940 ffffffff8004a018 ffff8108727fdd68 0000000000000282 ffff8108727fdd58 Call Trace: [<ffffffff8004d7d0>] run_workqueue+0x99/0xf6 [<ffffffff8004a018>] worker_thread+0x0/0x122 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff8004a108>] worker_thread+0xf0/0x122 [<ffffffff8008e40a>] default_wake_function+0x0/0xe [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032996>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032898>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Code: 0f 0b 68 aa 14 80 88 c2 87 00 48 8d 5d 34 48 89 df e8 c9 bc RIP [<ffffffff887f8e95>] :bonding:bond_mii_monitor+0x41e/0x4c0 RSP <ffff810872777e10>
Oonkwee, I know what an IO CRU is, but I have a feeling that gospo (assigned to this BZ) does not know what one is. Could you elaborate on the design of the IO CRU so he has a better understanding of what the problem is? Thanks, P.
A "CRU" (Customer Replaceable Unit) is essentially a hot-pluggable box containing processor, memory, PCI bus, and associated I/O devices. A Stratus system consists of two CRUs plugged into Stratus' proprietary backplane with the CPUs running in lockstep. From an I/O standpoint, unplugging a CRU is essentially the same as unplugging a whole bunch of devices at once. Note that this problem seems to be occuring both when we bring down the bonded interface cleanly (ifdown-eth) and uncleanly (Unplug a CRU).
Created attachment 481088 [details] bond-netpoll-block-tx-fix.patch The attached patch should resolve this. Feedback, as always, is appreciated.
*** This bug has been marked as a duplicate of bug 659594 ***