| Summary: | System interruption may occur in bond_release while doing ifdown-eth, or when removing an IO CRU. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Oonkwee Lim <oonkwee.lim> | ||||
| Component: | kernel | Assignee: | Jiri Pirko <jpirko> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 5.6 | CC: | agospoda, chas.horvath, dan.duval, jarod, jparadis, jpirko, kevin.paetzold, rkhan, robert.evans, rpacheco | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-02-25 22:56:20 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
|
Description
Oonkwee Lim
2011-02-22 18:15:15 UTC
PID: 19350 TASK: ffff81087f9ca860 CPU: 7 COMMAND: "ifdown-eth"
#0 [ffff810789617b80] crash_kexec at ffffffff800af85a
#1 [ffff810789617c40] __die at ffffffff80065117
#2 [ffff810789617c80] die at ffffffff8006c73a
#3 [ffff810789617cb0] do_invalid_op at ffffffff8006ccfa
#4 [ffff810789617d70] error_exit at ffffffff8005dde9
[exception RIP: bond_release+98]
RIP: ffffffff887f6c0b RSP: ffff810789617e28 RFLAGS: 00010286
RAX: 00000000ffffffff RBX: 00000000000005dc RCX: 0000000000000282
RDX: 00000000ffffffff RSI: ffff81080fb7a000 RDI: ffff810872cbf530
RBP: ffff810872cbf500 R8: 0000000000000008 R9: 0000000000000000
R10: 0000000000000004 R11: ffff81087cea3fd0 R12: ffff810872cbf000
R13: 000000000000000b R14: ffff81080fb7a000 R15: ffff81087ce892c0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#5 [ffff810789617e20] bond_release at ffffffff887f6bf5
#6 [ffff810789617e70] bonding_store_slaves at ffffffff887ffb54
#7 [ffff810789617ed0] sysfs_write_file at ffffffff8010fee2
#8 [ffff810789617f10] vfs_write at ffffffff80016aa3
#9 [ffff810789617f40] sys_write at ffffffff8001735b
#10 [ffff810789617f80] tracesys at ffffffff8005d28d (via system_call)
RIP: 0000003b56cc6420 RSP: 00007fffc74c64d8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff
RDX: 000000000000000b RSI: 00002ba20009d000 RDI: 0000000000000001
RBP: 000000000000000b R8: 00000000ffffffff R9: 00002ba1fcac3f50
R10: 0000000000000022 R11: 0000000000000246 R12: 0000003b56f52780
R13: 00002ba20009d000 R14: 000000000000000b R15: 0000000000000000
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
PID: 5569 TASK: ffff81087c1c1040 CPU: 5 COMMAND: "bond1"
#0 [ffff810872777b70] crash_kexec at ffffffff800af85a
#1 [ffff810872777c30] __die at ffffffff80065117
#2 [ffff810872777c70] die at ffffffff8006c73a
#3 [ffff810872777ca0] do_invalid_op at ffffffff8006ccfa
#4 [ffff810872777d60] error_exit at ffffffff8005dde9
[exception RIP: bond_mii_monitor+1054]
RIP: ffffffff887f8e95 RSP: ffff810872777e10 RFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff810872692530 RCX: 0000000000000286
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff80358ac0
RBP: ffff810872692500 R8: ffff810872776000 R9: 0000000000000039
R10: 00000000ffffffff R11: 0a64657265747369 R12: ffff810399a57800
R13: 0000000000000000 R14: 0000000000000002 R15: ffffffff887f8a77
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#5 [ffff810872777e38] run_workqueue at ffffffff8004d7d0
#6 [ffff810872777e78] worker_thread at ffffffff8004a108
#7 [ffff810872777ee8] kthread at ffffffff80032996
#8 [ffff810872777f48] kernel_thread at ffffffff8005dfb1
block_netpoll_tx and unblock_netpoll_tx are protected in bond_release but
not in other places.
int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
{
....
write_lock_bh(&bond->lock);
block_netpoll_tx();
static void bond_miimon_commit(struct bonding *bond)
{
....
block_netpoll_tx();
write_lock_bh(&bond->curr_slave_lock);
bond_select_active_slave(bond);
write_unlock_bh(&bond->curr_slave_lock);
unblock_netpoll_tx();
Kernel BUG at drivers/net/bonding/bonding.h:135 invalid opcode: 0000 [1] SMP last sysfs file: /block/dm-0/dev CPU 5 Modules linked in: ppp_deflate zlib_deflate ppp_async crc_ccitt ppp_generic slhc autofs4 hidp nfs fscach e nfs_acl rfcomm l2cap bluetooth lockd sunrpc bonding be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr iscsi_tcp bnx2i cnic uio cxgb3i iw_cxgb3 ib_core cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transpo rt_iscsi2 scsi_transport_iscsi video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi batter y asus_acpi acpi_memhotplug ac ipv6 xfrm_nalgo crypto_api parport_pc lp parport ipmi_devintf ftmod(PU) i pmi_msghandler button vtm(FU) sr_mod cdrom(U) radeonfb(FU) fosil(U) sg(U) pcspkr e1000(U) tpm_tis tpm tp m_bios dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_round_robin(U) dm_multipath(U) sc si_dh dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) lpfc(U) scsi_transport_fc ata_piix(U) a ic94xx(U) libsas(U) libata(U) scsi_transport_sas sd_mod(U) scsi_mod(U) ext3 jbd uhci_hcd(U) ohci_hcd ehc i_hcd Pid: 5569, comm: bond1 Tainted: PF 2.6.18-238.1.1.el5 #1 RIP: 0010:[<ffffffff887f8e95>] [<ffffffff887f8e95>] :bonding:bond_mii_monitor+0x41e/0x4c0 RSP: 0018:ffff810872777e10 EFLAGS: 00010286 RAX: 00000000ffffffff RBX: ffff810872692530 RCX: 0000000000000286 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff80358ac0 RBP: ffff810872692500 R08: ffff810872776000 R09: 0000000000000039 R10: 00000000ffffffff R11: 0a64657265747369 R12: ffff810399a57800 R13: 0000000000000000 R14: 0000000000000002 R15: ffffffff887f8a77 FS: 0000000000000000(0000) GS:ffff81011ddbe4c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 000000000ebfaf58 CR3: 0000000000201000 CR4: 00000000000006e0 Process bond1 (pid: 5569, threadinfo ffff810872776000, task ffff81087c1c1040) Stack: ffff810872692878 ffff810872692880 ffff81087b192940 0000000000000282 ffff810872692500 ffffffff8004d7d0 ffff810872777e80 ffff81087b192940 ffffffff8004a018 ffff8108727fdd68 0000000000000282 ffff8108727fdd58 Call Trace: [<ffffffff8004d7d0>] run_workqueue+0x99/0xf6 [<ffffffff8004a018>] worker_thread+0x0/0x122 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff8004a108>] worker_thread+0xf0/0x122 [<ffffffff8008e40a>] default_wake_function+0x0/0xe [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032996>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032898>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Code: 0f 0b 68 aa 14 80 88 c2 87 00 48 8d 5d 34 48 89 df e8 c9 bc RIP [<ffffffff887f8e95>] :bonding:bond_mii_monitor+0x41e/0x4c0 RSP <ffff810872777e10> Oonkwee, I know what an IO CRU is, but I have a feeling that gospo (assigned to this BZ) does not know what one is. Could you elaborate on the design of the IO CRU so he has a better understanding of what the problem is? Thanks, P. A "CRU" (Customer Replaceable Unit) is essentially a hot-pluggable box containing processor, memory, PCI bus, and associated I/O devices. A Stratus system consists of two CRUs plugged into Stratus' proprietary backplane with the CPUs running in lockstep. From an I/O standpoint, unplugging a CRU is essentially the same as unplugging a whole bunch of devices at once. Note that this problem seems to be occuring both when we bring down the bonded interface cleanly (ifdown-eth) and uncleanly (Unplug a CRU). Created attachment 481088 [details]
bond-netpoll-block-tx-fix.patch
The attached patch should resolve this.
Feedback, as always, is appreciated.
*** This bug has been marked as a duplicate of bug 659594 *** |