| Summary: | bnx2 panic kernel when load/unload module | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Hushan Jia <hjia> | ||||||
| Component: | kernel | Assignee: | Neil Horman <nhorman> | ||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Network QE <network-qe> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 6.1 | CC: | nhorman | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2011-05-06 17:19:08 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
|
Description
Hushan Jia
2011-04-08 07:51:20 UTC
Its a BUG halt in __run_timers that stems from the pre-empt count changing before and after a timer is run. Unfortunately the way its coded up makes it all but impossible to tell which timer it is thats changing the pre-empt count. Ostensibly it would be bnx2, but thats not guaranteed, as there are lots of in-flight timers, and the bnx2 timer should be cancelled before its removed. The fact that we bug halt in the calling function makes it very tough to determine who is doing this. For that reason upstream changed the bug halt to a WARN_ON so we could try to determine what else might be going on. I'll backport that patch in a sec and hopefully it will allow us more insight to the problem. Created attachment 491240 [details]
patch to survive preempt count leaks
Hey, heres a backport of 802702e0c2618465b813242d4dfee6a233ba0beb and 576da126a6c7364d70dfd58d0bbe43d05cf5859f. It will prevent the oops from being fatal and give us a chance to see where the preempt leak is really comming from. Could you please build a kernel with it and run your test again? Thanks!
Also, I'm going to move this to 6.2. Given the reproducer, it appears it takes a significant amount of stress by the local root user in a very specific sort of way to cause this to happen. I think it can wait for 6.2. Hi Neil, on the patched kernel, it oops, but with a different trace: # uname -r 2.6.32-130.el6.bz694737.x86_64 general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:01.0/0000:01:0d.0/0000:02:04.0/device CPU 3 Modules linked in: bnx2(+) sunrpc ipv6 dm_mirror dm_region_hash dm_log sg microcode serio_raw k8temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas radeon ttm drm_kms_helper drm hwmon i2c_algo_bit bnx2 0000:02:04.0: firmware: requesting bnx2/bnx2-mips-06-6.2.1.fw bnx2 0000:02:04.0: firmware: requesting bnx2/bnx2-rv2p-06-6.0.15.fw bnx2 0000:02:04.0: eth0: Broadcom NetXtreme II BCM5706 1000Base-SX (A2) PCI-X 64-bit 100MHz found at mem e2000000, IRQ 17, node addr 00:14:5e:6d:30:fc bnx2 0000:02:05.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18 i2c_core dm_mod [last unloaded: bnx2] Modules linked in: bnx2(+) sunrpc ipv6 dm_mirror dm_region_hash dm_log sg microcode serio_raw k8temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas radeon ttm drm_kms_helper drm hwmon i2c_algo_bit bnx2 0000:02:05.0: firmware: requesting bnx2/bnx2-mips-06-6.2.1.fw bnx2 0000:02:05.0: firmware: requesting bnx2/bnx2-rv2p-06-6.0.15.fw bnx2 0000:02:05.0: eth1: Broadcom NetXtreme II BCM5706 1000Base-SX (A2) PCI-X 64-bit 100MHz found at mem e4000000, IRQ 18, node addr 00:14:5e:b3:00:fc i2c_core dm_mod [last unloaded: bnx2] Pid: 0, comm: swapper Not tainted 2.6.32-130.el6.bz694737.x86_64 #1 BladeCenter LS21 -[797251Z]- RIP: 0010:[<ffffffffa0dca009>] [<ffffffffa0dca009>] 0xffffffffa0dca009 RSP: 0018:ffff880002183e38 EFLAGS: 00010a07 RAX: ffff880002183e90 RBX: ffff88007cd18000 RCX: 0000000000000001 RDX: ffff880002183e90 RSI: 0000000000000000 RDI: ffff880029648700 RBP: ffff880002183ed0 R08: 0000000000000000 R09: 000005a32eb44818 R10: ffff880029648700 R11: 0000000000000000 R12: ffff880029649bc8 R13: ffff880002183e90 R14: 0000000000000100 R15: ffff88007cd41fd8 FS: 00007f2e18813700(0000) GS:ffff880002180000(0000) knlGS:0000000000000000 CS: 0010 DS: DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f09d4608140 CR3: 000000007ace5000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff814e0a66>] ? notifier_call_chain+0x16/0x80 [<ffffffff810142fd>] ? default_idle+0x4d/0xb0 [<ffffffff810143c3>] ? c1e_idle+0x63/0x120 [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110 [<ffffffff814d4664>] ? start_secondary+0x202/0x245 BUG: scheduling while atomic: swapper/0/0x10000100 Modules linked in: bnx2(-) sunrpc ipv6 dm_mirror dm_region_hash dm_log sg microcode serio_raw k8temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: bnx2] CPU 3: Modules linked in: bnx2(-) sunrpc ipv6 dm_mirror dm_region_hash dm_log sg microcode serio_raw k8temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: bnx2] Pid: 0, comm: swapper Not tainted 2.6.32-130.el6.bz694737.x86_64 #1 BladeCenter LS21 -[797251Z]- RIP: 0010:[<ffffffff810362ab>] [<ffffffff810362ab>] native_safe_halt+0xb/0x10 RSP: 0018:ffff88007cd41ea8 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff88007cd41ea8 RCX: 0000000003000000 RDX: 0000000000000000 RSI: ffff88007cd41ee4 RDI: 0000000003000000 RBP: ffffffff8100bc8e R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8160b060 R13: 0000000000000000 R14: ffff880002195f80 R15: ffff88007cd41e38 FS: 00007f2e18813700(0000) GS:ffff880002180000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f09d4560448 CR3: 000000007ace5000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff814e0a66>] ? notifier_call_chain+0x16/0x80 [<ffffffff810142fd>] ? default_idle+0x4d/0xb0 [<ffffffff810143c3>] ? c1e_idle+0x63/0x120 [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110 [<ffffffff814d4664>] ? start_secondary+0x202/0x245 ---[ end trace 996e3d14436bb5c5 ]--- Kernel panic - not syncing: Fatal exception in interrupt Pid: 0, comm: swapper Tainted: G D ---------------- 2.6.32-130.el6.bz694737.x86_64 #1 Call Trace: <IRQ> [<ffffffff814da9e1>] ? panic+0x78/0x143 [<ffffffff814dea32>] ? oops_end+0xf2/0x100 [<ffffffff8100f2fb>] ? die+0x5b/0x90 [<ffffffff814de592>] ? do_general_protection+0x152/0x160 [<ffffffff814ddd65>] ? general_protection+0x25/0x30 [<ffffffff81079f53>] ? run_timer_softirq+0x173/0x3a0 [<ffffffff8109e070>] ? tick_sched_timer+0x0/0xc0 [<ffffffff8102a00d>] ? lapic_next_event+0x1d/0x30 [<ffffffff8106f737>] ? __do_softirq+0xb7/0x1e0 [<ffffffff81092d20>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff8100c2cc>] ? call_softirq+0x1c/0x30 [<ffffffff8100df05>] ? do_softirq+0x65/0xa0 [<ffffffff8106f525>] ? irq_exit+0x85/0x90 [<ffffffff814e33a0>] ? smp_apic_timer_interrupt+0x70/0x9b [<ffffffff8100bc93>] ? apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff810362ab>] ? native_safe_halt+0xb/0x10 [<ffffffff814e0a66>] ? notifier_call_chain+0x16/0x80 [<ffffffff810142fd>] ? default_idle+0x4d/0xb0 [<ffffffff810143c3>] ? c1e_idle+0x63/0x120 [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110 [<ffffffff814d4664>] ? start_secondary+0x202/0x245 panic occurred, switching back to text console BUG: scheduling while atomic: swapper/0/0x10000100 Modules linked in: bnx2(-) sunrpc ipv6 dm_mirror dm_region_hash dm_log sg microcode serio_raw k8temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: bnx2] CPU 3: Modules linked in: bnx2(-) sunrpc ipv6 dm_mirror dm_region_hash dm_log sg microcode serio_raw k8temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: bnx2] Pid: 0, comm: swapper Tainted: G D ---------------- 2.6.32-130.el6.bz694737.x86_64 #1 BladeCenter LS21 -[797251Z]- RIP: 0010:[<ffffffff810362ab>] [<ffffffff810362ab>] native_safe_halt+0xb/0x10 RSP: 0018:ffff88007cd41ea8 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff88007cd41ea8 RCX: 0000000003000000 RDX: 0000000000000000 RSI: ffff88007cd41ee4 RDI: 0000000003000000 RBP: ffffffff8100bc8e R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8160b060 R13: 0000000000000000 R14: ffff880002195f80 R15: ffff88007cd41e38 FS: 00007f09d2f1c7a0(0000) GS:ffff880002180000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f09d4652bb0 CR3: 0000000001a25000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff814e0a66>] ? notifier_call_chain+0x16/0x80 [<ffffffff810142fd>] ? default_idle+0x4d/0xb0 [<ffffffff810143c3>] ? c1e_idle+0x63/0x120 [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110 [<ffffffff814d4664>] ? start_secondary+0x202/0x245 Created attachment 492150 [details]
patch to prevent arming timer in bnx2
Its a bit difficult to be sure, but I think we have a situation here in which the bnx2_timer function is running after the module has been removed. Please test the above patch (in addition to the previous patch) on your reproducer. Its not a final fix, but it will confirm if this is the problem for us. Thanks!
With the second patch, kernel does not panic, but the interface is down, driver cant detect its link, so there is no network. Settings for eth1: Supported ports: [ FIBRE ] Supported link modes: 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 1000baseT/Full Advertised pause frame use: No Advertised auto-negotiation: Yes Speed: Unknown! Duplex: Unknown! (255) Port: FIBRE PHYAD: 1 Transceiver: internal Auto-negotiation: on Supports Wake-on: g Wake-on: g Link detected: no Ok, I've sent a patch for this upstream: http://marc.info/?l=linux-netdev&m=130384983429440&w=2 looks like it was accepted upstream, This will get pulled back during the 6.1 update cycle I just merged this commit in with my branch for the bnx2 6.2 update, so I'm going to close this as a dup of the omnibus update tracking bug for bnx2 *** This bug has been marked as a duplicate of bug 696756 *** |