Description of problem: Soft lock up detect while unplugging the HBA cable when IO running. Following is the stack trace Mar 9 16:44:39 linux3 kernel: BUG: soft lockup detected on CPU#0! Mar 9 16:44:39 linux3 kernel: [<c046bece>] invalidate_bh_lru+0x0/0x3b Mar 9 16:44:39 linux3 kernel: [<c044a0b7>] softlockup_tick+0x98/0xa6 Mar 9 16:44:39 linux3 kernel: [<c042cc98>] update_process_times+0x39/0x5c Mar 9 16:44:39 linux3 kernel: [<c04176ec>] smp_apic_timer_interrupt+0x5c/0x64 Mar 9 16:44:39 linux3 kernel: [<c04049bf>] apic_timer_interrupt+0x1f/0x24 Mar 9 16:44:39 linux3 kernel: [<c046bece>] invalidate_bh_lru+0x0/0x3b Mar 9 16:44:39 linux3 kernel: [<c0415921>] smp_call_function+0x99/0xc3 Mar 9 16:44:39 linux3 kernel: [<c046bece>] invalidate_bh_lru+0x0/0x3b Mar 9 16:44:39 linux3 kernel: [<c046bece>] invalidate_bh_lru+0x0/0x3b Mar 9 16:44:39 linux3 kernel: [<c0427b46>] on_each_cpu+0x10/0x1f Mar 9 16:44:39 linux3 kernel: [<c046c9c3>] invalidate_bdev+0x23/0x2e Mar 9 16:44:39 linux3 kernel: [<c0470bf8>] kill_bdev+0xd/0x20 Mar 9 16:44:39 linux3 kernel: [<c0471097>] __blkdev_put+0x3b/0x123 Mar 9 16:44:39 linux3 kernel: [<c046b623>] __fput+0x9c/0x167 Mar 9 16:44:39 linux3 kernel: [<c046910f>] filp_close+0x4e/0x54 Mar 9 16:44:39 linux3 kernel: [<c0403eff>] syscall_call+0x7/0xb Version-Release number of selected component (if applicable): 2.6.18-8 How reproducible: We saw this only once in Emulex lab. We are trying to reproduce this issue. Steps to Reproduce: 1. Storage configuration LP11002 -- brocade 4 gig ----- Hitachi array 2. Start IO using dt on sd devices of the Hitachi array. 3. While running IO unplug FC cable connected to HBA 4. Leave the cable unplugged for 30 seconds (dev_loss_tmo). 5. Reattach the cable. Check log file and the soft lockup bug reported in the log file. Actual results: Soft lock up detect while unplugging the HBA cable when IO running. Following is the stack trace Mar 9 16:44:39 linux3 kernel: BUG: soft lockup detected on CPU#0! Mar 9 16:44:39 linux3 kernel: [<c046bece>] invalidate_bh_lru+0x0/0x3b Mar 9 16:44:39 linux3 kernel: [<c044a0b7>] softlockup_tick+0x98/0xa6 Mar 9 16:44:39 linux3 kernel: [<c042cc98>] update_process_times+0x39/0x5c Mar 9 16:44:39 linux3 kernel: [<c04176ec>] smp_apic_timer_interrupt+0x5c/0x64 Mar 9 16:44:39 linux3 kernel: [<c04049bf>] apic_timer_interrupt+0x1f/0x24 Mar 9 16:44:39 linux3 kernel: [<c046bece>] invalidate_bh_lru+0x0/0x3b Mar 9 16:44:39 linux3 kernel: [<c0415921>] smp_call_function+0x99/0xc3 Mar 9 16:44:39 linux3 kernel: [<c046bece>] invalidate_bh_lru+0x0/0x3b Mar 9 16:44:39 linux3 kernel: [<c046bece>] invalidate_bh_lru+0x0/0x3b Mar 9 16:44:39 linux3 kernel: [<c0427b46>] on_each_cpu+0x10/0x1f Mar 9 16:44:39 linux3 kernel: [<c046c9c3>] invalidate_bdev+0x23/0x2e Mar 9 16:44:39 linux3 kernel: [<c0470bf8>] kill_bdev+0xd/0x20 Mar 9 16:44:39 linux3 kernel: [<c0471097>] __blkdev_put+0x3b/0x123 Mar 9 16:44:39 linux3 kernel: [<c046b623>] __fput+0x9c/0x167 Mar 9 16:44:39 linux3 kernel: [<c046910f>] filp_close+0x4e/0x54 Mar 9 16:44:39 linux3 kernel: [<c0403eff>] syscall_call+0x7/0xb Expected results: Rediscovery of all FC luns with no softlockup. Additional info:
Added to the RHEL 5.1 prioritization list.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
From Bino via email: Andrius, The stack trace of bug 234600 is also present in the log file of the bug 240473. But there are other stacks also present in bug 240473 log file. This could be due to the multipathing driver running in bug 240473 test environment. There is a good probability that the root cause of these two issues are same. -bino
Out of runway for 5.1 - deferring to 5.2.
Bino - are you still experiencing this with 5.1 Beta? We've heard from other partners and customers that this may have gone away between 5.0 and 5.1.
We are seeing a similar issueon PPC system with RHEL5.1 following is the stack trace from Power PC system. See also Bugzilla 234600. This bugzilla might be related to this issue. smp_call_function on cpu 1: other cpus not responding (2) 1:mon> t [c0000000c8ad3a10] c000000000070900 .on_each_cpu+0x24/0x88 [c0000000c8ad3ab0] c0000000000ee128 .invalidate_bh_lrus+0x28/0x40 [c0000000c8ad3b30] c0000000000f64b4 .kill_bdev+0x34/0x60 [c0000000c8ad3bb0] c0000000000f6e8c .__blkdev_put+0x88/0x220 [c0000000c8ad3c50] c0000000000eca1c .__fput+0x108/0x25c [c0000000c8ad3d00] c0000000000e8fa4 .filp_close+0xac/0xd4 [c0000000c8ad3d90] c0000000000eacf4 .sys_close+0xc4/0x110 [c0000000c8ad3e30] c0000000000086a4 syscall_exit+0x0/0x40
Bugzilla number in the above comment is wrong. Please also see Bugzilla 408541.
There are enough differences so that these may not be related, but note that we're seeing a soft-lockup in 5.1 when trying to unmount a volume that's gone readonly after an error. I reported it in bug #429054 , see that for more info. In our situation, the device (/dev/md7) is a RAID1 made up of two volumes, each coming off a different SAN array. We're using QLogic cards, though, not Emulex.
Emulex - have you tested this on the latest 5.2 Beta RC bits to see if this still persists?
Apparently there has been a workaround discovered by Emulex by doing the following: "The default syslog.conf file for RHEL makes all the printk statements sync with /var/log/messages file. After I changed the syslog.conf file to not to sync the printk with /var/log/messages file, the issue was not re-produceable." If this continues to be a problem, please re-open this bugzilla.