Description of Problem: The SCSI mid layer in the 2.4.9 kernel contains an eveil stack-smashing infinite recursion bug in the case that the SCpnt->init_command() fails (which happens on Fibre-Channel controllers if the link happens to be down). The stack looks something like this: ^M[<f8981f29>] __scsi_end_request [scsi_mod] 0xc9 ^M[<f8982843>] scsi_request_fn [scsi_mod] 0x2a3 ^M[<f899d3e0>] sd_template [sd_mod] 0x0 ^M[<f8981d47>] scsi_queue_next_request [scsi_mod] 0x67 ^M[<f897a524>] scsi_release_command_Rsmp_f12100cc [scsi_mod] 0x124 ^M[<f8981fcd>] __scsi_end_request [scsi_mod] 0x16d ^M[<f8982843>] scsi_request_fn [scsi_mod] 0x2a3 ... repeated dozens of times ... ^M[<f897a524>] scsi_release_command_Rsmp_f12100cc [scsi_mod] 0x124 ^M[<f8981fcd>] __scsi_end_request [scsi_mod] 0x16d ^M[<f8982008>] scsi_end_request_Rsmp_8a9b35ba [scsi_mod] 0x18 ^M[<f8982496>] scsi_io_completion_Rsmp_a96c15bb [scsi_mod] 0x3b6 ^M[<f899ada4>] rw_intr [sd_mod] 0x204 ^M[<f8963340>] lpfc_dpc [lpfcdd] 0x0 ^M[<f897b17d>] scsi_finish_command [scsi_mod] 0xad ^M[<f897aeec>] scsi_bottom_half_handler [scsi_mod] 0xbc ^M[<c012143d>] bh_action [kernel] 0x4d ^M[<c01212df>] tasklet_hi_action [kernel] 0x7f ^M[<c012100b>] do_softirq [kernel] 0x7b ^M[<c0121635>] ksoftirqd [kernel] 0xf5 ^M[<c0105886>] kernel_thread [kernel] 0x26 ^M[<c0121540>] ksoftirqd [kernel] 0x0 The problem is that __scsi_end_request() calls scsi_release_command() causing the loop. I note that in 2.4.18, it calls __scsi_release_command() instead, preventing the recursion. Version-Release number of selected component (if applicable): 2.4.9-e.3, 2.4.9-e.5 and 2.4.9-e.8 are all susceptible. How Reproducible: 100% Steps to Reproduce: 1. Install Advanced Server. Install Emulex LP9000 cards (or LP8000), or QLogic fibre cards. Connect to a switched fabric. 2. Start a load. 3. Disable the port on the FC switch to which the server is connected. 4. Watch the fireworks :-) Actual Results: Infinite recursion, several task structures destroyed, eventuall oops/panic/hang/crash when these data structures are accessed. Expected Results: System should could continue to operate normally and the I/Os to the downed link should fail. Additional Information:
ME TOO: I have Red Hat Advance Server 2.1 and latest errata kernel 2.4.9-e.10 trying to use "multipath" functionality with setup as follows: Redhat advanced server 2.1 in a SAN Proliant DL380 G2 245299-B21 PCI 2Gb FC adapter A7346A HP FC 1Gb/2Gb Entry Switch 8B Emulex LPFC (LP950) SCSI on PCI bus 01 device 18 irq 15 scsi1 Brocade switches (16 ports) VA7410. Oops dump --------- multipath: IO failure on sdf1, disabling IO path. Unable to handle kernel paging request at virtual address a7fa4070 *pde = 00000000 Oops: 0002 Kernel 2.4.9-e.10custom CPU: 0 EIP: 0010:[<c1c6a423>] Tainted: P EFLAGS: 00010082 EIP is at ___strtok_R29805c13 [] 0x18e1d0f eax: c12d7dfe ebx: 00000000 ecx: 000000ac edx: c1c6a418 esi: 00000018 edi: c1c6a418 ebp: c12d7630 esp: c12d7608 ds: 0018 es: 0018 ss: 0018 Process bdflush (pid: 6, stackpage=c12d7000) Stack: c6808174 c1c60ba0 c1a38c00 c68223e0 c26401a0 c1c6a400 00000000 00000256 00000000 00000296 c1c6a418 c68076fd c1c6a418 c1a38c00 c1c6a400 00000296 c1c6a400 c68004a5 c1c6a418 00000000 c1c6a400 c1a38cb4 00000000 00000000 Call Trace: [<c6808174>] scsi_request_fn [scsi_mod] 0x264 [<c68223e0>] sd_template [sd_mod] 0x0 [<c68076fd>] scsi_queue_next_request [scsi_mod] 0x3d [<c68004a5>] scsi_release_command_Ra9b69956 [scsi_mod] 0x105 [<c6807939>] __scsi_end_request [scsi_mod] 0x179 [<c6808195>] scsi_request_fn [scsi_mod] 0x285 [<c68223e0>] sd_template [sd_mod] 0x0 [<c68076fd>] scsi_queue_next_request [scsi_mod] 0x3d [<c68004a5>] scsi_release_command_Ra9b69956 [scsi_mod] 0x105 [<c6807939>] __scsi_end_request [scsi_mod] 0x179 [<c6808195>] scsi_request_fn [scsi_mod] 0x285 [<c68223e0>] sd_template [sd_mod] 0x0 [<c68076fd>] scsi_queue_next_request [scsi_mod] 0x3d [<c68004a5>] scsi_release_command_Ra9b69956 [scsi_mod] 0x105 [<c6807939>] __scsi_end_request [scsi_mod] 0x179 ...
We will investigate, with a goal of fixing this in the next AS2.1 kernel update.
The fix for this bug is in the pensacola tree.
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2003-368.html