Description of problem: Reboot or Shutdown fails if a device attached to an mpt controller suffers a timeout Version-Release number of selected component (if applicable): Originally reported against RHEL4 U2 How reproducible: Always Steps to Reproduce: 1. ensure broken device exists. 2. Execute reboot or shutdown. Actual results: shutdown hangs remote(pam_unix)[6995]: session closed for user guest Sending all processes the KILL signal... Saving random seed: Syncing hardware clock to system time Turning off swap: Turning off quotas: Unmounting pipe file systems: Unmounting file systems: Please stand by while rebooting the system... md: stopping all md devices. md: md0 switched to read-only mode. Synchronizing SCSI cache for disk sda: Expected results: successful reboot/shutdown. Additional info: Fusion MPT driver issued synchronize cache command to SES chip but the chip didn't return response. In this case, fusion MPT driver set timer(10sec) and then detects timeout but mptscsih_timer_expired() didn't do nothing. This cause fusion MPT driver waits for response forever from the device. Original patch was against Update 2, but mpt code was shuffled in a later update and applies to different file. Since the original patch duplicated the action in the else clause the attached patch just makes the reset code unconditional.
Created attachment 155083 [details] Amended customer patch now applicable to 4.5 kernel.
I agree that the "if" statement is unnecessary, since the same code is executed with both paths. The HardReset approach is rather dramatic, resetting the chip. If this works for everyone, fine with me. But I wonder about the other devices, if any, with an ID greater than the SES chip. With the HardReset approach, will those devices have a Sync Cache command executed? What driver version is being used? The code I've looked at doesn't issue a SYNCH CACHE command unless the device is DISK_TYPE. A SES device should not have a SYNC CACHE command sent to it. Shouldn't the SES device be a Processor type and so indicated in its Inquiry data? If the timer expires, it might be better to issue a Bus Reset to the device and allow the driver to proceed to the next device, rather than the "Big Hammer" of a chip Diagnostic Reset. Please advise what works for everyone.
Hello Larry, We detected this problem in RHEL4U2(Fusion MPT driver 3.02.18). As you say, same problem doesn't occur in RHEL4U3(3.02.62.01rh) or later because SYNCH CACHE command isn't issued to SES chip. But timeout may occur in disk if disk is broken. So we need to fix this problem. I think hard reset is easiest approach. But I think device reset or bus reset are better if possible. Masao Fukuchi
Created attachment 160016 [details] Compressed tar file of LSI driver source code.
Hello Larry, I looked at the offered source code. I think there are no problem because the driver will issue hard reset for all internal command. But I couldn't find all part where you revised because I don't have source code of 3.02.11. Could you make a patch against fusion MPT driver 3.02.73rh(driver for RHEL4.5) or fusion MPT driver 3.02.99.00(candidate for RHEL4.6)? Thank you. M.Fukuchi This event sent from IssueTracker by mmatsuya issue 121161
Fujitsu confirmed that this issue was gone in RHEL4.6. So, we can change it to MODIFIED. ----------------------------------------- I confirmed Fusion MPT driver 3.02.99.00rh in RHEL4.6 Beta fixes this problem. In former version of Fusion MPT driver, when timeout occurred for internal command, no action did. And it causes system hang. But, Fusion MPT driver 3.02.99.00rh calls wake_up() and it wakes up mptscsih_do_cmd() forcibly. Then the response will return to upper layer soon. Thank you. M.Fukuchi This event sent from IssueTracker by mmatsuya issue 121161
Hi there, As I wrote before, this issue was fixed on 4.6 beta. Please set the proper flags and the status. Regards, This event sent from IssueTracker by mmatsuya issue 121161