Created attachment 350884 [details] Dont reset device if it healthy. Description of problem: When try kdump on the server with aacraid device, kdump kernel stop with: AAC0: adapter kernel failed to start, init status = 0. EL5GA worked fine, 5.1+ could not work. Try the latest driver from 2.6.18-151(drv ver: 1.1-5[2461]) but failed too. 1. Checked the driver found root cause is aacraid driver soft reset device failed. 2. When kdump kernel restart system and before soft reset device status is KERNEL_UP_AND_RUNNING(0x00000080), but register MUnit.OIMR status is 0xf7(Expect: 0x0c). 3. After call aac_rx_restart_adapter() to soft reset device, device's status became to 0x0. 4. aac_rx_restart_adapter() called return is 0. device info: 9005:0285:9005:0286 An idea to avoid the issue is before do soft reset device, check device if working or no. Attachment is the patch.
Hi, Joe. Your patch looks fine for me, I just modify it slightly: we can skip that 'if' around your 'noreset' label, because in the path 'restart' is 0, thus that 'if' always fails. I already sent the patch out with you CC'ed. Pleae let me know what you think. And can you help me to test it? I am sorry that currently I don't have a machine with a storage adapter card. Thanks.
Your patch looked good and worked fine, thanks. I did not known if could modify something of rx_sync_cmd()'s parameter then driver could success reset the device. I have to say even with the patch, if kdump have configured "reset_devices" param, kdump kernel also failed to bootup, so to let it works, have to get rid of "reset_devices" parameter from kdump config file.
(In reply to comment #2) > Your patch looked good and worked fine, thanks. > Thanks for testing. > I did not known if could modify something of rx_sync_cmd()'s parameter then > driver > could success reset the device. > > I have to say even with the patch, if kdump have configured "reset_devices" > param, > kdump kernel also failed to bootup, so to let it works, have to get rid of > "reset_devices" > parameter from kdump config file. Hmm, I don't know how the aacraid hardware works, but it looks like aac_rx_restart_adapter() can't work when !aac_rx_check_health()?
> Hmm, I don't know how the aacraid hardware works, but it looks like > aac_rx_restart_adapter() can't work when !aac_rx_check_health()? even devices is healthy, aac_rx_restart_adapter could not reset the device, but the function return success reset the device. the patch with get rid of "reset_devices" just a workaround, it works.
(In reply to comment #4) > > Hmm, I don't know how the aacraid hardware works, but it looks like > > aac_rx_restart_adapter() can't work when !aac_rx_check_health()? > > even devices is healthy, aac_rx_restart_adapter could not reset the device, > but the function return success reset the device. > I knew, I meant '!aac_rx_check_health()' is true when device is healty. :) Hmm, I need some data sheet about aacraid hardware to continue since I don't understand how aacraid hardware works...
Joe, Can you provide any finer resolution on when the problem in this bugzilla was introduced? Thanks, Rob
(In reply to comment #6) > Joe, > > Can you provide any finer resolution on when the problem in this bugzilla was > introduced? > > Thanks, Rob According to our test, RHEL5 working fine, 5.1 or later kernel could not work. At RHEL5, the driver just checked if the device health, if yes then go on, or will return. Later kernel driver will handle kernel param "reset_devices", if kdump kernel have the param, driver will restart device at anytime, the problem is function aac_rx_restart_adapter() could not restart device at all, this let device's control register with incorrect value, so driver failed and kdump kernel hang. For no more knowledge of the device, so dont know why the function made device mess.
What type of aacraid adapter was this? What is the host model? If the aacraid adapter is not embedded, what is the adapter type?
(In reply to comment #8) > What type of aacraid adapter was this? What is the host model? If the aacraid > adapter is not embedded, what is the adapter type? I could not get the detail of the device but got device info via lspci: device info: 9005:0285:9005:0286
A modified patch was posted to linux-scsi mailing list: http://marc.info/?l=linux-kernel&m=124650190624659&w=2 This patch is reported to address the issue. Can someone at HCL determine whether the workaround in the linux-scsi email is correct, or offer a preferred solution and acknowledge this on the linux-scsi mailing list?
*** Bug 605714 has been marked as a duplicate of this bug. ***