Bug 510167 - Kdump kernel failed to reset aacraid device
Summary: Kdump kernel failed to reset aacraid device
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Rob Evers
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
: 605714 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-07-08 01:46 UTC by Joe Jin
Modified: 2011-06-01 04:52 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-06-01 04:51:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Dont reset device if it healthy. (800 bytes, patch)
2009-07-08 01:46 UTC, Joe Jin
no flags Details | Diff

Description Joe Jin 2009-07-08 01:46:21 UTC
Created attachment 350884 [details]
Dont reset device if it healthy.

Description of problem:
  When try kdump on the server with aacraid device, kdump kernel stop with:
  AAC0: adapter kernel failed to start, init status = 0. 

  EL5GA worked fine, 5.1+ could not work.
  Try the latest driver from 2.6.18-151(drv ver: 1.1-5[2461]) but failed too.
  1. Checked the driver found root cause is aacraid driver soft reset device failed.
  2. When kdump kernel restart system and before soft reset device status
     is KERNEL_UP_AND_RUNNING(0x00000080), but register MUnit.OIMR
     status is 0xf7(Expect: 0x0c).
  3. After call aac_rx_restart_adapter() to soft reset device, device's status
     became to 0x0.
  4. aac_rx_restart_adapter() called return is 0.

  device info: 9005:0285:9005:0286

  An idea to avoid the issue is before do soft reset device, check device if working or no.

  Attachment is the patch.

Comment 1 Cong Wang 2009-07-17 04:47:10 UTC
Hi, Joe.

Your patch looks fine for me, I just modify it slightly:

we can skip that 'if' around your 'noreset' label, because in the path 'restart' is 0, thus that 'if' always fails.

I already sent the patch out with you CC'ed. Pleae let me know what you think.

And can you help me to test it? I am sorry that currently I don't have a machine with a storage adapter card.

Thanks.

Comment 2 Joe Jin 2009-07-17 05:47:24 UTC
Your patch looked good and worked fine, thanks.

I did not known if could modify something of rx_sync_cmd()'s parameter then driver
could success reset the device.

I have to say even with the patch, if kdump have configured "reset_devices" param,
kdump kernel also failed to bootup, so to let it works, have to get rid of "reset_devices"
parameter from kdump config file.

Comment 3 Cong Wang 2009-07-17 08:31:01 UTC
(In reply to comment #2)
> Your patch looked good and worked fine, thanks.
> 

Thanks for testing.

> I did not known if could modify something of rx_sync_cmd()'s parameter then
> driver
> could success reset the device.
> 
> I have to say even with the patch, if kdump have configured "reset_devices"
> param,
> kdump kernel also failed to bootup, so to let it works, have to get rid of
> "reset_devices"
> parameter from kdump config file.  

Hmm, I don't know how the aacraid hardware works, but it looks like aac_rx_restart_adapter() can't work when !aac_rx_check_health()?

Comment 4 Joe Jin 2009-07-17 14:18:49 UTC
> Hmm, I don't know how the aacraid hardware works, but it looks like
> aac_rx_restart_adapter() can't work when !aac_rx_check_health()?  

even devices is healthy, aac_rx_restart_adapter could not reset the device,
but the function return success reset the device.

the patch with get rid of "reset_devices" just a workaround, it works.

Comment 5 Cong Wang 2009-07-20 05:31:27 UTC
(In reply to comment #4)
> > Hmm, I don't know how the aacraid hardware works, but it looks like
> > aac_rx_restart_adapter() can't work when !aac_rx_check_health()?  
> 
> even devices is healthy, aac_rx_restart_adapter could not reset the device,
> but the function return success reset the device.
> 

I knew, I meant '!aac_rx_check_health()' is true when device is healty. :)

Hmm, I need some data sheet about aacraid hardware to continue since I don't understand how aacraid hardware works...

Comment 6 Rob Evers 2009-08-20 17:32:00 UTC
Joe,

Can you provide any finer resolution on when the problem in this bugzilla was introduced?

Thanks, Rob

Comment 7 Joe Jin 2009-08-21 00:05:39 UTC
(In reply to comment #6)
> Joe,
> 
> Can you provide any finer resolution on when the problem in this bugzilla was
> introduced?
> 
> Thanks, Rob  

According to our test, RHEL5 working fine, 5.1 or later kernel could not work.
At RHEL5, the driver just checked if the device health, if yes then go on, or
will return. Later kernel driver will handle kernel param "reset_devices",  if 
kdump kernel have the param, driver will restart device at anytime, the
problem is function aac_rx_restart_adapter() could not restart device at all,
this let device's control register with incorrect value, so driver failed and 
kdump kernel hang.

For no more knowledge of the device, so dont know why the function made
device mess.

Comment 8 Rob Evers 2009-08-31 11:30:31 UTC
What type of aacraid adapter was this?  What is the host model?  If the aacraid adapter is not embedded, what is the adapter type?

Comment 9 Joe Jin 2009-08-31 13:57:04 UTC
(In reply to comment #8)
> What type of aacraid adapter was this?  What is the host model?  If the aacraid
> adapter is not embedded, what is the adapter type?  


I could not get the detail of the device but got device info via lspci:
  device info: 9005:0285:9005:0286

Comment 10 Rob Evers 2009-11-25 16:06:21 UTC
A modified patch was posted to linux-scsi mailing list:

http://marc.info/?l=linux-kernel&m=124650190624659&w=2

This patch is reported to address the issue.

Can someone at HCL determine whether the workaround in the linux-scsi email is correct, or offer a preferred solution and acknowledge this on the linux-scsi mailing list?

Comment 12 Rob Evers 2010-11-16 14:25:51 UTC
*** Bug 605714 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.