Bug 453472

Summary: [aacraid] aac_srb: aac_fib_send failed with status 8195
Product: Red Hat Enterprise Linux 5 Reporter: M Smith <msmith>
Component: kernelAssignee: Tomas Henzl <thenzl>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.2CC: aacraid, achim_leubner, andriusb, bmr, cfairchild, coughlan, dhoward, dhuff, drees76, emcnabb, jpirko, jtluka, karen.skweres, kevin, pep, sandy.garza, tao
Target Milestone: rcKeywords: OtherQA, Regression, ZStream
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 19:41:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 391501, 429709, 466885    
Attachments:
Description Flags
Boot log 5.2 PAE kernel
none
SOS report for 5.1
none
Boot log 2.6.18-109.el5
none
Boot log 2.6.18-109.el5PAE
none
patch to 1.1.5.2453
none
patch to 1.1.5.2455
none
boot log 2453 patch
none
boot log 2455 patch
none
patch 2453 divided into 5 parts
none
this patch removes the AAC_QUIRK_SCSI_32 none

Description M Smith 2008-06-30 20:36:47 UTC
Description of problem:

   Upgrade to 5.2 from 5.1  

Version-Release number of selected component (if applicable):

 kernel-PAE.i686 2.6.18-92.1.1.el5
 kernel-PAE - 2.6.18-92.1.6.el5.i686

How reproducible:

Every boot into the PAE kernel


Steps to Reproduce:
1. reboot system
2. start with PAE kernel

  
Actual results:

1. system hangs with error message constantly scrolling


Expected results:

1. normal boot process

Additional info:

only exhibits symptoms in the PAE kernel.

Adaptec 2210 RAID

Comment 1 Prarit Bhargava 2008-07-02 12:51:52 UTC
Hi "M Smith" -- could you attach a sysreport from 5.1, and a boot log of the 5.2
kernel to this issue?

Thanks,

P.

Comment 2 M Smith 2008-07-08 17:58:06 UTC
Created attachment 311301 [details]
Boot log 5.2 PAE kernel

Comment 3 M Smith 2008-07-08 17:58:57 UTC
Created attachment 311302 [details]
SOS report for 5.1

Comment 4 Guy Streeter 2008-07-31 18:15:02 UTC
My customer reports that kernel-PAE-2.6.18-53.1.13.el5 boots fine, but
kernel-PAE-2.6.18-92.el5 fails this way.

Comment 5 RHEL Program Management 2008-07-31 18:15:43 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 6 M Smith 2008-08-06 13:04:19 UTC
Update to kernel kernel-PAE-2.6.18-92.1.10.el5 still exhibits same error condition.

Kernel kernel-2.6.18-92.1.10.el5 does not.

Comment 7 Sandy Garza 2008-09-05 19:24:00 UTC
HP uses this driver in some of its servers and will be affected.

Comment 8 Tomas Henzl 2008-09-10 14:27:28 UTC
Hi All,
I wasn't able to reproduce the behaviour on my test box.
Please test the patched kernel here -
http://people.redhat.com/thenzl/bz453472/2455/
the patch applied here updates the Adaptec aacraid driver from version 1.1-5[2453]
to version 1.1-5[2455].

Comment 9 M Smith 2008-09-10 15:34:27 UTC
Greetings all.

The system halts with a kernel panic with each of these kernels.

Attaching the boot log for each.

Comment 10 M Smith 2008-09-10 15:35:28 UTC
Created attachment 316322 [details]
Boot log 2.6.18-109.el5

Comment 11 M Smith 2008-09-10 15:36:25 UTC
Created attachment 316323 [details]
Boot log  2.6.18-109.el5PAE

Comment 12 Tomas Henzl 2008-09-11 13:27:53 UTC
Created attachment 316437 [details]
patch to 1.1.5.2453

Hi "M Smith",
I'm sorry, but I tested it on two different boxes before posting without any issues. The kernel was taken from many other patches until now untested so it could happen, it look to me that it is unrelated to this bz.
I still could not reproduce the problem, so I need further help from you.
In RHEL5.1 is driver version 1.1.5.2437, RHEL5.2 has 1.1.5.2453 and in RHEL5.3 will eventually come planned version 1.1.5.2455. I'll post both patches here so you could test the patch planned for RHEL5.3 and when this will not help then remove the 1.1.5.2453.
Thanks

Comment 13 Tomas Henzl 2008-09-11 13:28:39 UTC
Created attachment 316438 [details]
patch to 1.1.5.2455

Comment 14 M Smith 2008-09-11 18:30:34 UTC
Hi Tomas,

I'm sorry but I can't quite figure out what you want me to do.

What should I test now?

As you asking that I compile a kernel with each version of the driver and test each one?

Do I take that patch to 1.1.5.2455 in Comment #13 and use that with the kernel that I downloaded in Comment #8?


Thanks

Comment 15 Tomas Henzl 2008-09-12 09:13:06 UTC
(In reply to comment #14)
> Hi Tomas,
> 
> I'm sorry but I can't quite figure out what you want me to do.
> 
> What should I test now?
> 
> As you asking that I compile a kernel with each version of the driver and test
> each one?

Yes, if you could, I thought that it would be more flexible this way. 
Start please with kernel sources for 5.1 and patch it with 1.1.5.2455.
When, this will not help then please remove the patch 1.1.5.2453 and test.
If you should have difficulties with compiling the kernel, tell me and I'll do it.
Thanks

Comment 16 M Smith 2008-09-19 19:50:11 UTC
Created two test kernels as follows:

-used source from pub/redhat/linux/enterprise/5Server/en/os/SRPMS

kernel-2.6.18-53.1.21.el5.src.rpm

- rpm -ivh kernel*.src.rpm

- Copied off the two patch files from the bugzilla ticket: 
aacraid-1.1.5.2453.patch
aacraid-1.1.5.2455.patch

- edit /usr/src/redhat/SPECS/kernel-2.6.spec:

change the buildid macro: 
%define buildid .2455

Add in first just the 2453 patch, then both that and the 2455 patch. 

- Build kernel rpms: 

rpmbuild -bb --target=i686 --with pae kernel-2.6.spec


Both kernels fail with the aacraid error as previously.

Bootup logs to be attached.


Thanks

Comment 17 M Smith 2008-09-19 19:51:20 UTC
Created attachment 317232 [details]
boot log 2453 patch

boot log 2453 patch

Comment 18 M Smith 2008-09-19 19:51:53 UTC
Created attachment 317233 [details]
boot log 2455 patch

boot log 2455 patch

Comment 23 Tomas Henzl 2008-09-29 09:04:04 UTC
(In reply to comment #16)
> Created two test kernels as follows:
> Bootup logs to be attached.

Thanks, I've divided the 2453 patch into four parts, but before we begin with this, could you please update the firmware in the raid controller? 
We have report from a customer with a similar problem where this this fixed the issue.
Tomas

Comment 24 M Smith 2008-09-29 21:31:25 UTC
Hello Tomas.

I have flashed the BIOS in the 2200S RAID Controller to version 4.2-0[8205].

The same symptoms present themselves when restarting the system with this update applied.


Michael

Comment 25 Tomas Henzl 2008-09-30 15:26:51 UTC
Created attachment 318091 [details]
patch 2453 divided into 5 parts

Michael,
I divided the 2453 patch into a few parts. Please start the test with compiling the kernel without any of them and without the 2453 and 2455 patch. This should bring us to the 5.1 kernel and it should work fine.
Then add the patches p(x).patch in order to see which patch will cause the problem.
Thanks,
Tomas

Comment 26 David Rees 2008-09-30 23:27:39 UTC
Michael, I would expect that if the AAC_QUIRK_SCSI_32 quirk added to the latest driver were removed for your card, that would also fix the issue. Of the patch series that Tomas has posted, that would correspond to patch 5 that would break your system.

See this post on LKML for another similar issue:

http://marc.info/?l=linux-kernel&m=122166454808377&w=2

FWIW, I'm seeing this on one of two nearly identical machines running CentOS 4 and kernel-smp-2.6.9-78.0.1.EL.i686. The difference between the two machines is that one has 2GB and the other 4GB, the one with 4GB does not boot.

These machines both have a Adaptec 2120S (Crusader), which also has the AAC_QUIRK_SCSI_32 turned on in patch posted by Tomas in Comment #25.

The same bug is also filed for Fedora 9 under Bug #450444.

Comment 27 M Smith 2008-10-02 20:27:21 UTC
Tomas, David

We have done patches 1-4 so far and the system starts as expected.

Awaiting for completion of patch 5 to test that.

Michael

Comment 28 Tomas Henzl 2008-10-03 12:02:46 UTC
Created attachment 319352 [details]
this patch removes the AAC_QUIRK_SCSI_32

(In reply to comment #27)
> Tomas, David
> 
> We have done patches 1-4 so far and the system starts as expected.
> 
> Awaiting for completion of patch 5 to test that.
> 
> Michael
If everything is going as expected kernel with fifth patch will fail.

Try then please the attached patch on top of the 2453 patch (RHEL5.2). It removes the AAC_QUIRK_SCSI_32 for the 2120S and 2200S controller as suggested by David.
Thanks, Tomas

Comment 29 RHEL Program Management 2008-10-03 12:15:32 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 30 M Smith 2008-10-03 21:15:43 UTC
(In reply to comment #28)
> Created an attachment (id=319352) [details]
> this patch removes the AAC_QUIRK_SCSI_32
> (In reply to comment #27)
> > Tomas, David
> > 
> > We have done patches 1-4 so far and the system starts as expected.
> > 
> > Awaiting for completion of patch 5 to test that.
> > 
> > Michael
> If everything is going as expected kernel with fifth patch will fail.
> Try then please the attached patch on top of the 2453 patch (RHEL5.2). It
> removes the AAC_QUIRK_SCSI_32 for the 2120S and 2200S controller as suggested
> by David.
> Thanks, Tomas

Patch five does indeed fail.

Applying the patch to the the 2453 one to remove AAC_QUIRK_SCSI_32 produces the desired result.

We will leave the system running this patched kernel for the time being.

Thanks,

Michael

Comment 33 Tomas Henzl 2008-10-13 14:49:54 UTC
Posted today on rhkl.

Comment 35 Don Zickus 2008-10-20 15:12:47 UTC
in kernel-2.6.18-120.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 38 M Smith 2008-10-20 20:21:09 UTC
(In reply to comment #35)
> in kernel-2.6.18-120.el5
> You can download this test kernel from http://people.redhat.com/dzickus/el5

I have installed both of these test kernels:

2.6.18-120.el5
2.6.18-120.el5PAE

The system starts normally with both and is currently running the latter; I plan to leave this as the active kernel for now.

Comment 41 Issue Tracker 2008-12-09 15:49:53 UTC
Hello streeter,

Customer tried the test kernel and not getting "aac_srb: aac_fib_send
failed with status: 8195" errors anymore. However dmesg still shows the
following messages :

Dec  4 10:27:32 web kernel: scsi 0:2:13:0: timing out command, waited 22s
Dec  4 10:27:32 web kernel: scsi 0:2:14:0: timing out command, waited 22s
Dec  4 16:21:21 web kernel: scsi 0:2:12:0: timing out command, waited 22s
Dec  4 16:21:21 web kernel: scsi 0:2:13:0: timing out command, waited 22s
Dec  4 16:21:21 web kernel: scsi 0:2:14:0: timing out command, waited 22s

Thanks.

Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by streeter 
 issue 227174

Comment 44 errata-xmlrpc 2009-01-20 19:41:13 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html