Bug 624713

Summary: [RHEL4] Problems with aacraid - File system going into read-only.
Product: Red Hat Enterprise Linux 4 Reporter: Bryn M. Reeves <bmr>
Component: kernelAssignee: Rob Evers <revers>
Status: CLOSED ERRATA QA Contact: Storage QE <storage-qe>
Severity: high Docs Contact:
Priority: high    
Version: 4.8CC: andriusb, coughlan, cward, djeffery, fbijlsma, jwest, revers, ServeRAIDDriver, sschaefer, syeghiay, tao
Target Milestone: rcKeywords: OtherQA
Target Release: 4.9   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 523920 Environment:
Last Closed: 2011-02-16 15:31:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 626414    
Attachments:
Description Flags
aacraid 24551 patch for RHEL4U8 none

Description Bryn M. Reeves 2010-08-17 14:42:27 UTC
+++ This bug was initially created as a clone of Bug #523920 +++

Description of problem:
Fle system is going into read-only mode.
Version-Release number of selected component (if applicable):


How reproducible:

There is no specific steps for reproducing this issue, but it depends on the IBM server type and how frequently aacraid management commands exits without getting response from aacraid firmware.

Steps to Reproduce:
1.
2.
3.
  
Actual results:
File system is going into read-only

Expected results:
File system should not go into ready-only

Additional info:

aacraid: Host adapter reset request. SCSI hang ?
aacraid: Host adapter reset request. SCSI hang ?
aacraid: SCSI bus appears hung
aacraid: Host adapter reset request. SCSI hang ?
aacraid: SCSI bus appears hung
aacraid: Host adapter reset request. SCSI hang ?
aacraid: SCSI bus appears hung
aacraid: Host adapter reset request. SCSI hang ?
aacraid: SCSI bus appears hung
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 17713050
aacraid: Host adapter reset request. SCSI hang ?
aacraid: SCSI bus appears hung
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 21904786
Buffer I/O error on device dm-1, logical block 428037
lost page write due to I/O error on dm-1
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 19059346
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 20331882
Buffer I/O error on device dm-1, logical block 231424
lost page write due to I/O error on dm-1
aacraid: Host adapter reset request. SCSI hang ?
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 1768954
Buffer I/O error on device dm-0, logical block 8210
lost page write due to I/O error on dm-0
ReiserFS: dm-0: warning: journal-837: IO error during journal replay
REISERFS: abort (device dm-0): Write error while updating journal header in
flush_journal_list
REISERFS: Aborting journal for filesystem on dm-0
REISERFS: abort (device dm-1): Journal write error in flush_commit_list
REISERFS: Aborting journal for filesystem on dm-1

0:0:0:0]    disk    ServeRA  A                V1.0  /dev/sda

and

Jun 10 04:15:14 ahost kernel: end_request: I/O error, dev sda, sector 8884755
Jun 10 04:15:14 ahost kernel: SCSI error : <0 0 0 0> return code = 0x70000
Jun 10 04:15:14 ahost kernel: end_request: I/O error, dev sda, sector 8884603
Jun 10 04:15:14 ahost kernel: SCSI error : <0 0 0 0> return code = 0x70000
Jun 10 04:15:14 ahost kernel: end_request: I/O error, dev sda, sector 8884787
Jun 10 04:15:14 ahost kernel: REISERFS: abort (device dm-0): Write error
while pushing transaction to disk in flush_journal_list

[0:0:0:0]    disk    ServeRA  ARRAYA           V1.0  /dev/sda

Comment 1 Bryn M. Reeves 2010-08-17 14:44:08 UTC
This is the RHEL4 version of bug 523920. Only the memory leak as discussed in the RHEL5 bug is relevant here:

Issue:3
--------
       The driver tends to not free the memory (FIB)  when the management
request exits prematurely. The accumulation of such un-freed memory causes the
driver to fail to allocate anymore memory (FIB) and hence return 0x70000 value
to the upper layer, which puts the file system into read only mode.

Fix details:
-------------
     The fix makes sure to free the memory(FIB) even if the request exits
prematurely hence ensuring the driver wouldn’t run out of memory(FIBs)

Comment 2 Bryn M. Reeves 2010-08-17 14:49:32 UTC
This was accepted upstream in 2.6.33:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=cacb6dc3d7fea751879a225c15e48228415e6359

Patch doesn't apply directly to current EL4 aacraid:

$ diffstat /tmp/aacraid-fix-leak.patch
aachba.c   |   52 +++++++++++++++++++++++++++++++++-----------
aacraid.h  |    5 +++-
commctrl.c |   28 +++++++++++------------
comminit.c |    6 ++++-
commsup.c  |   72 +++++++++++++++++++++++++++++++++++++++++++++++++++----------
dpcsup.c   |   36 +++++++++++++++++++++++++-----
6 files changed, 154 insertions(+), 45 deletions(-)

And does not apply cleanly to the current RHEL4 aacraid:

$ patch -p1 < /tmp/aacraid-fix-leak.patch
patching file drivers/scsi/aacraid/aachba.c
Hunk #1 succeeded at 266 (offset -27 lines).
Hunk #2 succeeded at 286 (offset -27 lines).
Hunk #3 succeeded at 336 (offset -27 lines).
Hunk #6 succeeded at 1480 (offset -10 lines).
Hunk #7 succeeded at 1639 (offset -16 lines).
Hunk #8 succeeded at 1719 (offset -16 lines).
patching file drivers/scsi/aacraid/aacraid.h
Hunk #1 FAILED at 12.
Hunk #2 FAILED at 1036.
2 out of 2 hunks FAILED -- saving rejects to file drivers/scsi/aacraid/aacraid.h.rej
patching file drivers/scsi/aacraid/commctrl.c
Hunk #1 succeeded at 142 (offset -11 lines).
Hunk #2 succeeded at 309 (offset -13 lines).
Hunk #3 FAILED at 593.
Hunk #4 FAILED at 645.
Hunk #5 FAILED at 695.
Hunk #6 FAILED at 734.
Hunk #7 succeeded at 727 (offset -45 lines).
Hunk #8 succeeded at 765 (offset -45 lines).
Hunk #9 succeeded at 803 (offset -45 lines).
4 out of 9 hunks FAILED -- saving rejects to file drivers/scsi/aacraid/commctrl.c.rej
patching file drivers/scsi/aacraid/comminit.c
Hunk #1 succeeded at 202 (offset 8 lines).
Hunk #2 succeeded at 314 (offset 8 lines).
patching file drivers/scsi/aacraid/commsup.c
Hunk #1 succeeded at 192 (offset 3 lines).
Hunk #2 succeeded at 400 (offset 3 lines).
Hunk #3 succeeded at 483 (offset 3 lines).
Hunk #4 FAILED at 547.
Hunk #5 succeeded at 721 (offset 1 line).
Hunk #6 succeeded at 742 (offset 1 line).
Hunk #7 succeeded at 1393 (offset -1 lines).
Hunk #8 succeeded at 1793 (offset -8 lines).
Hunk #9 succeeded at 1804 (offset -8 lines).
1 out of 9 hunks FAILED -- saving rejects to file drivers/scsi/aacraid/commsup.c.rej
patching file drivers/scsi/aacraid/dpcsup.c
[breeves@breeves rhel4]$ patch -R -p1 < /tmp/aacraid-fix-leak.patch
patching file drivers/scsi/aacraid/aachba.c
Hunk #1 succeeded at 266 (offset -27 lines).
Hunk #2 succeeded at 283 (offset -27 lines).
Hunk #3 succeeded at 328 (offset -27 lines).
Hunk #6 succeeded at 1460 (offset -10 lines).
Hunk #7 succeeded at 1617 (offset -16 lines).
Hunk #8 succeeded at 1696 (offset -16 lines).
patching file drivers/scsi/aacraid/aacraid.h
Hunk #1 FAILED at 12.
Hunk #2 FAILED at 1036.
2 out of 2 hunks FAILED -- saving rejects to file drivers/scsi/aacraid/aacraid.h.rej
patching file drivers/scsi/aacraid/commctrl.c
Hunk #1 succeeded at 142 (offset -11 lines).
Hunk #2 succeeded at 309 (offset -13 lines).
Hunk #3 FAILED at 593.
Hunk #4 FAILED at 645.
Hunk #5 FAILED at 695.
Hunk #6 FAILED at 734.
Hunk #7 succeeded at 727 (offset -45 lines).
Hunk #8 succeeded at 765 (offset -45 lines).
Hunk #9 succeeded at 803 (offset -45 lines).
4 out of 9 hunks FAILED -- saving rejects to file drivers/scsi/aacraid/commctrl.c.rej
patching file drivers/scsi/aacraid/comminit.c
Hunk #1 succeeded at 202 (offset 8 lines).
Hunk #2 succeeded at 312 (offset 8 lines).
patching file drivers/scsi/aacraid/commsup.c
Hunk #1 succeeded at 192 (offset 3 lines).
Hunk #2 succeeded at 393 (offset 3 lines).
Hunk #3 succeeded at 474 (offset 3 lines).
Hunk #4 FAILED at 516.
Hunk #7 succeeded at 1354 (offset -2 lines).
Hunk #8 succeeded at 1751 (offset -9 lines).
Hunk #9 succeeded at 1761 (offset -9 lines).
1 out of 9 hunks FAILED -- saving rejects to file drivers/scsi/aacraid/commsup.c.rej
patching file drivers/scsi/aacraid/dpcsup.c

Comment 3 serveraid 2010-08-18 14:23:25 UTC
The patch submitted earlier was for RHEL-5 base kernels 

>>> Regarding Patch for RHEL-4 base kernels 

 As per the RHEL4U8 aacraid driver source, the version of the aacraid driver is- 2455. 

Earlier we have submitted patch-2461 and on top of that we have submitted patch-24702 to RHEL-5 base kernels, but we haven’t submitted patch-2461 and patch-24702 to RHEL-4 base kernels.

We have planned to submit a fresh patch for RHEL-4 base kernels which includes both patch-2461 and patch-24702. 

Could you please let us know whether we need to merge patch-2461 and patch-24702 or should be submitted as two different patches?

Comment 4 Rob Evers 2010-08-18 14:51:05 UTC
(In reply to comment #3)
> The patch submitted earlier was for RHEL-5 base kernels 
> 
> >>> Regarding Patch for RHEL-4 base kernels 
> 
>  As per the RHEL4U8 aacraid driver source, the version of the aacraid driver
> is- 2455. 
> 
> Earlier we have submitted patch-2461 and on top of that we have submitted
> patch-24702 to RHEL-5 base kernels, but we haven’t submitted patch-2461 and
> patch-24702 to RHEL-4 base kernels.
> 
> We have planned to submit a fresh patch for RHEL-4 base kernels which includes
> both patch-2461 and patch-24702. 
> 
> Could you please let us know whether we need to merge patch-2461 and
> patch-24702 or should be submitted as two different patches?

Ideally we want one patch that only addresses the read-only filesystem issue.  Is this possible?

Comment 6 serveraid 2010-08-19 15:34:20 UTC
>Ideally we want one patch that only addresses the read-only filesystem issue. 
>Is this possible?

  Based on your suggestion we will be submitting a new patch for RHEL 4 U8 which addresses read-only file system issue alone.
We are not sure on the driver version for this patch which we are going to submit since it doesn’t contain 2461 changes. We have submitted the version 24702 patch for RHEL-5 base kernels. 

Please guide us for which version number we need to maintain for upcoming RHEL-4 base kernels.

Comment 7 Rob Evers 2010-08-19 17:58:44 UTC
(In reply to comment #6)
> >Ideally we want one patch that only addresses the read-only filesystem issue. 
> >Is this possible?
> 
>   Based on your suggestion we will be submitting a new patch for RHEL 4 U8
> which addresses read-only file system issue alone.
> We are not sure on the driver version for this patch which we are going to
> submit since it doesn’t contain 2461 changes. We have submitted the version
> 24702 patch for RHEL-5 base kernels. 
> 
> Please guide us for which version number we need to maintain for upcoming
> RHEL-4 base kernels.

This is really up to you.  Can you append something like -rh4-1 to the end of the version to indicate that it branched?

Comment 8 serveraid 2010-08-20 12:59:40 UTC
   
  The new patch for RHEL 4.8 will address both File system read-only and False RAID Alert issue, which are customer critical issues. The patch submitted to RHEL 5 base kernel contains the above mentioned fix. 

For RHEL 4.8 we are planning to change the version number from 2455 to 24551 to indicate that it is branched. We will release the patch for RHEL 4.8 once, HCL QA has qualified it.

Comment 9 Rob Evers 2010-08-20 13:23:05 UTC
(In reply to comment #8)

> For RHEL 4.8 we are planning to change the version number from 2455 to 24551 to
> indicate that it is branched. We will release the patch for RHEL 4.8 once, HCL
> QA has qualified it.

Please attach details of what HCL did to qualify this patch when the quality effort is complete.

Thanks, Rob

Comment 12 serveraid 2010-09-07 07:01:39 UTC
Hi Rob,

We have answered the above query in link below:
https://bugzilla.redhat.com/show_bug.cgi?id=523920 
comment no:31

Comment 13 serveraid 2010-09-07 07:07:04 UTC
Created attachment 443412 [details]
aacraid 24551 patch for RHEL4U8

I am attaching aacraid_24551 patch.
This patch is generated against the RHEL-4U8 which will address the file system
read only and False RAID alert issues

Comment 14 Rob Evers 2010-09-07 15:09:27 UTC
See potential hang/data corruption issue with equivalent patch in rhel5.6:

https://bugzilla.redhat.com/show_bug.cgi?id=523920#c34

Comment 17 RHEL Program Management 2010-09-29 18:51:12 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 19 Vivek Goyal 2010-10-14 14:43:21 UTC
Committed in 89.43.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 24 Chris Ward 2011-01-31 10:59:54 UTC
Test Results?

Comment 25 errata-xmlrpc 2011-02-16 15:31:03 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html