Bug 624713 - [RHEL4] Problems with aacraid - File system going into read-only.
[RHEL4] Problems with aacraid - File system going into read-only.
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.8
All Linux
high Severity high
: rc
: 4.9
Assigned To: Rob Evers
Storage QE
: OtherQA
Depends On:
Blocks: 626414
  Show dependency treegraph
 
Reported: 2010-08-17 10:42 EDT by Bryn M. Reeves
Modified: 2011-02-16 10:31 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 523920
Environment:
Last Closed: 2011-02-16 10:31:03 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
aacraid 24551 patch for RHEL4U8 (12.29 KB, patch)
2010-09-07 03:07 EDT, serveraid
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0263 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.9 kernel security and bug fix update 2011-02-16 10:14:55 EST

  None (edit)
Description Bryn M. Reeves 2010-08-17 10:42:27 EDT
+++ This bug was initially created as a clone of Bug #523920 +++

Description of problem:
Fle system is going into read-only mode.
Version-Release number of selected component (if applicable):


How reproducible:

There is no specific steps for reproducing this issue, but it depends on the IBM server type and how frequently aacraid management commands exits without getting response from aacraid firmware.

Steps to Reproduce:
1.
2.
3.
  
Actual results:
File system is going into read-only

Expected results:
File system should not go into ready-only

Additional info:

aacraid: Host adapter reset request. SCSI hang ?
aacraid: Host adapter reset request. SCSI hang ?
aacraid: SCSI bus appears hung
aacraid: Host adapter reset request. SCSI hang ?
aacraid: SCSI bus appears hung
aacraid: Host adapter reset request. SCSI hang ?
aacraid: SCSI bus appears hung
aacraid: Host adapter reset request. SCSI hang ?
aacraid: SCSI bus appears hung
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 17713050
aacraid: Host adapter reset request. SCSI hang ?
aacraid: SCSI bus appears hung
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 21904786
Buffer I/O error on device dm-1, logical block 428037
lost page write due to I/O error on dm-1
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 19059346
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 20331882
Buffer I/O error on device dm-1, logical block 231424
lost page write due to I/O error on dm-1
aacraid: Host adapter reset request. SCSI hang ?
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 1768954
Buffer I/O error on device dm-0, logical block 8210
lost page write due to I/O error on dm-0
ReiserFS: dm-0: warning: journal-837: IO error during journal replay
REISERFS: abort (device dm-0): Write error while updating journal header in
flush_journal_list
REISERFS: Aborting journal for filesystem on dm-0
REISERFS: abort (device dm-1): Journal write error in flush_commit_list
REISERFS: Aborting journal for filesystem on dm-1

0:0:0:0]    disk    ServeRA  A                V1.0  /dev/sda

and

Jun 10 04:15:14 ahost kernel: end_request: I/O error, dev sda, sector 8884755
Jun 10 04:15:14 ahost kernel: SCSI error : <0 0 0 0> return code = 0x70000
Jun 10 04:15:14 ahost kernel: end_request: I/O error, dev sda, sector 8884603
Jun 10 04:15:14 ahost kernel: SCSI error : <0 0 0 0> return code = 0x70000
Jun 10 04:15:14 ahost kernel: end_request: I/O error, dev sda, sector 8884787
Jun 10 04:15:14 ahost kernel: REISERFS: abort (device dm-0): Write error
while pushing transaction to disk in flush_journal_list

[0:0:0:0]    disk    ServeRA  ARRAYA           V1.0  /dev/sda
Comment 1 Bryn M. Reeves 2010-08-17 10:44:08 EDT
This is the RHEL4 version of bug 523920. Only the memory leak as discussed in the RHEL5 bug is relevant here:

Issue:3
--------
       The driver tends to not free the memory (FIB)  when the management
request exits prematurely. The accumulation of such un-freed memory causes the
driver to fail to allocate anymore memory (FIB) and hence return 0x70000 value
to the upper layer, which puts the file system into read only mode.

Fix details:
-------------
     The fix makes sure to free the memory(FIB) even if the request exits
prematurely hence ensuring the driver wouldn’t run out of memory(FIBs)
Comment 2 Bryn M. Reeves 2010-08-17 10:49:32 EDT
This was accepted upstream in 2.6.33:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=cacb6dc3d7fea751879a225c15e48228415e6359

Patch doesn't apply directly to current EL4 aacraid:

$ diffstat /tmp/aacraid-fix-leak.patch
aachba.c   |   52 +++++++++++++++++++++++++++++++++-----------
aacraid.h  |    5 +++-
commctrl.c |   28 +++++++++++------------
comminit.c |    6 ++++-
commsup.c  |   72 +++++++++++++++++++++++++++++++++++++++++++++++++++----------
dpcsup.c   |   36 +++++++++++++++++++++++++-----
6 files changed, 154 insertions(+), 45 deletions(-)

And does not apply cleanly to the current RHEL4 aacraid:

$ patch -p1 < /tmp/aacraid-fix-leak.patch
patching file drivers/scsi/aacraid/aachba.c
Hunk #1 succeeded at 266 (offset -27 lines).
Hunk #2 succeeded at 286 (offset -27 lines).
Hunk #3 succeeded at 336 (offset -27 lines).
Hunk #6 succeeded at 1480 (offset -10 lines).
Hunk #7 succeeded at 1639 (offset -16 lines).
Hunk #8 succeeded at 1719 (offset -16 lines).
patching file drivers/scsi/aacraid/aacraid.h
Hunk #1 FAILED at 12.
Hunk #2 FAILED at 1036.
2 out of 2 hunks FAILED -- saving rejects to file drivers/scsi/aacraid/aacraid.h.rej
patching file drivers/scsi/aacraid/commctrl.c
Hunk #1 succeeded at 142 (offset -11 lines).
Hunk #2 succeeded at 309 (offset -13 lines).
Hunk #3 FAILED at 593.
Hunk #4 FAILED at 645.
Hunk #5 FAILED at 695.
Hunk #6 FAILED at 734.
Hunk #7 succeeded at 727 (offset -45 lines).
Hunk #8 succeeded at 765 (offset -45 lines).
Hunk #9 succeeded at 803 (offset -45 lines).
4 out of 9 hunks FAILED -- saving rejects to file drivers/scsi/aacraid/commctrl.c.rej
patching file drivers/scsi/aacraid/comminit.c
Hunk #1 succeeded at 202 (offset 8 lines).
Hunk #2 succeeded at 314 (offset 8 lines).
patching file drivers/scsi/aacraid/commsup.c
Hunk #1 succeeded at 192 (offset 3 lines).
Hunk #2 succeeded at 400 (offset 3 lines).
Hunk #3 succeeded at 483 (offset 3 lines).
Hunk #4 FAILED at 547.
Hunk #5 succeeded at 721 (offset 1 line).
Hunk #6 succeeded at 742 (offset 1 line).
Hunk #7 succeeded at 1393 (offset -1 lines).
Hunk #8 succeeded at 1793 (offset -8 lines).
Hunk #9 succeeded at 1804 (offset -8 lines).
1 out of 9 hunks FAILED -- saving rejects to file drivers/scsi/aacraid/commsup.c.rej
patching file drivers/scsi/aacraid/dpcsup.c
[breeves@breeves rhel4]$ patch -R -p1 < /tmp/aacraid-fix-leak.patch
patching file drivers/scsi/aacraid/aachba.c
Hunk #1 succeeded at 266 (offset -27 lines).
Hunk #2 succeeded at 283 (offset -27 lines).
Hunk #3 succeeded at 328 (offset -27 lines).
Hunk #6 succeeded at 1460 (offset -10 lines).
Hunk #7 succeeded at 1617 (offset -16 lines).
Hunk #8 succeeded at 1696 (offset -16 lines).
patching file drivers/scsi/aacraid/aacraid.h
Hunk #1 FAILED at 12.
Hunk #2 FAILED at 1036.
2 out of 2 hunks FAILED -- saving rejects to file drivers/scsi/aacraid/aacraid.h.rej
patching file drivers/scsi/aacraid/commctrl.c
Hunk #1 succeeded at 142 (offset -11 lines).
Hunk #2 succeeded at 309 (offset -13 lines).
Hunk #3 FAILED at 593.
Hunk #4 FAILED at 645.
Hunk #5 FAILED at 695.
Hunk #6 FAILED at 734.
Hunk #7 succeeded at 727 (offset -45 lines).
Hunk #8 succeeded at 765 (offset -45 lines).
Hunk #9 succeeded at 803 (offset -45 lines).
4 out of 9 hunks FAILED -- saving rejects to file drivers/scsi/aacraid/commctrl.c.rej
patching file drivers/scsi/aacraid/comminit.c
Hunk #1 succeeded at 202 (offset 8 lines).
Hunk #2 succeeded at 312 (offset 8 lines).
patching file drivers/scsi/aacraid/commsup.c
Hunk #1 succeeded at 192 (offset 3 lines).
Hunk #2 succeeded at 393 (offset 3 lines).
Hunk #3 succeeded at 474 (offset 3 lines).
Hunk #4 FAILED at 516.
Hunk #7 succeeded at 1354 (offset -2 lines).
Hunk #8 succeeded at 1751 (offset -9 lines).
Hunk #9 succeeded at 1761 (offset -9 lines).
1 out of 9 hunks FAILED -- saving rejects to file drivers/scsi/aacraid/commsup.c.rej
patching file drivers/scsi/aacraid/dpcsup.c
Comment 3 serveraid 2010-08-18 10:23:25 EDT
The patch submitted earlier was for RHEL-5 base kernels 

>>> Regarding Patch for RHEL-4 base kernels 

 As per the RHEL4U8 aacraid driver source, the version of the aacraid driver is- 2455. 

Earlier we have submitted patch-2461 and on top of that we have submitted patch-24702 to RHEL-5 base kernels, but we haven’t submitted patch-2461 and patch-24702 to RHEL-4 base kernels.

We have planned to submit a fresh patch for RHEL-4 base kernels which includes both patch-2461 and patch-24702. 

Could you please let us know whether we need to merge patch-2461 and patch-24702 or should be submitted as two different patches?
Comment 4 Rob Evers 2010-08-18 10:51:05 EDT
(In reply to comment #3)
> The patch submitted earlier was for RHEL-5 base kernels 
> 
> >>> Regarding Patch for RHEL-4 base kernels 
> 
>  As per the RHEL4U8 aacraid driver source, the version of the aacraid driver
> is- 2455. 
> 
> Earlier we have submitted patch-2461 and on top of that we have submitted
> patch-24702 to RHEL-5 base kernels, but we haven’t submitted patch-2461 and
> patch-24702 to RHEL-4 base kernels.
> 
> We have planned to submit a fresh patch for RHEL-4 base kernels which includes
> both patch-2461 and patch-24702. 
> 
> Could you please let us know whether we need to merge patch-2461 and
> patch-24702 or should be submitted as two different patches?

Ideally we want one patch that only addresses the read-only filesystem issue.  Is this possible?
Comment 6 serveraid 2010-08-19 11:34:20 EDT
>Ideally we want one patch that only addresses the read-only filesystem issue. 
>Is this possible?

  Based on your suggestion we will be submitting a new patch for RHEL 4 U8 which addresses read-only file system issue alone.
We are not sure on the driver version for this patch which we are going to submit since it doesn’t contain 2461 changes. We have submitted the version 24702 patch for RHEL-5 base kernels. 

Please guide us for which version number we need to maintain for upcoming RHEL-4 base kernels.
Comment 7 Rob Evers 2010-08-19 13:58:44 EDT
(In reply to comment #6)
> >Ideally we want one patch that only addresses the read-only filesystem issue. 
> >Is this possible?
> 
>   Based on your suggestion we will be submitting a new patch for RHEL 4 U8
> which addresses read-only file system issue alone.
> We are not sure on the driver version for this patch which we are going to
> submit since it doesn’t contain 2461 changes. We have submitted the version
> 24702 patch for RHEL-5 base kernels. 
> 
> Please guide us for which version number we need to maintain for upcoming
> RHEL-4 base kernels.

This is really up to you.  Can you append something like -rh4-1 to the end of the version to indicate that it branched?
Comment 8 serveraid 2010-08-20 08:59:40 EDT
   
  The new patch for RHEL 4.8 will address both File system read-only and False RAID Alert issue, which are customer critical issues. The patch submitted to RHEL 5 base kernel contains the above mentioned fix. 

For RHEL 4.8 we are planning to change the version number from 2455 to 24551 to indicate that it is branched. We will release the patch for RHEL 4.8 once, HCL QA has qualified it.
Comment 9 Rob Evers 2010-08-20 09:23:05 EDT
(In reply to comment #8)

> For RHEL 4.8 we are planning to change the version number from 2455 to 24551 to
> indicate that it is branched. We will release the patch for RHEL 4.8 once, HCL
> QA has qualified it.

Please attach details of what HCL did to qualify this patch when the quality effort is complete.

Thanks, Rob
Comment 12 serveraid 2010-09-07 03:01:39 EDT
Hi Rob,

We have answered the above query in link below:
https://bugzilla.redhat.com/show_bug.cgi?id=523920 
comment no:31
Comment 13 serveraid 2010-09-07 03:07:04 EDT
Created attachment 443412 [details]
aacraid 24551 patch for RHEL4U8

I am attaching aacraid_24551 patch.
This patch is generated against the RHEL-4U8 which will address the file system
read only and False RAID alert issues
Comment 14 Rob Evers 2010-09-07 11:09:27 EDT
See potential hang/data corruption issue with equivalent patch in rhel5.6:

https://bugzilla.redhat.com/show_bug.cgi?id=523920#c34
Comment 17 RHEL Product and Program Management 2010-09-29 14:51:12 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 19 Vivek Goyal 2010-10-14 10:43:21 EDT
Committed in 89.43.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Comment 24 Chris Ward 2011-01-31 05:59:54 EST
Test Results?
Comment 25 errata-xmlrpc 2011-02-16 10:31:03 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html

Note You need to log in before you can comment on or make changes to this bug.