Bug 523920 - [Adaptec/HCL 5.6 bug] Problems with aacraid - File system going into read-only.
Summary: [Adaptec/HCL 5.6 bug] Problems with aacraid - File system going into read-only.
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.6
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: rc
: ---
Assignee: Rob Evers
QA Contact: Storage QE
URL:
Whiteboard:
Keywords: OtherQA
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-09-17 08:13 UTC by serveraid
Modified: 2018-10-27 15:09 UTC (History)
15 users (show)

(edit)
Issue1: File System going into read-only mode
---------

Root cause:
-----------
       The driver tends to not free the memory (FIB) when the management request exits prematurely. The accumulation of such un-freed memory causes the driver to fail to allocate anymore memory (FIB) and hence return 0x70000 value to the upper layer, which puts the file system into read only mode.

Fix details:
------------
     The fix makes sure to free the memory (FIB) even if the request exits prematurely hence ensuring the driver wouldn’t run out of memory (FIBs).


Issue2:
------- 
	False Raid Alert occurs- when the Physical Drives and Logical drives are reported as deleted or added, even though there is no change done on the system

Root cause:
-----------
        Driver IOCTLs is signaled with EINTR while waiting on response from the lower layers. Returning “EINTR” will never initiate internal retry. 

Fix details:
------------
        The issue was fixed by replacing “EINTR” with “ERESTARTSYS” for mid-layer retries.
Clone Of:
: 624713 (view as bug list)
(edit)
Last Closed: 2011-01-13 20:53:34 UTC


Attachments (Terms of Use)
The attached patch for RHEL5 U5 only (13.13 KB, patch)
2009-09-17 08:30 UTC, serveraid
no flags Details | Diff
The attached patch for RHEL5 U4 and above (14.32 KB, patch)
2009-09-23 14:06 UTC, serveraid
no flags Details | Diff
aac-24702 patch for RHEL5U5 (14.37 KB, patch)
2010-07-27 09:01 UTC, serveraid
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description serveraid 2009-09-17 08:13:55 UTC
Description of problem:
Fle system is going into read-only mode.
Version-Release number of selected component (if applicable):


How reproducible:

There is no specific steps for reproducing this issue, but it depends on the IBM server type and how frequently aacraid management commands exits without getting response from aacraid firmware.

Steps to Reproduce:
1.
2.
3.
  
Actual results:
File system is going into read-only

Expected results:
File system should not go into ready-only

Additional info:

The customer is SAP Hosting. They are regularly getting errors from the aacraid
driver. I have two dmesg outputs

aacraid: Host adapter reset request. SCSI hang ?
aacraid: Host adapter reset request. SCSI hang ?
aacraid: SCSI bus appears hung
aacraid: Host adapter reset request. SCSI hang ?
aacraid: SCSI bus appears hung
aacraid: Host adapter reset request. SCSI hang ?
aacraid: SCSI bus appears hung
aacraid: Host adapter reset request. SCSI hang ?
aacraid: SCSI bus appears hung
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 17713050
aacraid: Host adapter reset request. SCSI hang ?
aacraid: SCSI bus appears hung
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 21904786
Buffer I/O error on device dm-1, logical block 428037
lost page write due to I/O error on dm-1
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 19059346
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 20331882
Buffer I/O error on device dm-1, logical block 231424
lost page write due to I/O error on dm-1
aacraid: Host adapter reset request. SCSI hang ?
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 1768954
Buffer I/O error on device dm-0, logical block 8210
lost page write due to I/O error on dm-0
ReiserFS: dm-0: warning: journal-837: IO error during journal replay
REISERFS: abort (device dm-0): Write error while updating journal header in
flush_journal_list
REISERFS: Aborting journal for filesystem on dm-0
REISERFS: abort (device dm-1): Journal write error in flush_commit_list
REISERFS: Aborting journal for filesystem on dm-1

0:0:0:0]    disk    ServeRA  A                V1.0  /dev/sda

and

Jun 10 04:15:14 hg22719 kernel: end_request: I/O error, dev sda, sector 8884755
Jun 10 04:15:14 hg22719 kernel: SCSI error : <0 0 0 0> return code = 0x70000
Jun 10 04:15:14 hg22719 kernel: end_request: I/O error, dev sda, sector 8884603
Jun 10 04:15:14 hg22719 kernel: SCSI error : <0 0 0 0> return code = 0x70000
Jun 10 04:15:14 hg22719 kernel: end_request: I/O error, dev sda, sector 8884787
Jun 10 04:15:14 hg22719 kernel: REISERFS: abort (device dm-0): Write error
while pushing transaction to disk in flush_journal_list

[0:0:0:0]    disk    ServeRA  ARRAYA           V1.0  /dev/sda

Comment 1 serveraid 2009-09-17 08:19:28 UTC
Rob - As i created a bugzilla entry first time for redhat, so please let me know if it is ok for you.

Comment 2 serveraid 2009-09-17 08:30:24 UTC
Created attachment 361457 [details]
The attached patch for RHEL5 U5 only

Rob - Please do review and let me know if it is ok for you. if the patch is ok for you then i will go ahead and create bugzilla entry for other releases as well.

The attached patch was generated for the following issues:

Issue:1
--------
         Behavior of the ternary operation in function aac_send_raw_srb () was
observed incorrect in 64-bit version. This issue was because of missing
parenthesis in the condition to check the sg count.

Fix details:
-------------
          Fixed by adding parentheses.

Issue:2
--------
        Driver IOCTLs is signaled with EINTR while waiting on response from the
lower layers. Returning “EINTR” will never initiate internal retry. 

Fix details:
-------------
        Fixed by replacing “EINTR” with “ERESTARTSYS” for mid-layer retries.

Issue:3
--------
       The driver tends to not free the memory (FIB)  when the management
request exits prematurely. The accumulation of such un-freed memory causes the
driver to fail to allocate anymore memory (FIB) and hence return 0x70000 value
to the upper layer, which puts the file system into read only mode.

Fix details:
-------------
     The fix makes sure to free the memory(FIB) even if the request exits
prematurely hence ensuring the driver wouldn’t run out of memory(FIBs)

Comment 3 serveraid 2009-09-17 08:31:03 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Issue:1
--------
         Behavior of the ternary operation in function aac_send_raw_srb () was
observed incorrect in 64-bit version. This issue was because of missing
parenthesis in the condition to check the sg count.

Fix details:
-------------
          Fixed by adding parentheses.

Issue:2
--------
        Driver IOCTLs is signaled with EINTR while waiting on response from the
lower layers. Returning “EINTR” will never initiate internal retry. 

Fix details:
-------------
        Fixed by replacing “EINTR” with “ERESTARTSYS” for mid-layer retries.

Issue:3
--------
       The driver tends to not free the memory (FIB)  when the management
request exits prematurely. The accumulation of such un-freed memory causes the
driver to fail to allocate anymore memory (FIB) and hence return 0x70000 value
to the upper layer, which puts the file system into read only mode.

Fix details:
-------------
     The fix makes sure to free the memory(FIB) even if the request exits
prematurely hence ensuring the driver wouldn’t run out of memory(FIBs)

Comment 4 Rob Evers 2009-09-17 17:57:19 UTC
Release notes are targeted at customers so you need to indicate how a patch, or parts of a patch in this case, fix(es) a particular symptom that the customer would be observing.

Guidelines for patches:

Any changes reflected in patches need to be upstream, or at least posted to linux-scsi mailing list and generally reviewed and ready to be accepted by James Bottomley.  Something to keep in mind when submitting code upstream is that individual problems should be solved by individual patches, as this makes review much easier.  Patch sets are ok as well (even preferred) where a description of each problem accompanies each patch, but large patches that bundle multiple problem fixes together are much more difficult to review, and therefore, more difficult to get accepted upstream into scsi-misc (submitted via email to linux-scsi).

It is not required that you need to break this patch into multiple patches here, though that would be easier to review, and get accepted.  It may also be easier to describe the changes and how they relate to the code changes in the patch.  Detailed explanations of particular bug fixes and the accompanying code changes are helpful when submitting patches here.

I applied the patch to a pretty recent rhel5.4-ish kernel tree and it failed:

[revers@revers-desk kernel]$ patch -p1 --dry-run < ../patches/bz523920/rhel5_u5_aac24701.patch 
patching file drivers/scsi/aacraid/aachba.c
patching file drivers/scsi/aacraid/aacraid.h
patching file drivers/scsi/aacraid/commctrl.c
patching file drivers/scsi/aacraid/comminit.c
patching file drivers/scsi/aacraid/commsup.c
patching file drivers/scsi/aacraid/dpcsup.c
Hunk #2 FAILED at 125.
Hunk #3 succeeded at 240 (offset -2 lines).
1 out of 4 hunks FAILED -- saving rejects to file drivers/scsi/aacraid/dpcsup.c.rej

What tree did you generate the patch from?

Also, use 'diff -purN' or something equivalent to generate patches.  Nice to see stuff like c labels to reference where patches apply in the patch itself.

Some explanation of how the patch was tested against the problems reported along with regression testing, is important to get patches accepted into rhel releases.

There are white space problems in lines 238, 426, and 432.  No trailing tabs.  You might want to dig up a program called patch-check that can flag these issues.  Problems like this will most likely be rejected upstream.

Other upstream style:  80+ character lines.

Comment 5 serveraid 2009-09-18 11:41:37 UTC
Rob - Thank you very much for your feedback. Can you please let me know the code tree locations of the latest RHEL4.9 RHEL5.4, RHEL5.5?

Comment 6 serveraid 2009-09-23 14:06:18 UTC
Created attachment 362301 [details]
The attached patch for RHEL5 U4 and above

The attached patch was generated against "kernel-2.6.18-164.el5.src.rpm". Please apply on RHEL5 U4 and RHEL5 U5 and let us know if you face any issues during applying the patch.

Regarding testing:
------------------

This particular problem was reported by Cisco and SAP. Cisco reported on RHEL4 U6 and SAP reported on SLES9 SP4 and SLES10 SP2. We added these fixes on RHEL4 U6 and gave a private build to IBM and Cisco. Cisco and IBM tested it for more than 15 days and they reported that they did not see the issue so far. Before the fix, Cisco used to see the issue within 5 days. We generated a patch for SLES9 SP4 and SLES10 SP2 and submitted to Novell. Novell applied the patch and gave a test build to SAP. SAP tested and reported that it is working properly.

We also tested in ourlab using the tools "dishogsync", which is IO stress tool and the tool was provided by Cisco.

Please let me know if you need more. I am ready to provide as well.

Thank you very much for your humble/good support.

Comment 7 serveraid 2009-09-23 14:18:31 UTC
Here is some explanation on the issue:
----------------------------------------

1. What is the root cause? I personally would like to get the details on how
you arrived at this root cause since we have come a long way from where you
started. 

Response:

 The problem is the result of an accumulation of a "set" series of events
occurring that result in depleted resources for IO requests.  The issue
revolves around the handling of management requests.  For clarity, management
requests are not confined to arconf commands.  A management request is defined
in this environment as any "non-IO" command.  This includes arconf commands,
but also includes any internal status requests made by the FW, real time clock
synchronization of the RAID FW to the server and commands tied to
communications arbitration in the stack. Essentially anything other then blocks
of data is being considered a management command in this description.   When
the code stream is handling one of these management requests, the code stream
will wait for the FW to respond.  If while this management request is being
serviced, an interrupt occurs, the code stream will jump away from waiting on
the management command and service the interrupt.  when the interrupt has been
serviced and the code jumps back to finish the original management command the
FIB (Firmware Interface Block) associated with the command is left in an
unassociated state instead of being completed or cleared.  A new FIB is
generated to finish the original command but the one instance of FIB is left in
a hung state.  The SCSI middle layer is assigned 238 FIB resources out of a
total subsystem 400.  As this scenario happens multiple times, the resources
for IO get limited as a cumulation of these essentially hung FIBs take up
resources needed by the IO FIBs .  This will happen faster on a very busy
system but can also happen on a system running lower levels of stress but
statistically will take longer and the "bullet hitting a bullet" scenario with
the interrupt is less likely to occur as well. When the FIB resources get
limited due to a large number of FIBs being essentially hung.. the driver will
encounter the inability to assign a FIB to an IO.  When the write IO fails as a
result of this the driver will produce a 0x70000 error and then retry the
command 5 times.  If in the course of the 5 attempts it manages to get
resources, the system continues to run.  If the system does not manage to get
resources the OS write will have failed and the OS goes into a "read only"
mode.  To fix this we identified there was a FIB resource issue.. discovered
the hung FIBs, and modified the code to allow the FIB resources to be released
properly in the event of an interrupt displacing a management command. 

2. I need to see some evidence that this IS the root cause and not an
accidental symptomatic match to the problems VTG’s customers saw. As part of
this, please provide the details of other suspects and how they were ruled out. 

Response:  IBM created a script to demonstrate a "proof of concept" for the
problem scenario we identified as causing the "read only" situation.  IBM also
created a new driver with the issue fixed in it.  The new driver also had debug
code imbedded in it that would print messages to a log to inform development
that the scenario above in which a management command was interrupted by an
interrupt did in fact happen.  Systems with the presence of multiple log
entries of this type demonstrates the scenario is actually happening.   

Lets discuss the script in detail first.  The script is not representative of
customer data flow but instead generates multiple issues of arconf  "getstatus"
commands.  Dozens, up to hundreds of these commands are generated to add so
many management commands in the data stream that the issue has to happen from a
statistical point of view.  Using an unpatched driver the system will fall into
"read only" mode pretty quickly.  Running the patched driver should and does
allow FIBs to be cleared properly and even running the script should not fail
the system.  In the note above Cisco made the statement that no difference was
seen in failure rates between FW levels 418 and 427.  When running the script,
this is true because the script is designed to overwhelm the resources in such
a way that the likelihood of failure is so great it will almost certainly
happen.  In the real world environments, this doesn't appear to be true based
on the history we have seen.  421 seems much less likely to fail based on what
we have seen, and the problem seems to have been more likely on 427.  As you
move from one code level to another, subtle timings and code efficiencies
appear to be just enough so that the likelihood of encountering the timed issue
will increase or decrease but the script essentially bypasses all that and
bombards the code path with management commands. 

The patched driver has been run under standard stress tools at IBM and at
Cisco.  Logs have been gathered from our test beds that confirm via debug
messages imbedded in the test drivers that the scenario of an interrupt
occurring in the process of a management command servicing has occurred
multiple times on a system.  All unused FIBs were freed by the change added to
the driver, and the system did not go into read only.  No system has
encountered an issue of this type using the new driver.

Comment 8 Rob Evers 2009-09-23 20:09:07 UTC
The patch applies cleanly to today's rhel5.5, though haven't built it yet.

A few followup items from comment 4:

Have the changes in this patch been accepted into James Bottomley's scsi-misc-2.6 git repository or has it been posted onto the linux-scsi mailing list as a step towards being accepted?

Regarding testing described in comments 6 & 7, the testing reported only seems to apply to one item but three are listed in comment 3.

The release notes need to be updated such that a customer will be able to digest the update.

Comment 9 serveraid 2009-09-29 12:31:13 UTC
Rob - The changes have been posted onto the linux-scsi mailing list for acceptance as a first step as you said.

Regarding testing, all issues are interlinked each other. All the issues were happended because of down_interruptable. We have generated and given a patch with all these changes to Novell. Novell applied the patch on SLES9 SP4 and SLES10 sp2 and gave to SAP. SAP reported the patch is working properly. Our QA is also testing things on the below OSes:

1. RHEL4.6, RHEL4.7, and  RHEL4.8 (on 32 and 64 bits)
2. RHEL5.2, RHEL5.3, and RHEL5.4 (on 32 and 64 bits)
3. SLES9 SP2, SP3, and SP4 (on 32 and 64 bits)
4. SLES10, SLES10 SP1 and SLES10 SP2 (on 32 and 64 bits)
5. SLES11 (on 32 and 64 bits).


Please let me know if any additional info is required?

Comment 10 serveraid 2009-09-29 12:31:13 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,30 +1,23 @@
-Issue:1
---------
-         Behavior of the ternary operation in function aac_send_raw_srb () was
-observed incorrect in 64-bit version. This issue was because of missing
-parenthesis in the condition to check the sg count.
+Issue1: File System going into read-only mode
+---------
 
+Root cause:
+-----------
+       The driver tends to not free the memory (FIB) when the management request exits prematurely. The accumulation of such un-freed memory causes the driver to fail to allocate anymore memory (FIB) and hence return 0x70000 value to the upper layer, which puts the file system into read only mode.
+
 Fix details:
--------------
-          Fixed by adding parentheses.
+------------
+     The fix makes sure to free the memory (FIB) even if the request exits prematurely hence ensuring the driver wouldn’t run out of memory (FIBs).
 
-Issue:2
---------
-        Driver IOCTLs is signaled with EINTR while waiting on response from the
-lower layers. Returning “EINTR” will never initiate internal retry. 
 
-Fix details:
--------------
-        Fixed by replacing “EINTR” with “ERESTARTSYS” for mid-layer retries.
+Issue2:
+------- 
+	False Raid Alert occurs- when the Physical Drives and Logical drives are reported as deleted or added, even though there is no change done on the system
 
-Issue:3
---------
-       The driver tends to not free the memory (FIB)  when the management
-request exits prematurely. The accumulation of such un-freed memory causes the
-driver to fail to allocate anymore memory (FIB) and hence return 0x70000 value
-to the upper layer, which puts the file system into read only mode.
+Root cause:
+-----------
+        Driver IOCTLs is signaled with EINTR while waiting on response from the lower layers. Returning “EINTR” will never initiate internal retry. 
 
 Fix details:
--------------
+------------
-     The fix makes sure to free the memory(FIB) even if the request exits
+        The issue was fixed by replacing “EINTR” with “ERESTARTSYS” for mid-layer retries.-prematurely hence ensuring the driver wouldn’t run out of memory(FIBs)

Comment 11 serveraid 2009-10-06 06:41:14 UTC
Rob - Can you please let us know the kernel version and location source code of RHEL6 so that I would be able to generate and submit a patch for the issues we fixed?

Comment 12 Rob Evers 2009-10-06 14:28:09 UTC
ServerRAIDDriver@hcl.in -

Please wait until alpha 2 is available for rhel6. Andrius Benokraitis <andriusb@redhat.com> will inform when that is ready, and provide pointers directly.

Comment 13 Rob Evers 2009-10-06 16:01:39 UTC
note that a different bz should be opened for rhel6.0 patches

Comment 14 Rob Evers 2009-10-06 17:58:16 UTC
Is another patch supposed to be posted to linux-scsi for these fixes in response to the email below?

http://marc.info/?l=linux-scsi&m=125431052313241&w=2

Currently this patch is not upstream and appears to need some work.  Once this work is done, please repost upstream.  Once accepted upstream, please repost a backport of the accepted upstream patch(es) here and obsolete the currently attached patch.

Thanks.

Comment 17 David Jeffery 2010-02-04 21:40:18 UTC
A version of the changes has been merged for 2.6.33:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=cacb6dc3d7fea751879a225c15e48228415e6359

Comment 20 Rob Evers 2010-06-02 15:56:23 UTC
Changes need to be accepted upstream and tested in that form against a rhel 
release before they are eligible for acceptance. 

Differences exist between commit cacb6dc3d7fea751879a225c15e48228415e6359 
and the 2nd patch posted in bz523920.  See one example below.

(Assuming the first patch in the bz is obsolete, please mark as obsolete).

Was testing done against the commit that is upstream
or the 2nd patch posted in the bugzilla?

Please test the upstream version of the patch against rhel5.5 
under the failure scenario and then attach a patch generated from 
rhel5.5 that matches upstream.

Rob

@@ -842,13 +842,22 @@ static int aac_get_pci_info(struct aac_d
 int aac_do_ioctl(struct aac_dev * dev, int cmd, void __user *arg)
 {
 	int status;
+	unsigned long mflags;
 
 	/*
 	 *	HBA gets first crack
 	 */
 
+	spin_lock_irqsave(&dev->manage_lock, mflags);
+	if (dev->management_fib_count > AAC_NUM_MGT_FIB) {
+		printk(KERN_INFO "No management Fibs Available:%d\n",
+						dev->management_fib_count);
+		spin_unlock_irqrestore(&dev->manage_lock, mflags);
+		return -EBUSY;
+	}
+	spin_unlock_irqrestore(&dev->manage_lock, mflags);
 	status = aac_dev_ioctl(dev, cmd, arg);
-	if(status != -ENOTTY)
+	if (status != -ENOTTY)
 		return status;
 
 	switch (cmd) {
diff -purN a/drivers/scsi/aacraid/comminit.c b/drivers/scsi/aacraid/comminit.c

Comment 21 serveraid 2010-07-27 08:46:07 UTC
We have submitted the aacraid-24701 patch to kernel community and they identified a remote scenario in which race condition might occur.
So we have modified 24701 source code with suggestions given by SCSI community and resubmitted the aacraid-24702 patch to the SCSI community. 

James has accepted this patch and pushed into 2.6.33 kernel. 
http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-rc-fixes-2.6.git;a=commit;h=cacb6dc3d7fea751879a225c15e48228415e6359

Our QA has qualified 24702 patch against the upstream kernel and RHEL 5U5. 

The attached patch is generated against RHEL 5U5. We have moved the patches which were submitted earlier to obsolete. 
This patch addresses the following issues. 
Issue1: File System going into read-only mode
---------
Root cause:
The driver tends to not free the memory (FIB) when the management request exits prematurely. The accumulation of such un-freed memory causes the driver to fail to allocate anymore memory (FIB) and hence return 0x70000 value to the upper layer,which puts the file system into read only mode. 
Fix details:
The fix makes sure to free the memory (FIB) even if the request exits prematurely hence ensuring the driver wouldn't run out of memory (FIBs). 

Issue2: False Raid Alert
-------
False Raid Alert occurs- when the Physical Drives and Logical drives are reported as deleted or added, even though there is no change done on the system
Root cause:
Driver IOCTLs is signaled with EINTR while waiting on response from the lower layers. Returning "EINTR" will never initiate internal retry. 
Fix details:
The issue was fixed by replacing "EINTR" with "ERESTARTSYS" for mid-layer retries.

Please do let us know if we need to provide any more information.

Comment 22 serveraid 2010-07-27 08:51:15 UTC
Comment on attachment 361457 [details]
The attached patch for RHEL5 U5 only

Obsolete patch as we are submitting new patch.

Comment 23 serveraid 2010-07-27 08:54:50 UTC
Comment on attachment 362301 [details]
The attached patch for RHEL5 U4 and above

Obsolete as new patch is 
available.

Comment 24 serveraid 2010-07-27 09:01:59 UTC
Created attachment 434629 [details]
aac-24702 patch for RHEL5U5

This patch is generated against the RHEL-5U5 which will address the file system read only problem and False RAID alert.

Comment 25 Rob Evers 2010-07-27 13:37:01 UTC
Red Hat is only currently accepting critical bug fixes for the aacraid driver.  This bug report only requests a fix for one problem.

Thanks for you enthusiasm to include another fix, but the status of the 2nd problem and fix are not known at this point.  Additionally, a 2nd bugzilla report should be opened up to address that issue.

Please obsolete the patch you attached and re-attach a patch with just the verified upstream fix for the read only file system problem.

Thanks, Rob

Comment 26 serveraid 2010-07-27 14:18:55 UTC
Rob,

Thanks for your quick response. 

We have submitted both fixes, which are file system problem(Issue:1) and false RAID alert problem (Issue:2) in one patch to SCSI community and it was well tested, accepted by James and pushed into 2.6.33. 

We haven't included any new fixes in this patch other than the fixes accepted by the SCSI community. 

We have copied only those two fixes code changes from upstream kernel and generated new patch against RHEL 5U5 and this patch has qualified by HCL-QA, which was already attached to this bz. 

Please let us know if we need to provide any more information. 

Thanks,
Srinivas.

Comment 27 Rob Evers 2010-07-27 15:16:38 UTC
(In reply to comment #26)
> Rob,
> 
> Thanks for your quick response. 
> 
> We have submitted both fixes, which are file system problem(Issue:1) and false
> RAID alert problem (Issue:2) in one patch to SCSI community and it was well
> tested, accepted by James and pushed into 2.6.33. 
> 
> We haven't included any new fixes in this patch other than the fixes accepted
> by the SCSI community. 
> 
> We have copied only those two fixes code changes from upstream kernel and
> generated new patch against RHEL 5U5 and this patch has qualified by HCL-QA,
> which was already attached to this bz. 
> 
> Please let us know if we need to provide any more information. 
> 
> Thanks,
> Srinivas.    

Ok, I might have lost context on this and I think I recall that you are correct that the patch that was accepted addressed 2 issues.

Did you verify that the 2nd issue was actually fixed by the patch?

Rhel5.6 deadlines are a ways out so it might be a bit before I get back to this.

Thanks for the update.

Rob

Comment 28 serveraid 2010-07-28 12:21:17 UTC
Yes, HCL-QA has qualified the patch for the 2nd issue (FRA issue) with RHEL 5U5

Comment 29 serveraid 2010-08-06 14:47:13 UTC
Hi ROB,
 
Good Morning... 

Could you please let us know whether the patch can be pushed into RHEL-5.6 or not?   

Thanks,
Abhilash

Comment 30 Rob Evers 2010-08-06 20:07:05 UTC
(In reply to comment #29)
> Hi ROB,
> 
> Good Morning... 
> 
> Could you please let us know whether the patch can be pushed into RHEL-5.6 or
> not?   
> 
> Thanks,
> Abhilash    

Not yet but I plan to before rhel5.6 is release provided there is time for me to get this done.

Rob

Comment 31 Rob Evers 2010-08-20 14:28:00 UTC
(In reply to comment #29)
> Hi ROB,
> 
> Good Morning... 
> 
> Could you please let us know whether the patch can be pushed into RHEL-5.6 or
> not?   
> 
> Thanks,
> Abhilash

Please comment on the testing done with this patch in rhel5.6 (or rhel5.5) and/or attach a test plan that was executed with this patch in place.

Thankyou, Rob

Comment 32 serveraid 2010-08-24 14:29:35 UTC

As we will not be able to share the test plan we are listing the test methodologies followed for testing by QA.

HCL QA has tested this patch with below scenario:

Read Only Issue:
1.	Running heavy I/O using I/O tool like DiskStress
2.	In parallel test scripts were invoked to pump management command continuously.

Running the above setup for a week leads to FIB leak and the file systems hits the read only issue. After the fix, the above issue is not occurring

	
FRA issue:
	This problem is reproduced with IBM management tools are running in the system for almost a week. The application report for false raid events like a logical array is deleted / created. We tested the issue by running the setup for several weeks.
 
Please let us know if any more details is required.

Comment 33 Rob Evers 2010-08-30 15:58:19 UTC
(In reply to comment #32)

Thanks for the update.  The more info you provide regarding your testing, the more confidence I and others will have that the patches you are providing address the issue at hand, and have not introduced any regressions.

The information you have provided is sufficient for this case.

Comment 34 Rob Evers 2010-09-07 15:06:55 UTC
(In reply to comment #24)
> Created attachment 434629 [details]
> aac-24702 patch for RHEL5U5
> 
> This patch is generated against the RHEL-5U5 which will address the file system
> read only problem and False RAID alert.

After applying this patch to a recent rhel5.6 kernel (214), the first time I tested it using dt on the aacraid root filesystem, I saw a system hang and 2 files I checked, the dt-log, and one of the dt-test-files, had corruption at the end of the files.

I have not been able to reproduce this problem after running the same test for 3.5 days.

The dt options used in this test:

  ./dt log=dt.log of=./test limit=1M bs=256k procs=4 flags=direct disable=pstats runtime=2h

The system used:

  dell-pe700-01.rhts.eng.bos.redhat.com - an ia32 system.

Still attempting to reproduce the problem and capture a kdump.

Comment 35 Rob Evers 2010-09-08 13:17:13 UTC
(In reply to comment #34)

Still not able to reproduce this problem.  Posting and calling attention to QE team to prioritize aacraid quality effort on rhel5.6.

Comment 36 serveraid 2010-09-08 15:31:40 UTC
Hi Rob,

Based on your update, we were trying to recreate the issue in our lab.
We are yet to see the problem. It will be helpful if you could give more info on the recreation steps and the setup you had used.

Comment 38 RHEL Product and Program Management 2010-09-08 17:59:17 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 39 Rob Evers 2010-09-08 18:00:00 UTC
(In reply to comment #36)
> Hi Rob,
> 
> Based on your update, we were trying to recreate the issue in our lab.
> We are yet to see the problem. It will be helpful if you could give more info
> on the recreation steps and the setup you had used.

Steps before problem occurred:

Installed system

built kernel rpm w/ patch and rebooted

Ran dt w/ options:
./dt log=dt.log of=./test limit=1M bs=256k procs=4 flags=direct disable=pstats runtime=2h

Host/adatper info follows.  Let me know if you need more info and specifically what.

[root@dell-pe700-01 ~]# lspci | grep AAC
02:01.0 RAID bus controller: Adaptec AAC-RAID (rev 01)
[root@dell-pe700-01 ~]# man lspci
[root@dell-pe700-01 ~]# lspci -n | grep '02:01.0'
02:01.0 0104: 9005:0285 (rev 01)
[root@dell-pe700-01 ~]# 

from /var/log/messages:

Sep  7 11:36:53 dell-pe700-01 kernel: Adaptec aacraid driver 1.1-5[24702]
Sep  7 11:36:53 dell-pe700-01 kernel: ACPI: PCI Interrupt 0000:02:01.0[A] -> GSI 24 (level, low) -> IRQ 185
Sep  7 11:36:53 dell-pe700-01 kernel: AAC0: kernel 4.1-0[7417]
Sep  7 11:36:53 dell-pe700-01 kernel: AAC0: monitor 4.1-0[7417]
Sep  7 11:36:53 dell-pe700-01 kernel: AAC0: bios 4.1-0[7417]
Sep  7 11:36:53 dell-pe700-01 kernel: AAC0: serial BAA946
Sep  7 11:36:53 dell-pe700-01 kernel: scsi0 : aacraid
Sep  7 11:36:53 dell-pe700-01 kernel:   Vendor: CERC      Model: r5d3              Rev: V1.0
Sep  7 11:36:53 dell-pe700-01 kernel:   Type:   Direct-Access                      ANSI SCSI revision: 02
Sep  7 11:36:54 dell-pe700-01 kernel: SCSI device sda: 468614400 512-byte hdwr sectors (239931 MB)
Sep  7 11:36:54 dell-pe700-01 kernel: sda: Write Protect is off
Sep  7 11:36:54 dell-pe700-01 kernel: SCSI device sda: drive cache: write through
Sep  7 11:36:54 dell-pe700-01 kernel:  sda: sda1 sda2
Sep  7 11:36:54 dell-pe700-01 kernel: sd 0:0:0:0: Attached scsi removable disk sda
Sep  7 11:36:54 dell-pe700-01 kernel:   Vendor: CERC      Model: d1ro              Rev: V1.0
Sep  7 11:36:54 dell-pe700-01 kernel:   Type:   Direct-Access                      ANSI SCSI revision: 02
Sep  7 11:36:54 dell-pe700-01 kernel: SCSI device sdb: 234307200 512-byte hdwr sectors (119965 MB)
Sep  7 11:36:54 dell-pe700-01 kernel: sdb: Write Protect is off
Sep  7 11:36:54 dell-pe700-01 kernel: SCSI device sdb: drive cache: write through
Sep  7 11:36:54 dell-pe700-01 kernel: SCSI device sdb: 234307200 512-byte hdwr sectors (119965 MB)


[root@dell-pe700-01 host2]# cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 3.40GHz
stepping        : 9
cpu MHz         : 3391.725
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 0
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips        : 6783.45

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 3.40GHz
stepping        : 9
cpu MHz         : 3391.725
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips        : 6782.02

[root@dell-pe700-01 host2]#

Comment 40 serveraid 2010-09-10 13:23:14 UTC
 We were running the dt tool for past 2.5 days in our lab with the options specified in your comment.  We are using an ibm x3650 and an ibm x3550 machine for testing. In both the machines we haven’t observed any problems yet. We will keep you updated.

Comment 41 Rob Evers 2010-09-13 20:17:36 UTC
FYI - The system I saw the problem on is a dell poweredge 700 (i386)

Comment 42 serveraid 2010-09-20 06:33:42 UTC
   From the logs you have posted looks like you are using a very old firmware (AAC0: kernel 4.1-0[7417]). Could you please update to the latest firmware and try to reproduce this issue. Also please provide as details about the RAID controller you used while the problem occurred (eg 8k, 8k-l or 8s).

Comment 43 Rob Evers 2010-09-20 15:37:11 UTC
Vendor ID 9005
Device ID 0285
Subsys Vendor ID 1028
Subsys Device ID 0291

Since I couldn't reproduce the problem after trying for many days, I am not going to try again.  I am depending on external testing to see if the patch in this bugzilla has introduced any regressions as the single case I observed.

I have requested that the firmware be updated on the raid adapter.

Comment 45 Jarod Wilson 2010-09-21 20:59:09 UTC
in kernel-2.6.18-223.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 47 serveraid 2010-09-22 13:50:51 UTC
HCL QA team has started the testing for kernel-2.6.18-223.el5. We will keep you update for this.

Comment 48 serveraid 2010-09-22 14:24:57 UTC
We have created aacraid 24702 patch for RHEL6 (kernel version 2.6.32-44.2).
Could you please open a new BUD id for this, so that we can submit aacraid 24702 patch.

Comment 49 Rob Evers 2010-09-22 14:42:48 UTC
(In reply to comment #48)
> We have created aacraid 24702 patch for RHEL6 (kernel version 2.6.32-44.2).
> Could you please open a new BUD id for this, so that we can submit aacraid
> 24702 patch.

As far as I know, the update that addresses the read-only filesystem problem is already in rhel6.

What specific problems are you addressing?  Also, I think you can open your own bug(s).

Rob

Comment 50 serveraid 2010-09-23 14:45:04 UTC
Can you please provide the download link for Rhel6 source rpm, so that we can verify that fixes are included?

Comment 51 Andrius Benokraitis 2010-09-23 14:49:30 UTC
(In reply to comment #50)
> Can you please provide the download link for Rhel6 source rpm, so that we can
> verify that fixes are included?

Please contact me directly for this.

Comment 52 serveraid 2010-09-23 14:58:28 UTC
We have created aacraid 24702 patch for RHEL6 (kernel version 2.6.32-44.2).
Could you please open a new BUD id for this, so that we can submit aacraid 24702 patch.

Comment 53 serveraid 2010-09-23 15:01:26 UTC
Sorry for the last comment (Comment 52). Please consider it as duplicate.

Comment 54 serveraid 2010-10-01 14:15:56 UTC
HCL QA team has tested aacraid 24702 driver in kernel-2.6.18-223.el5 and team has not found any problem.

Comment 55 coschult 2010-10-26 20:49:19 UTC
I was unable to reproduce this issue on a x3650, running 64-bit freshly installed rhel 5.5.

I ran arcconf getstatus in an infinite loop, while running dt at the same time, for a week.

04:00.0 RAID bus controller: Adaptec AAC-RAID (Rocket) (rev 02)
Adaptec aacraid driver 1.1-5[2461]

Comment 56 Chris Ward 2010-11-09 13:40:39 UTC
~~ Attention Customers and Partners - RHEL 5.6 Public Beta is now available on RHN ~~

A fix for this 'OtherQA' BZ should be present and testable in the release. 

If this Bugzilla is verified as resolved, please update the Verified field above with an appropriate value and include a summary of the testing executed and the results obtained.

If you encounter any issues or have questions while testing, please describe them and set this bug into NEED_INFO. 

If you encounter new defects or have additional patches to request for inclusion, promptly escalate the new issues through your support representative.

Finally, future Beta kernels can be found here:
 http://people.redhat.com/jwilson/el5/

Note: Bugs with the 'OtherQA' keyword require Third-Party testing to confirm the request has been properly addressed. See: https://bugzilla.redhat.com/describekeywords.cgi#OtherQA ).

Comment 59 errata-xmlrpc 2011-01-13 20:53:34 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

Comment 60 coschult 2011-01-31 18:58:10 UTC
I tested kernel 2.6.18-238 (from people.redhat.com/jwilson) and found no regressions. 

I should note here, that I never was able to reproduce the original problem, so I can't verify the bugfix itself; I can only verify that this newer kernel seems to work ok.

Comment 61 Rob Evers 2011-02-01 13:42:25 UTC
(In reply to comment #60)
> I tested kernel 2.6.18-238 (from people.redhat.com/jwilson) and found no
> regressions. 
> 
> I should note here, that I never was able to reproduce the original problem, so
> I can't verify the bugfix itself; I can only verify that this newer kernel
> seems to work ok.

It should be noted that HCL could reproduce this problem and should follow up determining that the fix is complete.


Note You need to log in before you can comment on or make changes to this bug.