Bug 523920
Summary: | [Adaptec/HCL 5.6 bug] Problems with aacraid - File system going into read-only. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | serveraid <ServeRAIDDriver> | ||||||||
Component: | kernel | Assignee: | Rob Evers <revers> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Storage QE <storage-qe> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 5.6 | CC: | andriusb, bdonahue, coschult, coughlan, cward, djeffery, fbijlsma, jjarvis, jwest, jwsanta, revers, sbest, ServeRAIDDriver, syeghiay, tao | ||||||||
Target Milestone: | rc | Keywords: | OtherQA | ||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: |
Issue1: File System going into read-only mode
---------
Root cause:
-----------
The driver tends to not free the memory (FIB) when the management request exits prematurely. The accumulation of such un-freed memory causes the driver to fail to allocate anymore memory (FIB) and hence return 0x70000 value to the upper layer, which puts the file system into read only mode.
Fix details:
------------
The fix makes sure to free the memory (FIB) even if the request exits prematurely hence ensuring the driver wouldn’t run out of memory (FIBs).
Issue2:
-------
False Raid Alert occurs- when the Physical Drives and Logical drives are reported as deleted or added, even though there is no change done on the system
Root cause:
-----------
Driver IOCTLs is signaled with EINTR while waiting on response from the lower layers. Returning “EINTR” will never initiate internal retry.
Fix details:
------------
The issue was fixed by replacing “EINTR” with “ERESTARTSYS” for mid-layer retries.
|
Story Points: | --- | ||||||||
Clone Of: | |||||||||||
: | 624713 (view as bug list) | Environment: | |||||||||
Last Closed: | 2011-01-13 20:53:34 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
serveraid
2009-09-17 08:13:55 UTC
Rob - As i created a bugzilla entry first time for redhat, so please let me know if it is ok for you. Created attachment 361457 [details]
The attached patch for RHEL5 U5 only
Rob - Please do review and let me know if it is ok for you. if the patch is ok for you then i will go ahead and create bugzilla entry for other releases as well.
The attached patch was generated for the following issues:
Issue:1
--------
Behavior of the ternary operation in function aac_send_raw_srb () was
observed incorrect in 64-bit version. This issue was because of missing
parenthesis in the condition to check the sg count.
Fix details:
-------------
Fixed by adding parentheses.
Issue:2
--------
Driver IOCTLs is signaled with EINTR while waiting on response from the
lower layers. Returning “EINTR” will never initiate internal retry.
Fix details:
-------------
Fixed by replacing “EINTR” with “ERESTARTSYS” for mid-layer retries.
Issue:3
--------
The driver tends to not free the memory (FIB) when the management
request exits prematurely. The accumulation of such un-freed memory causes the
driver to fail to allocate anymore memory (FIB) and hence return 0x70000 value
to the upper layer, which puts the file system into read only mode.
Fix details:
-------------
The fix makes sure to free the memory(FIB) even if the request exits
prematurely hence ensuring the driver wouldn’t run out of memory(FIBs)
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Issue:1 -------- Behavior of the ternary operation in function aac_send_raw_srb () was observed incorrect in 64-bit version. This issue was because of missing parenthesis in the condition to check the sg count. Fix details: ------------- Fixed by adding parentheses. Issue:2 -------- Driver IOCTLs is signaled with EINTR while waiting on response from the lower layers. Returning “EINTR” will never initiate internal retry. Fix details: ------------- Fixed by replacing “EINTR” with “ERESTARTSYS” for mid-layer retries. Issue:3 -------- The driver tends to not free the memory (FIB) when the management request exits prematurely. The accumulation of such un-freed memory causes the driver to fail to allocate anymore memory (FIB) and hence return 0x70000 value to the upper layer, which puts the file system into read only mode. Fix details: ------------- The fix makes sure to free the memory(FIB) even if the request exits prematurely hence ensuring the driver wouldn’t run out of memory(FIBs) Release notes are targeted at customers so you need to indicate how a patch, or parts of a patch in this case, fix(es) a particular symptom that the customer would be observing. Guidelines for patches: Any changes reflected in patches need to be upstream, or at least posted to linux-scsi mailing list and generally reviewed and ready to be accepted by James Bottomley. Something to keep in mind when submitting code upstream is that individual problems should be solved by individual patches, as this makes review much easier. Patch sets are ok as well (even preferred) where a description of each problem accompanies each patch, but large patches that bundle multiple problem fixes together are much more difficult to review, and therefore, more difficult to get accepted upstream into scsi-misc (submitted via email to linux-scsi). It is not required that you need to break this patch into multiple patches here, though that would be easier to review, and get accepted. It may also be easier to describe the changes and how they relate to the code changes in the patch. Detailed explanations of particular bug fixes and the accompanying code changes are helpful when submitting patches here. I applied the patch to a pretty recent rhel5.4-ish kernel tree and it failed: [revers@revers-desk kernel]$ patch -p1 --dry-run < ../patches/bz523920/rhel5_u5_aac24701.patch patching file drivers/scsi/aacraid/aachba.c patching file drivers/scsi/aacraid/aacraid.h patching file drivers/scsi/aacraid/commctrl.c patching file drivers/scsi/aacraid/comminit.c patching file drivers/scsi/aacraid/commsup.c patching file drivers/scsi/aacraid/dpcsup.c Hunk #2 FAILED at 125. Hunk #3 succeeded at 240 (offset -2 lines). 1 out of 4 hunks FAILED -- saving rejects to file drivers/scsi/aacraid/dpcsup.c.rej What tree did you generate the patch from? Also, use 'diff -purN' or something equivalent to generate patches. Nice to see stuff like c labels to reference where patches apply in the patch itself. Some explanation of how the patch was tested against the problems reported along with regression testing, is important to get patches accepted into rhel releases. There are white space problems in lines 238, 426, and 432. No trailing tabs. You might want to dig up a program called patch-check that can flag these issues. Problems like this will most likely be rejected upstream. Other upstream style: 80+ character lines. Rob - Thank you very much for your feedback. Can you please let me know the code tree locations of the latest RHEL4.9 RHEL5.4, RHEL5.5? Created attachment 362301 [details]
The attached patch for RHEL5 U4 and above
The attached patch was generated against "kernel-2.6.18-164.el5.src.rpm". Please apply on RHEL5 U4 and RHEL5 U5 and let us know if you face any issues during applying the patch.
Regarding testing:
------------------
This particular problem was reported by Cisco and SAP. Cisco reported on RHEL4 U6 and SAP reported on SLES9 SP4 and SLES10 SP2. We added these fixes on RHEL4 U6 and gave a private build to IBM and Cisco. Cisco and IBM tested it for more than 15 days and they reported that they did not see the issue so far. Before the fix, Cisco used to see the issue within 5 days. We generated a patch for SLES9 SP4 and SLES10 SP2 and submitted to Novell. Novell applied the patch and gave a test build to SAP. SAP tested and reported that it is working properly.
We also tested in ourlab using the tools "dishogsync", which is IO stress tool and the tool was provided by Cisco.
Please let me know if you need more. I am ready to provide as well.
Thank you very much for your humble/good support.
Here is some explanation on the issue: ---------------------------------------- 1. What is the root cause? I personally would like to get the details on how you arrived at this root cause since we have come a long way from where you started. Response: The problem is the result of an accumulation of a "set" series of events occurring that result in depleted resources for IO requests. The issue revolves around the handling of management requests. For clarity, management requests are not confined to arconf commands. A management request is defined in this environment as any "non-IO" command. This includes arconf commands, but also includes any internal status requests made by the FW, real time clock synchronization of the RAID FW to the server and commands tied to communications arbitration in the stack. Essentially anything other then blocks of data is being considered a management command in this description. When the code stream is handling one of these management requests, the code stream will wait for the FW to respond. If while this management request is being serviced, an interrupt occurs, the code stream will jump away from waiting on the management command and service the interrupt. when the interrupt has been serviced and the code jumps back to finish the original management command the FIB (Firmware Interface Block) associated with the command is left in an unassociated state instead of being completed or cleared. A new FIB is generated to finish the original command but the one instance of FIB is left in a hung state. The SCSI middle layer is assigned 238 FIB resources out of a total subsystem 400. As this scenario happens multiple times, the resources for IO get limited as a cumulation of these essentially hung FIBs take up resources needed by the IO FIBs . This will happen faster on a very busy system but can also happen on a system running lower levels of stress but statistically will take longer and the "bullet hitting a bullet" scenario with the interrupt is less likely to occur as well. When the FIB resources get limited due to a large number of FIBs being essentially hung.. the driver will encounter the inability to assign a FIB to an IO. When the write IO fails as a result of this the driver will produce a 0x70000 error and then retry the command 5 times. If in the course of the 5 attempts it manages to get resources, the system continues to run. If the system does not manage to get resources the OS write will have failed and the OS goes into a "read only" mode. To fix this we identified there was a FIB resource issue.. discovered the hung FIBs, and modified the code to allow the FIB resources to be released properly in the event of an interrupt displacing a management command. 2. I need to see some evidence that this IS the root cause and not an accidental symptomatic match to the problems VTG’s customers saw. As part of this, please provide the details of other suspects and how they were ruled out. Response: IBM created a script to demonstrate a "proof of concept" for the problem scenario we identified as causing the "read only" situation. IBM also created a new driver with the issue fixed in it. The new driver also had debug code imbedded in it that would print messages to a log to inform development that the scenario above in which a management command was interrupted by an interrupt did in fact happen. Systems with the presence of multiple log entries of this type demonstrates the scenario is actually happening. Lets discuss the script in detail first. The script is not representative of customer data flow but instead generates multiple issues of arconf "getstatus" commands. Dozens, up to hundreds of these commands are generated to add so many management commands in the data stream that the issue has to happen from a statistical point of view. Using an unpatched driver the system will fall into "read only" mode pretty quickly. Running the patched driver should and does allow FIBs to be cleared properly and even running the script should not fail the system. In the note above Cisco made the statement that no difference was seen in failure rates between FW levels 418 and 427. When running the script, this is true because the script is designed to overwhelm the resources in such a way that the likelihood of failure is so great it will almost certainly happen. In the real world environments, this doesn't appear to be true based on the history we have seen. 421 seems much less likely to fail based on what we have seen, and the problem seems to have been more likely on 427. As you move from one code level to another, subtle timings and code efficiencies appear to be just enough so that the likelihood of encountering the timed issue will increase or decrease but the script essentially bypasses all that and bombards the code path with management commands. The patched driver has been run under standard stress tools at IBM and at Cisco. Logs have been gathered from our test beds that confirm via debug messages imbedded in the test drivers that the scenario of an interrupt occurring in the process of a management command servicing has occurred multiple times on a system. All unused FIBs were freed by the change added to the driver, and the system did not go into read only. No system has encountered an issue of this type using the new driver. The patch applies cleanly to today's rhel5.5, though haven't built it yet. A few followup items from comment 4: Have the changes in this patch been accepted into James Bottomley's scsi-misc-2.6 git repository or has it been posted onto the linux-scsi mailing list as a step towards being accepted? Regarding testing described in comments 6 & 7, the testing reported only seems to apply to one item but three are listed in comment 3. The release notes need to be updated such that a customer will be able to digest the update. Rob - The changes have been posted onto the linux-scsi mailing list for acceptance as a first step as you said. Regarding testing, all issues are interlinked each other. All the issues were happended because of down_interruptable. We have generated and given a patch with all these changes to Novell. Novell applied the patch on SLES9 SP4 and SLES10 sp2 and gave to SAP. SAP reported the patch is working properly. Our QA is also testing things on the below OSes: 1. RHEL4.6, RHEL4.7, and RHEL4.8 (on 32 and 64 bits) 2. RHEL5.2, RHEL5.3, and RHEL5.4 (on 32 and 64 bits) 3. SLES9 SP2, SP3, and SP4 (on 32 and 64 bits) 4. SLES10, SLES10 SP1 and SLES10 SP2 (on 32 and 64 bits) 5. SLES11 (on 32 and 64 bits). Please let me know if any additional info is required? Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,30 +1,23 @@ -Issue:1 --------- - Behavior of the ternary operation in function aac_send_raw_srb () was -observed incorrect in 64-bit version. This issue was because of missing -parenthesis in the condition to check the sg count. +Issue1: File System going into read-only mode +--------- +Root cause: +----------- + The driver tends to not free the memory (FIB) when the management request exits prematurely. The accumulation of such un-freed memory causes the driver to fail to allocate anymore memory (FIB) and hence return 0x70000 value to the upper layer, which puts the file system into read only mode. + Fix details: -------------- - Fixed by adding parentheses. +------------ + The fix makes sure to free the memory (FIB) even if the request exits prematurely hence ensuring the driver wouldn’t run out of memory (FIBs). -Issue:2 --------- - Driver IOCTLs is signaled with EINTR while waiting on response from the -lower layers. Returning “EINTR” will never initiate internal retry. -Fix details: -------------- - Fixed by replacing “EINTR” with “ERESTARTSYS” for mid-layer retries. +Issue2: +------- + False Raid Alert occurs- when the Physical Drives and Logical drives are reported as deleted or added, even though there is no change done on the system -Issue:3 --------- - The driver tends to not free the memory (FIB) when the management -request exits prematurely. The accumulation of such un-freed memory causes the -driver to fail to allocate anymore memory (FIB) and hence return 0x70000 value -to the upper layer, which puts the file system into read only mode. +Root cause: +----------- + Driver IOCTLs is signaled with EINTR while waiting on response from the lower layers. Returning “EINTR” will never initiate internal retry. Fix details: -------------- +------------ - The fix makes sure to free the memory(FIB) even if the request exits + The issue was fixed by replacing “EINTR” with “ERESTARTSYS” for mid-layer retries.-prematurely hence ensuring the driver wouldn’t run out of memory(FIBs) Rob - Can you please let us know the kernel version and location source code of RHEL6 so that I would be able to generate and submit a patch for the issues we fixed? ServerRAIDDriver - Please wait until alpha 2 is available for rhel6. Andrius Benokraitis <andriusb> will inform when that is ready, and provide pointers directly. note that a different bz should be opened for rhel6.0 patches Is another patch supposed to be posted to linux-scsi for these fixes in response to the email below? http://marc.info/?l=linux-scsi&m=125431052313241&w=2 Currently this patch is not upstream and appears to need some work. Once this work is done, please repost upstream. Once accepted upstream, please repost a backport of the accepted upstream patch(es) here and obsolete the currently attached patch. Thanks. A version of the changes has been merged for 2.6.33: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=cacb6dc3d7fea751879a225c15e48228415e6359 Changes need to be accepted upstream and tested in that form against a rhel release before they are eligible for acceptance. Differences exist between commit cacb6dc3d7fea751879a225c15e48228415e6359 and the 2nd patch posted in bz523920. See one example below. (Assuming the first patch in the bz is obsolete, please mark as obsolete). Was testing done against the commit that is upstream or the 2nd patch posted in the bugzilla? Please test the upstream version of the patch against rhel5.5 under the failure scenario and then attach a patch generated from rhel5.5 that matches upstream. Rob @@ -842,13 +842,22 @@ static int aac_get_pci_info(struct aac_d int aac_do_ioctl(struct aac_dev * dev, int cmd, void __user *arg) { int status; + unsigned long mflags; /* * HBA gets first crack */ + spin_lock_irqsave(&dev->manage_lock, mflags); + if (dev->management_fib_count > AAC_NUM_MGT_FIB) { + printk(KERN_INFO "No management Fibs Available:%d\n", + dev->management_fib_count); + spin_unlock_irqrestore(&dev->manage_lock, mflags); + return -EBUSY; + } + spin_unlock_irqrestore(&dev->manage_lock, mflags); status = aac_dev_ioctl(dev, cmd, arg); - if(status != -ENOTTY) + if (status != -ENOTTY) return status; switch (cmd) { diff -purN a/drivers/scsi/aacraid/comminit.c b/drivers/scsi/aacraid/comminit.c We have submitted the aacraid-24701 patch to kernel community and they identified a remote scenario in which race condition might occur. So we have modified 24701 source code with suggestions given by SCSI community and resubmitted the aacraid-24702 patch to the SCSI community. James has accepted this patch and pushed into 2.6.33 kernel. http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-rc-fixes-2.6.git;a=commit;h=cacb6dc3d7fea751879a225c15e48228415e6359 Our QA has qualified 24702 patch against the upstream kernel and RHEL 5U5. The attached patch is generated against RHEL 5U5. We have moved the patches which were submitted earlier to obsolete. This patch addresses the following issues. Issue1: File System going into read-only mode --------- Root cause: The driver tends to not free the memory (FIB) when the management request exits prematurely. The accumulation of such un-freed memory causes the driver to fail to allocate anymore memory (FIB) and hence return 0x70000 value to the upper layer,which puts the file system into read only mode. Fix details: The fix makes sure to free the memory (FIB) even if the request exits prematurely hence ensuring the driver wouldn't run out of memory (FIBs). Issue2: False Raid Alert ------- False Raid Alert occurs- when the Physical Drives and Logical drives are reported as deleted or added, even though there is no change done on the system Root cause: Driver IOCTLs is signaled with EINTR while waiting on response from the lower layers. Returning "EINTR" will never initiate internal retry. Fix details: The issue was fixed by replacing "EINTR" with "ERESTARTSYS" for mid-layer retries. Please do let us know if we need to provide any more information. Comment on attachment 361457 [details]
The attached patch for RHEL5 U5 only
Obsolete patch as we are submitting new patch.
Comment on attachment 362301 [details]
The attached patch for RHEL5 U4 and above
Obsolete as new patch is
available.
Created attachment 434629 [details]
aac-24702 patch for RHEL5U5
This patch is generated against the RHEL-5U5 which will address the file system read only problem and False RAID alert.
Red Hat is only currently accepting critical bug fixes for the aacraid driver. This bug report only requests a fix for one problem. Thanks for you enthusiasm to include another fix, but the status of the 2nd problem and fix are not known at this point. Additionally, a 2nd bugzilla report should be opened up to address that issue. Please obsolete the patch you attached and re-attach a patch with just the verified upstream fix for the read only file system problem. Thanks, Rob Rob, Thanks for your quick response. We have submitted both fixes, which are file system problem(Issue:1) and false RAID alert problem (Issue:2) in one patch to SCSI community and it was well tested, accepted by James and pushed into 2.6.33. We haven't included any new fixes in this patch other than the fixes accepted by the SCSI community. We have copied only those two fixes code changes from upstream kernel and generated new patch against RHEL 5U5 and this patch has qualified by HCL-QA, which was already attached to this bz. Please let us know if we need to provide any more information. Thanks, Srinivas. (In reply to comment #26) > Rob, > > Thanks for your quick response. > > We have submitted both fixes, which are file system problem(Issue:1) and false > RAID alert problem (Issue:2) in one patch to SCSI community and it was well > tested, accepted by James and pushed into 2.6.33. > > We haven't included any new fixes in this patch other than the fixes accepted > by the SCSI community. > > We have copied only those two fixes code changes from upstream kernel and > generated new patch against RHEL 5U5 and this patch has qualified by HCL-QA, > which was already attached to this bz. > > Please let us know if we need to provide any more information. > > Thanks, > Srinivas. Ok, I might have lost context on this and I think I recall that you are correct that the patch that was accepted addressed 2 issues. Did you verify that the 2nd issue was actually fixed by the patch? Rhel5.6 deadlines are a ways out so it might be a bit before I get back to this. Thanks for the update. Rob Yes, HCL-QA has qualified the patch for the 2nd issue (FRA issue) with RHEL 5U5 Hi ROB, Good Morning... Could you please let us know whether the patch can be pushed into RHEL-5.6 or not? Thanks, Abhilash (In reply to comment #29) > Hi ROB, > > Good Morning... > > Could you please let us know whether the patch can be pushed into RHEL-5.6 or > not? > > Thanks, > Abhilash Not yet but I plan to before rhel5.6 is release provided there is time for me to get this done. Rob (In reply to comment #29) > Hi ROB, > > Good Morning... > > Could you please let us know whether the patch can be pushed into RHEL-5.6 or > not? > > Thanks, > Abhilash Please comment on the testing done with this patch in rhel5.6 (or rhel5.5) and/or attach a test plan that was executed with this patch in place. Thankyou, Rob As we will not be able to share the test plan we are listing the test methodologies followed for testing by QA. HCL QA has tested this patch with below scenario: Read Only Issue: 1. Running heavy I/O using I/O tool like DiskStress 2. In parallel test scripts were invoked to pump management command continuously. Running the above setup for a week leads to FIB leak and the file systems hits the read only issue. After the fix, the above issue is not occurring FRA issue: This problem is reproduced with IBM management tools are running in the system for almost a week. The application report for false raid events like a logical array is deleted / created. We tested the issue by running the setup for several weeks. Please let us know if any more details is required. (In reply to comment #32) Thanks for the update. The more info you provide regarding your testing, the more confidence I and others will have that the patches you are providing address the issue at hand, and have not introduced any regressions. The information you have provided is sufficient for this case. (In reply to comment #24) > Created attachment 434629 [details] > aac-24702 patch for RHEL5U5 > > This patch is generated against the RHEL-5U5 which will address the file system > read only problem and False RAID alert. After applying this patch to a recent rhel5.6 kernel (214), the first time I tested it using dt on the aacraid root filesystem, I saw a system hang and 2 files I checked, the dt-log, and one of the dt-test-files, had corruption at the end of the files. I have not been able to reproduce this problem after running the same test for 3.5 days. The dt options used in this test: ./dt log=dt.log of=./test limit=1M bs=256k procs=4 flags=direct disable=pstats runtime=2h The system used: dell-pe700-01.rhts.eng.bos.redhat.com - an ia32 system. Still attempting to reproduce the problem and capture a kdump. (In reply to comment #34) Still not able to reproduce this problem. Posting and calling attention to QE team to prioritize aacraid quality effort on rhel5.6. Hi Rob, Based on your update, we were trying to recreate the issue in our lab. We are yet to see the problem. It will be helpful if you could give more info on the recreation steps and the setup you had used. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. (In reply to comment #36) > Hi Rob, > > Based on your update, we were trying to recreate the issue in our lab. > We are yet to see the problem. It will be helpful if you could give more info > on the recreation steps and the setup you had used. Steps before problem occurred: Installed system built kernel rpm w/ patch and rebooted Ran dt w/ options: ./dt log=dt.log of=./test limit=1M bs=256k procs=4 flags=direct disable=pstats runtime=2h Host/adatper info follows. Let me know if you need more info and specifically what. [root@dell-pe700-01 ~]# lspci | grep AAC 02:01.0 RAID bus controller: Adaptec AAC-RAID (rev 01) [root@dell-pe700-01 ~]# man lspci [root@dell-pe700-01 ~]# lspci -n | grep '02:01.0' 02:01.0 0104: 9005:0285 (rev 01) [root@dell-pe700-01 ~]# from /var/log/messages: Sep 7 11:36:53 dell-pe700-01 kernel: Adaptec aacraid driver 1.1-5[24702] Sep 7 11:36:53 dell-pe700-01 kernel: ACPI: PCI Interrupt 0000:02:01.0[A] -> GSI 24 (level, low) -> IRQ 185 Sep 7 11:36:53 dell-pe700-01 kernel: AAC0: kernel 4.1-0[7417] Sep 7 11:36:53 dell-pe700-01 kernel: AAC0: monitor 4.1-0[7417] Sep 7 11:36:53 dell-pe700-01 kernel: AAC0: bios 4.1-0[7417] Sep 7 11:36:53 dell-pe700-01 kernel: AAC0: serial BAA946 Sep 7 11:36:53 dell-pe700-01 kernel: scsi0 : aacraid Sep 7 11:36:53 dell-pe700-01 kernel: Vendor: CERC Model: r5d3 Rev: V1.0 Sep 7 11:36:53 dell-pe700-01 kernel: Type: Direct-Access ANSI SCSI revision: 02 Sep 7 11:36:54 dell-pe700-01 kernel: SCSI device sda: 468614400 512-byte hdwr sectors (239931 MB) Sep 7 11:36:54 dell-pe700-01 kernel: sda: Write Protect is off Sep 7 11:36:54 dell-pe700-01 kernel: SCSI device sda: drive cache: write through Sep 7 11:36:54 dell-pe700-01 kernel: sda: sda1 sda2 Sep 7 11:36:54 dell-pe700-01 kernel: sd 0:0:0:0: Attached scsi removable disk sda Sep 7 11:36:54 dell-pe700-01 kernel: Vendor: CERC Model: d1ro Rev: V1.0 Sep 7 11:36:54 dell-pe700-01 kernel: Type: Direct-Access ANSI SCSI revision: 02 Sep 7 11:36:54 dell-pe700-01 kernel: SCSI device sdb: 234307200 512-byte hdwr sectors (119965 MB) Sep 7 11:36:54 dell-pe700-01 kernel: sdb: Write Protect is off Sep 7 11:36:54 dell-pe700-01 kernel: SCSI device sdb: drive cache: write through Sep 7 11:36:54 dell-pe700-01 kernel: SCSI device sdb: 234307200 512-byte hdwr sectors (119965 MB) [root@dell-pe700-01 host2]# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 3.40GHz stepping : 9 cpu MHz : 3391.725 cache size : 512 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr bogomips : 6783.45 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 3.40GHz stepping : 9 cpu MHz : 3391.725 cache size : 512 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr bogomips : 6782.02 [root@dell-pe700-01 host2]# We were running the dt tool for past 2.5 days in our lab with the options specified in your comment. We are using an ibm x3650 and an ibm x3550 machine for testing. In both the machines we haven’t observed any problems yet. We will keep you updated. FYI - The system I saw the problem on is a dell poweredge 700 (i386) From the logs you have posted looks like you are using a very old firmware (AAC0: kernel 4.1-0[7417]). Could you please update to the latest firmware and try to reproduce this issue. Also please provide as details about the RAID controller you used while the problem occurred (eg 8k, 8k-l or 8s). Vendor ID 9005 Device ID 0285 Subsys Vendor ID 1028 Subsys Device ID 0291 Since I couldn't reproduce the problem after trying for many days, I am not going to try again. I am depending on external testing to see if the patch in this bugzilla has introduced any regressions as the single case I observed. I have requested that the firmware be updated on the raid adapter. in kernel-2.6.18-223.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. HCL QA team has started the testing for kernel-2.6.18-223.el5. We will keep you update for this. We have created aacraid 24702 patch for RHEL6 (kernel version 2.6.32-44.2). Could you please open a new BUD id for this, so that we can submit aacraid 24702 patch. (In reply to comment #48) > We have created aacraid 24702 patch for RHEL6 (kernel version 2.6.32-44.2). > Could you please open a new BUD id for this, so that we can submit aacraid > 24702 patch. As far as I know, the update that addresses the read-only filesystem problem is already in rhel6. What specific problems are you addressing? Also, I think you can open your own bug(s). Rob Can you please provide the download link for Rhel6 source rpm, so that we can verify that fixes are included? (In reply to comment #50) > Can you please provide the download link for Rhel6 source rpm, so that we can > verify that fixes are included? Please contact me directly for this. We have created aacraid 24702 patch for RHEL6 (kernel version 2.6.32-44.2). Could you please open a new BUD id for this, so that we can submit aacraid 24702 patch. Sorry for the last comment (Comment 52). Please consider it as duplicate. HCL QA team has tested aacraid 24702 driver in kernel-2.6.18-223.el5 and team has not found any problem. I was unable to reproduce this issue on a x3650, running 64-bit freshly installed rhel 5.5. I ran arcconf getstatus in an infinite loop, while running dt at the same time, for a week. 04:00.0 RAID bus controller: Adaptec AAC-RAID (Rocket) (rev 02) Adaptec aacraid driver 1.1-5[2461] ~~ Attention Customers and Partners - RHEL 5.6 Public Beta is now available on RHN ~~ A fix for this 'OtherQA' BZ should be present and testable in the release. If this Bugzilla is verified as resolved, please update the Verified field above with an appropriate value and include a summary of the testing executed and the results obtained. If you encounter any issues or have questions while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patches to request for inclusion, promptly escalate the new issues through your support representative. Finally, future Beta kernels can be found here: http://people.redhat.com/jwilson/el5/ Note: Bugs with the 'OtherQA' keyword require Third-Party testing to confirm the request has been properly addressed. See: https://bugzilla.redhat.com/describekeywords.cgi#OtherQA ). An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html I tested kernel 2.6.18-238 (from people.redhat.com/jwilson) and found no regressions. I should note here, that I never was able to reproduce the original problem, so I can't verify the bugfix itself; I can only verify that this newer kernel seems to work ok. (In reply to comment #60) > I tested kernel 2.6.18-238 (from people.redhat.com/jwilson) and found no > regressions. > > I should note here, that I never was able to reproduce the original problem, so > I can't verify the bugfix itself; I can only verify that this newer kernel > seems to work ok. It should be noted that HCL could reproduce this problem and should follow up determining that the fix is complete. |