Bug 438895
Summary: | dell percraid adapter dead issue | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Vivek Goyal <vgoyal> | ||||||||||
Component: | kernel | Assignee: | Tomas Henzl <thenzl> | ||||||||||
Status: | CLOSED NOTABUG | QA Contact: | Martin Jenner <mjenner> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 4.7 | CC: | achim_leubner, andriusb, coughlan, duck, jburke, ltroan, mgahagan, qcai, rlerch | ||||||||||
Target Milestone: | rc | ||||||||||||
Target Release: | 4.8 | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: |
The aacraid driver update that was first introduced in Red Hat Enterprise Linux 4.7 requires up to date Adaptec PERC3/Di firmware. Subsequent updates of Red Hat Enterprise Linux 4 (including this 4.8 update) require, that the PERC3/Di firmware is at version 2.8.1.7692, A13 or newer. The firmware may be obtained at the following location:
http://support.dell.com/support/downloads/download.aspx?c=us&cs=555&l=en&s=biz&releaseid=R168387&SystemID=PWE_PNT_PIII_1650&servicetag=&os=WNET&osl=en&deviceid=1375&devlib=0&typecnt=0&vercnt=9&catid=-1&impid=-1&formatcnt=4&libid=35&fileid=228550
|
Story Points: | --- | ||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2009-03-16 18:30:42 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 450901, 461297 | ||||||||||||
Attachments: |
|
Description
Vivek Goyal
2008-03-25 20:36:37 UTC
I am seeing same failure signature on 68.28.vgoyal.test1 rhts build. /kernel/memory/nullmap test is failing on i386 dell machines with percraid adapter. Logs don't say that percraid is dead but symptoms of the failure are same as the previous failures. http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/memory/nullmap&result=Fail&rwhiteboard=kernel%202.6.9-68.28.EL.vgoyal.test1%20hugemem&arch=i386&jobids=18420 http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/memory/nullmap&result=Fail&rwhiteboard=kernel%202.6.9-68.28.EL.vgoyal.test1%20smp&arch=i386&jobids=18420 (In reply to comment #0) > http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2279798 The log says, "aacraid driver (1.1-5[2441]" That is prior to Chip's latest 4.7 posting. The relevant part of the log appears to be: Starting rhts: 03/18/08 10:30:51 recipeID:62648 start: ... 03/18/08 11:12:06 JobID:17889 Test:/kernel/drivers/modules 03/18/08 11:12:07 testID:529442 start: percraid:Fatal Error: See system event log aacraid: Host adapter reset request. SCSI hang ? percraid: Host adapter dead -3 SCSI error : <0 0 0 0> return code = 0x6000000 end_request: I/O error, dev sda, sector 1173 Buffer I/O error on device sda1, logical block 555 lost page write due to I/O error on sda1 scsi0 (0:0): rejecting I/O to offline device SCSI error : <0 0 0 0> return code = 0x6000000 end_request: I/O error, dev sda, sector 1241 . . . I reserved this system and did some testing. The same thing happens in RHEL5 with this box http://rhts.redhat.com/testlogs/18428/64505/548019/2391992-test_log--kernel-security-selinux-ltp-selinux-20080229-EXTERNALWATCHDOG I believe this maybe a hardware issue not software. I opened a rhts-admin RT ticket: https://engineering.redhat.com/rt3/Ticket/Display.html?id=21835 aacraid: Host adapter abort request (0,0,0,0) aacraid: Host adapter abort request (0,0,0,0) aacraid: Host adapter abort request (0,0,0,0) aacraid: Host adapter abort request (0,0,0,0) aacraid: Host adapter abort request (0,0,0,0) aacraid: Host adapter abort request (0,0,0,0) aacraid: Host adapter reset request. SCSI hang ? AAC: Host adapter dead -3 end_request: I/O error, dev sda, sector 1163 end_request: I/O error, dev sda, sector 1231 end_request: I/O error, dev sda, sector 4895437 Buffer I/O error on device dm-0, logical block 585776 sd 0:0:0:0: rejecting I/O to offline device Buffer I/O error on device sda1, logical block 601 lost page write due to I/O error on sda1 Aborting journal on device sda1. journal commit I/O error lost page write due to I/O error on dm-0 end_request: I/O error, dev sda, sector 209229 Buffer I/O error on device dm-0, logical block 0 Aborting journal on device dm-0. journal commit I/O error ext3_abort called. EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only lost page write due to I/O error on dm-0 Buffer I/O error on device dm-0, logical block 1 lost page write due to I/O error on dm-0 end_request: I/O error, dev sda, sector 2307365 Buffer I/O error on device dm-0, logical block 262267 lost page write due to I/O error on dm-0 Buffer I/O error on device dm-0, logical block 262268 lost page write due to I/O error on dm-0 end_request: I/O error, dev sda, sector 2307397 Buffer I/O error on device dm-0, logical block 262271 lost page write due to I/O error on dm-0 end_request: I/O error, dev sda, sector 2307453 Buffer I/O error on device dm-0, logical block 262278 lost page write due to I/O error on dm-0 end_request: I/O error, dev sda, sector 2307597 Buffer I/O error on device dm-0, logical block 262296 lost page write due to I/O error on dm-0 end_request: I/O error, dev sda, sector 2307653 Buffer I/O error on device dm-0, logical block 262303 lost page write due to I/O error on dm-0 end_request: I/O error, dev sda, sector 2308837 end_request: I/O error, dev sda, sector 2309837 end_request: I/O error, dev sda, sector 2310157 end_request: I/O error, dev sda, sector 4403589 end_request: I/O error, dev sda, sector 4403613 Little more data on this issue. I reserved a different machine with the same adapter and a similar issue happened: percraid:PANIC: length of sg list is too big percraid:Fatal Error: See system event log percraid:NO CORE DUMP, Trace not started. aacraid: Host adapter reset request. SCSI hang ? percraid: Host adapter dead -3 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 SCSI error : <0 0 0 0> return code = 0x6000000 end_request: I/O error, dev sda, sector 1679 Buffer I/O error on device sda1, logical block 808 lost page write due to I/O error on sda1 scsi0 (0:0): rejecting I/O to offline device SCSI error : <0 0 0 0> return code = 0x6000000 scsi0 (0:0): rejecting I/O to offline device Buffer I/O error on device sda1, logical block 841 lost page write due to I/O error on sda1 Aborting journal on device sda1. end_request: I/O error, dev sda, sector 2521685 Buffer I/O error on device dm-0, logical block 289057 lost page write due to I/O error on dm-0 scsi0 (0:0): rejecting I/O to offline device Buffer I/O error on device dm-0, logical block 289058 lost page write due to I/O error on dm-0 SCSI error : <0 0 0 0> return code = 0x6000000 end_request: I/O error, dev sda, sector 287309 Buffer I/O error on device dm-0, logical block 9760 lost page write due to I/O error on dm-0 SCSI error : <0 0 0 0> return code = 0x6000000 end_request: I/O error, dev sda, sector 2521469 Buffer I/O error on device dm-0, logical block 289030 lost page write due to I/O error on dm-0 SCSI error : <0 0 0 0> return code = 0x6000000 end_request: I/O error, dev sda, sector 2521845 Buffer I/O error on device dm-0, logical block 289077 lost page write due to I/O error on dm-0 SCSI error : <0 0 0 0> return code = 0x6000000 Aborting journal on device dm-0. journal commit I/O error ext3_abort called. EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only EXT3-fs error (device dm-0) in start_transaction: Journal has aborted end_request: I/O error, dev sda, sector 1087 Buffer I/O error on device sda1, logical block 512 lost page write due to I/O error on sda1 SCSI error : <0 0 0 0> return code = 0x6000000 end_request: I/O error, dev sda, sector 209237 Buffer I/O error on device dm-0, logical block 1 lost page write due to I/O error on dm-0 scsi0 (0:0): rejecting I/O to offline device Buffer I/O error on device dm-0, logical block 2 lost page write due to I/O error on dm-0 SCSI error : <0 0 0 0> return code = 0x6000000 end_request: I/O error, dev sda, sector 217461 SCSI error : <0 0 0 0> return code = 0x6000000 end_request: I/O error, dev sda, sector 2306381 SCSI error : <0 0 0 0> return code = 0x6000000 end_request: I/O error, dev sda, sector 2306389 scsi0 (0:0): rejecting I/O to offline device EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has aborted SCSI error : <0 0 0 0> return code = 0x6000000 end_request: I/O error, dev sda, sector 2306421 SCSI error : <0 0 0 0> return code = 0x6000000 end_request: I/O error, dev sda, sector 2306853 SCSI error : <0 0 0 0> return code = 0x6000000 EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has aborted EXT3-fs error (device dm-0) in ext3_dirty_inode: Journal has aborted scsi0 (0:0): rejecting I/O to offline device scsi0 (0:0): rejecting I/O to offline device scsi0 (0:0): rejecting I/O to offline device scsi0 (0:0): rejecting I/O to offline device scsi0 (0:0): rejecting I/O to offline device scsi0 (0:0): rejecting I/O to offline device scsi0 (0:0): rejecting I/O to offline device scsi0 (0:0): rejecting I/O to offline device The above test was done with the 2.6.9-68.19.ELsmp kernel. It seems as if the RHTS /kernel/drivers/modules test triggers the unwanted behavior. Link to the test RPM: http://rhts.redhat.com/rpms/development/noarch/noarch/rh-tests-kernel-drivers-modules-3.0-8.noarch.rpm (In reply to comment #6) > The above test was done with the 2.6.9-68.19.ELsmp kernel. That kernel has driver version 1.1-5[2441]. The same version as RHEL 4.6. The aacraid update to 1.1.5-2453 for 4.7 is in 2.6.9-68.27.EL. Jeff, would it be too much trouble to re-test this system with a stock 4.6 install? If it passes it points to a 4.7 change outside the driver. If it fails, it is not a regression in 4.7. Eventually, we will want to re-test with 2.6.9-68.27.EL or later, to see if that has a fix. If this is too much trouble, please turn the system over to us so we can test it. Tom, I have the test running now I will update withthe details when it finishes. Thanks, Jeff Tom, This was with kernel-smp-2.6.9-67.EL running. Unfortunately something is wrong with the serial configuration on those systems in RDU. rpmdb: fsync Input/output error error: db4 error(5) from db->sync: Input/output error rpmdb: write: 0xb7d0f9b4, 4096: Read-only file system rpmdb: /var/lib/rpm/Basenames: write failed for page 0 rpmdb: write: 0xb7d085fc, 4096: Read-only file system rpmdb: /var/lib/rpm/Packages: write failed for page 0 rpmdb: write: 0xb7c8c63c, 4096: Read-only file system rpmdb: /var/lib/rpm/Packages: write failed for page 1 rpmdb: write: 0xb7ccd7b4, 4096: Read-only file system rpmdb: /var/lib/rpm/Packages: write failed for page 4153 rpmdb: write: 0xb7d022cc, 4096: Read-only file system rpmdb: /var/lib/rpm/Packages: write failed for page 4154 Message from syslogd@dell-pe1650-2 at Fri Apr 4 16:48:41 2008 ... dell-pe1650-2 kernel: journal commit I/O error rpmdb: write: 0xb7c6e6d4, 4096: Read-only file system rpmdb: /var/lib/rpm/Packages: write failed for page 4155 rpmdb: write: 0xb7d45554, 4096: Read-only file system rpmdb: /var/lib/rpm/Packages: write failed for page 4156 rpmdb: write: 0xb7d5c104, 4096: Read-only file system rpmdb: /var/lib/rpm/Packages: write failed for page 4157 rpmdb: write: 0xb7d4d994, 4096: Read-only file system rpmdb: /var/lib/rpm/Packages: write failed for page 4158 rpmdb: /var/lib/rpm/Basenames: write failed for page 174 rpmdb: read: 0xb7c53904, 4096: Input/output error error: db4 error(5) from dbcursor->c_get: Input/output error error: error(5) getting "ib_addr.ko" records from Basenames index rpmdb: read: 0xb7d0b794, 4096: Input/output error error: db4 error(5) from dbcursor->c_get: Input/output error error: error(5) getting "ib_local_sa.ko" records from Basenames index rpmdb: write: 0xb7d1d09c, 4096: Read-only file system rpmdb: /var/lib/rpm/Packages: write failed for page 4162 I also offlined this machine and could reproduce the issue with RHEL4 U6 kernel (67.EL) by running /kernel/drivers/modules rhts test. Interestingly I saw it happening only with hugmem kernel and not smp kernel (1 try). I think I will try again to see if it is kernel flavor related issue. Jeff, and Chip, I have been trying to find a similar system in Westford where we can try to reproduce this. I checked clu1, edge2, p750 with no luck. According to some old notes I have, clug (and presumably cluh) have aacraid. If you can find one of them, that would be a good candidate. Otherwise, We should have an aacraid board in the cabinet, or maybe in Chip's cube, that we can put in a system and try to reproduce this. Tom The system "clug" in the Westford lab is a pe1650, with the perc HBA. I installed RHEL 4.6 (or .7-beta?, I'm not sure now) and Jeff ran the failing test from RHTS. The problem did not occur. Some additional testing on this box might be the best next step. I am moving this from 4.7 to 4.8 at this point. This does not appear to be a RHEL 4.7 regression, and we are out of time. Tom Saw it again during RHTS run both on dell-pe1650-1.test.redhat.com and dell-pe1650-2.test.redhat.com http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/memory/nullmap&result=Fail&rwhiteboard=kernel%202.6.9-75.EL&arch=i386&jobids=24300 http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/memory/nullmap&result=Fail&rwhiteboard=kernel%202.6.9-75.EL%20smp&arch=i386&jobids=24300 Created attachment 311203 [details]
dell-pe1650-1 messages & dmidecode
Created attachment 311204 [details]
dell-pe1650-2 messages & dmidecode
(In reply to comment #21) > Info for dell-pe1650-1: > > BIOS (from dmidecode) > BIOS Information > Vendor: Dell Computer Corporation > Version: A11 > Release Date: 10/08/2003 ... > Info for Dell-pe1650-2: > BIOS (from dmidecode) > BIOS Information > Vendor: Dell Computer Corporation > Version: A05 > Release Date: 03/29/2002 Info for clug, in Westford: BIOS Information Vendor: Dell Computer Corporation Version: A05 Release Date: 03/29/2002 So, Dell-pe1650-2 and clug have the same (old) BIOS firmware. Again, (In reply to comment #21) ... > Info for Dell-pe1650-2: ... > Jul 7 15:29:55 dell-pe1650-2 kernel: Adaptec aacraid driver 1.1-5[2453] I see from the log that this system is running RHEL 5: kernel 2.6.18-92.el5. That may not matter, Jeff says that the problem is seen with RHEL 5. Out of curiosity, how often does this RHTS test run on these systems, and with which o.s. version? That is, does the test run frequently on these two systems, on both RHEL 4 and 5, and it only fails very occasionally, and always on RHEL 4? > Jul 7 15:29:55 dell-pe1650-2 kernel: ACPI: PCI Interrupt 0000:01:08.1[A] -> GSI > 18 (level, low) -> IRQ 177 > Jul 7 15:29:55 dell-pe1650-2 kernel: AAC0: kernel 2.7-0[3153] > Jul 7 15:29:55 dell-pe1650-2 kernel: AAC0: monitor 2.7-0[3153] > Jul 7 15:29:55 dell-pe1650-2 kernel: AAC0: bios 2.7-0[3153] > Jul 7 15:29:55 dell-pe1650-2 kernel: AAC0: serial CA3021D3 > Jul 7 15:29:55 dell-pe1650-2 kernel: scsi0 : percraid > Jul 7 15:29:55 dell-pe1650-2 kernel: Vendor: DELL Model: jmo > Rev: V1.0 > Jul 7 15:29:55 dell-pe1650-2 kernel: Type: Direct-Access > ANSI SCSI revision: 02 > Jul 7 15:29:55 dell-pe1650-2 kernel: SCSI device sda: 106633728 512-byte hdwr s > ectors (54596 MB) > Jul 7 15:29:55 dell-pe1650-2 kernel: sda: Write Protect is off > Jul 7 15:29:55 dell-pe1650-2 rpc.statd[1937]: statd running as root. chown /var > /lib/nfs/statd/sm to choose different user > Jul 7 15:29:55 dell-pe1650-2 kernel: SCSI device sda: drive cache: write back > Jul 7 15:29:55 dell-pe1650-2 kernel: SCSI device sda: 106633728 512-byte hdwr s > ectors (54596 MB) > Jul 7 15:29:55 dell-pe1650-2 kernel: sda: Write Protect is off > Jul 7 15:29:55 dell-pe1650-2 kernel: SCSI device sda: drive cache: write back > Jul 7 15:29:55 dell-pe1650-2 kernel: sda: sda1 sda2 > Jul 7 15:29:55 dell-pe1650-2 kernel: sd 0:0:0:0: Attached scsi removable disk s > da > Jul 7 15:29:55 dell-pe1650-2 kernel: Vendor: HITACHI Model: DK32DJ-18MC > Rev: D4D4 > Jul 7 15:29:55 dell-pe1650-2 kernel: Type: Direct-Access > ANSI SCSI revision: 03 > Jul 7 15:29:55 dell-pe1650-2 kernel: AAC:ID(0:00:0); Error Event [command:0xa0] > Jul 7 15:29:55 dell-pe1650-2 kernel: AAC:ID(0:00:0); Illegal Request [k:0x5,c:0 > x20,q:0x0] > Jul 7 15:29:55 dell-pe1650-2 kernel: AAC:ID(0:00:0); Invalid Command Operation > Code I do not see these on clug. They are indeed present in the log of the failed RHEL 4 RHTS test run referenced above. I'm having trouble finding these messages in the code, but if "command:0xa0" refers to a SCSI opcode, then that is a Report LUNs. That makes sense at this point, but I do not know why it is failing. For the record, the test that is running when the adapter appears to die is: function TestModule () { MODLIST='lp' KPNAME=$kernname-$kernver-$1 for t in $MODLIST ; do # Test the insertion of previous kernel modules /sbin/modprobe -r $t IMOD=$(rpm -ql $KPNAME | grep /$t.ko) insmod $IMOD if [ "$?" -ne "0" ]; then echo "$IMOD insertion Failed:" | tee -a $OUTPUTFILE else echo "$IMOD insertion Passed:" | tee -a $OUTPUTFILE fi done } Is the $OUTPUTFILE preserved in RHTS? Bottom line, more work is needed to reproduce this on clug, or to understand why it is not failing. Updating PM score. *** Bug 455268 has been marked as a duplicate of this bug. *** Observed this issue again during my rhts run. http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=4529790 Vivek, I'd like to test the patch from BZ#453472 even if chances that it will help are small. Both dell-pe1650-2 and dell-pe1650-1, are used for testing now, could you help me with the reservation ? Vivek, Do you still see the problem or has it vanished in latest kernel ? I have created a RHTS xml script. Using this script I tried the test against RHEL4-U6 and RHEL4-U8-re20090128.1. They both failed the exact same way. Here is a link to the actual job: http://tinyurl.com/BZ438895 Here are the links to the results: RHEL4-U6 http://tinyurl.com/aeh9qo RHEL4-U8-re20090128.1 http://tinyurl.com/bq3wyr -------- snip -------- 01/28/09 21:08:09 JobID:43834 Test:/kernel/drivers/modules Response:1 01/28/09 21:08:10 testID:1247692 start: percraid:Fatal Error: See system event log aacraid: Host adapter abort request (0,0,0,0) aacraid: Host adapter abort request (0,0,0,0) aacraid: Host adapter abort request (0,0,0,0) aacraid: Host adapter abort request (0,0,0,0) aacraid: Host adapter abort request (0,0,0,0) aacraid: Host adapter abort request (0,0,0,0) aacraid: Host adapter abort request (0,0,0,0) aacraid: Host adapter abort request (0,0,0,0) aacraid: Host adapter abort request (0,0,0,0) -------- snip -------- (In reply to comment #33) > I have created a RHTS xml script. Using this script I tried the test against > RHEL4-U6 and RHEL4-U8-re20090128.1. They both failed the exact same way. > Jeff, thanks for the testing. Please which test have you run with your script ? And if it is public useable could you make it public ? Created attachment 330371 [details]
XML file used with RHTS to duplicate issue
Tomas,
No Problem. I will attach the xml script so you have a list of the tests.
I believe the testcase is covered under GPL but it really will not work out side of Red Hat building. It goes to internal servers to grab some packages to test older modules. You could modify it to work externally if you want.
FYI the specific test is /kernel/drivers/modules
> FYI the specific test is /kernel/drivers/modules That test comes from rh-tests-kernel-drivers-modules-1.1-5.noarch.rpm and is as shown in Comment #25 above. I just logged in to the system and did: lsmod # confirmed that "lp" is loaded. /sbin/modprobe -r lp # no problem lsmod # it is unloaded rpm -ql kernel-2.6.9-80.EL | grep lp.ko /lib/modules/2.6.9-80.EL/kernel/drivers/char/lp.ko /lib/modules/2.6.9-80.EL/kernel/drivers/usb/class/usblp.ko # Humm. Maybe having two returns here is causing a problem for the script? insmod /lib/modules/2.6.9-80.EL/kernel/drivers/char/lp.ko insmod: error inserting '/lib/modules/2.6.9-80.EL/kernel/drivers/char/lp.ko': -1 Invalid module format # I do not know why that is happening. It seems to happen for all the # modules I tried. So, there may be some issues with the test script, and there may be something causing an error on insmod. None of that, of course explains why the aacraid would appear to go offline. That is just bizarre, considering how simple this test is. Tomas, please take a closer look at that script, then try to reproduce the problem without RHTS. If it does not reproduce, then we'll need to learn more about setting up and debugging in the RHTS environment. In the runtest.sh it has a grep command that will only return a single module. ----- snip ------ rpm -ql $KPNAME | grep /$t.ko ----- /snip ------ Looks like when you manually ran it you missed the / before the module. It should have been rpm -ql kernel-2.6.9-80.EL | grep /lp.ko Also we run this test on every RHEL4 kernel we build. If it was a test issue we would have see it way before this. Or at least we would have seen it on other systems. Looking at the actual results from the testing. I see this: Results from "2.6.9-80.EL" Starting ./runtest.sh Current Test Version = rh-tests-kernel-drivers-modules-3.0 Current Running Kernel Package = kernel-smp-2.6.9-80.EL Download/Install kernel-smp-2.6.9-5.EL.i686.rpm kernel Download kernel-smp-2.6.9-5.EL.i686.rpm Passed: Install kernel-smp-2.6.9-5.EL.i686.rpm Passed: Download/Install kernel-smp-2.6.9-11.EL.i686.rpm kernel Download kernel-smp-2.6.9-11.EL.i686.rpm Passed: Results from "2.6.9-67.EL" Starting ./runtest.sh Current Test Version = rh-tests-kernel-drivers-modules-3.0 Current Running Kernel Package = kernel-smp-2.6.9-67.EL Download/Install kernel-smp-2.6.9-5.EL.i686.rpm kernel Download kernel-smp-2.6.9-5.EL.i686.rpm Passed: Install kernel-smp-2.6.9-5.EL.i686.rpm Passed: Download/Install kernel-smp-2.6.9-11.EL.i686.rpm kernel Download kernel-smp-2.6.9-11.EL.i686.rpm Passed: What is interesting is they both seem to fail in the exact same spot: Just after installing the lp.ko from 2.6.9-11.EL. Next in the list would have been -22.EL (In reply to comment #37) > insmod /lib/modules/2.6.9-80.EL/kernel/drivers/char/lp.ko > > insmod: error inserting '/lib/modules/2.6.9-80.EL/kernel/drivers/char/lp.ko': > -1 Invalid module format My mistake. that should be "2.6.9-80.ELsmp", not "2.6.9-80.EL". insmod works fine now. And yes, Jeff is right about the missing / in the grep command. I've installed kernel-smp-2.6.9-22.EL.i686.rpm and kernel-smp-2.6.9-11.EL.i686.rpm. On both I'm able to do insmod /lib/modules/2.6.9-11.ELsmp/kernel/drivers/char/lp.ko and insmod /lib/modules/2.6.9-22.ELsmp/kernel/drivers/char/lp.ko. With 2.6.9-11 I can see in /var/log/messages Jan 30 11:40:07 dell-pe1650-1 kernel: ksign: module signed with unknown public key Jan 30 11:40:07 dell-pe1650-1 kernel: - signature keyid: 975e6fa84e049d8a ver=3 Could be that be a problem when running automatic tests ? Nope that is normal output for the test, see here This is from the log in Comment#32 http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=6184389 ------snip------ Console Log: ksign: module signed with unknown public key - signature keyid: e07bc3e85be30cfd ver=3 lp: driver loaded but no devices found ksign: module signed with unknown public key - signature keyid: d67b3e6b1ed6fec7 ver=3 lp: driver loaded but no devices found ksign: module signed with unknown public key - signature keyid: 975e6fa84e049d8a ver=3 lp: driver loaded but no devices found ksign: module signed with unknown public key - signature keyid: 063efdf11ad6baa3 ver=3 lp: driver loaded but no devices found ksign: module signed with unknown public key - signature keyid: 8301cd821788a86b ver=3 lp: driver loaded but no devices found ksign: module signed with unknown public key - signature keyid: 03629c5f482105a7 ver=3 lp: driver loaded but no devices found lp: driver loaded but no devices found ksign: module signed with unknown public key - signature keyid: 9ada2b4b1ec241de ver=3 lp: driver loaded but no devices found ------snip------ Just out of curiosity. Why are you running it by hand? You should be able to cd into the /mnt/test/kernel/drivers/modules directory and do either a make run or runtest.sh It looks to me that the issue here is not related to module loading/unloading, simply writing some amount of data to /dev/sda1 which happens for example during kernel install (and this is part of module test). There are some errors messages when the system starts : Attached scsi removable disk sda at scsi0, channel 0, id 0, lun 0 Vendor: HITACHI Model: DK32DJ-18MC Rev: D4D4 Type: Direct-Access ANSI SCSI revision: 03 percraid:ID(0:00:0); Error Event [command:0xa0] percraid:ID(0:00:0); Illegal Request [k:0x5,c:0x20,q:0x0] percraid:ID(0:00:0); Invalid Command Operation Code I'm trying to reinstall it with RHEL5.3, to see if also fails. The 5.3 suffers from the same issue. The 2.6.9-67 also has similar problem even if the symptoms are not exactly the same. Kernel 2.6.9-55 seems to be working well, so I'm going to see what differences are between them. Created attachment 331826 [details]
limit sg list length
The issue has been probably caused by the patch "Update aacraid driver to 1.1.5-2453". Further I've found that limiting the size of the sg list helps. See proposed patch.
(In reply to comment #45) > Further I've found that limiting the size of the sg list helps. So, have you found a simple way to reproduce this? As I understand it, RHTS does a complete install, and then runs a certain number of tests with no problem. Then, on one simple "rmmod/insmod lp" test, all I/O to the root disk suddenly fails. An attempt to run the equivalent test outside RHTS does not fail. It would be helpful to find a more direct reproducer. (In reply to comment #43) > percraid:ID(0:00:0); Error Event [command:0xa0] > percraid:ID(0:00:0); Illegal Request [k:0x5,c:0x20,q:0x0] > percraid:ID(0:00:0); Invalid Command Operation Code As I mentioned earlier, if "command:0xa0" refers to a SCSI opcode, then that is a Report LUNs. That makes sense at this point (scanning to find devices at module load time). I do not know why it is failing, but the system seems to recover and continue on okay. It might be related to the eventual I/O failure, but it is not clear how. (In reply to comment #45) > Created an attachment (id=331826) [details] > limit sg list length > > The issue has been probably caused by the patch "Update aacraid driver to > 1.1.5-2453". It will be interesting to see what happens when you back that out. > Further I've found that limiting the size of the sg list helps. > See proposed patch. That causes the RHTS test to succeed? Or it has some other good effect? (In reply to comment #46) > (In reply to comment #45) > > > Further I've found that limiting the size of the sg list helps. > > So, have you found a simple way to reproduce this? Yes, writing a somewhat larger amount of data to sda1 brings it down immediatelly 'dd if=/dev/zero of=/boot/asd/ts.bin count=2k bs=10k' > > As I understand it, RHTS does a complete install, and then runs a certain > number of tests with no problem. Then, on one simple "rmmod/insmod lp" test, > all I/O to the root disk suddenly fails. An attempt to run the equivalent test > outside RHTS does not fail. It would be helpful to find a more direct > reproducer. > This "rmmod/insmod lp" test installs several older kernels - the system fails while writing them to /boot/. > > (In reply to comment #43) > > > percraid:ID(0:00:0); Error Event [command:0xa0] > > percraid:ID(0:00:0); Illegal Request [k:0x5,c:0x20,q:0x0] > > percraid:ID(0:00:0); Invalid Command Operation Code > > As I mentioned earlier, if "command:0xa0" refers to a SCSI opcode, then that is > a Report LUNs. That makes sense at this point (scanning to find devices at > module load time). I do not know why it is failing, but the system seems to > recover and continue on okay. It might be related to the eventual I/O failure, > but it is not clear how. > After the the mentioned patch was applied the error report vanished. > That causes the RHTS test to succeed? Or it has some other good effect? In fact I haven't tested the rhts test yet, but the problem with writing to /boot/ is solved. > > That causes the RHTS test to succeed? Or it has some other good effect?
> In fact I haven't tested the rhts test yet, but the problem with writing to
> /boot/ is solved.
Even without the patch there is no problem with "rmmod/insmod lp".
Achim, maybe we have only old firmware on the box, could you please check it (Don't know what and if the information is in the log below - if you need some additional info I'll provide it) Thanks, Tomas Adaptec aacraid driver 1.1-5[2456] ACPI: PCI Interrupt 0000:01:08.1[A] -> GSI 18 (level, low) -> IRQ 193 percraid0: kernel 2.7-0[3153] percraid0: monitor 2.7-0[3153] percraid0: bios 2.7-0[3153] percraid0: serial CA3021D3 scsi0 : percraid Vendor: DELL Model: jmo Rev: V1.0 Type: Direct-Access ANSI SCSI revision: 02 SCSI device sda: 71089152 512-byte hdwr sectors (36398 MB) sda: Write Protect is off sda: Mode Sense: 06 00 00 00 SCSI device sda: drive cache: write back SCSI device sda: 71089152 512-byte hdwr sectors (36398 MB) sda: Write Protect is off sda: Mode Sense: 06 00 00 00 SCSI device sda: drive cache: write back sda: sda1 sda2 Attached scsi removable disk sda at scsi0, channel 0, id 0, lun 0 Vendor: HITACHI Model: DK32DJ-18MC Rev: D4D4 Type: Direct-Access ANSI SCSI revision: 03 Vendor: HITACHI Model: DK32DJ-18MC Rev: D4D4 Type: Direct-Access ANSI SCSI revision: 03 Vendor: HITACHI Model: DK32DJ-18MC Rev: D4D4 Type: Direct-Access ANSI SCSI revision: 03 Vendor: PE/PV Model: 1x3 SCSI BP Rev: 0.26 Type: Processor ANSI SCSI revision: 02 Tomas, looks good so far. Is there a dedicated procedure to reproduce the issue? Thanks, Achim (In reply to comment #51) > Tomas, > looks good so far. You meant the firmware we have is the newest ? > Is there a dedicated procedure to reproduce the issue? - install RHEL4.7 - mkdir /boot/asd (test directory on sda1) - dd if=/dev/zero of=/boot/asd/ts.bin count=10k bs=10k The system then fails while writing. > > Thanks, > Achim Achim, have you been able to reproduce this on your hardware ? Tomas, unfortunately I couldn't reproduce it yet. Does it depend on a special test machine, system BIOS etc.? Does it also occur with other Adaptec RAID controllers or did you see this with the DELL/Perc controller only? Thanks, Achim (In reply to comment #54) > unfortunately I couldn't reproduce it yet. Does it depend on a special test > machine, system BIOS etc.? Does it also occur with other Adaptec RAID > controllers or did you see this with the DELL/Perc controller only? I've seen this only on this one machine, so only with the DELL/Perc controller. Have you also used for testing this (Dagger/PERC3DiD) controller ? We'd like to know if we are using the latest controller's firmware and what is the latest version ? Can you point me to a firmware upload tool ? (I wasn't successful on the Adaptec's web). I have been able to reproduce this on a similar system in Westford. This is the system "clug" that I mentioned earlier: pe1650 BIOS Information Vendor: Dell Computer Corporation Version: A05 Release Date: 03/29/2002 This system could not reproduce the problem previously. I changed it from a single disk per logical unit (no RAID) to a two disk RAID1 and I can now reproduce the problem. I have not gone back to confirm this yet, but it appears to be necessary to cause the problem. I will look in to updating the fw next. Tom I updated the BIOS on clug with PE1650-BIOS-LX-A11.bin and the PERC fw with PE1650_RAID_FRMW_LX_R168387.BIN from the Dell web site. This fixed the problem. I will try dell-pe1650-1 and dell-pe1650-2 next. dell-pe1650-1 already had up-to-date BIOS (PE1650-BIOS-LX-A11.bin). It did not have the latest PERC fw (PE1650_RAID_FRMW_LX_R168387.BIN). In this state, it failed Tomas' simple test: dd if=/dev/zero of=/boot/asd/ts.bin count=10k bs=10k I updated the PERC fw. Now this dd test passes, as it did on clug. The PERC fw update is apparently a prerequisite for the driver update that went in to RHEL 4.7. I am trying to get a good RHTS run now. This system has passed RHTS tests with RHEL 4.7 and 4.8. Closing. Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: The aacraid driver update introduced in RHEL 4.7, and contained in subsequent RHEL 4 versions, requires up-to-date Adaptec PERC3/Di firmware. The minimum version required of the PERC3/Di firmware is 2.8.1.7692, A13. This firmware may be obtained at this site: http://support.dell.com/support/downloads/download.aspx?c=us&cs=555&l=en&s=biz&releaseid=R168387&SystemID=PWE_PNT_PIII_1650&servicetag=&os=WNET&osl=en&deviceid=1375&devlib=0&typecnt=0&vercnt=9&catid=-1&impid=-1&formatcnt=4&libid=35&fileid=228550 Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,3 +1,3 @@ -The aacraid driver update introduced in RHEL 4.7, and contained in subsequent RHEL 4 versions, requires up-to-date Adaptec PERC3/Di firmware. The minimum version required of the PERC3/Di firmware is 2.8.1.7692, A13. This firmware may be obtained at this site: +The aacraid driver update that was first introduced in Red Hat Enterprise Linux 4.7 requires up to date Adaptec PERC3/Di firmware. Subsequent updates of Red Hat Enterprise Linux 4 (including this 4.8 update) require, that the PERC3/Di firmware is at version 2.8.1.7692, A13 or newer. The firmware may be obtained at the following location: http://support.dell.com/support/downloads/download.aspx?c=us&cs=555&l=en&s=biz&releaseid=R168387&SystemID=PWE_PNT_PIII_1650&servicetag=&os=WNET&osl=en&deviceid=1375&devlib=0&typecnt=0&vercnt=9&catid=-1&impid=-1&formatcnt=4&libid=35&fileid=228550 |