Bug 438895

Summary: dell percraid adapter dead issue
Product: Red Hat Enterprise Linux 4 Reporter: Vivek Goyal <vgoyal>
Component: kernelAssignee: Tomas Henzl <thenzl>
Status: CLOSED NOTABUG QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: achim_leubner, andriusb, coughlan, duck, jburke, ltroan, mgahagan, qcai, rlerch
Target Milestone: rc   
Target Release: 4.8   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
The aacraid driver update that was first introduced in Red Hat Enterprise Linux 4.7 requires up to date Adaptec PERC3/Di firmware. Subsequent updates of Red Hat Enterprise Linux 4 (including this 4.8 update) require, that the PERC3/Di firmware is at version 2.8.1.7692, A13 or newer. The firmware may be obtained at the following location: http://support.dell.com/support/downloads/download.aspx?c=us&cs=555&l=en&s=biz&releaseid=R168387&SystemID=PWE_PNT_PIII_1650&servicetag=&os=WNET&osl=en&deviceid=1375&devlib=0&typecnt=0&vercnt=9&catid=-1&impid=-1&formatcnt=4&libid=35&fileid=228550
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-03-16 18:30:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 450901, 461297    
Attachments:
Description Flags
dell-pe1650-1 messages & dmidecode
none
dell-pe1650-2 messages & dmidecode
none
XML file used with RHTS to duplicate issue
none
limit sg list length none

Description Vivek Goyal 2008-03-25 20:36:37 UTC
Description of problem:

During RHTS I noticed failures on dell machine with percraid adapter. I have
noticed this
failure now 2-3 times, so there is something wrong specifically with percraid
adapter.

Version-Release number of selected component (if applicable):

Noticed in 68.25

How reproducible:

Noticed it twice in RHTS. Don't know how reproducible it is. I think on same
dell machine it
might be an reproducible issue.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Error messages:

EXT3-fs error (device dm-0) in start_transaction: Journal has aborted
EXT3-fs error (device dm-0) in start_transaction: Journal has aborted
EXT3-fs error (device dm-0) in start_transaction: Journal has aborted
EXT3-fs error (device dm-0) in start_transaction: Journal has aborted
EXT3-fs error (device dm-0) in start_transaction: Journal has aborted
EXT3-fs error (device dm-0) in start_transaction: Journal has aborted
percraid: Host adapter dead -3

Following are the links to error messages.

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2324721
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2279798

Comment 2 Vivek Goyal 2008-03-26 15:13:35 UTC
I am seeing same failure signature on 68.28.vgoyal.test1 rhts build. 

/kernel/memory/nullmap test is failing on i386 dell machines with percraid adapter.

Logs don't say that percraid is dead but symptoms of the failure are same as the
previous
failures.

http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/memory/nullmap&result=Fail&rwhiteboard=kernel%202.6.9-68.28.EL.vgoyal.test1%20hugemem&arch=i386&jobids=18420
http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/memory/nullmap&result=Fail&rwhiteboard=kernel%202.6.9-68.28.EL.vgoyal.test1%20smp&arch=i386&jobids=18420

Comment 3 Tom Coughlan 2008-03-26 15:35:28 UTC
(In reply to comment #0)

> http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2279798

The log says,  "aacraid driver (1.1-5[2441]"
That is prior to Chip's latest 4.7 posting. 

The relevant part of the log appears to be:

Starting rhts:  03/18/08 10:30:51  recipeID:62648 start:
...
03/18/08 11:12:06  JobID:17889 Test:/kernel/drivers/modules
03/18/08 11:12:07  testID:529442 start:

percraid:Fatal Error: See system event log

aacraid: Host adapter reset request. SCSI hang ?

percraid: Host adapter dead -3

SCSI error : <0 0 0 0> return code = 0x6000000

end_request: I/O error, dev sda, sector 1173

Buffer I/O error on device sda1, logical block 555

lost page write due to I/O error on sda1

scsi0 (0:0): rejecting I/O to offline device

SCSI error : <0 0 0 0> return code = 0x6000000

end_request: I/O error, dev sda, sector 1241
.
.
.


Comment 4 Jeff Burke 2008-03-26 20:23:22 UTC
I reserved this system and did some testing. The same thing happens in RHEL5
with this box 
http://rhts.redhat.com/testlogs/18428/64505/548019/2391992-test_log--kernel-security-selinux-ltp-selinux-20080229-EXTERNALWATCHDOG

I believe this maybe a hardware issue not software. I opened a rhts-admin RT ticket:

https://engineering.redhat.com/rt3/Ticket/Display.html?id=21835

aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter reset request. SCSI hang ?
AAC: Host adapter dead -3
end_request: I/O error, dev sda, sector 1163
end_request: I/O error, dev sda, sector 1231
end_request: I/O error, dev sda, sector 4895437
Buffer I/O error on device dm-0, logical block 585776
sd 0:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sda1, logical block 601
lost page write due to I/O error on sda1
Aborting journal on device sda1.
journal commit I/O error
lost page write due to I/O error on dm-0
end_request: I/O error, dev sda, sector 209229
Buffer I/O error on device dm-0, logical block 0
Aborting journal on device dm-0.
journal commit I/O error
ext3_abort called.
EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 1
lost page write due to I/O error on dm-0
end_request: I/O error, dev sda, sector 2307365
Buffer I/O error on device dm-0, logical block 262267
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 262268
lost page write due to I/O error on dm-0
end_request: I/O error, dev sda, sector 2307397
Buffer I/O error on device dm-0, logical block 262271
lost page write due to I/O error on dm-0
end_request: I/O error, dev sda, sector 2307453
Buffer I/O error on device dm-0, logical block 262278
lost page write due to I/O error on dm-0
end_request: I/O error, dev sda, sector 2307597
Buffer I/O error on device dm-0, logical block 262296
lost page write due to I/O error on dm-0
end_request: I/O error, dev sda, sector 2307653
Buffer I/O error on device dm-0, logical block 262303
lost page write due to I/O error on dm-0
end_request: I/O error, dev sda, sector 2308837
end_request: I/O error, dev sda, sector 2309837
end_request: I/O error, dev sda, sector 2310157
end_request: I/O error, dev sda, sector 4403589
end_request: I/O error, dev sda, sector 4403613


Comment 5 Jeff Burke 2008-04-03 20:14:37 UTC
Little more data on this issue. I reserved a different machine with the same
adapter and a similar issue happened:

percraid:PANIC:  length of sg list is too big 

percraid:Fatal Error: See system event log
percraid:NO CORE DUMP, Trace not started.
aacraid: Host adapter reset request. SCSI hang ?
percraid: Host adapter dead -3
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 1679
Buffer I/O error on device sda1, logical block 808
lost page write due to I/O error on sda1
scsi0 (0:0): rejecting I/O to offline device
SCSI error : <0 0 0 0> return code = 0x6000000
scsi0 (0:0): rejecting I/O to offline device
Buffer I/O error on device sda1, logical block 841
lost page write due to I/O error on sda1
Aborting journal on device sda1.
end_request: I/O error, dev sda, sector 2521685
Buffer I/O error on device dm-0, logical block 289057
lost page write due to I/O error on dm-0
scsi0 (0:0): rejecting I/O to offline device
Buffer I/O error on device dm-0, logical block 289058
lost page write due to I/O error on dm-0
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 287309
Buffer I/O error on device dm-0, logical block 9760
lost page write due to I/O error on dm-0
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 2521469
Buffer I/O error on device dm-0, logical block 289030
lost page write due to I/O error on dm-0
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 2521845
Buffer I/O error on device dm-0, logical block 289077
lost page write due to I/O error on dm-0
SCSI error : <0 0 0 0> return code = 0x6000000
Aborting journal on device dm-0.
journal commit I/O error
ext3_abort called.
EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
EXT3-fs error (device dm-0) in start_transaction: Journal has aborted
end_request: I/O error, dev sda, sector 1087
Buffer I/O error on device sda1, logical block 512
lost page write due to I/O error on sda1
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 209237
Buffer I/O error on device dm-0, logical block 1
lost page write due to I/O error on dm-0
scsi0 (0:0): rejecting I/O to offline device
Buffer I/O error on device dm-0, logical block 2
lost page write due to I/O error on dm-0
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 217461
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 2306381
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 2306389
scsi0 (0:0): rejecting I/O to offline device
EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has aborted
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 2306421
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 2306853
SCSI error : <0 0 0 0> return code = 0x6000000
EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device dm-0) in ext3_dirty_inode: Journal has aborted
scsi0 (0:0): rejecting I/O to offline device
scsi0 (0:0): rejecting I/O to offline device
scsi0 (0:0): rejecting I/O to offline device
scsi0 (0:0): rejecting I/O to offline device
scsi0 (0:0): rejecting I/O to offline device
scsi0 (0:0): rejecting I/O to offline device
scsi0 (0:0): rejecting I/O to offline device
scsi0 (0:0): rejecting I/O to offline device


Comment 6 Jeff Burke 2008-04-03 20:18:29 UTC
The above test was done with the 2.6.9-68.19.ELsmp kernel. It seems as if the
RHTS /kernel/drivers/modules test triggers the unwanted behavior.

Link to the test RPM:
http://rhts.redhat.com/rpms/development/noarch/noarch/rh-tests-kernel-drivers-modules-3.0-8.noarch.rpm


Comment 7 Tom Coughlan 2008-04-04 19:13:56 UTC
(In reply to comment #6)
> The above test was done with the 2.6.9-68.19.ELsmp kernel. 

That kernel has driver version 1.1-5[2441]. The same version as RHEL 4.6.

The aacraid update to 1.1.5-2453 for 4.7 is in 2.6.9-68.27.EL. 

Jeff, would it be too much trouble to re-test this system with a stock 4.6
install? If it passes it points to a 4.7 change outside the driver. If it fails,
it is not a regression in 4.7. 

Eventually, we will want to re-test with 2.6.9-68.27.EL or later, to see if that
has a fix. 

If this is too much trouble, please turn the system over to us so we can test it. 






Comment 8 Jeff Burke 2008-04-04 20:47:36 UTC
Tom,
   I have the test running now I will update withthe details when it finishes.
Thanks,
Jeff

Comment 9 Jeff Burke 2008-04-04 20:54:04 UTC
Tom,
  This was with kernel-smp-2.6.9-67.EL running. Unfortunately something is wrong
with the serial configuration on those systems in RDU.

rpmdb: fsync Input/output error
error: db4 error(5) from db->sync: Input/output error
rpmdb: write: 0xb7d0f9b4, 4096: Read-only file system
rpmdb: /var/lib/rpm/Basenames: write failed for page 0
rpmdb: write: 0xb7d085fc, 4096: Read-only file system
rpmdb: /var/lib/rpm/Packages: write failed for page 0
rpmdb: write: 0xb7c8c63c, 4096: Read-only file system
rpmdb: /var/lib/rpm/Packages: write failed for page 1
rpmdb: write: 0xb7ccd7b4, 4096: Read-only file system
rpmdb: /var/lib/rpm/Packages: write failed for page 4153
rpmdb: write: 0xb7d022cc, 4096: Read-only file system
rpmdb: /var/lib/rpm/Packages: write failed for page 4154

Message from syslogd@dell-pe1650-2 at Fri Apr  4 16:48:41 2008 ...
dell-pe1650-2 kernel: journal commit I/O error
rpmdb: write: 0xb7c6e6d4, 4096: Read-only file system
rpmdb: /var/lib/rpm/Packages: write failed for page 4155
rpmdb: write: 0xb7d45554, 4096: Read-only file system
rpmdb: /var/lib/rpm/Packages: write failed for page 4156
rpmdb: write: 0xb7d5c104, 4096: Read-only file system
rpmdb: /var/lib/rpm/Packages: write failed for page 4157
rpmdb: write: 0xb7d4d994, 4096: Read-only file system
rpmdb: /var/lib/rpm/Packages: write failed for page 4158
rpmdb: /var/lib/rpm/Basenames: write failed for page 174
rpmdb: read: 0xb7c53904, 4096: Input/output error
error: db4 error(5) from dbcursor->c_get: Input/output error
error: error(5) getting "ib_addr.ko" records from Basenames index
rpmdb: read: 0xb7d0b794, 4096: Input/output error
error: db4 error(5) from dbcursor->c_get: Input/output error
error: error(5) getting "ib_local_sa.ko" records from Basenames index
rpmdb: write: 0xb7d1d09c, 4096: Read-only file system
rpmdb: /var/lib/rpm/Packages: write failed for page 4162


Comment 10 Vivek Goyal 2008-04-16 13:53:26 UTC
I also offlined this machine and could reproduce the issue with RHEL4 U6 kernel
(67.EL) by running /kernel/drivers/modules rhts test.

Interestingly I saw it happening only with hugmem kernel and not smp kernel (1
try). I think I will try again to see if it is kernel flavor related issue.



Comment 11 Tom Coughlan 2008-04-19 03:27:01 UTC
Jeff, and Chip,

I have been trying to find a similar system in Westford where we can try to
reproduce this. I checked clu1, edge2, p750 with no luck. According to some old
notes I have, clug (and presumably cluh) have aacraid. If you can find one of
them, that would be a good candidate. Otherwise, We should have an aacraid board
in the cabinet, or maybe in Chip's cube, that we can put in a system and try to
reproduce this.  

Tom

Comment 18 Tom Coughlan 2008-06-25 00:50:23 UTC
The system "clug" in the Westford lab is a pe1650, with the perc HBA. I
installed RHEL 4.6 (or .7-beta?, I'm not sure now) and Jeff ran the failing test
from RHTS. The problem did not occur. Some additional testing on this box might
be the best next step. 

I am moving this from 4.7 to 4.8 at this point. This does not appear to be a
RHEL 4.7 regression, and we are out of time.

Tom

Comment 23 Mike Gahagan 2008-07-07 21:21:34 UTC
Created attachment 311203 [details]
dell-pe1650-1 messages & dmidecode

Comment 24 Mike Gahagan 2008-07-07 21:22:29 UTC
Created attachment 311204 [details]
dell-pe1650-2 messages & dmidecode

Comment 25 Tom Coughlan 2008-07-24 22:30:29 UTC
(In reply to comment #21)

> Info for dell-pe1650-1:
> 
> BIOS (from dmidecode)
> BIOS Information
>         Vendor: Dell Computer Corporation
>         Version: A11
>         Release Date: 10/08/2003

...

> Info for Dell-pe1650-2:
> BIOS (from dmidecode)
> BIOS Information
>         Vendor: Dell Computer Corporation
>         Version: A05
>         Release Date: 03/29/2002

Info for clug, in Westford:

        BIOS Information
                Vendor: Dell Computer Corporation
                Version: A05
                Release Date: 03/29/2002

So, Dell-pe1650-2 and clug have the same (old) BIOS firmware. 

Again, (In reply to comment #21)

...

> Info for Dell-pe1650-2:
...
> Jul  7 15:29:55 dell-pe1650-2 kernel: Adaptec aacraid driver 1.1-5[2453]

I see from the log that this system is running RHEL 5: kernel 2.6.18-92.el5.
That may not matter, Jeff says that the problem is seen with RHEL 5. 

Out of curiosity, how often does this RHTS test run on these systems, and with
which o.s. version? That is, does the test run frequently on these two systems,
on both RHEL 4 and 5, and it only fails very occasionally, and always on RHEL 4? 

> Jul  7 15:29:55 dell-pe1650-2 kernel: ACPI: PCI Interrupt 0000:01:08.1[A] -> GSI
>  18 (level, low) -> IRQ 177
> Jul  7 15:29:55 dell-pe1650-2 kernel: AAC0: kernel 2.7-0[3153] 
> Jul  7 15:29:55 dell-pe1650-2 kernel: AAC0: monitor 2.7-0[3153]
> Jul  7 15:29:55 dell-pe1650-2 kernel: AAC0: bios 2.7-0[3153]
> Jul  7 15:29:55 dell-pe1650-2 kernel: AAC0: serial CA3021D3
> Jul  7 15:29:55 dell-pe1650-2 kernel: scsi0 : percraid
> Jul  7 15:29:55 dell-pe1650-2 kernel:   Vendor: DELL      Model: jmo            
>    Rev: V1.0
> Jul  7 15:29:55 dell-pe1650-2 kernel:   Type:   Direct-Access                   
>    ANSI SCSI revision: 02
> Jul  7 15:29:55 dell-pe1650-2 kernel: SCSI device sda: 106633728 512-byte hdwr s
> ectors (54596 MB)
> Jul  7 15:29:55 dell-pe1650-2 kernel: sda: Write Protect is off
> Jul  7 15:29:55 dell-pe1650-2 rpc.statd[1937]: statd running as root. chown /var
> /lib/nfs/statd/sm to choose different user 
> Jul  7 15:29:55 dell-pe1650-2 kernel: SCSI device sda: drive cache: write back
> Jul  7 15:29:55 dell-pe1650-2 kernel: SCSI device sda: 106633728 512-byte hdwr s
> ectors (54596 MB)
> Jul  7 15:29:55 dell-pe1650-2 kernel: sda: Write Protect is off
> Jul  7 15:29:55 dell-pe1650-2 kernel: SCSI device sda: drive cache: write back
> Jul  7 15:29:55 dell-pe1650-2 kernel:  sda: sda1 sda2
> Jul  7 15:29:55 dell-pe1650-2 kernel: sd 0:0:0:0: Attached scsi removable disk s
> da
> Jul  7 15:29:55 dell-pe1650-2 kernel:   Vendor: HITACHI   Model: DK32DJ-18MC    
>    Rev: D4D4
> Jul  7 15:29:55 dell-pe1650-2 kernel:   Type:   Direct-Access                   
>    ANSI SCSI revision: 03
> Jul  7 15:29:55 dell-pe1650-2 kernel: AAC:ID(0:00:0); Error Event [command:0xa0]
> Jul  7 15:29:55 dell-pe1650-2 kernel: AAC:ID(0:00:0); Illegal Request [k:0x5,c:0
> x20,q:0x0]
> Jul  7 15:29:55 dell-pe1650-2 kernel: AAC:ID(0:00:0); Invalid Command Operation 
> Code

I do not see these on clug. They are indeed present in the log of the failed
RHEL 4 RHTS test run referenced above. 

I'm having trouble finding these messages in the code, but if "command:0xa0"
refers to a SCSI opcode, then that is a Report LUNs. That makes sense at this
point, but I do not know why it is failing. 

For the record, the test that is running when the adapter appears to die is:

function TestModule ()
{
    MODLIST='lp'
    KPNAME=$kernname-$kernver-$1
    for t in $MODLIST ; do
        # Test the insertion of previous kernel modules
        /sbin/modprobe -r $t
        IMOD=$(rpm -ql $KPNAME | grep /$t.ko)
        insmod $IMOD
        if [ "$?" -ne "0" ]; then
            echo "$IMOD insertion Failed:" | tee -a $OUTPUTFILE
        else
            echo "$IMOD insertion Passed:" | tee -a $OUTPUTFILE
        fi
    done
}

Is the $OUTPUTFILE preserved in RHTS? 

Bottom line, more work is needed to reproduce this on clug, or to understand why
it is not failing.

Comment 26 RHEL Program Management 2008-09-03 13:04:18 UTC
Updating PM score.

Comment 27 Qian Cai 2008-09-25 09:02:13 UTC
*** Bug 455268 has been marked as a duplicate of this bug. ***

Comment 28 Vivek Goyal 2008-10-03 13:39:46 UTC
Observed this issue again during my rhts run.

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=4529790

Comment 30 Tomas Henzl 2008-10-06 14:09:05 UTC
Vivek,
I'd like to test the patch from BZ#453472 even if chances that it will help are small. 
Both dell-pe1650-2 and dell-pe1650-1, are used for testing now, could you help me with the reservation ?

Comment 31 Tomas Henzl 2008-11-03 12:36:55 UTC
Vivek,
Do you still see the problem or has it vanished in latest kernel ?

Comment 33 Jeff Burke 2009-01-29 14:16:10 UTC
I have created a RHTS xml script. Using this script I tried the test against RHEL4-U6 and RHEL4-U8-re20090128.1. They both failed the exact same way.

Here is a link to the actual job:
http://tinyurl.com/BZ438895

Here are the links to the results:
 RHEL4-U6
http://tinyurl.com/aeh9qo

 RHEL4-U8-re20090128.1
http://tinyurl.com/bq3wyr

-------- snip --------
01/28/09 21:08:09  JobID:43834 Test:/kernel/drivers/modules Response:1

01/28/09 21:08:10  testID:1247692 start:



percraid:Fatal Error: See system event log
aacraid: Host adapter abort request (0,0,0,0)

aacraid: Host adapter abort request (0,0,0,0)

aacraid: Host adapter abort request (0,0,0,0)

aacraid: Host adapter abort request (0,0,0,0)

aacraid: Host adapter abort request (0,0,0,0)

aacraid: Host adapter abort request (0,0,0,0)

aacraid: Host adapter abort request (0,0,0,0)

aacraid: Host adapter abort request (0,0,0,0)

aacraid: Host adapter abort request (0,0,0,0)
-------- snip --------

Comment 34 Tomas Henzl 2009-01-29 16:07:03 UTC
(In reply to comment #33)
> I have created a RHTS xml script. Using this script I tried the test against
> RHEL4-U6 and RHEL4-U8-re20090128.1. They both failed the exact same way.
> 
Jeff,
thanks for the testing. Please which test have you run with your script ? And if it is public useable could you make it public ?

Comment 35 Jeff Burke 2009-01-29 16:16:58 UTC
Created attachment 330371 [details]
XML file used with RHTS to duplicate issue

Tomas,
   No Problem. I will attach the xml script so you have a list of the tests.

   I believe the testcase is covered under GPL but it really will not work out side of Red Hat building. It goes to internal servers to grab some packages to test older modules. You could modify it to work externally if you want.

   FYI the specific test is /kernel/drivers/modules

Comment 37 Tom Coughlan 2009-01-29 18:27:45 UTC
>    FYI the specific test is /kernel/drivers/modules

That test comes from 

rh-tests-kernel-drivers-modules-1.1-5.noarch.rpm

and is as shown in Comment #25 above. 

I just logged in to the system and did:

lsmod                  # confirmed that "lp" is loaded.
/sbin/modprobe -r lp   # no problem
lsmod                  # it is unloaded
rpm -ql kernel-2.6.9-80.EL | grep lp.ko

/lib/modules/2.6.9-80.EL/kernel/drivers/char/lp.ko
/lib/modules/2.6.9-80.EL/kernel/drivers/usb/class/usblp.ko

# Humm. Maybe having two returns here is causing a problem for the script?

insmod /lib/modules/2.6.9-80.EL/kernel/drivers/char/lp.ko

insmod: error inserting '/lib/modules/2.6.9-80.EL/kernel/drivers/char/lp.ko': -1 Invalid module format

# I do not know why that is happening. It seems to happen for all the 
# modules I tried. 


So, there may be some issues with the test script, and there may be something causing an error on insmod. None of that, of course explains why the aacraid would appear to go offline. That is just bizarre, considering how simple this test is. Tomas, please take a closer look at that script, then try to reproduce the problem without RHTS. If it does not reproduce, then we'll need to learn more about setting up and debugging in the RHTS environment.

Comment 38 Jeff Burke 2009-01-29 18:50:06 UTC
In the runtest.sh it has a grep command that will only return a single module.

----- snip ------
rpm -ql $KPNAME | grep /$t.ko
----- /snip ------

Looks like when you manually ran it you missed the / before the module. It should have been rpm -ql kernel-2.6.9-80.EL | grep /lp.ko

Also we run this test on every RHEL4 kernel we build. If it was a test issue we would have see it way before this. Or at least we would have seen it on other systems.

Comment 39 Jeff Burke 2009-01-29 18:56:14 UTC
Looking at the actual results from the testing. I see this:

Results from "2.6.9-80.EL"
Starting ./runtest.sh
Current Test Version = rh-tests-kernel-drivers-modules-3.0
Current Running Kernel Package = kernel-smp-2.6.9-80.EL
Download/Install kernel-smp-2.6.9-5.EL.i686.rpm kernel
Download kernel-smp-2.6.9-5.EL.i686.rpm Passed:
Install kernel-smp-2.6.9-5.EL.i686.rpm Passed:
Download/Install kernel-smp-2.6.9-11.EL.i686.rpm kernel
Download kernel-smp-2.6.9-11.EL.i686.rpm Passed:

Results from "2.6.9-67.EL"
Starting ./runtest.sh
Current Test Version = rh-tests-kernel-drivers-modules-3.0
Current Running Kernel Package = kernel-smp-2.6.9-67.EL
Download/Install kernel-smp-2.6.9-5.EL.i686.rpm kernel
Download kernel-smp-2.6.9-5.EL.i686.rpm Passed:
Install kernel-smp-2.6.9-5.EL.i686.rpm Passed:
Download/Install kernel-smp-2.6.9-11.EL.i686.rpm kernel
Download kernel-smp-2.6.9-11.EL.i686.rpm Passed:

What is interesting is they both seem to fail in the exact same spot: Just after installing the lp.ko from 2.6.9-11.EL. Next in the list would have been -22.EL

Comment 40 Tom Coughlan 2009-01-29 18:57:31 UTC
(In reply to comment #37)

> insmod /lib/modules/2.6.9-80.EL/kernel/drivers/char/lp.ko
> 
> insmod: error inserting '/lib/modules/2.6.9-80.EL/kernel/drivers/char/lp.ko':
> -1 Invalid module format

My mistake. that should be "2.6.9-80.ELsmp", not "2.6.9-80.EL".

insmod works fine now. 

And yes, Jeff is right about the missing / in the grep command.

Comment 41 Tomas Henzl 2009-01-30 16:47:45 UTC
I've installed kernel-smp-2.6.9-22.EL.i686.rpm and kernel-smp-2.6.9-11.EL.i686.rpm. On both I'm able to do 
insmod /lib/modules/2.6.9-11.ELsmp/kernel/drivers/char/lp.ko and
insmod /lib/modules/2.6.9-22.ELsmp/kernel/drivers/char/lp.ko.

With 2.6.9-11 I can see in /var/log/messages 
Jan 30 11:40:07 dell-pe1650-1 kernel: ksign: module signed with unknown public key
Jan 30 11:40:07 dell-pe1650-1 kernel: - signature keyid: 975e6fa84e049d8a ver=3
Could be that be a problem when running automatic tests ?

Comment 42 Jeff Burke 2009-01-30 17:10:47 UTC
Nope that is normal output for the test, see here

This is from the log in Comment#32
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=6184389


------snip------
Console Log:  	

ksign: module signed with unknown public key
- signature keyid: e07bc3e85be30cfd ver=3
lp: driver loaded but no devices found
ksign: module signed with unknown public key
- signature keyid: d67b3e6b1ed6fec7 ver=3
lp: driver loaded but no devices found
ksign: module signed with unknown public key
- signature keyid: 975e6fa84e049d8a ver=3
lp: driver loaded but no devices found
ksign: module signed with unknown public key
- signature keyid: 063efdf11ad6baa3 ver=3
lp: driver loaded but no devices found
ksign: module signed with unknown public key
- signature keyid: 8301cd821788a86b ver=3
lp: driver loaded but no devices found
ksign: module signed with unknown public key
- signature keyid: 03629c5f482105a7 ver=3
lp: driver loaded but no devices found
lp: driver loaded but no devices found
ksign: module signed with unknown public key
- signature keyid: 9ada2b4b1ec241de ver=3
lp: driver loaded but no devices found
------snip------


Just out of curiosity. Why are you running it by hand? You should be able to cd into the /mnt/test/kernel/drivers/modules directory and do either a make run or runtest.sh

Comment 43 Tomas Henzl 2009-02-02 17:10:40 UTC
It looks to me that the issue here is not related to module loading/unloading,
simply writing some amount of data to /dev/sda1 which happens for example during
kernel install (and this is part of module test).
There are some errors messages when the system starts :
Attached scsi removable disk sda at scsi0, channel 0, id 0, lun 0
  Vendor: HITACHI   Model: DK32DJ-18MC       Rev: D4D4
  Type:   Direct-Access                      ANSI SCSI revision: 03
percraid:ID(0:00:0); Error Event [command:0xa0]
percraid:ID(0:00:0); Illegal Request [k:0x5,c:0x20,q:0x0]
percraid:ID(0:00:0); Invalid Command Operation Code

I'm trying to reinstall it with RHEL5.3, to see if also fails.

Comment 44 Tomas Henzl 2009-02-04 14:10:03 UTC
The 5.3 suffers from the same issue. The 2.6.9-67 also has similar problem even if the symptoms are not exactly the same. Kernel 2.6.9-55 seems to be working well, so I'm going to see what differences are between them.

Comment 45 Tomas Henzl 2009-02-13 13:09:04 UTC
Created attachment 331826 [details]
limit sg list length

The issue has been probably caused by the patch "Update aacraid driver to 1.1.5-2453". Further I've found that limiting the size of the sg list helps. See proposed patch.

Comment 46 Tom Coughlan 2009-02-13 16:00:16 UTC
(In reply to comment #45)

> Further I've found that limiting the size of the sg list helps.

So, have you found a simple way to reproduce this?

As I understand it, RHTS does a complete install, and then runs a certain number of tests with no problem. Then, on one simple "rmmod/insmod lp" test, all I/O to the root disk suddenly fails. An attempt to run the equivalent test outside RHTS does not fail. It would be helpful to find a more direct reproducer. 


(In reply to comment #43)

> percraid:ID(0:00:0); Error Event [command:0xa0]
> percraid:ID(0:00:0); Illegal Request [k:0x5,c:0x20,q:0x0]
> percraid:ID(0:00:0); Invalid Command Operation Code

As I mentioned earlier, if "command:0xa0" refers to a SCSI opcode, then that is a Report LUNs. That makes sense at this point (scanning to find devices at module load time).  I do not know why it is failing, but the system seems to recover and continue on okay. It might be related to the eventual I/O failure, but it is not clear how. 

(In reply to comment #45)
> Created an attachment (id=331826) [details]
> limit sg list length
> 
> The issue has been probably caused by the patch "Update aacraid driver to
> 1.1.5-2453". 

It will be interesting to see what happens when you back that out. 

> Further I've found that limiting the size of the sg list helps.
> See proposed patch.

That causes the RHTS test to succeed? Or it has some other good effect?

Comment 47 Tomas Henzl 2009-02-13 16:43:51 UTC
(In reply to comment #46)
> (In reply to comment #45)
> 
> > Further I've found that limiting the size of the sg list helps.
> 
> So, have you found a simple way to reproduce this?
Yes, writing a somewhat larger amount of data to sda1 brings it down immediatelly 'dd if=/dev/zero of=/boot/asd/ts.bin count=2k bs=10k'
> 
> As I understand it, RHTS does a complete install, and then runs a certain
> number of tests with no problem. Then, on one simple "rmmod/insmod lp" test,
> all I/O to the root disk suddenly fails. An attempt to run the equivalent test
> outside RHTS does not fail. It would be helpful to find a more direct
> reproducer. 
> 
This "rmmod/insmod lp" test installs several older kernels - the system fails while writing them to /boot/.
> 
> (In reply to comment #43)
> 
> > percraid:ID(0:00:0); Error Event [command:0xa0]
> > percraid:ID(0:00:0); Illegal Request [k:0x5,c:0x20,q:0x0]
> > percraid:ID(0:00:0); Invalid Command Operation Code
> 
> As I mentioned earlier, if "command:0xa0" refers to a SCSI opcode, then that is
> a Report LUNs. That makes sense at this point (scanning to find devices at
> module load time).  I do not know why it is failing, but the system seems to
> recover and continue on okay. It might be related to the eventual I/O failure,
> but it is not clear how. 
> 
After the the mentioned patch was applied the error report vanished.

> That causes the RHTS test to succeed? Or it has some other good effect?
In fact I haven't tested the rhts test yet, but the problem with writing to /boot/ is solved.

Comment 48 Tomas Henzl 2009-02-13 16:45:29 UTC
> > That causes the RHTS test to succeed? Or it has some other good effect?
> In fact I haven't tested the rhts test yet, but the problem with writing to
> /boot/ is solved.
Even without the patch there is no problem with "rmmod/insmod lp".

Comment 49 Tomas Henzl 2009-02-13 17:18:14 UTC
Achim,
maybe we have only old firmware on the box, could you please check it
(Don't know what and if the information is in the log below - if you need some additional info I'll provide it)
Thanks,
Tomas

Adaptec aacraid driver 1.1-5[2456]
ACPI: PCI Interrupt 0000:01:08.1[A] -> GSI 18 (level, low) -> IRQ 193
percraid0: kernel 2.7-0[3153] 
percraid0: monitor 2.7-0[3153]
percraid0: bios 2.7-0[3153]
percraid0: serial CA3021D3
scsi0 : percraid
  Vendor: DELL      Model: jmo               Rev: V1.0
  Type:   Direct-Access                      ANSI SCSI revision: 02
SCSI device sda: 71089152 512-byte hdwr sectors (36398 MB)
sda: Write Protect is off
sda: Mode Sense: 06 00 00 00
SCSI device sda: drive cache: write back
SCSI device sda: 71089152 512-byte hdwr sectors (36398 MB)
sda: Write Protect is off
sda: Mode Sense: 06 00 00 00
SCSI device sda: drive cache: write back
 sda: sda1 sda2
Attached scsi removable disk sda at scsi0, channel 0, id 0, lun 0
  Vendor: HITACHI   Model: DK32DJ-18MC       Rev: D4D4
  Type:   Direct-Access                      ANSI SCSI revision: 03
  Vendor: HITACHI   Model: DK32DJ-18MC       Rev: D4D4
  Type:   Direct-Access                      ANSI SCSI revision: 03
  Vendor: HITACHI   Model: DK32DJ-18MC       Rev: D4D4
  Type:   Direct-Access                      ANSI SCSI revision: 03
  Vendor: PE/PV     Model: 1x3 SCSI BP       Rev: 0.26
  Type:   Processor                          ANSI SCSI revision: 02

Comment 51 Achim Leubner 2009-02-18 16:11:09 UTC
Tomas,
looks good so far. Is there a dedicated procedure to reproduce the issue?

Thanks,
Achim

Comment 52 Tomas Henzl 2009-02-18 16:52:23 UTC
(In reply to comment #51)
> Tomas,
> looks good so far. 
You meant the firmware we have is the newest ?

> Is there a dedicated procedure to reproduce the issue?
- install RHEL4.7
- mkdir /boot/asd (test directory on sda1)
- dd if=/dev/zero of=/boot/asd/ts.bin count=10k bs=10k
The system then fails while writing.

> 
> Thanks,
> Achim

Comment 53 Tomas Henzl 2009-03-02 10:26:24 UTC
Achim,
have you been able to reproduce this on your hardware ?

Comment 54 Achim Leubner 2009-03-09 10:58:30 UTC
Tomas,

unfortunately I couldn't reproduce it yet. Does it depend on a special test machine, system BIOS etc.? Does it also occur with other Adaptec RAID controllers or did you see this with the DELL/Perc controller only? 

Thanks,
Achim

Comment 55 Tomas Henzl 2009-03-09 12:36:16 UTC
(In reply to comment #54)
> unfortunately I couldn't reproduce it yet. Does it depend on a special test
> machine, system BIOS etc.? Does it also occur with other Adaptec RAID
> controllers or did you see this with the DELL/Perc controller only? 

I've seen this only on this one machine, so only with the DELL/Perc controller.
Have you also used for testing this (Dagger/PERC3DiD) controller ?
We'd like to know if we are using the latest controller's firmware and what is the latest version ? 
Can you point me to a firmware upload tool ? (I wasn't successful on the Adaptec's web).

Comment 56 Tom Coughlan 2009-03-10 15:43:03 UTC
I have been able to reproduce this on a similar system in Westford. This is the system "clug" that I mentioned earlier:

        pe1650
        BIOS Information
                Vendor: Dell Computer Corporation
                Version: A05
                Release Date: 03/29/2002

This system could not reproduce the problem previously. I changed it from a single disk per logical unit (no RAID) to a two disk RAID1 and I can now reproduce the problem. I have not gone back to confirm this yet, but it appears to be necessary to cause the problem. 

I will look in to updating the fw next. 

Tom

Comment 57 Tom Coughlan 2009-03-11 20:55:45 UTC
I updated the BIOS on clug with 
PE1650-BIOS-LX-A11.bin 
and the PERC fw with 
PE1650_RAID_FRMW_LX_R168387.BIN
from the Dell web site. This fixed the problem. I will try dell-pe1650-1 and dell-pe1650-2 next.

Comment 58 Tom Coughlan 2009-03-13 14:04:35 UTC
dell-pe1650-1 already had up-to-date BIOS (PE1650-BIOS-LX-A11.bin).
It did not have the latest PERC fw (PE1650_RAID_FRMW_LX_R168387.BIN).
In this state, it failed Tomas' simple test:
dd if=/dev/zero of=/boot/asd/ts.bin count=10k bs=10k

I updated the PERC fw. Now this dd test passes, as it did on clug. The PERC fw update is apparently a prerequisite for the driver update that went in to RHEL 4.7.

I am trying to get a good RHTS run now.

Comment 59 Tom Coughlan 2009-03-16 18:30:42 UTC
This system has passed RHTS tests with RHEL 4.7 and 4.8. Closing.

Comment 60 Tom Coughlan 2009-03-16 18:30:42 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
The aacraid driver update introduced in RHEL 4.7, and contained in subsequent RHEL 4 versions, requires up-to-date Adaptec PERC3/Di firmware. The minimum version required of the PERC3/Di firmware is 2.8.1.7692, A13. This firmware may be obtained at this site:

http://support.dell.com/support/downloads/download.aspx?c=us&cs=555&l=en&s=biz&releaseid=R168387&SystemID=PWE_PNT_PIII_1650&servicetag=&os=WNET&osl=en&deviceid=1375&devlib=0&typecnt=0&vercnt=9&catid=-1&impid=-1&formatcnt=4&libid=35&fileid=228550

Comment 62 Ryan Lerch 2009-04-06 22:42:47 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,3 +1,3 @@
-The aacraid driver update introduced in RHEL 4.7, and contained in subsequent RHEL 4 versions, requires up-to-date Adaptec PERC3/Di firmware. The minimum version required of the PERC3/Di firmware is 2.8.1.7692, A13. This firmware may be obtained at this site:
+The aacraid driver update that was first introduced in Red Hat Enterprise Linux 4.7 requires up to date Adaptec PERC3/Di firmware. Subsequent updates of Red Hat Enterprise Linux 4 (including this 4.8 update) require, that the PERC3/Di firmware is at version 2.8.1.7692, A13 or newer. The firmware may be obtained at the following location:
 
 http://support.dell.com/support/downloads/download.aspx?c=us&cs=555&l=en&s=biz&releaseid=R168387&SystemID=PWE_PNT_PIII_1650&servicetag=&os=WNET&osl=en&deviceid=1375&devlib=0&typecnt=0&vercnt=9&catid=-1&impid=-1&formatcnt=4&libid=35&fileid=228550