Testing BusLogic BA80C32 SCSI controller with Cerberus test suite on x86. System was built from qa0307.0, which included kernel 2.4.2-0.1.22. After 11h48m15s, the following error messages were displayed on console: EXT2_fs erro (device sd(8,5)): ext2_readdir: bad entry in directory #89775: inode out of bounds - offset=0, inode=50462976 EXT2_fs erro (device sd(8,5)): ext2_readdir: bad entry in directory #78116: inode out of bounds - offset=0, inode=50462976 EXT2_fs erro (device sd(8,5)): ext2_readdir: bad entry in directory #33317: inode out of bounds - offset=0, inode=50462976 Keyboard input was echoed on screen, numlock and caps lock key were immediately responsive. Ctrl-C appeared to stop cerberus, but no shell prompt was displayed. A second ctrl-C did not display the typical error message. No problems changing VCs, and all other VCs were in the same state.
Another buslogic data corrupter report. Do you have access to a different scsi card to pin this down to a driver problem?
Yes, I've got access to a few (Durham test lab). Should I pick another buslogic that uses a different driver, or will most SCSI cards work well enough for comparison?
Try something else reliable: say, an Adaptec. If you can reproduce the corruption with the Buslogic you've got and don't see any with an Adaptec, then that would give us a clear indication of where the problem is here.
retrying... might take a bit longer than expected (or not be reproducible), I just discovered that the motherboard used had a beta chipset, that's apparently too buggy for reliable testing with cerberus. Will start with another system, repeating for a while and posting new info back here as it arises.
OK, thanks --- let me know what happens either way so we can either lay this to rest as a false alarm or resurrect it and dig further.
This resembles the generic SCSI corruption we are seeing, bug #31519
Tested out another system with the BusLogic card, using cerberus and 2.4.2-0.1.25smp kernel. Memory tests in cerberus failed, with severe memory corruption. contents of the cerberus MEMORY0 log include: Writing block size 16640 (65K) with alignment 3...Verifying...Done. Writing block size 33024 (129K) with alignment 3...Verifying...Done. Writing block size 49408 (193K) with alignment 3...Verifying...Done. Writing block size 65792 (257K) with alignment 3...Verifying...Done. Writing block size 82176 (321K) with alignment 3...Verifying...Done. Writing block size 2973696 (11616K) with alignment 3...Verifying...Done. Writing block size 5947392 (23232K) with alignment 3...Verifying...Done. Writing block size 8921088 (34848K) with alignment 3...Verifying...Done. Writing block size 11894784 (46464K) with alignment 3...Verifying...Done. Writing block size 14868480 (58080K) with alignment 3...Verifying...Done. Writing block size 17842176 (69696K) with alignment 3...Verifying...Done. Writing block size 20815872 (81312K) with alignment 3...Verifying...Done. Writing block size 23789568 (92928K) with alignment 3...Verifying...Memory error at offset 158459 of 23789568 : expected aaad15a5, got 39c39c39 Local process address: 0x401e6c00 Scanning /proc/kcore. This is dangerous, take cover. Possible location of memory failure: 0x5584bff (85M) on page 21892 System RAM fault likely. Check System RAM first, then motherboard/CPU. Failure Context: offset expected got 158456 aaad15a2 aaad15a2 158457 aaad15a3 aaad15a3 158458 aaad15a4 aaad15a4 158459 aaad15a5 39c39c39 *** fail location 158460 aaad15a6 39c39c39 158461 aaad15a7 39c39c39 158462 aaad15a8 39c39c39 158463 aaad15a9 39c39c39 158464 aaad15aa 39c39c39 158465 aaad15ab 39c39c39 Mon Mar 12 21:54:11 EST 2001: MEMORY0 FAILED: on 2/64 after 5h9m1s each MEMORY0 failure is very similar to this failure. the string '39c39c39' looks *identical* to the string that cerberus' FIFO_MMAP test is using: (cerberus FIFO_MMAP log): Total Statistics (1474): Output device/file name: /tmp/FIFO.tmp.1046 (device type=fifo) Type of I/O's performed: sequential Data pattern written: 0x39c39c39 (read verify disabled) after cerberus exited with the failure, many commands (such as uptime) failed. strace revealed that each of these commands died with SIGSEGV on old_mmap an strace of uptime: execve("/usr/bin/uptime", ["uptime"], [/* 22 vars */]) = 0 uname({sys="Linux", node="test93.test.redhat.com", ...}) = 0 brk(0) = 0x8049744 open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=19708, ...}) = 0 old_mmap(NULL, 19708, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40017000 close(3) = 0 open("/lib/libproc.so.2.0.7", O_RDONLY) = 3 read(3, "", 1024) = 0 close(3) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001c000 --- SIGSEGV (Segmentation fault) --- +++ killed by SIGSEGV +++ /var/log/messages also contained the following: Mar 12 21:42:25 test93 kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002 Mar 12 21:42:25 test93 kernel: Info fld=0x2892ba, Current sd08:06: sense key Hardware Error Mar 12 21:42:25 test93 kernel: Additional sense indicates No defect spare location available Mar 12 21:42:25 test93 kernel: I/O error: dev 08:06, sector 506176 Mar 12 21:42:25 test93 kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002 Mar 12 21:42:25 test93 kernel: Info fld=0x2892b7, Current sd08:06: sense key Hardware Error Mar 12 21:42:25 test93 kernel: Additional sense indicates No defect spare location available Mar 12 21:42:25 test93 kernel: I/O error: dev 08:06, sector 506184 Mar 12 21:42:25 test93 kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002 Mar 12 21:42:25 test93 kernel: Info fld=0x2892b7, Current sd08:06: sense key Hardware Error Mar 12 21:42:25 test93 kernel: Additional sense indicates No defect spare location available Mar 12 21:42:25 test93 kernel: I/O error: dev 08:06, sector 506192 Mar 12 21:42:26 test93 kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002 Mar 12 21:42:37 test93 kernel: Info fld=0x2892b7, Current sd08:06: sense key Hardware Error Mar 12 21:42:38 test93 kernel: Additional sense indicates No defect spare location available Mar 12 21:42:43 test93 kernel: I/O error: dev 08:06, sector 506200 Mar 12 21:42:44 test93 kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002 Mar 12 21:42:44 test93 kernel: Info fld=0x2892b7, Current sd08:06: sense key Hardware Error Mar 12 21:42:44 test93 kernel: Additional sense indicates No defect spare location available Mar 12 21:42:46 test93 kernel: I/O error: dev 08:06, sector 506208 Mar 12 21:42:46 test93 kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002 Mar 12 21:42:52 test93 kernel: Info fld=0x2892b7, Current sd08:06: sense key Hardware Error Mar 12 21:42:52 test93 kernel: Additional sense indicates No defect spare location available Mar 12 21:42:55 test93 kernel: I/O error: dev 08:06, sector 506216 <... continues for ~1400 lines >
Could you please try doing a run with the MMAP_FIFO test disabled?
Attempting to do that now. Trying to replicate current errors on other machines. Reinstalled qa0312.0 (same qa tree used before, 2.4.2-0.1.25 kernel), reformatted partitions (including bad block check, verfied on console during installation that no errors occured). Installed stress-kernel-0.9-11. Started cerberus run with MEMORY, CRASHME, FS, and P3-FPU tests. Within 20 seconds, machine was failing P3-FPU tests, with additional error messages dumped to console: /usr/bin/ctcs/runin/runtest: line 170: 5512 Segmentation fault (core dumped) ./$ttrun $params 2>&1 SCSI disk errors were also occuring, identical to the error reported above, except Info fld=0x21bd6f Checking the cerberus logs on P3-FPU revealed: Tue Mar 13 18:18:26 EST 2001: P3-FPU success: on 1/256 after 19s /bin/sh: error while loading shared libraries: libtermcap.so.2: cannot load shared object file: Input/output error Tue Mar 13 18:18:27 EST 2001: P3-FPU FAILED: on 2/256 after 20s /bin/sh: error while loading shared libraries: libtermcap.so.2: cannot load shared object file: Input/output error Tue Mar 13 18:18:27 EST 2001: P3-FPU FAILED: on 3/256 after 20s /bin/sh: error while loading shared libraries: libtermcap.so.2: cannot load shared object file: Input/output error <repeated for all remaining iterations (152) of P3-FPU> no other test had yet finished to completion. /bin/ls dies with SIGSEGV after cerberus exited with those errors, strace shows: execve("/bin/ls", ["ls"], [/* 22 vars */]) = 0 uname({sys="Linux", node="test93.test.redhat.com", ...}) = 0 brk(0) = 0x80540c4 open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=19708, ...}) = 0 old_mmap(NULL, 19708, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40017000 close(3) = 0 open("/lib/libtermcap.so.2", O_RDONLY) = 3 read(3, "", 1024) = 0 close(3) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001c000 --- SIGSEGV (Segmentation fault) --- +++ killed by SIGSEGV +++
Any more info on the non-MMAP_FIFO results? That one will be important: it will help us to decide whether the 0x39c39c39 pattern is just being picked up from the MMAP_FIFO processes accidentally, or whether the mmap code is actually causing the trouble.
no more info available (yet). we're working to sort this out, somewhat limited by hardware. Should have some different results in about 18-24 hours (possibly earlier, I'll post preliminary results here to help with the time crunch).
same hardware, installing qa0314.0 (kernel 2.4.2-0.1.28-BOOT), anaconda reports that lilo did not run properly (kindly urging the user to make a boot disk) machine encounters SCSI errors (displayed on VC4). Here's a screenful of them, manually copied. It's a 4 line repeating sequence, the raw sense data and sectors differ slightly between messages. <4>SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002 <4>Info fld=0x2dbc4c, current sd08:05: sns = f0 4 <4>ASC=32 ASCQ= 0 <4>Raw sense data: 0xf0 0x00 0x04 0x00 0x2d 0x92 0xa4 0x0a 0x00 0x00 0x00 0x00 0x32 0x00 0x03 0x00 0x00 0x00 <4> I/O error: dev 08:05, sector 2906232 <4>SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002 <4>Info fld=0x2dbc4c, current sd08:05: sns = f0 4 <4>ASC=32 ASCQ= 0 <4>Raw sense data: 0xf0 0x00 0x04 0x00 0x2e 0x0e 0xb1 0x0a 0x00 0x00 0x00 0x00 0x32 0x00 0x03 0x00 0x00 0x00 <4> I/O error: dev 08:05, sector 2938024 <4>SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002 <4>Info fld=0x2dbc4c, current sd08:05: sns = f0 4 <4>ASC=32 ASCQ= 0 <4>Raw sense data: 0xf0 0x00 0x04 0x00 0x2e 0x0e 0xb1 0x0a 0x00 0x00 0x00 0x00 0x32 0x00 0x03 0x00 0x00 0x00 <4> I/O error: dev 08:05, sector 2938024 <4>SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002 <4>Info fld=0x2dbc4c, current sd08:05: sns = f0 4 <4>ASC=32 ASCQ= 0 <4>Raw sense data: 0xf0 0x00 0x04 0x00 0x2d 0xbc 0x4c 0x0a 0x00 0x00 0x00 0x00 0x32 0x00 0x03 0x00 0x00 0x00 <4> I/O error: dev 08:05, sector 2916936 <4>SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002 <4>Info fld=0x2dbc4c, current sd08:05: sns = f0 4 <4>ASC=32 ASCQ= 0 <4>Raw sense data: 0xf0 0x00 0x04 0x00 0x2d 0xbc 0x4c 0x0a 0x00 0x00 0x00 0x00 0x32 0x00 0x03 0x00 0x00 0x00 <4> I/O error: dev 08:05, sector 2916936 <4>SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002 <4>Info fld=0x2dbc4c, current sd08:05: sns = f0 4 <4>ASC=32 ASCQ= 0 <4>Raw sense data: 0xf0 0x00 0x04 0x00 0x2d 0xbc 0x4c 0x0a 0x00 0x00 0x00 0x00 0x32 0x00 0x03 0x00 0x00 0x00 <4> I/O error: dev 08:05, sector 2916936 machine might still be in the same condition when you read this, so if you want more information on the machine, let me know.
Could you open the scsi error case as a separate bugzilla report? It doesn't look as if it is likely to be related to the core memory corruptions in the rest of this bug report.
done, opened as bug #31863.
Do any of the Cerberus tests use shared memory? I've just been chasing a tmpfs problem reported by Arjan, and swapoff after that test run is producing a slew of "VM: Undead swap entry 00256300" type of errors. tmpfs and shm use the same core VM functionality, so I wonder if these can be related.
One more thing --- would it be possible to try a test run with a uniprocessor kernel? If we can determine whether or not the VM memory corruption bug is an SMP race, that will enormously reduce the places we're searching in the code. btw, the Mar 12 21:42:25 test93 kernel: Additional sense indicates No defect spare location available errors mentioned above at one point indicate a dead disk, and definitely don't point to a kernel problem. The disk has just got too many bad sectors and has no spare space left to remap duff blocks any more.
Here's another case of page cache corruption hitting on a 4 way Bear during database build: obj-$(CONFIG_NLS_CODEPAGE_775) += nls_cp775.o obj-$(CONFIG_NLS_CODEPAGE_8p^QBEp^QBE^P^@^@D0^^3@.o obj-$(CONFIG_NLS_CODEPAGE_852) += nls_cp852.o in hex: 00000f0 5f47 4c4e 5f53 4f43 4544 4150 4547 385f 0000100 1170 c542 1170 c542 0010 c400 1eb0 c033 0000110 6f2e 6f0a 6a62 242d 4328 4e4f 4946 5f47 which are definately kernel pointers. -ben
This defect considered MUST-FIX for Florence Gold
OK, generic corruption appears fixed, and bootmem alloc bug is fixed, time to re-test the BusLogic card.
Buslogic was tested before release but someone forgot to update the bug report. :-)