31074 – [corruption ]2.4.2-0.1.22 (qa0307.0) fails cerberus with fs errors

Bug 31074 - [corruption ]2.4.2-0.1.22 (qa0307.0) fails cerberus with fs errors

Summary: [corruption ]2.4.2-0.1.22 (qa0307.0) fails cerberus with fs errors

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.1
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Stephen Tweedie
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-03-08 16:43 UTC by Brian Brock
Modified:	2007-04-18 16:32 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2001-04-08 00:11:54 UTC
Embargoed:

Attachments	(Terms of Use)

Description Brian Brock 2001-03-08 16:43:27 UTC

Testing BusLogic BA80C32 SCSI controller with Cerberus test suite on x86. 
System was built from qa0307.0, which included kernel 2.4.2-0.1.22.


After 11h48m15s, the following error messages were displayed on console:

EXT2_fs erro (device sd(8,5)): ext2_readdir: bad entry in directory #89775:
inode out of bounds - offset=0, inode=50462976
EXT2_fs erro (device sd(8,5)): ext2_readdir: bad entry in directory #78116:
inode out of bounds - offset=0, inode=50462976
EXT2_fs erro (device sd(8,5)): ext2_readdir: bad entry in directory #33317:
inode out of bounds - offset=0, inode=50462976

Keyboard input was echoed on screen, numlock and caps lock key were
immediately responsive.  Ctrl-C appeared to stop cerberus, but no shell
prompt was displayed.  A second ctrl-C did not display the typical error
message.  No problems changing VCs, and all other VCs were in the same
state.

Comment 1 Stephen Tweedie 2001-03-08 18:06:45 UTC

Another buslogic data corrupter report.  Do you have access to a different scsi
card to pin this down to a driver problem?

Comment 2 Brian Brock 2001-03-08 18:32:34 UTC

Yes, I've got access to a few (Durham test lab).  Should I pick another buslogic
that uses a different driver, or will most SCSI cards work well enough for
comparison?

Comment 3 Stephen Tweedie 2001-03-08 19:39:40 UTC

Try something else reliable: say, an Adaptec.  If you can reproduce the
corruption with the Buslogic you've got and don't see any with an Adaptec, then
that would give us a clear indication of where the problem is here.

Comment 4 Brian Brock 2001-03-08 21:27:15 UTC

retrying... might take a bit longer than expected (or not be reproducible), I
just discovered that the motherboard used had a beta chipset, that's apparently
too buggy for reliable testing with cerberus.  Will start with another system,
repeating for a while and posting new info back here as it arises.

Comment 5 Stephen Tweedie 2001-03-08 22:11:05 UTC

OK, thanks --- let me know what happens either way so we can either lay this to
rest as a false alarm or resurrect it and dig further.

Comment 6 Michael K. Johnson 2001-03-13 16:06:19 UTC

This resembles the generic SCSI corruption we are seeing,
bug #31519

Comment 7 Brian Brock 2001-03-13 22:17:11 UTC

Tested out another system with the BusLogic card, using cerberus and
2.4.2-0.1.25smp kernel.

Memory tests in cerberus failed, with severe memory corruption.

contents of the cerberus MEMORY0 log include:
Writing block size 16640 (65K) with alignment 3...Verifying...Done.
Writing block size 33024 (129K) with alignment 3...Verifying...Done.
Writing block size 49408 (193K) with alignment 3...Verifying...Done.
Writing block size 65792 (257K) with alignment 3...Verifying...Done.
Writing block size 82176 (321K) with alignment 3...Verifying...Done.
Writing block size 2973696 (11616K) with alignment 3...Verifying...Done.
Writing block size 5947392 (23232K) with alignment 3...Verifying...Done.
Writing block size 8921088 (34848K) with alignment 3...Verifying...Done.
Writing block size 11894784 (46464K) with alignment 3...Verifying...Done.
Writing block size 14868480 (58080K) with alignment 3...Verifying...Done.
Writing block size 17842176 (69696K) with alignment 3...Verifying...Done.
Writing block size 20815872 (81312K) with alignment 3...Verifying...Done.
Writing block size 23789568 (92928K) with alignment 3...Verifying...Memory error
at offset 158459 of 23789568 : expected aaad15a5, got 39c39c39
Local process address: 0x401e6c00
Scanning /proc/kcore.  This is dangerous, take cover.
Possible location of memory failure: 0x5584bff (85M) on page 21892
System RAM fault likely.  Check System RAM first, then motherboard/CPU.
Failure Context: 
offset  expected        got     
158456  aaad15a2        aaad15a2
158457  aaad15a3        aaad15a3
158458  aaad15a4        aaad15a4
158459  aaad15a5        39c39c39  *** fail location
158460  aaad15a6        39c39c39
158461  aaad15a7        39c39c39
158462  aaad15a8        39c39c39
158463  aaad15a9        39c39c39
158464  aaad15aa        39c39c39
158465  aaad15ab        39c39c39
Mon Mar 12 21:54:11 EST 2001: MEMORY0 FAILED: on 2/64 after 5h9m1s

each MEMORY0 failure is very similar to this failure.

the string '39c39c39' looks *identical* to the string that cerberus' FIFO_MMAP
test is 
 using:

(cerberus FIFO_MMAP log):
Total Statistics (1474):
     Output device/file name: /tmp/FIFO.tmp.1046 (device type=fifo)
     Type of I/O's performed: sequential
        Data pattern written: 0x39c39c39 (read verify disabled)


after cerberus exited with the failure, many commands (such as uptime) failed. 
strace revealed that
each of these commands died with SIGSEGV on old_mmap
an strace of uptime:

execve("/usr/bin/uptime", ["uptime"], [/* 22 vars */]) = 0
uname({sys="Linux", node="test93.test.redhat.com", ...}) = 0
brk(0)                                  = 0x8049744
open("/etc/ld.so.preload", O_RDONLY)    = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=19708, ...}) = 0
old_mmap(NULL, 19708, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40017000
close(3)                                = 0
open("/lib/libproc.so.2.0.7", O_RDONLY) = 3
read(3, "", 1024)                       = 0
close(3)                                = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x4001c000
--- SIGSEGV (Segmentation fault) ---
+++ killed by SIGSEGV +++



/var/log/messages also contained the following:

Mar 12 21:42:25 test93 kernel: SCSI disk error : host 0 channel 0 id 0 lun 0
return code = 28000002
Mar 12 21:42:25 test93 kernel: Info fld=0x2892ba, Current sd08:06: sense key
Hardware Error
Mar 12 21:42:25 test93 kernel: Additional sense indicates No defect spare
location available
Mar 12 21:42:25 test93 kernel:  I/O error: dev 08:06, sector 506176
Mar 12 21:42:25 test93 kernel: SCSI disk error : host 0 channel 0 id 0 lun 0
return code = 28000002
Mar 12 21:42:25 test93 kernel: Info fld=0x2892b7, Current sd08:06: sense key
Hardware Error
Mar 12 21:42:25 test93 kernel: Additional sense indicates No defect spare
location available
Mar 12 21:42:25 test93 kernel:  I/O error: dev 08:06, sector 506184
Mar 12 21:42:25 test93 kernel: SCSI disk error : host 0 channel 0 id 0 lun 0
return code = 28000002
Mar 12 21:42:25 test93 kernel: Info fld=0x2892b7, Current sd08:06: sense key
Hardware Error
Mar 12 21:42:25 test93 kernel: Additional sense indicates No defect spare
location available
Mar 12 21:42:25 test93 kernel:  I/O error: dev 08:06, sector 506192
Mar 12 21:42:26 test93 kernel: SCSI disk error : host 0 channel 0 id 0 lun 0
return code = 28000002
Mar 12 21:42:37 test93 kernel: Info fld=0x2892b7, Current sd08:06: sense key
Hardware Error
Mar 12 21:42:38 test93 kernel: Additional sense indicates No defect spare
location available
Mar 12 21:42:43 test93 kernel:  I/O error: dev 08:06, sector 506200
Mar 12 21:42:44 test93 kernel: SCSI disk error : host 0 channel 0 id 0 lun 0
return code = 28000002
Mar 12 21:42:44 test93 kernel: Info fld=0x2892b7, Current sd08:06: sense key
Hardware Error
Mar 12 21:42:44 test93 kernel: Additional sense indicates No defect spare
location available
Mar 12 21:42:46 test93 kernel:  I/O error: dev 08:06, sector 506208
Mar 12 21:42:46 test93 kernel: SCSI disk error : host 0 channel 0 id 0 lun 0
return code = 28000002
Mar 12 21:42:52 test93 kernel: Info fld=0x2892b7, Current sd08:06: sense key
Hardware Error
Mar 12 21:42:52 test93 kernel: Additional sense indicates No defect spare
location available
Mar 12 21:42:55 test93 kernel:  I/O error: dev 08:06, sector 506216

<... continues for ~1400 lines >

Comment 8 Michael K. Johnson 2001-03-13 22:30:57 UTC

Could you please try doing a run with the MMAP_FIFO test disabled?

Comment 9 Brian Brock 2001-03-13 23:54:06 UTC

Attempting to do that now.  Trying to replicate current errors on other
machines.

Reinstalled qa0312.0 (same qa tree used before, 2.4.2-0.1.25 kernel),
reformatted partitions (including bad block check, verfied on console during
installation that no errors occured).  Installed stress-kernel-0.9-11.

Started cerberus run with MEMORY, CRASHME, FS, and P3-FPU tests.  Within 20
seconds, machine was failing P3-FPU tests, with additional error messages dumped
to console:

/usr/bin/ctcs/runin/runtest:  line 170: 5512 Segmentation fault (core dumped)
./$ttrun $params 2>&1
SCSI disk errors were also occuring, identical to the error reported above,
except Info fld=0x21bd6f

Checking the cerberus logs on P3-FPU revealed:

Tue Mar 13 18:18:26 EST 2001: P3-FPU success: on 1/256 after 19s
/bin/sh: error while loading shared libraries: libtermcap.so.2: cannot load
shared object file: Input/output error
Tue Mar 13 18:18:27 EST 2001: P3-FPU FAILED: on 2/256 after 20s
/bin/sh: error while loading shared libraries: libtermcap.so.2: cannot load
shared object file: Input/output error
Tue Mar 13 18:18:27 EST 2001: P3-FPU FAILED: on 3/256 after 20s
/bin/sh: error while loading shared libraries: libtermcap.so.2: cannot load
shared object file: Input/output error

<repeated for all remaining iterations (152) of P3-FPU>

no other test had yet finished to completion.

/bin/ls dies with SIGSEGV after cerberus exited with those errors, strace shows:
execve("/bin/ls", ["ls"], [/* 22 vars */]) = 0
uname({sys="Linux", node="test93.test.redhat.com", ...}) = 0
brk(0)                                  = 0x80540c4
open("/etc/ld.so.preload", O_RDONLY)    = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=19708, ...}) = 0
old_mmap(NULL, 19708, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40017000
close(3)                                = 0
open("/lib/libtermcap.so.2", O_RDONLY)  = 3
read(3, "", 1024)                       = 0
close(3)                                = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x4001c000
--- SIGSEGV (Segmentation fault) ---
+++ killed by SIGSEGV +++

Comment 10 Stephen Tweedie 2001-03-14 19:12:52 UTC

Any more info on the non-MMAP_FIFO results?  That one will be important: it will
help us to decide whether the 0x39c39c39 pattern is just being picked up from
the MMAP_FIFO processes accidentally, or whether the mmap code is actually
causing the trouble.

Comment 11 Brian Brock 2001-03-14 19:17:17 UTC

no more info available (yet).  we're working to sort this out, somewhat limited
by hardware.  Should
have some different results in about 18-24 hours (possibly earlier, I'll post
preliminary results here to help with the time crunch).

Comment 12 Brian Brock 2001-03-14 23:33:47 UTC

same hardware, installing qa0314.0 (kernel 2.4.2-0.1.28-BOOT), anaconda reports
that lilo did not run properly (kindly urging the user to make a boot disk)
machine encounters SCSI errors (displayed on VC4).  Here's a screenful of them,
manually copied.  It's a 4 line repeating sequence, the raw sense data and
sectors differ slightly between messages.

<4>SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002
<4>Info fld=0x2dbc4c, current sd08:05: sns = f0  4
<4>ASC=32 ASCQ= 0
<4>Raw sense data: 0xf0 0x00 0x04 0x00 0x2d 0x92 0xa4 0x0a 0x00 0x00 0x00 0x00
0x32 0x00 0x03 0x00 0x00 0x00
<4> I/O error: dev 08:05, sector 2906232
<4>SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002
<4>Info fld=0x2dbc4c, current sd08:05: sns = f0  4
<4>ASC=32 ASCQ= 0
<4>Raw sense data: 0xf0 0x00 0x04 0x00 0x2e 0x0e 0xb1 0x0a 0x00 0x00 0x00 0x00
0x32 0x00 0x03 0x00 0x00 0x00
<4> I/O error: dev 08:05, sector 2938024
<4>SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002
<4>Info fld=0x2dbc4c, current sd08:05: sns = f0  4
<4>ASC=32 ASCQ= 0
<4>Raw sense data: 0xf0 0x00 0x04 0x00 0x2e 0x0e 0xb1 0x0a 0x00 0x00 0x00 0x00
0x32 0x00 0x03 0x00 0x00 0x00
<4> I/O error: dev 08:05, sector 2938024
<4>SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002
<4>Info fld=0x2dbc4c, current sd08:05: sns = f0  4
<4>ASC=32 ASCQ= 0
<4>Raw sense data: 0xf0 0x00 0x04 0x00 0x2d 0xbc 0x4c 0x0a 0x00 0x00 0x00 0x00
0x32 0x00 0x03 0x00 0x00 0x00
<4> I/O error: dev 08:05, sector 2916936
<4>SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002
<4>Info fld=0x2dbc4c, current sd08:05: sns = f0  4
<4>ASC=32 ASCQ= 0
<4>Raw sense data: 0xf0 0x00 0x04 0x00 0x2d 0xbc 0x4c 0x0a 0x00 0x00 0x00 0x00
0x32 0x00 0x03 0x00 0x00 0x00
<4> I/O error: dev 08:05, sector 2916936
<4>SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002
<4>Info fld=0x2dbc4c, current sd08:05: sns = f0  4
<4>ASC=32 ASCQ= 0
<4>Raw sense data: 0xf0 0x00 0x04 0x00 0x2d 0xbc 0x4c 0x0a 0x00 0x00 0x00 0x00
0x32 0x00 0x03 0x00 0x00 0x00
<4> I/O error: dev 08:05, sector 2916936

machine might still be in the same condition when you read this, so if you want
more information on the machine, let me know.

Comment 13 Stephen Tweedie 2001-03-15 12:25:34 UTC

Could you open the scsi error case as a separate bugzilla report?  It doesn't
look as if it is likely to be related to the core memory corruptions in the rest
of this bug report.

Comment 14 Brian Brock 2001-03-15 15:00:04 UTC

done, opened as bug #31863.

Comment 15 Stephen Tweedie 2001-03-16 19:11:27 UTC

Do any of the Cerberus tests use shared memory?  I've just been chasing a tmpfs
problem reported by Arjan, and swapoff after that test run is producing a slew
of "VM: Undead swap entry 00256300" type of errors.  tmpfs and shm use the same
core VM functionality, so I wonder if these can be related.

Comment 16 Stephen Tweedie 2001-03-16 19:17:53 UTC

One more thing --- would it be possible to try a test run with a uniprocessor
kernel?  If we can determine whether or not the VM memory corruption bug is an
SMP race, that will enormously reduce the places we're searching in the code.

btw, the

Mar 12 21:42:25 test93 kernel: Additional sense indicates No defect spare
location available

errors mentioned above at one point indicate a dead disk, and definitely don't
point to a kernel problem.  The disk has just got too many bad sectors and has
no spare space left to remap duff blocks any more.

Comment 17 Ben LaHaise 2001-03-16 19:43:59 UTC

Here's another case of page cache corruption hitting on a 4 way Bear during
database build:

obj-$(CONFIG_NLS_CODEPAGE_775)  += nls_cp775.o
obj-$(CONFIG_NLS_CODEPAGE_8p^QBEp^QBE^P^@^@D0^^3@.o
obj-$(CONFIG_NLS_CODEPAGE_852)  += nls_cp852.o

in hex:

00000f0 5f47 4c4e 5f53 4f43 4544 4150 4547 385f
0000100 1170 c542 1170 c542 0010 c400 1eb0 c033
0000110 6f2e 6f0a 6a62 242d 4328 4e4f 4946 5f47

which are definately kernel pointers.

		-ben

Comment 18 Glen Foster 2001-03-19 15:27:47 UTC

This defect considered MUST-FIX for Florence Gold

Comment 19 Arjan van de Ven 2001-04-08 00:11:51 UTC

OK, generic corruption appears fixed, and bootmem alloc bug is fixed,
time to re-test the BusLogic card.

Comment 20 Michael K. Johnson 2001-07-03 21:38:01 UTC

Buslogic was tested before release but someone forgot to update
the bug report.  :-)

Note You need to log in before you can comment on or make changes to this bug.