31519 – [corruption] ex2fs errors on 2.4.2-0.1.2{3,4,5} (running Cerberus)

Bug 31519 - [corruption] ex2fs errors on 2.4.2-0.1.2{3,4,5} (running Cerberus)

Summary: [corruption] ex2fs errors on 2.4.2-0.1.2{3,4,5} (running Cerberus)

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.1
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-03-12 17:03 UTC by Brian Brock
Modified:	2007-04-18 16:32 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2001-03-19 18:26:38 UTC
Embargoed:

Attachments	(Terms of Use)

Description Brian Brock 2001-03-12 17:03:29 UTC

Filesystem errors consistently appear on machines that ran Cerberus this
weekend.  All machines tested had SCSI hardware, and were performing the
destructive SCSI test against an unused (and empty) /home partition.

re-running tests against IDE and SCSI machines, and with 2.4.2-0.1.25 to
eliminate extra variables in testing.  Output from at least one machine
will be logged via serial console to a
(relatively) stable machine.

The basic error pattern looks like this:

Mar 10 02:12:31 test93 kernel:   of device
Mar 10 02:12:32 test93 kernel: 08:08: rw=0, want= limit=265041
Mar 10 02:12:32 test93 kernel:  of device
Mar 10 02:12:32 test93 kernel: 08:08: rw=0, wan limit= of device
Mar 10 02:12:32 test93 kernel: <6 of device
Mar 10 02:12:32 test93 kernel: 08:08: rw=0, want= limit=26 of device
Mar 10 02:12:32 test93 kernel: 08:08: rw=0, wa limit= of device
Mar 10 02:12:32 test93 kernel: 08:08: rw=0, wan limit=26 of device
Mar 10 02:12:32 test93 kernel: 08:08: rw=0, want= limit=265 of device
Mar 10 02:12:32 test93 kernel: 08:08: rw=0, wa limit= of device
Mar 10 02:12:32 test93 kernel: 08:08: rw=0, wan limit of device
Mar 10 02:12:32 test93 kernel: 08:08: rw=0, wan limit of device
Mar 10 02:12:32 test93 kernel: 08:08: rw=0, want limit of device
Mar 10 02:12:33 test93 kernel: 08:08: rw=0, want=969 limit of device
Mar 10 02:12:33 test93 kernel: 08:08: rw=0, want=96 limit of device
Mar 10 02:12:33 test93 kernel: 08:08: rw=0, want=969 limit of device
Mar 10 02:12:33 test93 kernel: 08:08: rw=0, want=9691 limit of device
Mar 10 02:12:33 test93 kernel: 08:08: rw=0, wan limi of device
Mar 10 02:12:33 test93 kernel: 08:08: rw=0, want=96 limit of device
Mar 10 02:12:33 test93 kernel: 08:08: rw=0, want=9 limit=26 of device
Mar 10 02:12:34 test93 kernel: 08:08: rw=0, want=9691208 limit of device

<similar for ~100 lines>

Mar 10 02:12:46 test93 kernel: 08:08: rw=0, want=969120826, limit=265041
Mar 10 02:12:46 test93 kernel: attempt to access beyond end of device
Mar 10 02:12:46 test93 kernel: 08:08: rw=0, want=969120826, limit=265041
Mar 10 02:12:46 test93 kernel: attempt to access beyond end of device
Mar 10 02:12:46 test93 kernel: 08:08: rw=0, want=969120826, limit=265041
Mar 10 02:12:46 test93 kernel: attempt to access beyond end of device
Mar 10 02:12:47 test93 kernel: 08:08: rw=0, want=969120826, limit=265041
Mar 10 02:12:47 test93 kernel: attempt to access beyond end of device
Mar 10 02:12:47 test93 kernel: 08:08: rw=0, want=969120826, limit=265041

<for ~100 lines>

Mar 10 02:12:52 test93 kernel: EXT2-fs error (device sd(8,8)):
ext2_free_branches: Read failure, inode=22220, block=969120825
Mar 10 02:12:52 test93 kernel: attempt to access beyond end of device
Mar 10 02:12:52 test93 kernel: 08:08: rw=0, want=969120826, limit=265041

<for ~800 lines>

<after this, no new patterns or messages discernable>


------
Errors occured in the following machines:
Compaq ML530 with sym53c8xx (2.4.2-0.1.25, compiled by mingo for a special
case)
Acer with aic_7xxx driver running stock 2.4.2-0.1.24smp
HP NetServer with megaraid and unused aic_7xxx running 2.4.2-0.1.23smp
HP Kayak with sym53c8xx running 2.4.2-0.1.24 (uniprocessor)

Cerberus reported that all tests passed without errors, and a EXT2-fs error
was displayed on
console immediately after Cerberus exited.  

looks like three separate errors:

attempt to access beyond end of device
08:08: rw=0, want=969120826, limit=265041
EXT2-fs error on ext2_free_branches

coincident with these errors, there's some sort of corruption causing klogd
logs to include permutations of the original error messages.

All machines had this error occur suddenly, and continue from that point
forward (test started several hours earlier, with no error messages until
the onset of these messages.

Any suggestions for looking further at these machines, or additional data
that needs to be recorded now?

Comment 1 Arjan van de Ven 2001-03-12 20:05:16 UTC

By chance, did all these boxes have an OHCI USB controller ?

Comment 2 Arjan van de Ven 2001-03-12 20:11:28 UTC

If it proves to be SCSI only, could you try the stock Wolverine kernel (0.1.9) ?

Comment 3 Brian Brock 2001-03-12 20:30:36 UTC

At least on (the HP kayak) reports uhci.  checking on the other boxes, they're
in various states that
make it hard to check for about another hour.  

I'll check with the stock wolverine kernel if IDE doesn't cause them same
problems, or if another machine that uses only SCSI drivers eliciting the
problem in 2.4.2-0.1.24 becomes free first.

Comment 4 Michael K. Johnson 2001-03-13 03:44:30 UTC

See also bug #30174 - 2.4.2-0.1.22 (qa0307.0) fails cerberus with fs errors

Comment 5 Stephen Tweedie 2001-03-13 09:43:56 UTC

There is something strange in here which I've seen before in other 2.4.*
corruption reports.  The illegal block we have got is hex 0x39C39C39.  In other
words, it's a pattern repeating every 12 bits.  

Ah, found it --- go to
http://devserv.devel.redhat.com/~bmatthews/kernel_regression.html and look up
the 2.4.1-0.1.10 results.

The 2.4.1-0.1.9smp (stock Wolverine config) result also shows a corruption
pattern with a similar pattern, this time 1728999656 == 0x670E70E8.  2.4.2-ac7 -
enterprise.html has the same pattern, as does 2.4.2-ac7 - HIGHMEM-4GB.

This looks a lot more like a driver overwriting memory with a specific pattern
than a filesystem or VM fault.  I can't ever remember seeing this sort of
pattern from a VM/VFS bug in the past.  Does anybody recognise the pattern?

Comment 6 Michael K. Johnson 2001-03-13 14:06:58 UTC

Brian, please re-run these tests twice: once with current 586-smp kernel
and once with 686-enterprise kernel.  That will show whether/how this is
related to HIGHMEM; 586 kernels are not HIGHMEM kernels at all, and the
enterprise kernel is HIGHMEM64G.

Comment 7 Brian Brock 2001-03-13 15:22:32 UTC

Starting re-runs with the kernels suggested by johnsonm.

Last night's runs had the following results:
Most machines still actively running with apparent success on 2.4.2-0.1.25.  Too
early to presume that earlier problems aren't occuring on them.

Machine with BusLogic SCSI interface running 2.4.2-0.1.25smp - i686 has severe
memory corruption, and failed MEMORY0 test in stress-kernel (cerberus) with
failure on 28 of 64 attempts.
(BusLogic card was being tested for bug #31074)

Machine with sym53c8xx running 2.4.2-0.1.26smp (with 'noapic'): still running at
14h33m, last progress was a successful completion of a test at 11h45m.  machine
is barely using swap, RAM is
almost completely used (according to free).  Also have the following error on
console:

EXT2-fs error (device sd(8,5)): ext2_free_branches: Read failure, inode=22607,
block=969120825

sda5 a.k.a. sd(8,5) is the mounted / filesystem, destructive SCSI tests were
being run on sdb2 sd(8,18)

Comment 8 Arjan van de Ven 2001-03-13 15:30:56 UTC

969120825 = 0x39c39c39 = 00111001110000111001110000111001
spooky

Comment 9 Brian Brock 2001-03-13 22:22:11 UTC

take a look at bug #31074 for another instance of that pattern

Comment 10 Arjan van de Ven 2001-04-08 00:07:25 UTC

Appears fixed in 2.4.2-0.1.51 and later

Note You need to log in before you can comment on or make changes to this bug.