Filesystem errors consistently appear on machines that ran Cerberus this weekend. All machines tested had SCSI hardware, and were performing the destructive SCSI test against an unused (and empty) /home partition. re-running tests against IDE and SCSI machines, and with 2.4.2-0.1.25 to eliminate extra variables in testing. Output from at least one machine will be logged via serial console to a (relatively) stable machine. The basic error pattern looks like this: Mar 10 02:12:31 test93 kernel: of device Mar 10 02:12:32 test93 kernel: 08:08: rw=0, want= limit=265041 Mar 10 02:12:32 test93 kernel: of device Mar 10 02:12:32 test93 kernel: 08:08: rw=0, wan limit= of device Mar 10 02:12:32 test93 kernel: <6 of device Mar 10 02:12:32 test93 kernel: 08:08: rw=0, want= limit=26 of device Mar 10 02:12:32 test93 kernel: 08:08: rw=0, wa limit= of device Mar 10 02:12:32 test93 kernel: 08:08: rw=0, wan limit=26 of device Mar 10 02:12:32 test93 kernel: 08:08: rw=0, want= limit=265 of device Mar 10 02:12:32 test93 kernel: 08:08: rw=0, wa limit= of device Mar 10 02:12:32 test93 kernel: 08:08: rw=0, wan limit of device Mar 10 02:12:32 test93 kernel: 08:08: rw=0, wan limit of device Mar 10 02:12:32 test93 kernel: 08:08: rw=0, want limit of device Mar 10 02:12:33 test93 kernel: 08:08: rw=0, want=969 limit of device Mar 10 02:12:33 test93 kernel: 08:08: rw=0, want=96 limit of device Mar 10 02:12:33 test93 kernel: 08:08: rw=0, want=969 limit of device Mar 10 02:12:33 test93 kernel: 08:08: rw=0, want=9691 limit of device Mar 10 02:12:33 test93 kernel: 08:08: rw=0, wan limi of device Mar 10 02:12:33 test93 kernel: 08:08: rw=0, want=96 limit of device Mar 10 02:12:33 test93 kernel: 08:08: rw=0, want=9 limit=26 of device Mar 10 02:12:34 test93 kernel: 08:08: rw=0, want=9691208 limit of device <similar for ~100 lines> Mar 10 02:12:46 test93 kernel: 08:08: rw=0, want=969120826, limit=265041 Mar 10 02:12:46 test93 kernel: attempt to access beyond end of device Mar 10 02:12:46 test93 kernel: 08:08: rw=0, want=969120826, limit=265041 Mar 10 02:12:46 test93 kernel: attempt to access beyond end of device Mar 10 02:12:46 test93 kernel: 08:08: rw=0, want=969120826, limit=265041 Mar 10 02:12:46 test93 kernel: attempt to access beyond end of device Mar 10 02:12:47 test93 kernel: 08:08: rw=0, want=969120826, limit=265041 Mar 10 02:12:47 test93 kernel: attempt to access beyond end of device Mar 10 02:12:47 test93 kernel: 08:08: rw=0, want=969120826, limit=265041 <for ~100 lines> Mar 10 02:12:52 test93 kernel: EXT2-fs error (device sd(8,8)): ext2_free_branches: Read failure, inode=22220, block=969120825 Mar 10 02:12:52 test93 kernel: attempt to access beyond end of device Mar 10 02:12:52 test93 kernel: 08:08: rw=0, want=969120826, limit=265041 <for ~800 lines> <after this, no new patterns or messages discernable> ------ Errors occured in the following machines: Compaq ML530 with sym53c8xx (2.4.2-0.1.25, compiled by mingo for a special case) Acer with aic_7xxx driver running stock 2.4.2-0.1.24smp HP NetServer with megaraid and unused aic_7xxx running 2.4.2-0.1.23smp HP Kayak with sym53c8xx running 2.4.2-0.1.24 (uniprocessor) Cerberus reported that all tests passed without errors, and a EXT2-fs error was displayed on console immediately after Cerberus exited. looks like three separate errors: attempt to access beyond end of device 08:08: rw=0, want=969120826, limit=265041 EXT2-fs error on ext2_free_branches coincident with these errors, there's some sort of corruption causing klogd logs to include permutations of the original error messages. All machines had this error occur suddenly, and continue from that point forward (test started several hours earlier, with no error messages until the onset of these messages. Any suggestions for looking further at these machines, or additional data that needs to be recorded now?
By chance, did all these boxes have an OHCI USB controller ?
If it proves to be SCSI only, could you try the stock Wolverine kernel (0.1.9) ?
At least on (the HP kayak) reports uhci. checking on the other boxes, they're in various states that make it hard to check for about another hour. I'll check with the stock wolverine kernel if IDE doesn't cause them same problems, or if another machine that uses only SCSI drivers eliciting the problem in 2.4.2-0.1.24 becomes free first.
See also bug #30174 - 2.4.2-0.1.22 (qa0307.0) fails cerberus with fs errors
There is something strange in here which I've seen before in other 2.4.* corruption reports. The illegal block we have got is hex 0x39C39C39. In other words, it's a pattern repeating every 12 bits. Ah, found it --- go to http://devserv.devel.redhat.com/~bmatthews/kernel_regression.html and look up the 2.4.1-0.1.10 results. The 2.4.1-0.1.9smp (stock Wolverine config) result also shows a corruption pattern with a similar pattern, this time 1728999656 == 0x670E70E8. 2.4.2-ac7 - enterprise.html has the same pattern, as does 2.4.2-ac7 - HIGHMEM-4GB. This looks a lot more like a driver overwriting memory with a specific pattern than a filesystem or VM fault. I can't ever remember seeing this sort of pattern from a VM/VFS bug in the past. Does anybody recognise the pattern?
Brian, please re-run these tests twice: once with current 586-smp kernel and once with 686-enterprise kernel. That will show whether/how this is related to HIGHMEM; 586 kernels are not HIGHMEM kernels at all, and the enterprise kernel is HIGHMEM64G.
Starting re-runs with the kernels suggested by johnsonm. Last night's runs had the following results: Most machines still actively running with apparent success on 2.4.2-0.1.25. Too early to presume that earlier problems aren't occuring on them. Machine with BusLogic SCSI interface running 2.4.2-0.1.25smp - i686 has severe memory corruption, and failed MEMORY0 test in stress-kernel (cerberus) with failure on 28 of 64 attempts. (BusLogic card was being tested for bug #31074) Machine with sym53c8xx running 2.4.2-0.1.26smp (with 'noapic'): still running at 14h33m, last progress was a successful completion of a test at 11h45m. machine is barely using swap, RAM is almost completely used (according to free). Also have the following error on console: EXT2-fs error (device sd(8,5)): ext2_free_branches: Read failure, inode=22607, block=969120825 sda5 a.k.a. sd(8,5) is the mounted / filesystem, destructive SCSI tests were being run on sdb2 sd(8,18)
969120825 = 0x39c39c39 = 00111001110000111001110000111001 spooky
take a look at bug #31074 for another instance of that pattern
Appears fixed in 2.4.2-0.1.51 and later