I'm using RHL73 with 2.4.18-5 (i586) on this production box I'm writing with now. Previously this was RHL72, before that RHL62, before that RHL52. This also happened with 2.4.18-3, and before (2.4.9, probably also once or twice in 2-3 years in 2.2. series) but the gut feeling is that it has become worse of late. The problem is that kernel keeps crashing mysteriously, without any kind of oops to console or any message in any logs. The crashing is only partial; the system answers to pings but not to any service like www, ssh or whatnot (all the existing ssh sessions hang). Therefore it seems kernel and userspace have "split". On console, you can press enter or any keys, but other than printing them on screen, nothing happens. When the system has crashed, the HDD led is always on; therefore I believe the block system is somehow at fault here. The crashing has always only happened during the night, sometime between 02:30 and 05:00. This is when backups are made to another harddisk (mke2fs a partition, mount it, copy everything with tar, unmount). This is another reason to believe block system is somehow the culprit. I'm not using Ultra DMA or any other hdparm tweaks; I can get the system to crash in a similar fashion if I enable some DMA settings, so I guess the motherbord/HDD chipset is a bit buggy anyway. A bit of the system: every partition mirrored with raid1, all ext3 (partitions are both on the Quantum Fireballs, the backup drive is the hdd IBM -- the backup drive is non-raid ext2 though) --8<-- 00:00.0 Host bridge: VIA Technologies, Inc. VT82C598 [Apollo MVP3] (rev 04) 00:01.0 PCI bridge: VIA Technologies, Inc. VT82C598/694x [Apollo MVP3/Pro133x AGP] 00:07.0 ISA bridge: VIA Technologies, Inc. VT82C586/A/B PCI-to-ISA [Apollo VP] (rev 47) 00:07.1 IDE interface: VIA Technologies, Inc. Bus Master IDE (rev 06) 00:07.2 USB Controller: VIA Technologies, Inc. USB (rev 02) 00:07.3 Bridge: VIA Technologies, Inc. VT82C586B ACPI (rev 10) 00:08.0 VGA compatible controller: S3 Inc. 86c764/765 [Trio32/64/64V+] 00:09.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 08) --8<-- --8<-- Uniform Multi-Platform E-IDE driver Revision: 6.31 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx VP_IDE: IDE controller on PCI bus 00 dev 39 VP_IDE: chipset revision 6 VP_IDE: not 100% native mode: will probe irqs later ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx VP_IDE: VIA vt82c586b (rev 47) IDE UDMA33 controller on pci00:07.1 ide0: BM-DMA at 0xe000-0xe007, BIOS settings: hda:DMA, hdb:DMA ide1: BM-DMA at 0xe008-0xe00f, BIOS settings: hdc:DMA, hdd:DMA hda: QUANTUM FIREBALLP AS20.5, ATA DISK drive hdb: TOSHIBA CD-ROM XM-6102D, ATAPI CD/DVD-ROM drive hdc: QUANTUM FIREBALLP AS20.5, ATA DISK drive hdd: IBM-DTTA-350840, ATA DISK drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 ide1 at 0x170-0x177,0x376 on irq 15 hda: 40132503sectors (20548 MB) w/1902KiB Cache, CHS=2498/255/63, UDMA(33) hdc: 40132503 sectors (20548 MB) w/1902KiB Cache, CHS=39813/16/63, UDMA(33) hdd: 16514064 sectors (8455 MB) w/467KiB Cache, CHS=16383/16/63, UDMA(33) ide-floppy driver 0.99.newide Partition check: hda: hda1 hda2 hda3 < hda5 hda6 hda7 hda8 > hdc: [PTBL] [2498/255/63] hdc1 hdc2 hdc3 < hdc5 hdc6 hdc7 hdc8 > hdd: [PTBL] [1027/255/63] hdd1 < hdd5 hdd6 hdd7 > --8<-- --8<-- cpu family : 5 model : 8 model name : AMD-K6(tm) 3D processor stepping : 12 cpu MHz : 300.691 --8<-- I'd _really_ appreciate pointers in how to move on from this, to debug it further or to get it over and done with (which computer parts to replace, for example) -- as this is a production system, this is really eating the credibility of Linux... :-( In the meantime, I have checked all the drives for bad blocks and removed unmaskirq. The settings (all of them default) are now like: root: /home/pekkas$ /sbin/hdparm -v /dev/hda /dev/hdb /dev/hdc /dev/hdd /dev/hda: multcount = 16 (on) I/O support = 1 (32-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) nowerr = 0 (off) readonly = 0 (off) readahead = 8 (on) geometry = 2498/255/63, sectors = 40132503, start = 0 busstate = 1 (on) /dev/hdb: HDIO_GET_MULTCOUNT failed: Invalid argument I/O support = 1 (32-bit) unmaskirq = 0 (off) using_dma = 0 (off) keepsettings = 0 (off) HDIO_GET_NOWERR failed: Invalid argument readonly = 1 (on) readahead = 8 (on) HDIO_GETGEO failed: Invalid argument busstate = 1 (on) /dev/hdc: multcount = 16 (on) I/O support = 1 (32-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) nowerr = 0 (off) readonly = 0 (off) readahead = 8 (on) geometry = 2498/255/63, sectors = 40132503, start = 0 busstate = 1 (on) /dev/hdd: multcount = 16 (on) I/O support = 1 (32-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) nowerr = 0 (off) readonly = 0 (off) readahead = 8 (on) geometry = 1027/255/63, sectors = 16514064, start = 0 busstate = 1 (on) Note: except for the unmaskirq, these are the default settings. When doing 'dd if=/dev/hdd of=/dev/zero bs=10M' I get errors like: hdd: dma_intr: error=0x84 { DriveStatusError BadCRC } hdd: dma_intr: status=0x51 { DriveReady SeekComplete Error } hdd: dma_intr: error=0x84 { DriveStatusError BadCRC } hdd: dma_intr: status=0x51 { DriveReady SeekComplete Error } hdd: dma_intr: error=0x84 { DriveStatusError BadCRC } hdd: dma_intr: status=0x51 { DriveReady SeekComplete Error } hdd: dma_intr: error=0x84 { DriveStatusError BadCRC } hdd: dma_intr: status=0x53 { DriveReady SeekComplete Index Error } The output of these errors cease immediately if I turn off DMA on /dev/hdd (the backup drive). So I must assume that there is probably some bug in the /dev/hdd drive and that is causing kernel freezes?
My problem seems somewhat similar. Reproducible on two systems. Single PentMMX Kontron/Jumptec ETX-MGX board on an in-house carrier card, single IBM Travelstar lap-top drive. Kernel mods: IRQ0 is sharable, HZ=500. Stress test: cd /usr/src/linux-2.4.18-3;while(`true`);do make dep clean bzImage;done Runs between 9 and 18 hours, then a no-symptom, no-message, dead-to-the-world hang. ssh sessions hang up. System is headless (no vga console), pushbutton reset required. Attempting work-around: migrate to 2.4.20 kernel. Change fstab ext3->ext2. nmi_watchdog=1 Will happily add whatever other instrumentation might help, I just need instructons for dummies. Same board runs full-speed memtest86 for hours, DOS-based ISA device tests for *days*.
Follow-up: ETX-format CPU (Kontron/Jumptec ETX-MGx) card is on a carrier that has a debug port@228 similar to (traditional) port80 Downloaded 2.4.20 ("latest stable"). menuconfig'd, built, lilo, booted fine. Kernel changes: HZ=500; irq0 sharable; mod to timer_interrupt: unsigned char intCount; ++intCount; outb_p(intCount,0x228); Started stress test (cd /usr/src/linux-2.4.20;while(`true`);do make clean bzImage;done) (n.b. run as user root) and kernel hung top(1) (run as user root) last said: 9:56pm up 1:33, 3 users, load average: 1.20, 1.26, 1.20 31 processes: 28 sleeping, 3 running, 0 zombie, 0 stopped CPU states: 95.4% user, 4.5% system, 0.0% nice, 0.0% idle Mem: 60696K av, 58460K used, 2236K free, 0K shrd, 4916K buff Swap: 152608K av, 116K used, 152492K free 29480K cached tail -f /var/log/messages last said (about 1 hour earlier, run as user root) May 17 20:27:04 mcut16 ntpd[519]: kernel time discipline status change 41 debug port shows timer interrupt NOT incrementing ergo kernel not running at any level. hdparm /dev/hda shows:/dev/hda: multcount = 16 (on) I/O support = 0 (default 16-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) nowerr = 0 (off) readonly = 0 (off) readahead = 8 (on) geometry = 730/255/63, sectors = 11733120, start = 0 busstate = 1 (on) kernel .config available upon request (776 lines). dmesg available upon request (101 lines). Will perform whatever tests might help find this problem's source and cure it.
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/