Bug 67737 - kernel hangs without oopses (block system?)
kernel hangs without oopses (block system?)
Status: CLOSED CURRENTRELEASE
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
7.3
i386 Linux
high Severity high
: ---
: ---
Assigned To: Arjan van de Ven
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2002-07-01 07:33 EDT by Pekka Savola
Modified: 2008-08-01 12:22 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-09-30 11:39:43 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Pekka Savola 2002-07-01 07:33:59 EDT
I'm using RHL73 with 2.4.18-5 (i586) on this production box I'm writing 
with now.  Previously this was RHL72, before that RHL62, before that 
RHL52.

This also happened with 2.4.18-3, and before (2.4.9, probably also once or twice in 2-3 years in 2.2. series) 
but the gut feeling is that it has become worse of late.

The problem is that kernel keeps crashing mysteriously, without any kind 
of oops to console or any message in any logs.  

The crashing is only partial; the system answers to pings but not to any
service like www, ssh or whatnot (all the existing ssh sessions hang).  
Therefore it seems kernel and userspace have "split".  On console, you can
press enter or any keys, but other than printing them on screen, nothing
happens.

When the system has crashed, the HDD led is always on; therefore I believe 
the block system is somehow at fault here.

The crashing has always only happened during the night, sometime between 
02:30 and 05:00.  This is when backups are made to another harddisk 
(mke2fs a partition, mount it, copy everything with tar, unmount).  This is 
another reason to believe block system is somehow the culprit.

I'm not using Ultra DMA or any other hdparm tweaks; I can get the system 
to crash in a similar fashion if I enable some DMA settings, so I guess 
the motherbord/HDD chipset is a bit buggy anyway.

A bit of the system:

every partition mirrored with raid1, all ext3 (partitions are both on the 
Quantum Fireballs, the backup drive is the hdd IBM -- the backup drive is non-raid ext2 though)

--8<--
00:00.0 Host bridge: VIA Technologies, Inc. VT82C598 [Apollo MVP3] (rev 04)
00:01.0 PCI bridge: VIA Technologies, Inc. VT82C598/694x [Apollo MVP3/Pro133x AGP]
00:07.0 ISA bridge: VIA Technologies, Inc. VT82C586/A/B PCI-to-ISA [Apollo VP] (rev 47)
00:07.1 IDE interface: VIA Technologies, Inc. Bus Master IDE (rev 06)
00:07.2 USB Controller: VIA Technologies, Inc. USB (rev 02)
00:07.3 Bridge: VIA Technologies, Inc. VT82C586B ACPI (rev 10)
00:08.0 VGA compatible controller: S3 Inc. 86c764/765 [Trio32/64/64V+]
00:09.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 08)
--8<--
--8<--
Uniform Multi-Platform E-IDE driver Revision: 6.31
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
VP_IDE: IDE controller on PCI bus 00 dev 39
VP_IDE: chipset revision 6
VP_IDE: not 100% native mode: will probe irqs later
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
VP_IDE: VIA vt82c586b (rev 47) IDE UDMA33 controller on pci00:07.1
    ide0: BM-DMA at 0xe000-0xe007, BIOS settings: hda:DMA, hdb:DMA
    ide1: BM-DMA at 0xe008-0xe00f, BIOS settings: hdc:DMA, hdd:DMA
hda: QUANTUM FIREBALLP AS20.5, ATA DISK drive
hdb: TOSHIBA CD-ROM XM-6102D, ATAPI CD/DVD-ROM drive
hdc: QUANTUM FIREBALLP AS20.5, ATA DISK drive
hdd: IBM-DTTA-350840, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
ide1 at 0x170-0x177,0x376 on irq 15
hda: 40132503sectors (20548 MB) w/1902KiB Cache, CHS=2498/255/63, UDMA(33)
hdc: 40132503 sectors (20548 MB) w/1902KiB Cache, CHS=39813/16/63, UDMA(33)
hdd: 16514064 sectors (8455 MB) w/467KiB Cache, CHS=16383/16/63, UDMA(33)
ide-floppy driver 0.99.newide
Partition check:
 hda: hda1 hda2 hda3 < hda5 hda6 hda7 hda8 >
 hdc: [PTBL] [2498/255/63] hdc1 hdc2 hdc3 < hdc5 hdc6 hdc7 hdc8 >
 hdd: [PTBL] [1027/255/63] hdd1 < hdd5 hdd6 hdd7 >
--8<--
--8<--
cpu family      : 5
model           : 8
model name      : AMD-K6(tm) 3D processor
stepping        : 12
cpu MHz         : 300.691
--8<--

I'd _really_ appreciate pointers in how to move on from this, to debug it 
further or to get it over and done with (which computer parts to replace, 
for example) -- as this is a production system, this is really eating the 
credibility of Linux... :-(

In the meantime, I have checked all the drives for bad blocks and removed unmaskirq.   The settings (all of them default) are now like:

root: /home/pekkas$ /sbin/hdparm -v /dev/hda /dev/hdb /dev/hdc /dev/hdd

/dev/hda:
 multcount    = 16 (on)
 I/O support  =  1 (32-bit)
 unmaskirq    =  0 (off)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 nowerr       =  0 (off)
 readonly     =  0 (off)
 readahead    =  8 (on)
 geometry     = 2498/255/63, sectors = 40132503, start = 0
 busstate     =  1 (on)

/dev/hdb:
 HDIO_GET_MULTCOUNT failed: Invalid argument
 I/O support  =  1 (32-bit)
 unmaskirq    =  0 (off)
 using_dma    =  0 (off)
 keepsettings =  0 (off)
 HDIO_GET_NOWERR failed: Invalid argument
 readonly     =  1 (on)
 readahead    =  8 (on)
 HDIO_GETGEO failed: Invalid argument
 busstate     =  1 (on)

/dev/hdc:
 multcount    = 16 (on)
 I/O support  =  1 (32-bit)
 unmaskirq    =  0 (off)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 nowerr       =  0 (off)
 readonly     =  0 (off)
 readahead    =  8 (on)
 geometry     = 2498/255/63, sectors = 40132503, start = 0
 busstate     =  1 (on)

/dev/hdd:
 multcount    = 16 (on)
 I/O support  =  1 (32-bit)
 unmaskirq    =  0 (off)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 nowerr       =  0 (off)
 readonly     =  0 (off)
 readahead    =  8 (on)
 geometry     = 1027/255/63, sectors = 16514064, start = 0
 busstate     =  1 (on)

Note: except for the unmaskirq, these are the default settings.

When doing 'dd if=/dev/hdd of=/dev/zero bs=10M' I get errors like:

hdd: dma_intr: error=0x84 { DriveStatusError BadCRC }
hdd: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdd: dma_intr: error=0x84 { DriveStatusError BadCRC }
hdd: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdd: dma_intr: error=0x84 { DriveStatusError BadCRC }
hdd: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdd: dma_intr: error=0x84 { DriveStatusError BadCRC }
hdd: dma_intr: status=0x53 { DriveReady SeekComplete Index Error }

The output of these errors cease immediately if I turn off DMA on /dev/hdd (the backup drive).

So I must assume that there is probably some bug in the /dev/hdd drive and that is causing kernel freezes?
Comment 1 Brian Brunner 2003-05-17 20:19:03 EDT
My problem seems somewhat similar.  
Reproducible on two systems.
Single PentMMX Kontron/Jumptec ETX-MGX board on an in-house carrier card, 
single IBM Travelstar lap-top drive. Kernel mods: IRQ0 is sharable, HZ=500.
Stress test: cd /usr/src/linux-2.4.18-3;while(`true`);do make dep clean 
bzImage;done
Runs between 9 and 18 hours, then a no-symptom, no-message, dead-to-the-world 
hang.  ssh sessions hang up. System is headless (no vga console), pushbutton 
reset required.
Attempting work-around: migrate to 2.4.20 kernel.  Change fstab ext3->ext2. 
nmi_watchdog=1
Will happily add whatever other instrumentation might help, I just need 
instructons for dummies.
Same board runs full-speed memtest86 for hours, DOS-based ISA device tests for 
*days*.
Comment 2 Brian Brunner 2003-05-18 11:46:42 EDT
Follow-up:

ETX-format CPU (Kontron/Jumptec ETX-MGx) card is on a carrier that has a debug 
port@228 similar to (traditional) port80

Downloaded 2.4.20 ("latest stable"). menuconfig'd, built, lilo, booted fine.

Kernel changes: HZ=500; irq0 sharable; mod to timer_interrupt: unsigned char 
intCount; ++intCount; outb_p(intCount,0x228);

Started stress test (cd /usr/src/linux-2.4.20;while(`true`);do make clean 
bzImage;done) (n.b. run as user root)

and kernel hung 

top(1) (run as user root) last said:
  9:56pm  up  1:33,  3 users,  load average: 1.20, 1.26, 1.20
31 processes: 28 sleeping, 3 running, 0 zombie, 0 stopped
CPU states: 95.4% user,  4.5% system,  0.0% nice,  0.0% idle
Mem:    60696K av,   58460K used,    2236K free,       0K shrd,    4916K buff
Swap:  152608K av,     116K used,  152492K free                   29480K cached

tail -f /var/log/messages last said (about 1 hour earlier, run as user root)
May 17 20:27:04 mcut16 ntpd[519]: kernel time discipline status change 41

debug port shows timer interrupt NOT incrementing ergo kernel not running at 
any level.


hdparm /dev/hda shows:/dev/hda:
 multcount    = 16 (on)
 I/O support  =  0 (default 16-bit)
 unmaskirq    =  0 (off)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 nowerr       =  0 (off)
 readonly     =  0 (off)
 readahead    =  8 (on)
 geometry     = 730/255/63, sectors = 11733120, start = 0
 busstate     =  1 (on)

kernel .config available upon request (776 lines).
dmesg available upon request (101 lines).

Will perform whatever tests might help find this problem's source and cure it.
Comment 3 Bugzilla owner 2004-09-30 11:39:43 EDT
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Note You need to log in before you can comment on or make changes to this bug.