178983 – hard drive dma timeout after moderate load

Bug 178983 - hard drive dma timeout after moderate load

Summary: hard drive dma timeout after moderate load

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	4
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-01-25 23:19 UTC by Jack Levin
Modified:	2015-01-04 22:24 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-05-05 12:59:47 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Jack Levin 2006-01-25 23:19:07 UTC

Description of problem:

I have 45 machines with 2 x WD320GB drives in each machine.  As soon as any of
the machines receives moderate load (such as accessing large number of files,
acting as a webserver with 90 queries per second, where each query is file
download request), either hda or hdc drive gets its dma turned off and machine
gets very slow.  I have tried various kernels, tried turning off ACPI in BIOS,
passed acpi=off and noacpi to kernel (in grub.conf), it makes no difference.  

Version-Release number of selected component (if applicable):
kernel-2.6.15-1.1824_FC4.i686.rpm

How reproducible:
very


Steps to Reproduce:
Run machine on 15 mbps with 80 to 90 web queries per second.  
  
Actual results:
Jan 25 13:17:21 localhost kernel: hda: dma_timer_expiry: dma status == 0x21
Jan 25 13:17:31 localhost kernel: hda: DMA timeout error
Jan 25 13:17:31 localhost kernel: hda: dma timeout error: status=0x51 {
DriveReady SeekComplete Error }
Jan 25 13:17:31 localhost kernel: hda: dma timeout error: error=0x04 {
DriveStatusError }
Jan 25 13:17:31 localhost kernel: hda: multwrite_intr: status=0x51 { DriveReady
SeekComplete Error }
Jan 25 13:17:31 localhost kernel: hda: multwrite_intr: error=0x04 {
DriveStatusError }
Jan 25 13:17:31 localhost kernel: hda: multwrite_intr: status=0x51 { DriveReady
SeekComplete Error }
Jan 25 13:17:31 localhost kernel: hda: multwrite_intr: error=0x04 {
DriveStatusError }
Jan 25 13:19:02 localhost kernel: hda: multwrite_intr: status=0x51 { DriveReady
SeekComplete Error }
Jan 25 13:19:03 localhost kernel: hda: multwrite_intr: error=0x04 {
DriveStatusError }
Jan 25 13:19:03 localhost kernel: hda: multwrite_intr: status=0x51 { DriveReady
SeekComplete Error }
Jan 25 13:19:03 localhost kernel: hda: multwrite_intr: error=0x04 {
DriveStatusError }
Jan 25 13:19:04 localhost kernel: hda: dma_timer_expiry: dma status == 0x21
Jan 25 13:19:04 localhost kernel: hda: DMA timeout error
Jan 25 13:19:04 localhost kernel: hda: dma timeout error: status=0x58 {
DriveReady SeekComplete DataRequest }
Jan 25 13:19:04 localhost kernel: hda: dma_timer_expiry: dma status == 0x21
Jan 25 13:19:05 localhost kernel: hda: DMA timeout error
Jan 25 13:19:05 localhost kernel: hda: dma timeout error: status=0x58 {
DriveReady SeekComplete DataRequest }
Jan 25 13:19:06 localhost kernel: hda: dma_timer_expiry: dma status == 0x21
Jan 25 13:19:06 localhost kernel: hda: DMA timeout error
Jan 25 13:19:06 localhost kernel: hda: dma timeout error: status=0x58 {
DriveReady SeekComplete DataRequest }

 multcount    =  0 (off)
 IO_support   =  0 (default 16-bit)
 unmaskirq    =  0 (off)
 using_dma    =  0 (off)
 keepsettings =  0 (off)
 readonly     =  0 (off)
 readahead    = 256 (on)
 geometry     = 38913/255/63, sectors = 320072933376, start = 0
img43:root ~ $ hdparm /dev/hdc

/dev/hdc:
 multcount    = 16 (on)
 IO_support   =  1 (32-bit)
 unmaskirq    =  1 (on)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 readonly     =  0 (off)
 readahead    = 256 (on)
 geometry     = 38913/255/63, sectors = 320072933376, start = 0

img43:root ~ $ hdparm -t /dev/hda

/dev/hda:
 Timing buffered disk reads:    6 MB in  3.18 seconds =   1.89 MB/sec

/dev/hdc:
 Timing buffered disk reads:   20 MB in  3.01 seconds =   6.65 MB/sec
img43:root ~ $

Expected results:
No DMA timeouts

Additional info:
Reboot returns machine back to normal but only for few hours depending on the
load...  Here is some more data:

img43:root ~ $ hdparm -i /dev/hda

/dev/hda:

 Model=WDC WD3200JB-22KFA0, FwRev=08.05J08, SerialNo=WD-WCAMR1936348
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=65
 BuffType=unknown, BuffSize=8192kB, MaxMultSect=16, MultSect=off
 CurCHS=65535/1/63, CurSects=4128705, LBA=yes, LBAsects=268435455
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 *udma4 udma5
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: device does not report version:

 * signifies the current active mode


img43:root ~ $ lspci
00:00.0 Host bridge: VIA Technologies, Inc. P4M266 Host Bridge
00:01.0 PCI bridge: VIA Technologies, Inc. VT8633 [Apollo Pro266 AGP]
00:0f.0 IDE interface: VIA Technologies, Inc.
VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06
)
00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.3 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.4 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 86)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8237 ISA bridge
[KT600/K8T800/K8T890 South]
00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT8233/A/8235/8237
AC97 Audio Controller (rev 60)
00:12.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 78)
01:00.0 VGA compatible controller: S3 Inc. VT8375 [ProSavage8 KM266/KL266]

Jan 13 05:56:45 localhost kernel:     ide0: BM-DMA at 0xd000-0xd007, BIOS
settings: hda:DMA, hdb:pio
Jan 13 05:56:45 localhost kernel: hda: WDC WD3200JB-22KFA0, ATA DISK drive
Jan 13 05:56:45 localhost kernel: hda: max request size: 1024KiB
Jan 13 05:56:45 localhost kernel: hda: 625142448 sectors (320072 MB) w/8192KiB
Cache, CHS=38913/255/63, UDMA(
100)
Jan 13 05:56:45 localhost kernel:  hda: hda1 hda2
Jan 13 05:56:46 localhost kernel: EXT3 FS on hda1, internal journal
Jan 14 01:23:20 localhost kernel:     ide0: BM-DMA at 0xd000-0xd007, BIOS
settings: hda:DMA, hdb:pio
Jan 14 01:23:20 localhost kernel: hda: WDC WD3200JB-22KFA0, ATA DISK drive
Jan 14 01:23:20 localhost kernel: hda: max request size: 1024KiB
Jan 14 01:23:20 localhost kernel: hda: 625142448 sectors (320072 MB) w/8192KiB
Cache, CHS=38913/255/63, UDMA(
100)
Jan 14 01:23:20 localhost kernel:  hda: hda1 hda2
Jan 14 01:23:20 localhost kernel: EXT3-fs: hda1: orphan cleanup on readonly fs
Jan 14 01:23:20 localhost kernel: EXT3-fs: hda1: 40 orphan inodes deleted
Jan 14 01:23:21 localhost kernel: EXT3 FS on hda1, internal journal


Some machines report 
EXT3-fs error (device hda1) in start_transaction: Journal has aborted
EXT3-fs error (device hda1) in start_transaction: Journal has aborted
EXT3-fs error (device hda1) in start_transaction: Journal has aborted 

shortly after, lock up, and require manual fsck to bring back up to operation.
(we end up losing some data in the process).

img43:root ~ $ vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  1    188   2468 106592 247044    0    0    28     3   13    30  7 14 34 45
 0  1    188   1540 106744 247644    0    0   724  1308 4698   147  2 47  4 47
 0  1    188   3328 105540 246792    0    0  1344     0 4621   210  3 43 13 42
 2  3    188   1928 105600 248088    0    0  1348     0 5122   181  3 53 20 24
 1  3    188  11440 104208 239900    0    0  1800     0 5685    98  1 68  0 31
 1  2    188   7860 104276 242160    0    0  2324     0 5900    87  2 81  0 17
 0  1    188   8824 104452 242780    0    0   752  1148 4649   129  3 56  0 41
 0  1    188   7484 104696 243568    0    0  1020     0 4265   189  3 36  0 61
 0  1    188   6220 104976 244384    0    0  1076     0 3360   226  6 23  0 71
 0  1    188   4812 105248 245788    0    0  1672    44 4165   184  2 39  0 59
 0  0    188   2764 105480 246920    0    0  1344     0 3798   255  5 24  5 66
 0  0    188   2316 105552 247408    0    0   516  1632 5331   205  2 53 27 18
 0  1    188   2064 104996 248196    0    0  1116     0 3697   201  2 25  7 66

Comment 1 Dave Jones 2006-02-03 06:41:01 UTC

This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.

Comment 2 John Thacker 2006-05-05 12:59:47 UTC

Closing due to lack of response.

Note You need to log in before you can comment on or make changes to this bug.