Bug 58984

Summary:	Deadlock on DMA with ceratin seagate IDE drives
Product:	[Retired] Red Hat Linux	Reporter:	Kevin Range <range006>
Component:	kernel	Assignee:	Arjan van de Ven <arjanv>
Status:	CLOSED WONTFIX	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.2
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2003-04-12 04:31:40 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Kevin Range 2002-01-28 22:00:51 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.2.1) Gecko/20010901

Description of problem:
We have several seagate IDE drives that deadlock the machine when used in DMA
mode.  We also have two other drives of the same model that work flawlessly. 
The deadlocking drives have firmware revision 3.01.  The working drives are
firmware version 2.13.  They are all model ST380021A.  The deadlock occurs with
SMP and UP versions of the 2.4.9-13 and 2.4.7-10 kernels.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Install a segate ST380021A2 IDE drive, firmware version 3.01 in a machine.
2. Hammer on the drive by copying a couple of gigs of data onto it.  It also
helps if you cause a lot of interrupts, like by scping a whole bunch of stuff at
the same time.
3. Wait a few minutes and bammo, deadlock.
	

Actual Results:  Deadlock

Expected Results:  Normal operation.

Additional info:

The machine in question:

Dual PII 400Mhz
512 MB RAM
Asus P2B-DS motherboard
2 SCSI disks
1 IDE disk (the seagate)
SMC Etherpower II
Matrox G200
IDE CD-ROM


Workaround for this bug: disable DMA, but then disk I/O is horribly slow.

Comment 1 Arjan van de Ven 2002-01-28 22:33:12 UTC

Can you try hdparm -X34 /dev/hdXX ?
THat's one step lower in DMA but still with dma...

Comment 2 Kevin Range 2002-02-01 21:33:26 UTC

-X34 does improve things. No more deadlocks.  However, operation is still not
problem free.  /var/log/messages says:

Feb  1 12:57:44 muscat kernel: hda: timeout waiting for DMA
Feb  1 12:57:44 muscat kernel: ide_dmaproc: chipset supported ide_dma_timeout
func only: 14
Feb  1 12:57:44 muscat kernel: blk: queue c03bab80, I/O limit 4095Mb (mask
0xffffffff)
Feb  1 12:57:49 muscat kernel: hda: status timeout: status=0xd0 { Busy }
Feb  1 12:57:49 muscat kernel: hda: drive not ready for command
Feb  1 12:57:49 muscat kernel: ide0: reset: success
Feb  1 12:57:49 muscat kernel: blk: queue c03bab80, I/O limit 4095Mb (mask
0xffffffff)

when we run our "toture test" above.  

Andre Hedrick says that this is an "Intel PIIX4 erratium, NO FIX", but then why
do the machines with the 2.13 firmware work fine in UltraDMA mode?

Comment 3 Need Real Name 2002-02-20 15:33:47 UTC

Similar problem with segate ST380021A2 IDE drive, firmware version 3.01
 on a Asus CUV4x-DLS. 1GB memory, 2x 1GHz PIII, no sound card, G400 video card.
IDE is primary master (used as a scratch directory), no other IDE disks in
system. 1 36GB scsi disk (used for everything else) and 1 scsi DVD-RAM drive

hdparm -d0 /dev/hda dosent fix problem.
When copying files to the IDE drive or essentially using the IDE drive at all,
system will momentarily freeze, followed by a clicking noise from IDE drive.
System then becomes usable for the next few minutes.

Looking in /var/log/messages shows:

Feb 19 08:03:36 gewurztraminer kernel: hda: DMA disabled
Feb 19 08:15:00 gewurztraminer kernel: hda: status timeout: status=0xd0 { Busy }
Feb 19 08:15:00 gewurztraminer kernel: hda: no DRQ after issuing WRITE
Feb 19 08:15:30 gewurztraminer kernel: ide0: reset timed-out, status=0x80
Feb 19 08:15:35 gewurztraminer kernel: hda: status timeout: status=0x80 { Busy }
Feb 19 08:15:35 gewurztraminer kernel: hda: drive not ready for command
Feb 19 08:15:35 gewurztraminer kernel: ide0: reset: success
Feb 19 08:32:55 gewurztraminer kernel: hda: status timeout: status=0xd0 { Busy }
Feb 19 08:32:55 gewurztraminer kernel: hda: no DRQ after issuing WRITE
Feb 19 08:33:25 gewurztraminer kernel: ide0: reset timed-out, status=0x80
Feb 19 08:33:30 gewurztraminer kernel: hda: status timeout: status=0x80 { Busy }
Feb 19 08:33:30 gewurztraminer kernel: hda: drive not ready for command
Feb 19 08:33:30 gewurztraminer kernel: ide0: reset: success
Feb 19 08:39:20 gewurztraminer kernel: hda: status timeout: status=0xd0 { Busy }
Feb 19 08:39:20 gewurztraminer kernel: hda: no DRQ after issuing WRITE
Feb 19 08:39:50 gewurztraminer kernel: ide0: reset timed-out, status=0x80
Feb 19 08:39:55 gewurztraminer kernel:
Feb 19 08:39:55 gewurztraminer kernel: wait_on_irq, CPU 0:
Feb 19 08:39:55 gewurztraminer kernel: irq:  0 [ 0 0 ]
Feb 19 08:39:55 gewurztraminer kernel: bh:   1 [ 0 1 ]
Feb 19 08:39:55 gewurztraminer kernel: Stack dumps:
Feb 19 08:39:55 gewurztraminer kernel: CPU 1:4004dd52 4004dd62 4004dd72 4004dd82
4004dd92 400b05f0 4004ddb2 4004ddc2
Feb 19 08:39:55 gewurztraminer kernel:        4004ddd2 4004dde2 4009a140
4004de02 4004de12 4004de22 400b3040 4004de42
Feb 19 08:39:55 gewurztraminer kernel:        40113700 400b0940 4004de72
4004de82 4004de92 4004dea2 4004deb2 4004dec2
Feb 19 08:39:55 gewurztraminer kernel: Call Trace:
Feb 19 08:39:55 gewurztraminer kernel:
Feb 19 08:39:55 gewurztraminer kernel: CPU 0:f7ff3f28 c024aed0 00000000 00000000
ffffffff 00000000 c0108842 c024aee5
Feb 19 08:39:55 gewurztraminer kernel:        00000000 db088000 00000001
c0174b78 db088168 c02deee4 f7ff3f74 f7ff2658
Feb 19 08:39:55 gewurztraminer kernel:        f7ff2000 c011fb4d db088000
db088130 c02deee4 f7ff2000 00000000 c0128095
Feb 19 08:39:56 gewurztraminer kernel: Call Trace:
[call_spurious_interrupt+119259/153515] .rodata.str1.1 [kernel] 0x74b
Feb 19 08:39:56 gewurztraminer kernel: Call Trace: [<c024aed0>] .rodata.str1.1
[kernel] 0x74b
Feb 19 08:39:56 gewurztraminer kernel: [__global_cli+226/368] __global_cli
[kernel] 0xe2
Feb 19 08:39:56 gewurztraminer kernel: [<c0108842>] __global_cli [kernel] 0xe2
Feb 19 08:39:56 gewurztraminer kernel: [call_spurious_interrupt+119280/153515]
.rodata.str1.1 [kernel] 0x760
Feb 19 08:39:56 gewurztraminer kernel: [<c024aee5>] .rodata.str1.1 [kernel] 0x760
Feb 19 08:39:56 gewurztraminer kernel: [flush_to_ldisc+216/288] flush_to_ldisc
[kernel] 0xd8
Feb 19 08:39:56 gewurztraminer kernel: [<c0174b78>] flush_to_ldisc [kernel] 0xd8
Feb 19 08:39:56 gewurztraminer kernel: [__run_task_queue+93/112]
__run_task_queue [kernel] 0x5d
Feb 19 08:39:56 gewurztraminer kernel: [<c011fb4d>] __run_task_queue [kernel] 0x5d
Feb 19 08:39:56 gewurztraminer kernel: [context_thread+325/512] context_thread
[kernel] 0x145
Feb 19 08:39:56 gewurztraminer kernel: [<c0128095>] context_thread [kernel] 0x145
Feb 19 08:39:56 gewurztraminer kernel: [context_thread+0/512] context_thread
[kernel] 0x0
Feb 19 08:39:56 gewurztraminer kernel: [<c0127f50>] context_thread [kernel] 0x0
Feb 19 08:39:56 gewurztraminer kernel: [_stext+0/96] stext [kernel] 0x0
Feb 19 08:39:56 gewurztraminer kernel: [<c0105000>] stext [kernel] 0x0
Feb 19 08:39:56 gewurztraminer kernel: [kernel_thread+38/48] kernel_thread
[kernel] 0x26
Feb 19 08:39:56 gewurztraminer kernel: [<c0105866>] kernel_thread [kernel] 0x26
Feb 19 08:39:56 gewurztraminer kernel: [context_thread+0/512] context_thread
[kernel] 0x0
Feb 19 08:39:56 gewurztraminer kernel: [<c0127f50>] context_thread [kernel] 0x0
Feb 19 08:39:56 gewurztraminer kernel:
Feb 19 08:39:56 gewurztraminer kernel:
Feb 19 08:39:56 gewurztraminer kernel: hda: status timeout: status=0x80 { Busy }
Feb 19 08:39:56 gewurztraminer kernel: hda: drive not ready for command
Feb 19 08:39:56 gewurztraminer kernel: ide0: reset: success

Comment 4 Kevin Range 2002-02-20 17:36:29 UTC

New information.  Running with no DMA is not just "not problem free".  We are
experienceing data loss.  Apparently data being written during teh "ide reset
dance" is lost or corrupted.  Even more strange is the regularity of the resets,
aprrox. every four minutes (while scping files to it):

Feb 20 10:56:36 tokay kernel: hda: status timeout: status=0x80 { Busy }
Feb 20 10:56:36 tokay kernel: hda: drive not ready for command
Feb 20 10:56:36 tokay kernel: ide0: reset: success
Feb 20 11:00:06 tokay kernel: hda: status timeout: status=0xd0 { Busy }
Feb 20 11:00:06 tokay kernel: hda: no DRQ after issuing WRITE
Feb 20 11:00:36 tokay kernel: ide0: reset timed-out, status=0x80
Feb 20 11:00:41 tokay kernel: hda: status timeout: status=0x80 { Busy }
Feb 20 11:00:41 tokay kernel: hda: drive not ready for command
Feb 20 11:00:41 tokay kernel: ide0: reset: success
Feb 20 11:04:21 tokay kernel: hda: status timeout: status=0xd0 { Busy }
Feb 20 11:04:21 tokay kernel: hda: no DRQ after issuing WRITE
Feb 20 11:04:51 tokay kernel: ide0: reset timed-out, status=0x80
Feb 20 11:04:56 tokay kernel: hda: status timeout: status=0x80 { Busy }
Feb 20 11:04:56 tokay kernel: hda: drive not ready for command
Feb 20 11:04:56 tokay kernel: ide0: reset: success
Feb 20 11:06:43 tokay sshd(pam_unix)[7335]: session opened for user root by (uid=0)
Feb 20 11:08:36 tokay kernel: hda: status timeout: status=0xd0 { Busy }

I have four machines here all with the specs I gave in my first report.  Two are
running Model=ST380021A, FwRev=2.13 with no problems.  Two are running
Model=ST380021A, FwRev=3.01 and having data loss.   All proposed solutions
(hdparm whatever) have failed.  All proposed explainations have been shown to be
false ("Intel PIIX4 erratium, NO FIX" makes no sense if two machines work, two
don't, and someone else with a non-intel chipset is having similar problems).
Seagate says the drives are all good, so this is looking like a Linux kernel
problem to me.

Any ideas on how to get these drives working?

Comment 5 Kevin Range 2003-04-12 04:31:40 UTC

I got rid of all of these drives, so I can't test this bug anymore.