Bug 66143 - System hang after 5-12 h IO stress - flushtlb problem?
Summary: System hang after 5-12 h IO stress - flushtlb problem?
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel
Version: 7.3
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
Assignee: Arjan van de Ven
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2002-06-05 12:20 UTC by Martin Wilck
Modified: 2007-04-18 16:42 UTC (History)
2 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2002-06-11 16:03:28 UTC
Embargoed:


Attachments (Terms of Use)
One-line patch for ServerWorks CSB5 IDE DMA (413 bytes, patch)
2002-06-06 11:25 UTC, Martin Wilck
no flags Details | Diff
4-line patch that fixes DMA address calculation >4GB (important!!) (442 bytes, patch)
2002-06-06 11:27 UTC, Martin Wilck
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2002:110 0 normal SHIPPED_LIVE Updated kernel with bugfixes available 2002-06-10 04:00:00 UTC

Description Martin Wilck 2002-06-05 12:20:42 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 Galeon/1.2.0 (X11; Linux i686; U;) Gecko/20020408

Description of problem:
The problem occurs with all recent RedHat kernels. The symptopms seem to be less
severe with increasing kernel version (2.4.7-complete freeze, 2.4.9-31:
applications freeeze or die, 2.4.18-4: same symptoms, but system runs longer
before it happens).


Version-Release number of selected component (if applicable):
2.4.7-10, 2.4.9-31, 2.4.18-4

How reproducible:
Always

Steps to Reproduce:
1.Start high-load IO-bound stress test
2.
3.
	

Actual Results:  
(2.4.7) System appears "dead", with the exception of ping.
(2.4.9/2.4.18) Applications freeze or die. Alt-Sysrq still works.


Expected Results:  Test completes successfully.

Additional info:

We have applied Sunil Saxenas's flushtlb patch:
http://marc.theaimsgroup.com/?l=linux-kernel&m=102208353523931&w=2

and the problem appears to be gone.
Although that soesn't prove the flushtlb problem is actually causing our
problem, we strongly recommend that this patch be integrated 
into the next RedHat update kernels.

Comment 1 Arjan van de Ven 2002-06-05 12:40:16 UTC
That patch only affects SMP Pentium IV systems. Is the system in question also a
pentium IV one ?

Comment 2 Need Real Name 2002-06-06 08:32:15 UTC
Yes, it is a System GC LE chipset an Dual Prestonia.

Comment 3 Need Real Name 2002-06-06 08:33:21 UTC
Yes, it is a System GC LE chipset an Dual Prestonia.

Comment 4 Arjan van de Ven 2002-06-06 08:35:40 UTC
In that case the yesterday released 2.4.9-34 kernel should fix this for 7.1/7.2;
for 7.3 a fix is in the works.

Comment 5 Martin Wilck 2002-06-06 11:23:40 UTC
So you think that this is related to the PGE handling you mention 
in the advisory?

In any case I have looked at the new kernel source - it is missing two small
Patches that were included in the 7.3 2.4.18 series. Both are very important for
our newer machines. I'll atttach them here although they are not directly
related to the problem itself.

Arjan, please have a look at them.

Concerning the original problem - we'll test the new kernel and see what happens.

Martin


Comment 6 Martin Wilck 2002-06-06 11:25:14 UTC
Created attachment 59844 [details]
One-line patch for ServerWorks CSB5 IDE DMA

Comment 7 Martin Wilck 2002-06-06 11:27:05 UTC
Created attachment 59845 [details]
4-line patch that fixes DMA address calculation >4GB (important!!)

Comment 8 Arjan van de Ven 2002-06-06 12:06:46 UTC
Yes the PGE fix the the problem Sunil found.

As for the patches; I've added the 4Gb one to the tree in case we ever do a
2.4.9 erratum again.

Comment 9 Martin Wilck 2002-06-06 12:13:47 UTC
I hope you do because that is really a nasty one if you have >4GB machines (we
had an Adaptec SCSI controller happily DMA'ing to and from the kernel core
memory). Meanwhile, we'll advise our >4GB customers to upgrade to the 7.3 kernel.

The CSB5-Patch may seem ridiculous - however, IDE load may cause such a heavy
interrupt load that timer and local APIC interrupts don't get through, causing
the LOC interupt counts to differ heavily, and (if the same CPU servers timer
and IDE IRQs) cause system time to go awry. 



Comment 10 Martin Wilck 2002-06-11 16:03:22 UTC
I am taking back what I said about the CSB5 patch. It should *not* be applied,
and probably even reverted in 2.4.18, until bug 66054 is resolved.

Comment 11 Arjan van de Ven 2002-06-20 13:45:54 UTC
p4 bug is fixed in the 2.4.18-5 kernel


Note You need to log in before you can comment on or make changes to this bug.