69552 – lots of disk writes lock out interactive processes

Bug 69552 - lots of disk writes lock out interactive processes

Summary: lots of disk writes lock out interactive processes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.3
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-07-23 01:59 UTC by Tom Karzes
Modified:	2007-04-18 16:44 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2003-06-07 19:13:54 UTC
Embargoed:

Attachments	(Terms of Use)
example program dw.c (720 bytes, text/plain) 2002-07-23 02:13 UTC, Tom Karzes	no flags	Details
here's a log of my "hdparm" attempts (5.51 KB, text/plain) 2002-07-23 09:19 UTC, Tom Karzes	no flags	Details
View All

Description Tom Karzes 2002-07-23 01:59:43 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.2.1) Gecko/20010901

Description of problem:
When doing lots of disk writes (e.g., writing a
large file), keyboard and mouse events tend to
be ignored for long periods of time (many seconds,
or even minutes).  There is an existing tech support
ticket number for this, ticket #209989.  I was advised
by the tech support staff to open a bugzilla report
for the problem.  After experimenting with the problem
a fair amount, I found that the best way to reproduce it
was to run a program I wrote called "dw.c", which is
attached to one of the entries in the ticket # mentioned
above.  All this program does is a bunch or "write" system
calls.  If I run it with a buffer size of 8192, and a
repeat count of 40000 (i.e., it calls write with a size
of 8192, 40000 times in a row, for a total of 327680000
byes of output), it absolutely locks up my machine, for
minutes on end.  Even at boot level 3, if I kick this
off in background, then type some simple command like
"date", it will be a minute or so before the word "date"
even appears on my screen.

The tech support folks noticed that my disks were running
slower than they should, and suggested some hdparm commands
to speed them up, but this doesn't really seem to address
the problem, which I suspect is a kernel bug.  My hardware
configuation is also given in the problem ticket mentioned
above, but basically it's a 2.26GHz P4 processor, Intel
motherboard using the 845e chipset, 1GB DDR266 ecc unbuffered
memory on two 512MB memory cards, two 80GB maxtor IDE hard drives,
and my video card is an ABIT Siluro using the NVIDIA GeForce 4
Ti4600 chipset (for which I had to download and install the
NVIDIA video drivers).  I'm running Red Hat Professional, version
7.3, and I clicked "everything" when I did the install.  I then
downloaded, built, and installed the NVIDIA video driver, which
worked perfectly.  When I noticed the performance problem, I
used the RedHat update agent to upgrade to the latest kernel
patch level (-5), and I also downloaded the latest glibc
packages.  I then rebooted, rebuild, and resinstalled the
NVIDIA video driver.  Everything works fine, except for
this performance problem, which I'd really like to track
down and fix.  The problem is completely reproducible and
is extremely annoying -- I should be able to do large file
copies etc. in background without being aware of them (except
of course disk i/o will slow down), but instead I am seeing
my interactive response (mouse & keyboard events, and probably
other things as well) getting almost completely locked out.

Please contact me if I can provide any more needed information
on this problem.


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Run a program that does lots of back-to-back disk writes,
  preferably direct, unbuffered writes.  This can be done
  at boot level 3 or 5.
2.Try to do anything else interactively, e.g. type or move
  the mouse -- there are enormous delays in the response.
  These delays should not be occurring.
	

Actual Results:  Response slowed to a virtual standstill.


Expected Results:  Response should be minimally affected.  In the case of
typing ot moving the mouse, no noticeable slowdown should
be seen.


Additional info:

I'm listing the priority as "high" because I consider this
to be a reliability problem:  I am concerned that trying to
do things like write a CD ROM etc. will fail due to the lack
of reasonable response time, plus it is delaying my ability
to bring up the rest of the functionality in my system (e.g.,
CD ROM burner, a firewall box that I bought, etc.) -- much
of this equipment is on 30-day warranty, and I would really
like to get a patch for this problem soon so that I can not
only use my system normally but also move on to bringing up
the rest of the functionality and making sure everything's
ok.  The technical support team has suggested some commands
that will speed up my hard drives, but as mentioned, I don't
believe this really addresses the problem I've been seeing.
In fact, if anything I'd expect running the hard drives faster
to make the problem worse, not better.

Comment 1 Tom Karzes 2002-07-23 02:13:19 UTC

Created attachment 66424 [details]
example program dw.c

Comment 2 Tom Karzes 2002-07-23 02:17:46 UTC

I've attached the source of the program I mentioned
which illustrates this problem very strongly.  It
takes two arguments, a block size and a count.
It then creates a buffer with the specified
block size, fills it with 0xff bytes, and
writes it to standard output "count" times,
using direct, unbuffered "write" system calls.
The total output size is block_size*count.

To compile, simply:

    % gcc dw.c -o dw

Then try running it with a buffer size of 8192 and
a count of 40000 (for a total of 327680000 bytes).
Run it in background, at boot level 3.	When I do
this, it completely locks up my machine:

    % dw 8192 40000 > dw.out &

This should make it much easier to track down this
problem.

Comment 3 Arjan van de Ven 2002-07-23 07:53:35 UTC

question 1: if this is IDE, is DMA enabled ?
question 2: does changing the "elvtune" parameters to lower values (see man
elvtune for syntax) improve things ?

Comment 4 Tom Karzes 2002-07-23 09:17:27 UTC

In answer to your queastions:

    1.  I have two 80GB IDE drives.  I had assumed
        that DMA was enabled, but now it appears that
        it is not.  I tried executing the "hdparm"
        commands I was instructed to use, and there
        was a problem enabling DMA.  I will attach
        the log and additional comments in the next entry.

    2.  I have not used "elvtune" before.  I'm willing
        to give it a try, but I have no idea what values
        to use.  If I just run it with no arguments, I
        get the following:

        /dev/hda elevator ID            0
                read_latency:           8192
                write_latency:          16384
                max_bomb_segments:      6

        The results for /dev/hdb are identical.  What
        values should I try?  Is this safe, or is there
        a risk of losing my disk if I use bad values here?

Comment 5 Tom Karzes 2002-07-23 09:19:04 UTC

Created attachment 66486 [details]
here's a log of my "hdparm" attempts

Comment 6 Tom Karzes 2002-07-23 09:21:33 UTC

Here's a copy of the update I made to ticket #209989,
regarding my attempts to execute the "hdparm" commands
I was given (I attached the log in the previous update):

Ok, I tried the "hdparm" commands that were outlined.  Please
note that the syntax given in the ticket entry was incorrect:
all of the arguments prior to the device name are options and
must be preceded by "-".  So instead of:

    hdparm -u1 d1 c1 X66 /dev/hda

one must type:

    hdparm -u1 -d1 -c1 -X66 /dev/hda

Anyway, once I figured this out, I tried it out.  Unfortunately,
I had some problems.  It is refusing to enable DMA (I had assumed
that DMA had been enabled right from the start, so this was a bit
disturbing).  Anyway when it tried to enable it, it got an
"HDIO_SET_DMA failed: Operation not permitted" error.  This doesn't
sound right, does it???  This is a very new mother board and very
new hard disks -- how can DMA not be supported?  When it changed
the speed to -X66, it appears to have worked, although at the time
it was changed I got some "ide0: unexpected interrupt, status=0x58,
count=1" errors -- are they expected when the speed is changed?

Second, when I tried to increase the speed to -X68, I got a new
error: "ide0: Speed warnings UDMA 3/4/5 is not functional."  Should
it be?	The motherboard is an Intel Desktop D845EBG2L, and purports
to support UltraDMA 100/66/33, and the disks are ATA133 Maxtor drives.
Anyway, I'm attaching a log of the hdparm attempts (with the errors
that were logged to the console inserted at the points where they
occurred).  I also tried my lock-up example after these changes, and
the problem still exists.

After this I rebooted my machine, without making any changes to
/etc/rc.d/rc.local.  I figured I'd rather find out what's going
on first, and in the meantime it seemed safest to run at the
original settings.

Comment 7 Greg Bailey 2002-07-23 16:10:37 UTC

I've seen this on MANY of our Compaq 1850R's running RedHat 7.1, 7.2, and 7.3. 
All of them have hardware RAID.  I noticed Bugzilla #33309, but didn't see what
the resolution was, as this problem still seems to be occuring, even with the
latest kernel (2.4.18-5).

Comment 8 Tom Karzes 2002-07-24 07:54:36 UTC

It turns out the problem is that the RedHat 7.3 kernel doesn't
support the Intel Desktop D845EBG2L motherboard, due to a
BIOS problem which prevents Linux from enabling DMA.  This
results in a *MASSIVE* slowdown in disk performance, and in
addition heavy disk traffic causes other processes to lock up.
Until RedHat comes out with an official, complete patch for
Intel's BIOS bug (I'm not holding my breath for Intel to
release a BIOS that fixes this; the Intel guy I talked to
told me flat out that Linux is not supported), I strongly
advise people to avoid Intel motherboards.

In the meantime, I've built a patched kernel that bypasses
some resource checks and gets me to mode 2 (but not to
mode 5, which is what it should be using).  I would *really*
like a proper patch for this bug which eliminates the need
for this complete workaround.

Comment 9 Arjan van de Ven 2002-07-24 08:16:49 UTC

We have a patch in testing in the rawhide kernel for this btw

Comment 10 Arjan van de Ven 2002-07-24 08:41:58 UTC

and re elvtune: elvtune cannot cause dataloss, it just changes the parameter the
kernel uses to reorder requests. Low values mean "almost no sorting" while high
values mean agressive sorting. Sorting is sort of nice for throughput (disks
have shorter head moves -> shorter seek times) but can give some starvation,
hence it being limited. values of 64/256 I've played with and seem to improve
response a bit, but do cost throughput noticably

Comment 11 Tom Karzes 2002-07-29 22:04:45 UTC

I've tried some of the 2.4.19 kernels from www.kernel.org.
I first tried 2.4.19-rc3, which does *not* fix this
problem.  I then tried 2.4.19-rc3-ac3, which *does* fix
the problem, so for now I'm using this kernel.  Hopefully
Alan's fixes for this will make it into 2.4.19-rc4 and
eventually into an official RedHat kernel, but for now
this is the only kernel I can use without experiencing
severe performance penalties.  So far I haven't had any
problems with it.  I also noticed that 2.4.19-rc3-ac4 is
now available, although I haven't tried it yet, and
probably won't unless I experience any problems.

Comment 12 Arjan van de Ven 2002-07-29 22:08:17 UTC

as I said, the rawhide kernel (and the Limbo beta kernel) have the needed patch.
If you prefer RPM kernels you could chose to use one of those

Comment 13 Tom Karzes 2002-07-29 23:29:42 UTC

I didn't realize the rawhide kernel was currently available
from RedHat.  I wouldn't mind trying it, but I find the RPM
download interface somewhat cumbersome to use (and at the
moment, the search mechanism appears to be somewhat broken).
What I really want is a list of available kernels, but the
search mechanism doesn't appear to be working properly.
Here's what I did:

    o   Open www.redhat.com

    o   Click on "download"

    o   Under "Find latest RPM's"

        o   Under "By Category"

            o   Select "-Kernel" under "Base" and click "Search"

                (No hits)

            o   Select "Base" directly and click "Search"

                Lots of stuff comes up, most of it
                non-kernel.

        o   Under "By Keyword"

            o   enter "kernel" and click "Search'

                This one seems to work.  A bunch of kernel
                releases pop up.  However, when I try to
                access the second page, I get "Proxy Error"
                (from the RedHat site):

                    Proxy Error

                    We're sorry! We are currently experiencing
                    technical difficulties. Please try again later.

                Presumably this is a temporary problem that will be
                fixed soon.  In any case, none of the visible pages
                mentioned "rawhide", nor did a search on "rawhide"
                yield anything.

Is there a simpler, more reliable way to get a list of the
available kernel RPMs?

Comment 14 Tom Karzes 2002-07-31 04:30:54 UTC

It was suggested that I try the "rawhide" or "Limbo"
kernels from the RPM database, but the RPM
search/download mechanism appears to be broken,
at least for kernel searches.  I've therefore
created a new ticket for this problem, ticket 210823,
which contains the details of the RPM search/download
problems.

Comment 15 Arjan van de Ven 2003-06-07 19:13:54 UTC

current erratum has elevator tuning patches to fix this

Note You need to log in before you can comment on or make changes to this bug.