Bug 85231

Summary: extremely slow response during heavy I/O (dd, cp's, etc.) on compaq dl380 RHAS
Product: Red Hat Enterprise Linux 2.1 Reporter: rob lojek <rob>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 2.1CC: nphilipp, tao
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-09-28 13:11:54 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description rob lojek 2003-02-26 20:45:29 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.3a)
Gecko/20021207 Phoenix/0.5

Description of problem:
After starting a dd or cp of big files, can't login on console, can't ssh to
machine, basic commands like "ls" and "free" take > 1 minute to return.




Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. create a large file:

dd if=/dev/zero of=/data1/bkp/2gB-testfile bs=1024 count=2M &

2. Within a few 100 mB, RHAS machine slows to a crawl. Remote SSH logins take
minutes or are refused, simple commands like "ls", "free" take minutes to return.

Actual Results:  [root@f-db8 root]# time free > /dev/null

real    1m8.882s
user    0m0.000s
sys     0m0.010s

Expected Results:  [root@admin00 /root]# time free > /dev/null

real    0m0.005s
user    0m0.000s
sys     0m0.000s


Additional info:

Machine is a dl380-G2, dual 1400 mhz P-III with 4 gB RAM.

Running 2.4.9-e.12 kernel (all the other kernels show same problem, though).
Installing a 2.4.20 kernel (mainline from kernel.org) fixes the problem.

This problem is preventing us from running oracle in production on this machine.

Tools like vmstat, sar, uptime are affected by the problem as well, making it
hard to gather "profiling" data.

Here's a vmstat from the machine--I started the "dd" after the third line
appeared, and it finished by the line before last:

[root@f-db8 root]# vmstat 10
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  0  0      0 118084  57396 3166504   0   0    35   226  109    75   1   2  97
 0  0  0      0 117992  57396 3166504   0   0     0    17  103    70   0   1  99
 0  0  0      0 117992  57396 3166504   0   0     0    11  103    28   0   0 100
 0  3  1      0 1211144  57396 2067576   0   0     0 56419  179   461   2  36  62
 0  5  1      0 1075796  57396 2202812   0   0     0 13553  213    74   1   9  90
 0  4  1      0 946492  57396 2332416   0   0     0 12975  208    48   0   6  93
 1  2  2      0 819304  57396 2459412   0   0     0 12940  209    71   0   6  94
 0  5  2      0 689232  57396 2589472   0   0     0 12816  203    41   0   5  94
 0  5  2      0 565500  57396 2713696   0   0     0 12437  206    73   1   6  93
 0  6  2      0 440956  57396 2844520   0   0     0 12728  207    66   1   6  94
 0  6  2      0 313012  57396 2972464   0   0     0 12815  204    40   0   5  95
 0  6  2      0 183544  57396 3101952   0   0     0 12966  205    77   1   7  92
 0  5  2      0 119024  57396 3166512   0   0     0 13347  208    39   0   2  98
 0  4  2      0 118540  57396 3166508   0   0     0 13792  212    77   0   4  96

Comment 1 Chris Kloiber 2003-02-26 23:22:34 UTC
This customer has told me they are running 2.4.9-e.12enterprise, not 2.4.9-e.12.

Comment 2 rob lojek 2003-02-27 00:05:06 UTC
That's correct, it's 2.4.9-e.12enterprise. These are the kernel's we've observed
this problem on:

2.4.9-e.12enterprise
2.4.9-e.8smp
2.4.9-e.8enterprise
2.4.9-e.3smp

I'm fairly certain it'll be there for the others in between as well.

Comment 3 Larry Woodman 2003-02-27 17:14:37 UTC
OK, at the risk of opening a really big can of worms here we think we know
why this is happening and we added a couple tuning knobs to 2.4.9-e.12
to work around this problem.

What is likely happening here is that the dd program ends up ~512000 dirty
pages to the pagecache.  When either bdflush or kupdate or the dd program
itself start queueing those pages up for physical IO the system can end up
with upto several hundred thousand IO buffers queued to the device.  At the 
best, a single device can only process a few thousand IO buffers per second. 
This means that the IO queue can remain very long for over 100 or more seconds.
If another process needs to perform IO to that device(ls, free, etc) it might
very well wait behind half of the IO queue depth on the average, thereby
stalling it for up to minutes at a time.

The way this can be worked around is to limit the maximum depth of the IO
queue.  Currently the max IO queue depth(known as high_io_sectors)is ~4500000
512 byte sectors(~2.5GB) based on that 4GB of memory.  Once the IO queue
grows to that value, all additional IO queueing is suspended until the device
catches up and the IO queue drops below a low watermark(known as low_io_sectors)
of ~4000000 512 byte sectors(~2GB).

What we have done in 2.4.9-e.12 is to allow tuning of the low and high
watermarks.  They are currently in /proc/sys/vm/low_io_sectors and
/proc/sys/vm/high_io_sectors respectively.  Before getting into tuning
suggestions for these two parameters, let me talk about the effects, side
effects and warnings:  Lowering the max IO queue results in better interactive
response time at the cost of throughput, Increasing the max IO queue results
in increased throughput at the cost of interactive response time.  Currently
the system defaults to high throughput, infact they are so high they are never
reached!!!

Can you experiment with these values for us???

Tuning:
1.) let the system become idle, vmstat showing no bo/bi.
2.) set the low_io_sectors(example: echo 32000 > /proc/sys/vm/low_io_sectors)
3.) set the high_io_sectors(example: echo 64000 > /proc/sys/vm/high_io_sectors)

WARNINGS:
1.) make sure the system is idle
2.) whenever changing these two parameters ALWAYS make sure low_io_sectors
    is less than high_io_sectors.


Side Effects:1.) lowering these values typically results in more overall time to
complete
    the program.
2.) lowering these values typically results in more context switching as
    shown by vmstat.


We have done limited performance evaluation with these parameters set to 
very low values(low_io_sectors = 2000, high_io_sectors = 4000) but we do 
notice a much more interactive system at the cost of a slightly lower 
throughput.


Thanks, Larry Woodman & Dave Anderson



Comment 4 rob lojek 2003-02-27 17:27:44 UTC
Larry & Dave--
thanks for your suggestions, we're jumping on this immediately. One question:
since we're so happy with the general performance of the 2.4.20 kernel we've
tried on this machine, would you happen to know what the default
(low|high)_io_sectors settings are for that kernel, since they're not in the
/proc interface? Or, for any other non-RHAS kernel you might know about?

I'll post our test results here shortly.

--Rob Lojek

Comment 5 Larry Woodman 2003-02-27 19:26:09 UTC
The 2.4.20 kernel is so different in the areas of memory reclamation
and writing out dirty pages and buffers these parameters arent even 
used any more in that kernel.

Larry


Comment 6 rob lojek 2003-02-27 21:24:23 UTC
I added the tweak above, which improved console responsiveness.

[root@f-db8 root]# time "ls" > /dev/null

real    0m9.178s
user    0m0.000s
sys     0m0.020s

However, it's still long enough for applications connecting to the machine to
time out. This never happens with 2.4.20. I'm conerned that tweaking the
io_sectors numbers may improve responsiveness a little bit, but that it won't
fix the underlying problem.

If you can tell me what the io_sectors size is for 2.4.20 (or where to find it
in the source), then we can at least do a "mano a mano" shootout between the two
and figure out whether io_sectors is even the problem.


A little more background information:
The goal of this whole thing is to figure out whether we can use RHAS on these
machines to run oracle. Right now, we can't--at least not for the types of
operations we need to do with oracle. Oracle works beautifully with RH
7.1/2.4.20 or even RHAS/2.4.20, but neither one of those is a "supported
configuration" from oracle. Therefore, when we discover oracle bugs (which has
happened in the past), we aren't eligible for support.


--Rob Lojek

Comment 7 Larry Woodman 2003-02-28 16:51:03 UTC
Rob, first question what values did you set the low_io_sectors and
high_io_sectors to???  If you didnt try 2000 and 4000 respectively
please do that.

Also, where did you get the 2.4.20 kernel you are comparing this to,
Red Hat???

BTW, is there a "queued_sectors" variable in the 2.4.20 System.map???

Thanks, Larry


Comment 8 rob lojek 2003-02-28 17:04:54 UTC
1. I used your examples from the earlier comment:

   2.) set the low_io_sectors(example: echo 32000 > /proc/sys/vm/low_io_sectors)
   3.) set the high_io_sectors(example: echo 64000 > /proc/sys/vm/high_io_sectors)

2. 2.4.20 is from kernel.org's public download area, compiled & installed.

Comment 9 Larry Woodman 2003-02-28 17:21:12 UTC
Once again, please try 4000 and 2000 and after that please try
2000 and 1000 for high_io_sectors and low_io_sectors.  Very low
values should yield much better interactive response with only
minimal decrease in throughput.

Larry


Comment 10 rob lojek 2003-03-01 00:07:11 UTC
OK, we're still doing some testing with the lower io_sectors limits. While I'm
doing that, could you guys comment on this piece of information that was posted
to this bug's "companion" RH service request (229211) by Christopher K:

"I received a suggestion from our kernel team-

After installing the lateste errata kernel, modify the third parameter in
/proc/sys.vm/pagecache.
The easiest way to do that is to edit the file /etc/sysctl.conf, and add a line
that says:

vm.pagecache = 2 50 75

Then run the command 'sysctl -p' as root to make it effective immediately."


Is this something we should try in combination with altering io_sectors limits?
What's your take on this?

Thanks for your help, I think we're making progress!

--Rob Lojek

Comment 11 rob lojek 2003-03-01 01:07:11 UTC
Here's the quick & dirty test results for thruput. We're seeing what looks like
acceptably snappy system response from the 2000/4000 low/hi limits--almost as
good as 2.4.20. The ultimate judgement will fall to our db-qa guys, though, who
will give this machine a thorough pounding shortly.

AFA thruput goes, I just timed the "dd" command from earlier with different
settings:

1. 5079076/4816932 (hi/lo) -- the default
2. 4000/2000
3. 2000/1000
4. 2.4.20

Results (2 samples each, min:sec):

1. 1:32.12, 1:34.67
2. 2:33.82, 2:36.47
3. 2:51.16, 3:07.78
4. 2:15.18, 2:12.09

It looks like you guys were right about the base kernel settings--they do blaze
for thruput. However, the machine's unusable in that state--at least for us.
It's not just shell commands that are slow, but processes listening on ports
time out too (ntp, snmp, etc.).

Next, I'm going to turn the box over to db-qa for db testing. In the meantime,
please let me know if you think that tuning vm.pagecache will be of benefit to
us in addition to/instead of the io_sectors settings.

--Rob Lojek

Comment 12 rob lojek 2003-03-03 23:35:48 UTC
OK, our qa team has been evaluating the box. Oracle seems to run o.k., but
they've all commented on this behavior, which I wanted to post here:

1. Initially, a certain command (e.g., "ls", "mount", etc.) will take 10-60
seconds, but then it will run quickly--almost as if its output were cached.

2. The 1-minute load avg. on the box seems to stay pegged at 1.00, even after
all load-intensive processes (oracle, file copies, etc.) have finished. Then,
after 5-10 minutes, it'll slowly creep back to 0. It's as if an I/O buffer is
being cleared.

Both of these are difficult to reproduce, but any
comments/explanations/suggestions are appreciated.

--Rob Lojek

Comment 13 rob lojek 2003-03-03 23:38:46 UTC
PS--
these are the current io_sectors settings:

[root@f-db8 root]# cat /proc/sys/vm/low_io_sectors
2000
[root@f-db8 root]# cat /proc/sys/vm/high_io_sectors
4000

These look like they give us 2.4.20-like thruput, with (usually) acceptable
snappiness.

Comment 14 Larry Woodman 2005-09-28 13:11:54 UTC
We cant make changes to this area of the AS2.1 kernel any more.  Please upgrase
to RHEL3 or RHEL4.

Larry Woodman