Bug 85231
Summary: | extremely slow response during heavy I/O (dd, cp's, etc.) on compaq dl380 RHAS | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 2.1 | Reporter: | rob lojek <rob> |
Component: | kernel | Assignee: | Larry Woodman <lwoodman> |
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 2.1 | CC: | nphilipp, tao |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-09-28 13:11:54 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
rob lojek
2003-02-26 20:45:29 UTC
This customer has told me they are running 2.4.9-e.12enterprise, not 2.4.9-e.12. That's correct, it's 2.4.9-e.12enterprise. These are the kernel's we've observed this problem on: 2.4.9-e.12enterprise 2.4.9-e.8smp 2.4.9-e.8enterprise 2.4.9-e.3smp I'm fairly certain it'll be there for the others in between as well. OK, at the risk of opening a really big can of worms here we think we know why this is happening and we added a couple tuning knobs to 2.4.9-e.12 to work around this problem. What is likely happening here is that the dd program ends up ~512000 dirty pages to the pagecache. When either bdflush or kupdate or the dd program itself start queueing those pages up for physical IO the system can end up with upto several hundred thousand IO buffers queued to the device. At the best, a single device can only process a few thousand IO buffers per second. This means that the IO queue can remain very long for over 100 or more seconds. If another process needs to perform IO to that device(ls, free, etc) it might very well wait behind half of the IO queue depth on the average, thereby stalling it for up to minutes at a time. The way this can be worked around is to limit the maximum depth of the IO queue. Currently the max IO queue depth(known as high_io_sectors)is ~4500000 512 byte sectors(~2.5GB) based on that 4GB of memory. Once the IO queue grows to that value, all additional IO queueing is suspended until the device catches up and the IO queue drops below a low watermark(known as low_io_sectors) of ~4000000 512 byte sectors(~2GB). What we have done in 2.4.9-e.12 is to allow tuning of the low and high watermarks. They are currently in /proc/sys/vm/low_io_sectors and /proc/sys/vm/high_io_sectors respectively. Before getting into tuning suggestions for these two parameters, let me talk about the effects, side effects and warnings: Lowering the max IO queue results in better interactive response time at the cost of throughput, Increasing the max IO queue results in increased throughput at the cost of interactive response time. Currently the system defaults to high throughput, infact they are so high they are never reached!!! Can you experiment with these values for us??? Tuning: 1.) let the system become idle, vmstat showing no bo/bi. 2.) set the low_io_sectors(example: echo 32000 > /proc/sys/vm/low_io_sectors) 3.) set the high_io_sectors(example: echo 64000 > /proc/sys/vm/high_io_sectors) WARNINGS: 1.) make sure the system is idle 2.) whenever changing these two parameters ALWAYS make sure low_io_sectors is less than high_io_sectors. Side Effects:1.) lowering these values typically results in more overall time to complete the program. 2.) lowering these values typically results in more context switching as shown by vmstat. We have done limited performance evaluation with these parameters set to very low values(low_io_sectors = 2000, high_io_sectors = 4000) but we do notice a much more interactive system at the cost of a slightly lower throughput. Thanks, Larry Woodman & Dave Anderson Larry & Dave-- thanks for your suggestions, we're jumping on this immediately. One question: since we're so happy with the general performance of the 2.4.20 kernel we've tried on this machine, would you happen to know what the default (low|high)_io_sectors settings are for that kernel, since they're not in the /proc interface? Or, for any other non-RHAS kernel you might know about? I'll post our test results here shortly. --Rob Lojek The 2.4.20 kernel is so different in the areas of memory reclamation and writing out dirty pages and buffers these parameters arent even used any more in that kernel. Larry I added the tweak above, which improved console responsiveness. [root@f-db8 root]# time "ls" > /dev/null real 0m9.178s user 0m0.000s sys 0m0.020s However, it's still long enough for applications connecting to the machine to time out. This never happens with 2.4.20. I'm conerned that tweaking the io_sectors numbers may improve responsiveness a little bit, but that it won't fix the underlying problem. If you can tell me what the io_sectors size is for 2.4.20 (or where to find it in the source), then we can at least do a "mano a mano" shootout between the two and figure out whether io_sectors is even the problem. A little more background information: The goal of this whole thing is to figure out whether we can use RHAS on these machines to run oracle. Right now, we can't--at least not for the types of operations we need to do with oracle. Oracle works beautifully with RH 7.1/2.4.20 or even RHAS/2.4.20, but neither one of those is a "supported configuration" from oracle. Therefore, when we discover oracle bugs (which has happened in the past), we aren't eligible for support. --Rob Lojek Rob, first question what values did you set the low_io_sectors and high_io_sectors to??? If you didnt try 2000 and 4000 respectively please do that. Also, where did you get the 2.4.20 kernel you are comparing this to, Red Hat??? BTW, is there a "queued_sectors" variable in the 2.4.20 System.map??? Thanks, Larry 1. I used your examples from the earlier comment: 2.) set the low_io_sectors(example: echo 32000 > /proc/sys/vm/low_io_sectors) 3.) set the high_io_sectors(example: echo 64000 > /proc/sys/vm/high_io_sectors) 2. 2.4.20 is from kernel.org's public download area, compiled & installed. Once again, please try 4000 and 2000 and after that please try 2000 and 1000 for high_io_sectors and low_io_sectors. Very low values should yield much better interactive response with only minimal decrease in throughput. Larry OK, we're still doing some testing with the lower io_sectors limits. While I'm doing that, could you guys comment on this piece of information that was posted to this bug's "companion" RH service request (229211) by Christopher K: "I received a suggestion from our kernel team- After installing the lateste errata kernel, modify the third parameter in /proc/sys.vm/pagecache. The easiest way to do that is to edit the file /etc/sysctl.conf, and add a line that says: vm.pagecache = 2 50 75 Then run the command 'sysctl -p' as root to make it effective immediately." Is this something we should try in combination with altering io_sectors limits? What's your take on this? Thanks for your help, I think we're making progress! --Rob Lojek Here's the quick & dirty test results for thruput. We're seeing what looks like acceptably snappy system response from the 2000/4000 low/hi limits--almost as good as 2.4.20. The ultimate judgement will fall to our db-qa guys, though, who will give this machine a thorough pounding shortly. AFA thruput goes, I just timed the "dd" command from earlier with different settings: 1. 5079076/4816932 (hi/lo) -- the default 2. 4000/2000 3. 2000/1000 4. 2.4.20 Results (2 samples each, min:sec): 1. 1:32.12, 1:34.67 2. 2:33.82, 2:36.47 3. 2:51.16, 3:07.78 4. 2:15.18, 2:12.09 It looks like you guys were right about the base kernel settings--they do blaze for thruput. However, the machine's unusable in that state--at least for us. It's not just shell commands that are slow, but processes listening on ports time out too (ntp, snmp, etc.). Next, I'm going to turn the box over to db-qa for db testing. In the meantime, please let me know if you think that tuning vm.pagecache will be of benefit to us in addition to/instead of the io_sectors settings. --Rob Lojek OK, our qa team has been evaluating the box. Oracle seems to run o.k., but they've all commented on this behavior, which I wanted to post here: 1. Initially, a certain command (e.g., "ls", "mount", etc.) will take 10-60 seconds, but then it will run quickly--almost as if its output were cached. 2. The 1-minute load avg. on the box seems to stay pegged at 1.00, even after all load-intensive processes (oracle, file copies, etc.) have finished. Then, after 5-10 minutes, it'll slowly creep back to 0. It's as if an I/O buffer is being cleared. Both of these are difficult to reproduce, but any comments/explanations/suggestions are appreciated. --Rob Lojek PS-- these are the current io_sectors settings: [root@f-db8 root]# cat /proc/sys/vm/low_io_sectors 2000 [root@f-db8 root]# cat /proc/sys/vm/high_io_sectors 4000 These look like they give us 2.4.20-like thruput, with (usually) acceptable snappiness. We cant make changes to this area of the AS2.1 kernel any more. Please upgrase to RHEL3 or RHEL4. Larry Woodman |