From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.3a) Gecko/20021207 Phoenix/0.5 Description of problem: After starting a dd or cp of big files, can't login on console, can't ssh to machine, basic commands like "ls" and "free" take > 1 minute to return. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. create a large file: dd if=/dev/zero of=/data1/bkp/2gB-testfile bs=1024 count=2M & 2. Within a few 100 mB, RHAS machine slows to a crawl. Remote SSH logins take minutes or are refused, simple commands like "ls", "free" take minutes to return. Actual Results: [root@f-db8 root]# time free > /dev/null real 1m8.882s user 0m0.000s sys 0m0.010s Expected Results: [root@admin00 /root]# time free > /dev/null real 0m0.005s user 0m0.000s sys 0m0.000s Additional info: Machine is a dl380-G2, dual 1400 mhz P-III with 4 gB RAM. Running 2.4.9-e.12 kernel (all the other kernels show same problem, though). Installing a 2.4.20 kernel (mainline from kernel.org) fixes the problem. This problem is preventing us from running oracle in production on this machine. Tools like vmstat, sar, uptime are affected by the problem as well, making it hard to gather "profiling" data. Here's a vmstat from the machine--I started the "dd" after the third line appeared, and it finished by the line before last: [root@f-db8 root]# vmstat 10 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 0 0 0 118084 57396 3166504 0 0 35 226 109 75 1 2 97 0 0 0 0 117992 57396 3166504 0 0 0 17 103 70 0 1 99 0 0 0 0 117992 57396 3166504 0 0 0 11 103 28 0 0 100 0 3 1 0 1211144 57396 2067576 0 0 0 56419 179 461 2 36 62 0 5 1 0 1075796 57396 2202812 0 0 0 13553 213 74 1 9 90 0 4 1 0 946492 57396 2332416 0 0 0 12975 208 48 0 6 93 1 2 2 0 819304 57396 2459412 0 0 0 12940 209 71 0 6 94 0 5 2 0 689232 57396 2589472 0 0 0 12816 203 41 0 5 94 0 5 2 0 565500 57396 2713696 0 0 0 12437 206 73 1 6 93 0 6 2 0 440956 57396 2844520 0 0 0 12728 207 66 1 6 94 0 6 2 0 313012 57396 2972464 0 0 0 12815 204 40 0 5 95 0 6 2 0 183544 57396 3101952 0 0 0 12966 205 77 1 7 92 0 5 2 0 119024 57396 3166512 0 0 0 13347 208 39 0 2 98 0 4 2 0 118540 57396 3166508 0 0 0 13792 212 77 0 4 96
This customer has told me they are running 2.4.9-e.12enterprise, not 2.4.9-e.12.
That's correct, it's 2.4.9-e.12enterprise. These are the kernel's we've observed this problem on: 2.4.9-e.12enterprise 2.4.9-e.8smp 2.4.9-e.8enterprise 2.4.9-e.3smp I'm fairly certain it'll be there for the others in between as well.
OK, at the risk of opening a really big can of worms here we think we know why this is happening and we added a couple tuning knobs to 2.4.9-e.12 to work around this problem. What is likely happening here is that the dd program ends up ~512000 dirty pages to the pagecache. When either bdflush or kupdate or the dd program itself start queueing those pages up for physical IO the system can end up with upto several hundred thousand IO buffers queued to the device. At the best, a single device can only process a few thousand IO buffers per second. This means that the IO queue can remain very long for over 100 or more seconds. If another process needs to perform IO to that device(ls, free, etc) it might very well wait behind half of the IO queue depth on the average, thereby stalling it for up to minutes at a time. The way this can be worked around is to limit the maximum depth of the IO queue. Currently the max IO queue depth(known as high_io_sectors)is ~4500000 512 byte sectors(~2.5GB) based on that 4GB of memory. Once the IO queue grows to that value, all additional IO queueing is suspended until the device catches up and the IO queue drops below a low watermark(known as low_io_sectors) of ~4000000 512 byte sectors(~2GB). What we have done in 2.4.9-e.12 is to allow tuning of the low and high watermarks. They are currently in /proc/sys/vm/low_io_sectors and /proc/sys/vm/high_io_sectors respectively. Before getting into tuning suggestions for these two parameters, let me talk about the effects, side effects and warnings: Lowering the max IO queue results in better interactive response time at the cost of throughput, Increasing the max IO queue results in increased throughput at the cost of interactive response time. Currently the system defaults to high throughput, infact they are so high they are never reached!!! Can you experiment with these values for us??? Tuning: 1.) let the system become idle, vmstat showing no bo/bi. 2.) set the low_io_sectors(example: echo 32000 > /proc/sys/vm/low_io_sectors) 3.) set the high_io_sectors(example: echo 64000 > /proc/sys/vm/high_io_sectors) WARNINGS: 1.) make sure the system is idle 2.) whenever changing these two parameters ALWAYS make sure low_io_sectors is less than high_io_sectors. Side Effects:1.) lowering these values typically results in more overall time to complete the program. 2.) lowering these values typically results in more context switching as shown by vmstat. We have done limited performance evaluation with these parameters set to very low values(low_io_sectors = 2000, high_io_sectors = 4000) but we do notice a much more interactive system at the cost of a slightly lower throughput. Thanks, Larry Woodman & Dave Anderson
Larry & Dave-- thanks for your suggestions, we're jumping on this immediately. One question: since we're so happy with the general performance of the 2.4.20 kernel we've tried on this machine, would you happen to know what the default (low|high)_io_sectors settings are for that kernel, since they're not in the /proc interface? Or, for any other non-RHAS kernel you might know about? I'll post our test results here shortly. --Rob Lojek
The 2.4.20 kernel is so different in the areas of memory reclamation and writing out dirty pages and buffers these parameters arent even used any more in that kernel. Larry
I added the tweak above, which improved console responsiveness. [root@f-db8 root]# time "ls" > /dev/null real 0m9.178s user 0m0.000s sys 0m0.020s However, it's still long enough for applications connecting to the machine to time out. This never happens with 2.4.20. I'm conerned that tweaking the io_sectors numbers may improve responsiveness a little bit, but that it won't fix the underlying problem. If you can tell me what the io_sectors size is for 2.4.20 (or where to find it in the source), then we can at least do a "mano a mano" shootout between the two and figure out whether io_sectors is even the problem. A little more background information: The goal of this whole thing is to figure out whether we can use RHAS on these machines to run oracle. Right now, we can't--at least not for the types of operations we need to do with oracle. Oracle works beautifully with RH 7.1/2.4.20 or even RHAS/2.4.20, but neither one of those is a "supported configuration" from oracle. Therefore, when we discover oracle bugs (which has happened in the past), we aren't eligible for support. --Rob Lojek
Rob, first question what values did you set the low_io_sectors and high_io_sectors to??? If you didnt try 2000 and 4000 respectively please do that. Also, where did you get the 2.4.20 kernel you are comparing this to, Red Hat??? BTW, is there a "queued_sectors" variable in the 2.4.20 System.map??? Thanks, Larry
1. I used your examples from the earlier comment: 2.) set the low_io_sectors(example: echo 32000 > /proc/sys/vm/low_io_sectors) 3.) set the high_io_sectors(example: echo 64000 > /proc/sys/vm/high_io_sectors) 2. 2.4.20 is from kernel.org's public download area, compiled & installed.
Once again, please try 4000 and 2000 and after that please try 2000 and 1000 for high_io_sectors and low_io_sectors. Very low values should yield much better interactive response with only minimal decrease in throughput. Larry
OK, we're still doing some testing with the lower io_sectors limits. While I'm doing that, could you guys comment on this piece of information that was posted to this bug's "companion" RH service request (229211) by Christopher K: "I received a suggestion from our kernel team- After installing the lateste errata kernel, modify the third parameter in /proc/sys.vm/pagecache. The easiest way to do that is to edit the file /etc/sysctl.conf, and add a line that says: vm.pagecache = 2 50 75 Then run the command 'sysctl -p' as root to make it effective immediately." Is this something we should try in combination with altering io_sectors limits? What's your take on this? Thanks for your help, I think we're making progress! --Rob Lojek
Here's the quick & dirty test results for thruput. We're seeing what looks like acceptably snappy system response from the 2000/4000 low/hi limits--almost as good as 2.4.20. The ultimate judgement will fall to our db-qa guys, though, who will give this machine a thorough pounding shortly. AFA thruput goes, I just timed the "dd" command from earlier with different settings: 1. 5079076/4816932 (hi/lo) -- the default 2. 4000/2000 3. 2000/1000 4. 2.4.20 Results (2 samples each, min:sec): 1. 1:32.12, 1:34.67 2. 2:33.82, 2:36.47 3. 2:51.16, 3:07.78 4. 2:15.18, 2:12.09 It looks like you guys were right about the base kernel settings--they do blaze for thruput. However, the machine's unusable in that state--at least for us. It's not just shell commands that are slow, but processes listening on ports time out too (ntp, snmp, etc.). Next, I'm going to turn the box over to db-qa for db testing. In the meantime, please let me know if you think that tuning vm.pagecache will be of benefit to us in addition to/instead of the io_sectors settings. --Rob Lojek
OK, our qa team has been evaluating the box. Oracle seems to run o.k., but they've all commented on this behavior, which I wanted to post here: 1. Initially, a certain command (e.g., "ls", "mount", etc.) will take 10-60 seconds, but then it will run quickly--almost as if its output were cached. 2. The 1-minute load avg. on the box seems to stay pegged at 1.00, even after all load-intensive processes (oracle, file copies, etc.) have finished. Then, after 5-10 minutes, it'll slowly creep back to 0. It's as if an I/O buffer is being cleared. Both of these are difficult to reproduce, but any comments/explanations/suggestions are appreciated. --Rob Lojek
PS-- these are the current io_sectors settings: [root@f-db8 root]# cat /proc/sys/vm/low_io_sectors 2000 [root@f-db8 root]# cat /proc/sys/vm/high_io_sectors 4000 These look like they give us 2.4.20-like thruput, with (usually) acceptable snappiness.
We cant make changes to this area of the AS2.1 kernel any more. Please upgrase to RHEL3 or RHEL4. Larry Woodman