Description of problem: On one of our servers we often experience a load of 1 even when nothing special is running. vmstat 1 says: procs memory swap io system cpu r b swpd free buff cache si so bi bo in cs us sy id wa 0 1 212736 10196 111044 558868 0 0 0 0 110 171 0 0 100 0 ^^^ 0 1 212736 10196 111044 558868 0 0 0 0 106 166 0 0 100 0 ^^^ 0 1 212736 10196 111044 558868 0 0 0 0 105 162 0 0 100 0 ^^^ .... So, for me, this seems to be an IO problem. Checking the ps output I found that kswapd is in an uninterruptable sleep (the ps man page says that this is usually caused by IO): $ ps lax|grep kswap 1 0 4 1 15 0 0 0 schedu DW ? 0:01 [kswapd] This high load is not reproduceable. There are times when everything seems to work fine. We have some other RHEL 3 AS servers, too. This problem only occurs on this machine. The only difference I can see is that on this machine we use the uniprocessor kernel (whereas all our other servers are SMP machines). I think we can rule out a hardware problem: we migrated the whole setup to another server (create partitions, rsync etc.), but the problem remained. Version-Release number of selected component (if applicable): Red Hat Enterprise Linux AS 3 (Taroon Update 1) kernel-2.4.21-9.EL Machine details: 1 GB memory 1.5 GB swap 1 1GHz PIII CPU Greetings, Michael
Sorry, forgot one thing: "not reproduceable" means that we cannot say when (or why) this happens. However, it happens repeatedly (the load is 1 for maybe 50-60% of the time).
Could you please add some vmstat output to this bug report ? Is the kernel busy swapping things out? Are they being swapped in again, too?
No, there is no swap in or swap out. Some more vmstat lines: procs memory swap io system cpu r b swpd free buff cache si so bi bo in cs us sy id wa ... 0 1 217460 9000 112752 554064 0 0 0 0 106 168 0 0 100 0 0 1 217460 8988 112752 554068 0 0 0 0 128 194 0 0 100 0 0 1 217460 8988 112752 554068 0 0 0 0 105 166 0 0 100 0 0 1 217460 8988 112752 554068 0 0 0 0 106 163 0 0 100 0 0 1 217460 8984 112752 554076 0 0 0 240 198 198 0 0 81 19 0 1 217460 8984 112752 554076 0 0 0 0 108 167 0 0 100 0 0 1 217460 8988 112752 554076 0 0 0 0 105 163 0 0 100 0 0 1 217460 8984 112752 554076 0 0 0 0 105 164 0 0 100 0 0 1 217460 8984 112752 554076 0 0 0 0 104 162 0 1 99 0
Enable sysrq if not already done, by either setting the "kernel.sysrq" value to 1 in /etc/sysctl.conf, or by manually echo'ing a 1 into /proc/sys/kernel/sysrq. Then during the time when the system is in this state, give us the output of several Alt-Sysrq-m entries.
Created attachment 97792 [details] Alt-Sysrq-m 1
Created attachment 97793 [details] Alt-Sysrq-m
Created attachment 97794 [details] Alt-Sysrq-m 3
Created attachment 97795 [details] Alt-Sysrq-m 4
Created attachment 97796 [details] Alt-Sysrq-m 5
Created attachment 97797 [details] Alt-Sysrq-m
Thanks, Michael. As you saw, there actually is no I/O going on, but kswapd is sleeping in an uninterruptible state, which counts as a load of 1. Please try this: # echo 100 > /proc/sys/vm/inactive_clean_percent
Yes, this seems to help. What does this exactly do? I did not find any documentation about this proc file ... Michael
It basically restores a part of the page reclamation task to mimic its behaviour in the RHEL3 initial release. The original (pre-U1) behaviour was far more aggressive in moving pages (writing out) from the inactive dirty list to the inactive laundry list. However, it was found to be unnacceptibly detrimental to database performance due to unnecessary I/O. Basically, pages will be moved from the inactive dirty list (written out) when the sum of the inactive laundry list and inactive clean pages falls to less than 5 percent (the default inactive_clean_percent value) of inactive dirty list pages. In other words, it tries to keep the combined inactive laundry and inactive clean page counts to be 5% of the inactive dirty list page count. By increasing the inactive_clean_percent, pages will be written out (moved from the inactive dirty list) more aggressively. By setting it to 100 percent, it mimics the original behavior which kept the combined count of inactive laundry and inactive-clean pages equal to the number of pages on the inactive dirty list. If desired, you could experiment with the percentage, i.e., it isn't necessary to keep it at 5 or 100. That being said, there is a corner case problem with the new scheme such that kswapd is fooled into running far more often, but not doing anything. It normally sleeps for a second before being woken up to check for the need to do some work. However, in your case, it sees that the page count on one or more zones are between "min" and "low", sleeps uninterruptibly for a very short amount of time, wakes up to do some page write-outs, but the inactive_clean_percent value of 5 keeps that from happening. Hence it goes back to sleep again in the same manner. Because it blocks in an uninterruptible state, it is counted as a load of 1. We are looking into an better manner of handle such cases. But for now, the inactive_clean_percent tuneable is the best way to deal with this.
We are planning to use 30 as the default inactive_clean_percent value (instead of 5) in the next RHEL3 release. If possible, could you please try setting inactive_clean_percent to 30? (as opposed to 100) This can be done at any time by echo'ing 30 into /proc/sys/vm/inactive_clean_percent. Then let us know if you ever see kswapd get into the same "D" state when nothing special is running, although we believe it would be impossible.
Just as an additional voice we have a small development database server that is exhibiting this same issue. It's a dual processor system with 4GB of memory (Dell 2550). The systems have limited CPU load but experience significant memory and I/O pressure at times. We had already figured out that setting inactive_clean_percent to 25 seemed to mostly eliminate the problem. After finding this note we have increased the value to 30 and so far it seems to have eliminated the issue. Just thought another data point might be of interest. Tom
Well, I now have a case where the problem occurs even with inactive_clean_percent set to 30. We reconfigured the Dell 2550 servers to use hugetlb support and to keep the SGA locked in memory. These systems have 4GB of memory and run two Oracle instances with 1GB SGA's each. Leaving everything as normal this system preforms horribly (always preformed great under RHEL 2.1) because the systems are way to swappy. After reconfiguring the kernel and Oracle to support hugetlb and ramfs (even though this is not a VLM system) the performance is great since the systems not always trying to swap out the Oracle SGA. Unfortunately, having 2GB of memory that is unswapable seems to trigger the kswapd issue again. We ended up bumping the inactive_clean_percent back to 100 to get rid of this. The systems preform great when configured this way so it's not really a big deal, but I wanted to mention that it is possible to trigger the kswapd issue even when it is set to 30. Would this be expected? Later, Tom
As it stands now, there is no inactive_clean_percent value that is going to satisfy all implementations all the time. The use of 100%, which mimics the way page reclamation worked in RHEL3 pre-U1, was clearly unnacceptable for large database instances; for your situation it's desirable. That is why inactive_clean_percent was made a tuneable.
OK, like I said, no big issue, although changing the way something works like that should certainly be mentioned in the release notes, perhaps with a link to exactly what this tuneable does. Introducing changes in default behaviour via an update, and without clear notification and explanation, is very "Microsoftish". Later, Tom
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-188.html