Red Hat Bugzilla – Bug 115438
strange load - kswapd/IO ?
Last modified: 2007-11-30 17:07:00 EST
Description of problem:
On one of our servers we often experience a load of 1 even when
nothing special is running.
vmstat 1 says:
procs memory swap io system
r b swpd free buff cache si so bi bo in cs us
sy id wa
0 1 212736 10196 111044 558868 0 0 0 0 110 171 0
0 100 0
0 1 212736 10196 111044 558868 0 0 0 0 106 166 0
0 100 0
0 1 212736 10196 111044 558868 0 0 0 0 105 162 0
0 100 0
So, for me, this seems to be an IO problem.
Checking the ps output I found that kswapd is in an uninterruptable
sleep (the ps man page says that this is usually caused by IO):
$ ps lax|grep kswap
1 0 4 1 15 0 0 0 schedu DW ? 0:01
This high load is not reproduceable. There are times when everything
seems to work fine.
We have some other RHEL 3 AS servers, too. This problem only occurs
on this machine. The only difference I can see is that on this machine
we use the uniprocessor kernel (whereas all our other servers are
I think we can rule out a hardware problem: we migrated the whole
setup to another server (create partitions, rsync etc.), but the
Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux AS 3 (Taroon Update 1)
1 GB memory
1.5 GB swap
1 1GHz PIII CPU
Sorry, forgot one thing:
"not reproduceable" means that we cannot say when (or why) this
happens. However, it happens repeatedly (the load is 1 for maybe
50-60% of the time).
Could you please add some vmstat output to this bug report ?
Is the kernel busy swapping things out?
Are they being swapped in again, too?
No, there is no swap in or swap out.
Some more vmstat lines:
procs memory swap io system
r b swpd free buff cache si so bi bo in cs us sy id wa
0 1 217460 9000 112752 554064 0 0 0 0 106 168 0 0 100 0
0 1 217460 8988 112752 554068 0 0 0 0 128 194 0 0 100 0
0 1 217460 8988 112752 554068 0 0 0 0 105 166 0 0 100 0
0 1 217460 8988 112752 554068 0 0 0 0 106 163 0 0 100 0
0 1 217460 8984 112752 554076 0 0 0 240 198 198 0 0 81 19
0 1 217460 8984 112752 554076 0 0 0 0 108 167 0 0 100 0
0 1 217460 8988 112752 554076 0 0 0 0 105 163 0 0 100 0
0 1 217460 8984 112752 554076 0 0 0 0 105 164 0 0 100 0
0 1 217460 8984 112752 554076 0 0 0 0 104 162 0 1 99 0
Enable sysrq if not already done, by either setting the
"kernel.sysrq" value to 1 in /etc/sysctl.conf, or by
manually echo'ing a 1 into /proc/sys/kernel/sysrq.
Then during the time when the system is in this state,
give us the output of several Alt-Sysrq-m entries.
Created attachment 97792 [details]
Created attachment 97793 [details]
Created attachment 97794 [details]
Created attachment 97795 [details]
Created attachment 97796 [details]
Created attachment 97797 [details]
As you saw, there actually is no I/O going on, but kswapd is
sleeping in an uninterruptible state, which counts as a load of 1.
Please try this:
# echo 100 > /proc/sys/vm/inactive_clean_percent
Yes, this seems to help.
What does this exactly do? I did not find any documentation about this
proc file ...
It basically restores a part of the page reclamation task
to mimic its behaviour in the RHEL3 initial release.
The original (pre-U1) behaviour was far more aggressive
in moving pages (writing out) from the inactive dirty list to the
inactive laundry list. However, it was found to be unnacceptibly
detrimental to database performance due to unnecessary I/O.
Basically, pages will be moved from the inactive dirty list
(written out) when the sum of the inactive laundry list and
inactive clean pages falls to less than 5 percent (the default
inactive_clean_percent value) of inactive dirty list pages.
In other words, it tries to keep the combined inactive laundry and
inactive clean page counts to be 5% of the inactive dirty list
By increasing the inactive_clean_percent, pages will be written
out (moved from the inactive dirty list) more aggressively.
By setting it to 100 percent, it mimics the original behavior
which kept the combined count of inactive laundry and inactive-clean
pages equal to the number of pages on the inactive dirty list.
If desired, you could experiment with the percentage, i.e., it
isn't necessary to keep it at 5 or 100.
That being said, there is a corner case problem with the new
scheme such that kswapd is fooled into running far more often, but
not doing anything. It normally sleeps for a second before
being woken up to check for the need to do some work. However,
in your case, it sees that the page count on one or more zones
are between "min" and "low", sleeps uninterruptibly for a very
short amount of time, wakes up to do some page write-outs, but
the inactive_clean_percent value of 5 keeps that from happening.
Hence it goes back to sleep again in the same manner. Because it
blocks in an uninterruptible state, it is counted as a load of 1.
We are looking into an better manner of handle such cases.
But for now, the inactive_clean_percent tuneable is the best way
to deal with this.
We are planning to use 30 as the default inactive_clean_percent
value (instead of 5) in the next RHEL3 release.
If possible, could you please try setting inactive_clean_percent
to 30? (as opposed to 100) This can be done at any time by echo'ing
30 into /proc/sys/vm/inactive_clean_percent. Then let us know
if you ever see kswapd get into the same "D" state when nothing
special is running, although we believe it would be impossible.
Just as an additional voice we have a small development database
server that is exhibiting this same issue. It's a dual processor
system with 4GB of memory (Dell 2550). The systems have limited CPU
load but experience significant memory and I/O pressure at times.
We had already figured out that setting inactive_clean_percent to 25
seemed to mostly eliminate the problem. After finding this note we
have increased the value to 30 and so far it seems to have eliminated
Just thought another data point might be of interest.
Well, I now have a case where the problem occurs even with
inactive_clean_percent set to 30. We reconfigured the Dell 2550
servers to use hugetlb support and to keep the SGA locked in memory.
These systems have 4GB of memory and run two Oracle instances with
1GB SGA's each.
Leaving everything as normal this system preforms horribly (always
preformed great under RHEL 2.1) because the systems are way to
swappy. After reconfiguring the kernel and Oracle to support hugetlb
and ramfs (even though this is not a VLM system) the performance is
great since the systems not always trying to swap out the Oracle SGA.
Unfortunately, having 2GB of memory that is unswapable seems to
trigger the kswapd issue again. We ended up bumping the
inactive_clean_percent back to 100 to get rid of this.
The systems preform great when configured this way so it's not really
a big deal, but I wanted to mention that it is possible to trigger
the kswapd issue even when it is set to 30. Would this be expected?
As it stands now, there is no inactive_clean_percent value that is
going to satisfy all implementations all the time. The use of 100%,
which mimics the way page reclamation worked in RHEL3 pre-U1, was
clearly unnacceptable for large database instances; for your situation
it's desirable. That is why inactive_clean_percent was made a
OK, like I said, no big issue, although changing the way something
works like that should certainly be mentioned in the release notes,
perhaps with a link to exactly what this tuneable does. Introducing
changes in default behaviour via an update, and without clear
notification and explanation, is very "Microsoftish".
An errata has been issued which should help the problem described in this bug report.
This report is therefore being closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, please follow the link below. You may reopen
this bug report if the solution does not work for you.