115438 – strange load - kswapd/IO ?

Bug 115438 - strange load - kswapd/IO ?

Summary: strange load - kswapd/IO ?

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Dave Anderson
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-02-12 15:46 UTC by Michael Redinger
Modified:	2007-11-30 22:07 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-05-12 01:08:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Alt-Sysrq-m 1 (1.84 KB, text/plain) 2004-02-18 15:32 UTC, Michael Redinger	no flags	Details
Alt-Sysrq-m (1.84 KB, text/plain) 2004-02-18 15:32 UTC, Michael Redinger	no flags	Details
Alt-Sysrq-m 3 (1.84 KB, text/plain) 2004-02-18 15:32 UTC, Michael Redinger	no flags	Details
Alt-Sysrq-m 4 (1.84 KB, text/plain) 2004-02-18 15:33 UTC, Michael Redinger	no flags	Details
Alt-Sysrq-m 5 (1.84 KB, text/plain) 2004-02-18 15:33 UTC, Michael Redinger	no flags	Details
Alt-Sysrq-m (1.84 KB, text/plain) 2004-02-18 15:33 UTC, Michael Redinger	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2004:188	0	normal	SHIPPED_LIVE	Important: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 2	2004-05-11 04:00:00 UTC

Description Michael Redinger 2004-02-12 15:46:21 UTC

Description of problem:
On one of our servers we often experience a load of 1 even when
nothing special is running.
vmstat 1 says:

procs                      memory      swap          io     system   
     cpu
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us
sy id wa
 0  1 212736  10196 111044 558868    0    0     0     0  110   171  0
 0 100  0
   ^^^
 0  1 212736  10196 111044 558868    0    0     0     0  106   166  0
 0 100  0
   ^^^
 0  1 212736  10196 111044 558868    0    0     0     0  105   162  0
 0 100  0
   ^^^
....

So, for me, this seems to be an IO problem.

Checking the ps output I found that kswapd is in an uninterruptable
sleep (the ps man page says that this is usually caused by IO):

$ ps lax|grep kswap
1     0     4     1  15   0     0    0 schedu DW   ?          0:01
[kswapd]


This high load is not reproduceable. There are times when everything
seems to work fine.

We have some other RHEL 3 AS servers, too. This problem only occurs
on this machine. The only difference I can see is that on this machine
we use the uniprocessor kernel (whereas all our other servers are
SMP machines).

I think we can rule out a hardware problem: we migrated the whole
setup to another server (create partitions, rsync etc.), but the
problem remained.

Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux AS 3 (Taroon Update 1)
kernel-2.4.21-9.EL

Machine details:
1 GB memory
1.5 GB swap
1 1GHz PIII CPU



Greetings,
Michael

Comment 1 Michael Redinger 2004-02-12 15:48:45 UTC

Sorry, forgot one thing:
"not reproduceable" means that we cannot say when (or why) this
happens. However, it happens repeatedly (the load is 1 for maybe
50-60% of the time).

Comment 2 Rik van Riel 2004-02-12 16:20:05 UTC

Could you please add some vmstat output to this bug report ?

Is the kernel busy swapping things out?
Are they being swapped in again, too?

Comment 3 Michael Redinger 2004-02-12 16:33:48 UTC

No, there is no swap in or swap out.
Some more vmstat lines:


procs                      memory      swap          io     system   
     cpu
r b   swpd  free   buff  cache  si  so   bi  bo   in   cs us sy id wa

...
0 1 217460  9000 112752 554064   0   0    0   0  106  168  0  0 100  0
0 1 217460  8988 112752 554068   0   0    0   0  128  194  0  0 100  0
0 1 217460  8988 112752 554068   0   0    0   0  105  166  0  0 100  0
0 1 217460  8988 112752 554068   0   0    0   0  106  163  0  0 100  0
0 1 217460  8984 112752 554076   0   0    0 240  198  198  0  0  81 19
0 1 217460  8984 112752 554076   0   0    0   0  108  167  0  0 100  0
0 1 217460  8988 112752 554076   0   0    0   0  105  163  0  0 100  0
0 1 217460  8984 112752 554076   0   0    0   0  105  164  0  0 100  0
0 1 217460  8984 112752 554076   0   0    0   0  104  162  0  1  99  0

Comment 4 Dave Anderson 2004-02-16 13:34:38 UTC

Enable sysrq if not already done, by either setting the
"kernel.sysrq" value to 1 in /etc/sysctl.conf, or by
manually echo'ing a 1 into /proc/sys/kernel/sysrq.

Then during the time when the system is in this state,
give us the output of several Alt-Sysrq-m entries.

Comment 5 Michael Redinger 2004-02-18 15:32:22 UTC

Created attachment 97792 [details]
Alt-Sysrq-m 1

Comment 6 Michael Redinger 2004-02-18 15:32:40 UTC

Created attachment 97793 [details]
Alt-Sysrq-m

Comment 7 Michael Redinger 2004-02-18 15:32:56 UTC

Created attachment 97794 [details]
Alt-Sysrq-m 3

Comment 8 Michael Redinger 2004-02-18 15:33:10 UTC

Created attachment 97795 [details]
Alt-Sysrq-m 4

Comment 9 Michael Redinger 2004-02-18 15:33:24 UTC

Created attachment 97796 [details]
Alt-Sysrq-m 5

Comment 10 Michael Redinger 2004-02-18 15:33:40 UTC

Created attachment 97797 [details]
Alt-Sysrq-m

Comment 11 Dave Anderson 2004-02-18 16:11:29 UTC

Thanks, Michael.

As you saw, there actually is no I/O going on, but kswapd is
sleeping in an uninterruptible state, which counts as a load of 1. 

Please try this:

 # echo 100 > /proc/sys/vm/inactive_clean_percent

Comment 12 Michael Redinger 2004-02-23 12:39:57 UTC

Yes, this seems to help.
What does this exactly do? I did not find any documentation about this
proc file ...

Michael

Comment 13 Dave Anderson 2004-02-23 16:01:35 UTC

It basically restores a part of the page reclamation task
to mimic its behaviour in the RHEL3 initial release.

The original (pre-U1) behaviour was far more aggressive
in moving pages (writing out) from the inactive dirty list to the
inactive laundry list.  However, it was found to be unnacceptibly
detrimental to database performance due to unnecessary I/O.

Basically, pages will be moved from the inactive dirty list
(written out) when the sum of the inactive laundry list and
inactive clean pages falls to less than 5 percent (the default
inactive_clean_percent value) of inactive dirty list pages.
In other words, it tries to keep the combined inactive laundry and
inactive clean page counts to be 5% of the inactive dirty list
page count.

By increasing the inactive_clean_percent, pages will be written
out (moved from the inactive dirty list) more aggressively.
By setting it to 100 percent, it mimics the original behavior
which kept the combined count of inactive laundry and inactive-clean
pages equal to the number of pages on the inactive dirty list.
If desired, you could experiment with the percentage, i.e., it
isn't necessary to keep it at 5 or 100.

That being said, there is a corner case problem with the new 
scheme such that kswapd is fooled into running far more often, but
not doing anything.  It normally sleeps for a second before
being woken up to check for the need to do some work.  However,
in your case, it sees that the page count on one or more zones
are between "min" and "low", sleeps uninterruptibly for a very
short amount of time, wakes up to do some page write-outs, but
the inactive_clean_percent value of 5 keeps that from happening.
Hence it goes back to sleep again in the same manner.  Because it
blocks in an uninterruptible state, it is counted as a load of 1.

We are looking into an better manner of handle such cases.
But for now, the inactive_clean_percent tuneable is the best way
to deal with this.

Comment 16 Dave Anderson 2004-03-08 19:07:54 UTC

We are planning to use 30 as the default inactive_clean_percent
value (instead of 5) in the next RHEL3 release.

If possible, could you please try setting inactive_clean_percent
to 30? (as opposed to 100)  This can be done at any time by echo'ing
30 into /proc/sys/vm/inactive_clean_percent.  Then let us know
if you ever see kswapd get into the same "D" state when nothing
special is running, although we believe it would be impossible.

Comment 17 Tom Sightler 2004-03-09 06:36:28 UTC

Just as an additional voice we have a small development database 
server that is exhibiting this same issue.  It's a dual processor 
system with 4GB of memory (Dell 2550).  The systems have limited CPU 
load but experience significant memory and I/O pressure at times.

We had already figured out that setting inactive_clean_percent to 25 
seemed to mostly eliminate the problem.  After finding this note we 
have increased the value to 30 and so far it seems to have eliminated 
the issue.

Just thought another data point might be of interest.

Tom

Comment 18 Tom Sightler 2004-04-09 16:45:25 UTC

Well, I now have a case where the problem occurs even with 
inactive_clean_percent set to 30.  We reconfigured the Dell 2550 
servers to use hugetlb support and to keep the SGA locked in memory.  
These systems have 4GB of memory and run two Oracle instances with 
1GB SGA's each.

Leaving everything as normal this system preforms horribly (always 
preformed great under RHEL 2.1) because the systems are way to 
swappy.  After reconfiguring the kernel and Oracle to support hugetlb 
and ramfs (even though this is not a VLM system) the performance is 
great since the systems not always trying to swap out the Oracle SGA.

Unfortunately, having 2GB of memory that is unswapable seems to 
trigger the kswapd issue again.  We ended up bumping the 
inactive_clean_percent back to 100 to get rid of this.

The systems preform great when configured this way so it's not really 
a big deal, but I wanted to mention that it is possible to trigger 
the kswapd issue even when it is set to 30.  Would this be expected?

Later,
Tom

Comment 19 Dave Anderson 2004-04-12 13:08:36 UTC

As it stands now, there is no inactive_clean_percent value that is
going to satisfy all implementations all the time.  The use of 100%,
which mimics the way page reclamation worked in RHEL3 pre-U1, was
clearly unnacceptable for large database instances; for your situation
it's desirable.  That is why inactive_clean_percent was made a
tuneable.

Comment 20 Tom Sightler 2004-04-13 18:28:46 UTC

OK, like I said, no big issue, although changing the way something
works like that should certainly be mentioned in the release notes,
perhaps with a link to exactly what this tuneable does.  Introducing
changes in default behaviour via an update, and without clear
notification and explanation, is very "Microsoftish".

Later,
Tom

Comment 21 John Flanagan 2004-05-12 01:08:34 UTC

An errata has been issued which should help the problem described in this bug report. 
This report is therefore being closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, please follow the link below. You may reopen 
this bug report if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-188.html

Note You need to log in before you can comment on or make changes to this bug.