135312 – kswapd has trouble with high I/O

Bug 135312 - kswapd has trouble with high I/O

Summary: kswapd has trouble with high I/O

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	athlon
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	FC3Target
TreeView+	depends on / blocked

Reported:	2004-10-11 21:05 UTC by eruin
Modified:	2015-01-04 22:10 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-10-29 23:54:14 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description eruin 2004-10-11 21:05:49 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; rv:1.7.3) Gecko/20041008
Firefox/0.10.1

Description of problem:
When running an I/O-intensive process, kswapd starts to constantly use
alot of CPU (50-60% according to top), causing all other processes to
stumble to a halt. 

Version-Release number of selected component (if applicable):
2.6.8-1.603

How reproducible:
Always

Steps to Reproduce:
1. Run gdm -> gnome (minimal setup)
2. Open terminal
3. Use cedega to start BfVietnam.exe

(Or basically any other process that eats RAM like it was candy)

Actual Results:  Other processes than kswapd die/freeze.

Unless offending I/O-intensive process is killed thoroughly, the
system will eventually cease responding and a hard reboot is required.

Expected Results:  kswapd should complete whatever it was doing and
let the rest of the system continue as normal. 

Additional info:

riel> it's a known bug from the upstream kernel, fixed in akpm's
latest pile of patches

Comment 1 eruin 2004-10-12 17:40:57 UTC

Tested using kernel-2.6.8-1.607 now. As far as I can tell this is
supposed to be based on
http://kernel.org/pub/linux/kernel/v2.6/testing/ChangeLog-2.6.9-rc4
and the patches to kswapd included there. Still experiencing the same
behaviour. System dies.

Comment 2 Anssi Johansson 2004-10-22 22:18:49 UTC

Bug #126234 looks just like this bug.

Comment 3 Anssi Johansson 2004-10-23 08:53:34 UTC

I'm having problems which might be related to this bug. I mentioned
this first in bug #131251 comment #34:

--clip--
Now that I finally got my memory sticks I tried upgrading the memory
to 4GB. And boy did I open another can of worms by doing that. After
booting up I ran "free" and saw that I had 4GB of usable memory in
Linux. I loaded the [680MB] SQL dump file with nano and did the usual
text search trick. It worked fine, but that's probably because the
system had enough RAM this time (one text editor instance uses only
about 1GB of memory). I loaded another copy in another session, it
loaded nicely as well. I started loading the third file when the
system seemed to get stuck, nano just stopped at "Reading File". On
another channel I tried running "free" to see if the system was
already swapping at this point (it shouldn't). Surprise surprise, the
"free" command itself got stuck, it refused to output anything and
also refused to return back to command prompt. Running "ls" and
"vmstat 2" on other channels resulted in the same, nothing was printed
on the screen and those commands didn't return to the command prompt.
The editors on channels 1 and 2 were responsive and worked, until I
tried to move to the next page. This move made both of them get stuck.
This happened with 2.6.9 and elevator=as. 
--clip--

Another experience, this time with 2.6.9-bk6 with the grab_swap_token
function call commented out to work around the OOM issue in bug
#131251. I started "vmstat 2" on one channel, and managed to load 3
instances of nano on three different channels. The fourth got stuck in
Reading File phase. Meanwhile the vmstat command on another channel
kept on running with the following information:

r	0
b	6
swpd	23324
free	5780
buff	1112
cache	290616
si	0
so	0
bi	0
bo	0
in	1001
cs	4-6
us	0
sy	0
id	0
wa	100

I left the computer running for a few hours, during which the free and
cache values changes very slightly I had an open bash prompt on one
channel and tried running "free" on it, but it got stuck as in my
previous experiment. nano sessions also got stuck once I tried moving
to the next page of text.

To make the FC developers happy I tried the Fedora Core version of the
kernel, build 640. This time I had "top" running on one channel and
"vmstat" on another. The vmstat values before trying to start nano:

r	0
b	0
swpd	0
free	3906088
buff	14708
cache	44488
si	0
so	0
bi	0
bo	0
in	1001-1003
cs	8-14
us	0
sy	0
id	100
wa	0

Three instances of nano loaded fine, the fourth stopped again in the
Reading File phase. vmstat kept running, with these values:
r	1
b	2
swpd	23360
free	6956
buff	12548
cache	679088
si	0
so	0
bi	0
bo	0
in	1001
cs	6-9
us	0
sy	100
id	0
wa	0

Notice that the 100 value was now indeed in the System (sy) column,
unlike in 2.6.9-bk6 where the 100 value was in the Wait (wa) column. I
don't know if this was just a coincidence or repeatable behaviour. The
"top" session which I had running on another channel got stuck, the
screen no longer got updated. On that frozen screen I can see that
kswapd0 uses 66.9% of CPU (state:R) and nano uses the remaining 33.0%
(state:D).

I'm running the latest development branch of Fedora on an Abit AV8
motherboard, with AMD Athlon 64 3500+ and 4GB of memory. No X running
on this server. The hard disk is a parallel ATA Seagate 120GB.

Comment 4 eruin 2004-10-24 14:23:47 UTC

Also testing .640 here. I'm having mixed results, and will report back
when I've done more extensive testing...

Comment 5 Anssi Johansson 2004-10-25 20:18:51 UTC

I tried if I could reproduce the problem on vanilla 2.6.10-rc1 kernel.
I wasn't able to lock up the system, but still there's something weird
going on in kswpad. I managed to load 4 copies of my text file and
after that "top" shows that kswapd0 is using 99% of cpu, although
nothing visible seems to be happening. According to vmstat, no
swapping is currently being done (si=so=0). 

During the loading of those files kswapd seemed to be acting sanely,
but once the kernel had to touch swap kswapd went crazy. Currently
there's 33156 KB of used swap space, 7152 KB of free memory, 2564 KB
of buffers and 97080 KB of cache. The system worked fine at this point
of time, even though kswapd was consuming lots of CPU time.

Addendum: When I tried copying that 680MB test file to another name
for further testing, the system locked up solidly as in my previous
experiments. The cp process never finished. The frozen "top" screen
tells that kswapd is using 99% of cpu. vmstat keeps running, this time
the values are 33156 KB for swap, 89852 KB free, 3324 KB buffers,
12804 KB cache. The editor on the first channel froze when I tried to
move to the next page. I tried exiting the remaining 3 nano text
editors, but they froze too. At this point the vmstat command that was
still running on another channel displayed 6 processes being in
uninterruptible sleep (vmstat column "b").

Is there something useful that I could test?

Comment 6 eruin 2004-10-27 04:38:13 UTC

Still, with .640 and .643 - kswapd goes crazy as soon as swap is
touched. Settles (ie down to about 6% cpu usage) when tabbing to a
text-terminal, goes back up to eat cpu (~60% constant) when tabbing
back to X.

Don't know if there's any point in this, but I'm going to test with
the initial fc3t2 kernel for reference.

Comment 7 eruin 2004-10-27 04:40:23 UTC

I failed to mention I am no longer able to kill the system this way.
Kswapd now seems to only affect the performance of said I/O-intensive
task...

Comment 8 eruin 2004-10-30 23:42:52 UTC

Thanks, I can confirm it's sorted in the .649
linux-2.6.9-vm-tame-oomkiller.patch

Note You need to log in before you can comment on or make changes to this bug.