149609 – The oom-killer kicks in even though there is swap space available

Bug 149609 - The oom-killer kicks in even though there is swap space available

Summary: The oom-killer kicks in even though there is swap space available

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	3
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-02-24 13:55 UTC by Lars Berntzon
Modified:	2015-01-04 22:17 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-09-28 08:11:51 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Lars Berntzon 2005-02-24 13:55:59 UTC

Description of problem:

The kernel oom-killer kicks in and starts to kill processes even
though there are a lot of swap space free. 
I have a FC3 i386 machine that i use as a staging server for my backups.
I run rsync every night from several servers to this server. I also
use lvm2 snapshots. After a couple of hours I get errors from rsync
and the staging server reports out of memory in the messages log. See
messages below.
It is a lot of swap free, still it gets out of memory.

My gut feeling is that this is a combination of using lvm2-snapshots
and thus causing disk accesses slower and extreme disk usage. Since
disk accesses are loaded and even slower with lvm2-snapshots, the
flushing of dirty pages takes a long time, perhaps long enough to
cause timeout in the oom-killer and it starts killing.


Version-Release number of selected component (if applicable):
2.6.10-1.766_FC3smp
(and also rsync-2.6.3, but it does not matter)

How reproducible:
Every time, but never at exact same time/point. I think it occurs more
often if the disk system is more loaded (like when I simultaneusly is
scanning for viruses).

Steps to Reproduce:
1. Start rsync of several hundred megs from multiple clients.
2.
3.
  
Actual results:

After a couple of hours rsync reports errors on the client, and the
server (that I have problem with) logs that oom-killer starts.

Expected results:


Additional info:

Feb 21 01:31:43 xellb1 kernel: oom-killer: gfp_mask=0xd2
Feb 21 01:31:44 xellb1 kernel: Mem-info:
Feb 21 01:31:44 xellb1 kernel: DMA per-cpu:
Feb 21 01:31:44 xellb1 kernel: cpu 0 hot: low 2, high 6, batch 1
Feb 21 01:31:44 xellb1 kernel: cpu 0 cold: low 0, high 2, batch 1
Feb 21 01:31:44 xellb1 kernel: cpu 1 hot: low 2, high 6, batch 1
Feb 21 01:31:44 xellb1 kernel: cpu 1 cold: low 0, high 2, batch 1
Feb 21 01:31:44 xellb1 kernel: Normal per-cpu:
Feb 21 01:31:44 xellb1 kernel: cpu 0 hot: low 32, high 96, batch 16
Feb 21 01:31:44 xellb1 kernel: cpu 0 cold: low 0, high 32, batch 16
Feb 21 01:31:44 xellb1 kernel: cpu 1 hot: low 32, high 96, batch 16
Feb 21 01:31:44 xellb1 kernel: cpu 1 cold: low 0, high 32, batch 16
Feb 21 01:31:44 xellb1 kernel: HighMem per-cpu: empty
Feb 21 01:31:45 xellb1 kernel: 
Feb 21 01:31:51 xellb1 kernel: Free pages:        3408kB (0kB HighMem)
Feb 21 01:31:54 xellb1 kernel: Active:1119 inactive:46737 dirty:3664
writeback:41995 unstable:0 free:852 slab:76717 mapped:1323 pagetables:870
Feb 21 01:31:56 xellb1 kernel: DMA free:88kB min:88kB low:108kB
high:132kB active:68kB inactive:9872kB present:16384kB
pages_scanned:102 all_unreclaimab
le? no
Feb 21 01:31:57 xellb1 kernel: protections[]: 0 0 0
Feb 21 01:31:58 xellb1 kernel: Normal free:3320kB min:2800kB
low:3500kB high:4200kB active:4408kB inactive:177204kB
present:507648kB pages_scanned:921 a
ll_unreclaimable? no
Feb 21 01:31:59 xellb1 kernel: protections[]: 0 0 0
Feb 21 01:32:01 xellb1 kernel: HighMem free:0kB min:128kB low:160kB
high:192kB active:0kB inactive:0kB present:0kB pages_scanned:0
all_unreclaimable? no
Feb 21 01:32:03 xellb1 kernel: protections[]: 0 0 0
Feb 21 01:32:04 xellb1 kernel: DMA: 0*4kB 3*8kB 0*16kB 0*32kB 1*64kB
0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 88kB
Feb 21 01:32:05 xellb1 kernel: Normal: 140*4kB 5*8kB 40*16kB 3*32kB
1*64kB 15*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3320kB
Feb 21 01:32:06 xellb1 kernel: HighMem: empty
Feb 21 01:32:07 xellb1 kernel: Swap cache: add 239734, delete 239282,
find 1176768/1203329, race 2+24
Feb 21 01:32:09 xellb1 kernel: Free swap:       910360kB
Feb 21 01:32:10 xellb1 kernel: 131008 pages of RAM
Feb 21 01:32:11 xellb1 kernel: 0 pages of HIGHMEM
Feb 21 01:32:12 xellb1 kernel: 2376 reserved pages
Feb 21 01:32:13 xellb1 kernel: 50693 pages shared
Feb 21 01:32:14 xellb1 kernel: 452 pages swap cached
Feb 21 01:32:16 xellb1 kernel: Out of Memory: Killed process 23734
(sendmail).

Comment 1 Dave Jones 2005-02-24 19:40:39 UTC

can you try the kernel at
http://people.redhat.com/davej/kernels/Fedora/FC3/  

that has some leaks fixed which could explain this problem

Comment 2 Lars Berntzon 2005-02-25 15:22:26 UTC

The kernel 770 seems to have donethe trick. No more oom killer
started. I will let it run over the weekend to make sure I do not get
the error any more.
Thanks for your good work.

Comment 3 Lars Berntzon 2005-03-01 07:39:58 UTC

Nope, still out of memory with the 770 kernel. See below for new
extract from my messages file. I forgot to mention earlier - and that
is whats make me think it is the disk accesses that are too slow and
thus causing timeouts in the oom killer to kick in - is that when I
run rsync with bandwith limitation (half the network speed = 50Mb) the
crashes does not occur.

New messages log:
Mar  1 02:36:17 xellb1 kernel: oom-killer: gfp_mask=0x1d2
Mar  1 02:36:17 xellb1 kernel: Mem-info:
Mar  1 02:36:17 xellb1 kernel: DMA per-cpu:
Mar  1 02:36:17 xellb1 kernel: cpu 0 hot: low 2, high 6, batch 1
Mar  1 02:36:17 xellb1 kernel: cpu 0 cold: low 0, high 2, batch 1
Mar  1 02:36:17 xellb1 kernel: cpu 1 hot: low 2, high 6, batch 1
Mar  1 02:36:17 xellb1 kernel: cpu 1 cold: low 0, high 2, batch 1
Mar  1 02:36:17 xellb1 kernel: Normal per-cpu:
Mar  1 02:36:17 xellb1 kernel: cpu 0 hot: low 32, high 96, batch 16
Mar  1 02:36:17 xellb1 kernel: cpu 0 cold: low 0, high 32, batch 16
Mar  1 02:36:17 xellb1 kernel: cpu 1 hot: low 32, high 96, batch 16
Mar  1 02:36:17 xellb1 kernel: cpu 1 cold: low 0, high 32, batch 16
Mar  1 02:36:17 xellb1 kernel: HighMem per-cpu: empty
Mar  1 02:36:17 xellb1 kernel: 
Mar  1 02:36:17 xellb1 kernel: Free pages:        4068kB (0kB HighMem)
Mar  1 02:36:17 xellb1 kernel: Active:3243 inactive:34969 dirty:1361
writeback:33479 unstable:0 free:1017 slab:86767 mapped:0 pagetables:433
Mar  1 02:36:17 xellb1 kernel: DMA free:92kB min:88kB low:108kB
high:132kB active:320kB inactive:8888kB present:16384kB
pages_scanned:739 all_unreclaimable? no
Mar  1 02:36:17 xellb1 kernel: protections[]: 0 0 0
Mar  1 02:36:17 xellb1 kernel: Normal free:3976kB min:2800kB
low:3500kB high:4200kB active:12652kB inactive:130988kB
present:507648kB pages_scanned:7106 all_unreclaimable? 
no
Mar  1 02:36:17 xellb1 kernel: protections[]: 0 0 0
Mar  1 02:36:17 xellb1 kernel: HighMem free:0kB min:128kB low:160kB
high:192kB active:0kB inactive:0kB present:0kB pages_scanned:0
all_unreclaimable? no
Mar  1 02:36:17 xellb1 kernel: protections[]: 0 0 0
Mar  1 02:36:17 xellb1 kernel: DMA: 1*4kB 3*8kB 0*16kB 0*32kB 1*64kB
0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 92kB
Mar  1 02:36:17 xellb1 kernel: Normal: 236*4kB 31*8kB 48*16kB 35*32kB
4*64kB 1*128kB 2*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3976kB
Mar  1 02:36:17 xellb1 kernel: HighMem: empty
Mar  1 02:36:17 xellb1 kernel: Swap cache: add 358711, delete 358670,
find 1369042/1420012, race 0+29
Mar  1 02:36:17 xellb1 kernel: Free swap:       924448kB
Mar  1 02:36:17 xellb1 kernel: 131008 pages of RAM
Mar  1 02:36:17 xellb1 kernel: 0 pages of HIGHMEM
Mar  1 02:36:17 xellb1 kernel: 2376 reserved pages
Mar  1 02:36:17 xellb1 kernel: 37546 pages shared
Mar  1 02:36:17 xellb1 kernel: 41 pages swap cached
Mar  1 02:36:17 xellb1 kernel: Out of Memory: Killed process 31617
(rsync).

Comment 4 Lars Berntzon 2005-03-03 07:58:24 UTC

More symptoms:
Today I was able to reproduce the problem much easier. My problem does
seem to have with the lvm2 snapshots I am using. I have a 1 TB volume
and a snapshot with about 150G space. When I did this test, the
snapshot volume was full to 97% and I tried to overwrite a old file
with new data   with about 400 MB (thus causing the snapshot volume to
duplicate the data), then my shell was killed by the oom-killer. When
I logged in again the kcopyd daemon was running heavily.
This happens everytime I try the overwrite.
When I removed the almost filled up snapshot and created a fresh new
and  do the overwrite operation again it works fine.

A little note - refering back to my idea that it is the snapshots that
causes the disk accesses to be so slow that oom-killer kicks in -
writing a 400 M file to a volume that has a snapshot can take more
than 3 minutes, writing to the same file when snapshots are not active
takes 6 secs. The time is not an issue for me since the server is just
a staging server, but the oom-killing really hurts.

/Lars

Comment 5 Dave Jones 2005-07-15 20:10:47 UTC

An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which
may contain a fix for your problem.   Please update to this new kernel, and
report whether or not it fixes your problem.

If you have updated to Fedora Core 4 since this bug was opened, and the problem
still occurs with the latest updates for that release, please change the version
field of this bug to 'fc4'.

Thank you.

Note You need to log in before you can comment on or make changes to this bug.