Description of problem: The kernel oom-killer kicks in and starts to kill processes even though there are a lot of swap space free. I have a FC3 i386 machine that i use as a staging server for my backups. I run rsync every night from several servers to this server. I also use lvm2 snapshots. After a couple of hours I get errors from rsync and the staging server reports out of memory in the messages log. See messages below. It is a lot of swap free, still it gets out of memory. My gut feeling is that this is a combination of using lvm2-snapshots and thus causing disk accesses slower and extreme disk usage. Since disk accesses are loaded and even slower with lvm2-snapshots, the flushing of dirty pages takes a long time, perhaps long enough to cause timeout in the oom-killer and it starts killing. Version-Release number of selected component (if applicable): 2.6.10-1.766_FC3smp (and also rsync-2.6.3, but it does not matter) How reproducible: Every time, but never at exact same time/point. I think it occurs more often if the disk system is more loaded (like when I simultaneusly is scanning for viruses). Steps to Reproduce: 1. Start rsync of several hundred megs from multiple clients. 2. 3. Actual results: After a couple of hours rsync reports errors on the client, and the server (that I have problem with) logs that oom-killer starts. Expected results: Additional info: Feb 21 01:31:43 xellb1 kernel: oom-killer: gfp_mask=0xd2 Feb 21 01:31:44 xellb1 kernel: Mem-info: Feb 21 01:31:44 xellb1 kernel: DMA per-cpu: Feb 21 01:31:44 xellb1 kernel: cpu 0 hot: low 2, high 6, batch 1 Feb 21 01:31:44 xellb1 kernel: cpu 0 cold: low 0, high 2, batch 1 Feb 21 01:31:44 xellb1 kernel: cpu 1 hot: low 2, high 6, batch 1 Feb 21 01:31:44 xellb1 kernel: cpu 1 cold: low 0, high 2, batch 1 Feb 21 01:31:44 xellb1 kernel: Normal per-cpu: Feb 21 01:31:44 xellb1 kernel: cpu 0 hot: low 32, high 96, batch 16 Feb 21 01:31:44 xellb1 kernel: cpu 0 cold: low 0, high 32, batch 16 Feb 21 01:31:44 xellb1 kernel: cpu 1 hot: low 32, high 96, batch 16 Feb 21 01:31:44 xellb1 kernel: cpu 1 cold: low 0, high 32, batch 16 Feb 21 01:31:44 xellb1 kernel: HighMem per-cpu: empty Feb 21 01:31:45 xellb1 kernel: Feb 21 01:31:51 xellb1 kernel: Free pages: 3408kB (0kB HighMem) Feb 21 01:31:54 xellb1 kernel: Active:1119 inactive:46737 dirty:3664 writeback:41995 unstable:0 free:852 slab:76717 mapped:1323 pagetables:870 Feb 21 01:31:56 xellb1 kernel: DMA free:88kB min:88kB low:108kB high:132kB active:68kB inactive:9872kB present:16384kB pages_scanned:102 all_unreclaimab le? no Feb 21 01:31:57 xellb1 kernel: protections[]: 0 0 0 Feb 21 01:31:58 xellb1 kernel: Normal free:3320kB min:2800kB low:3500kB high:4200kB active:4408kB inactive:177204kB present:507648kB pages_scanned:921 a ll_unreclaimable? no Feb 21 01:31:59 xellb1 kernel: protections[]: 0 0 0 Feb 21 01:32:01 xellb1 kernel: HighMem free:0kB min:128kB low:160kB high:192kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no Feb 21 01:32:03 xellb1 kernel: protections[]: 0 0 0 Feb 21 01:32:04 xellb1 kernel: DMA: 0*4kB 3*8kB 0*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 88kB Feb 21 01:32:05 xellb1 kernel: Normal: 140*4kB 5*8kB 40*16kB 3*32kB 1*64kB 15*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3320kB Feb 21 01:32:06 xellb1 kernel: HighMem: empty Feb 21 01:32:07 xellb1 kernel: Swap cache: add 239734, delete 239282, find 1176768/1203329, race 2+24 Feb 21 01:32:09 xellb1 kernel: Free swap: 910360kB Feb 21 01:32:10 xellb1 kernel: 131008 pages of RAM Feb 21 01:32:11 xellb1 kernel: 0 pages of HIGHMEM Feb 21 01:32:12 xellb1 kernel: 2376 reserved pages Feb 21 01:32:13 xellb1 kernel: 50693 pages shared Feb 21 01:32:14 xellb1 kernel: 452 pages swap cached Feb 21 01:32:16 xellb1 kernel: Out of Memory: Killed process 23734 (sendmail).
can you try the kernel at http://people.redhat.com/davej/kernels/Fedora/FC3/ that has some leaks fixed which could explain this problem
The kernel 770 seems to have donethe trick. No more oom killer started. I will let it run over the weekend to make sure I do not get the error any more. Thanks for your good work.
Nope, still out of memory with the 770 kernel. See below for new extract from my messages file. I forgot to mention earlier - and that is whats make me think it is the disk accesses that are too slow and thus causing timeouts in the oom killer to kick in - is that when I run rsync with bandwith limitation (half the network speed = 50Mb) the crashes does not occur. New messages log: Mar 1 02:36:17 xellb1 kernel: oom-killer: gfp_mask=0x1d2 Mar 1 02:36:17 xellb1 kernel: Mem-info: Mar 1 02:36:17 xellb1 kernel: DMA per-cpu: Mar 1 02:36:17 xellb1 kernel: cpu 0 hot: low 2, high 6, batch 1 Mar 1 02:36:17 xellb1 kernel: cpu 0 cold: low 0, high 2, batch 1 Mar 1 02:36:17 xellb1 kernel: cpu 1 hot: low 2, high 6, batch 1 Mar 1 02:36:17 xellb1 kernel: cpu 1 cold: low 0, high 2, batch 1 Mar 1 02:36:17 xellb1 kernel: Normal per-cpu: Mar 1 02:36:17 xellb1 kernel: cpu 0 hot: low 32, high 96, batch 16 Mar 1 02:36:17 xellb1 kernel: cpu 0 cold: low 0, high 32, batch 16 Mar 1 02:36:17 xellb1 kernel: cpu 1 hot: low 32, high 96, batch 16 Mar 1 02:36:17 xellb1 kernel: cpu 1 cold: low 0, high 32, batch 16 Mar 1 02:36:17 xellb1 kernel: HighMem per-cpu: empty Mar 1 02:36:17 xellb1 kernel: Mar 1 02:36:17 xellb1 kernel: Free pages: 4068kB (0kB HighMem) Mar 1 02:36:17 xellb1 kernel: Active:3243 inactive:34969 dirty:1361 writeback:33479 unstable:0 free:1017 slab:86767 mapped:0 pagetables:433 Mar 1 02:36:17 xellb1 kernel: DMA free:92kB min:88kB low:108kB high:132kB active:320kB inactive:8888kB present:16384kB pages_scanned:739 all_unreclaimable? no Mar 1 02:36:17 xellb1 kernel: protections[]: 0 0 0 Mar 1 02:36:17 xellb1 kernel: Normal free:3976kB min:2800kB low:3500kB high:4200kB active:12652kB inactive:130988kB present:507648kB pages_scanned:7106 all_unreclaimable? no Mar 1 02:36:17 xellb1 kernel: protections[]: 0 0 0 Mar 1 02:36:17 xellb1 kernel: HighMem free:0kB min:128kB low:160kB high:192kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no Mar 1 02:36:17 xellb1 kernel: protections[]: 0 0 0 Mar 1 02:36:17 xellb1 kernel: DMA: 1*4kB 3*8kB 0*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 92kB Mar 1 02:36:17 xellb1 kernel: Normal: 236*4kB 31*8kB 48*16kB 35*32kB 4*64kB 1*128kB 2*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3976kB Mar 1 02:36:17 xellb1 kernel: HighMem: empty Mar 1 02:36:17 xellb1 kernel: Swap cache: add 358711, delete 358670, find 1369042/1420012, race 0+29 Mar 1 02:36:17 xellb1 kernel: Free swap: 924448kB Mar 1 02:36:17 xellb1 kernel: 131008 pages of RAM Mar 1 02:36:17 xellb1 kernel: 0 pages of HIGHMEM Mar 1 02:36:17 xellb1 kernel: 2376 reserved pages Mar 1 02:36:17 xellb1 kernel: 37546 pages shared Mar 1 02:36:17 xellb1 kernel: 41 pages swap cached Mar 1 02:36:17 xellb1 kernel: Out of Memory: Killed process 31617 (rsync).
More symptoms: Today I was able to reproduce the problem much easier. My problem does seem to have with the lvm2 snapshots I am using. I have a 1 TB volume and a snapshot with about 150G space. When I did this test, the snapshot volume was full to 97% and I tried to overwrite a old file with new data with about 400 MB (thus causing the snapshot volume to duplicate the data), then my shell was killed by the oom-killer. When I logged in again the kcopyd daemon was running heavily. This happens everytime I try the overwrite. When I removed the almost filled up snapshot and created a fresh new and do the overwrite operation again it works fine. A little note - refering back to my idea that it is the snapshots that causes the disk accesses to be so slow that oom-killer kicks in - writing a 400 M file to a volume that has a snapshot can take more than 3 minutes, writing to the same file when snapshots are not active takes 6 secs. The time is not an issue for me since the server is just a staging server, but the oom-killing really hurts. /Lars
An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which may contain a fix for your problem. Please update to this new kernel, and report whether or not it fixes your problem. If you have updated to Fedora Core 4 since this bug was opened, and the problem still occurs with the latest updates for that release, please change the version field of this bug to 'fc4'. Thank you.