Description of problem: We've running RH4U4 on foure our servers. Two of the them were rebooted by cluster software because OOM killer kill one of cluster process. We are running MySQL Cluster on this server and few other application, systems are sometimes under heavy load. But when the OOM starts to kill processess servers weren't under heavy load. Version-Release number of selected component (if applicable): Red Hat Enterprise Linux ES release 4 (Nahant Update 4) How reproducible: Random, we cant reproduce it when we want. Actual results: Randomly OOM kills Expected results: No OOM kills Additional info: xxxxxxxx:user:~> uname -a Linux xxxxxxxx 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:28:02 EDT 2006 i686 i686 i386 GNU/Linux Output from logs: Mar 4 11:26:51 xxxxxxxx kernel: oom-killer: gfp_mask=0xd0 Mar 4 11:26:51 xxxxxxxx kernel: Mem-info: Mar 4 11:26:51 xxxxxxxx kernel: DMA per-cpu: Mar 4 11:26:51 xxxxxxxx kernel: cpu 0 hot: low 2, high 6, batch 1 Mar 4 11:26:51 xxxxxxxx kernel: cpu 0 cold: low 0, high 2, batch 1 Mar 4 11:26:51 xxxxxxxx kernel: cpu 1 hot: low 2, high 6, batch 1 Mar 4 11:26:51 xxxxxxxx kernel: cpu 1 cold: low 0, high 2, batch 1 Mar 4 11:26:51 xxxxxxxx kernel: Normal per-cpu: Mar 4 11:26:51 xxxxxxxx kernel: cpu 0 hot: low 32, high 96, batch 16 Mar 4 11:26:51 xxxxxxxx kernel: cpu 0 cold: low 0, high 32, batch 16 Mar 4 11:26:51 xxxxxxxx kernel: cpu 1 hot: low 32, high 96, batch 16 Mar 4 11:26:51 xxxxxxxx kernel: cpu 1 cold: low 0, high 32, batch 16 Mar 4 11:26:51 xxxxxxxx kernel: HighMem per-cpu: Mar 4 11:26:52 xxxxxxxx kernel: cpu 0 hot: low 32, high 96, batch 16 Mar 4 11:26:52 xxxxxxxx kernel: cpu 0 cold: low 0, high 32, batch 16 Mar 4 11:26:53 xxxxxxxx kernel: cpu 1 hot: low 32, high 96, batch 16 Mar 4 11:26:53 xxxxxxxx kernel: cpu 1 cold: low 0, high 32, batch 16 Mar 4 11:26:53 xxxxxxxx kernel: Mar 4 11:26:53 xxxxxxxx kernel: Free pages: 2665012kB (2651520kB HighMem) Mar 4 11:26:54 xxxxxxxx kernel: Active:136947 inactive:742 dirty:0 writeback:0 unstable:0 free:666253 slab:212160 mapped:72062 pagetables:1053 Mar 4 11:26:54 xxxxxxxx clurgmgrd: [9312]: <info> Executing /opt/pro/commdb-ndbd1.init status Mar 4 11:26:54 xxxxxxxx kernel: DMA free:12564kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB pages_scanned:8411 all_unreclaimable? yes Mar 4 11:26:55 xxxxxxxx kernel: protections[]: 0 0 0 Mar 4 11:26:55 xxxxxxxx clurgmgrd[9312]: <notice> status on script "commdb-ndbd1" returned 2 (invalid argument(s)) Mar 4 11:26:55 xxxxxxxx kernel: Normal free:928kB min:928kB low:1856kB high:2784kB active:416kB inactive:252kB present:901120kB pages_scanned:1188 all_unreclaimable? yes Mar 4 11:26:55 xxxxxxxx clurgmgrd: [9312]: <info> Executing /opt/pro/commdb-mgmd.init status Mar 4 11:26:55 xxxxxxxx kernel: protections[]: 0 0 0 Mar 4 11:26:56 xxxxxxxx kernel: HighMem free:2651520kB min:512kB low:1024kB high:1536kB active:547372kB inactive:2716kB present:3735548kB pages_scanned:0 all_unreclaimable? no Mar 4 11:26:56 xxxxxxxx kernel: protections[]: 0 0 0 Mar 4 11:26:56 xxxxxxxx kernel: DMA: 5*4kB 4*8kB 4*16kB 3*32kB 3*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 2*4096kB = 12564kB Mar 4 11:26:57 xxxxxxxx kernel: Normal: 10*4kB 3*8kB 0*16kB 3*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 928kB Mar 4 11:26:57 xxxxxxxx clurgmgrd: [9312]: <info> Executing /opt/pro/smtpout1.init status Mar 4 11:26:57 xxxxxxxx kernel: HighMem: 17938*4kB 16983*8kB 15212*16kB 9058*32kB 3920*64kB 981*128kB 491*256kB 387*512kB 342*1024kB 150*2048kB 135*4096kB = 2651520kB Mar 4 11:26:57 xxxxxxxx kernel: Swap cache: add 1, delete 1, find 0/0, race 0+0 Mar 4 11:26:57 xxxxxxxx clurgmgrd[9312]: <notice> Stopping service commdb-ndbd1 Mar 4 11:26:57 xxxxxxxx kernel: 0 bounce buffer pages Mar 4 11:26:57 xxxxxxxx kernel: Free swap: 2003252kB Mar 4 11:26:58 xxxxxxxx kernel: 1163263 pages of RAM Mar 4 11:26:58 xxxxxxxx kernel: 802802 pages of HIGHMEM Mar 4 11:26:58 xxxxxxxx kernel: 141733 reserved pages Mar 4 11:26:58 xxxxxxxx kernel: 64402 pages shared Mar 4 11:26:59 xxxxxxxx kernel: 0 pages swap cached Mar 4 11:26:59 xxxxxxxx kernel: Out of Memory: Killed process 17287 (ndbd). Please let me know if you need more info
The problem here is this is a 32-bit x86 system and all of Lowmem is consumed in the slabcache: slab:212160 Normal free:928kB min:928kB low:1856kB high:2784kB active:416kB inactive:252kB present:901120kB Please ge a /proc/slabinfo output when the OOM kill happens so we can see who is consuming all of this memory. Larry Woodman
Our problem was caused by memory leak from cman. So this bug should be closed. Thanks for your time.
Problem and resolution is described in bug #212634. rgmanager consume to much memory and cause oom-killer to start killing.