Description of problem: OOM - killed glusterd ( that RHS server is part of cluster but not a single volume bricks is there) Version-Release number of selected component (if applicable): 3.4.0.8rhs-1.el6.x86_64 How reproducible: Steps to Reproduce: [root@fred ~]# gluster volume info Volume Name: t1 Type: Distribute Volume ID: 67627603-4968-46c5-a42f-b15d31e742d7 Status: Started Number of Bricks: 5 Transport-type: tcp Bricks: Brick1: fan.lab.eng.blr.redhat.com:/rhs/brick1/t1 Brick2: mia.lab.eng.blr.redhat.com:/rhs/brick1/t1 Brick3: fred.lab.eng.blr.redhat.com:/rhs/brick1/t1 Brick4: fred.lab.eng.blr.redhat.com:/rhs/brick1/t2 Brick5: fred.lab.eng.blr.redhat.com:/rhs/brick1/t3 Volume Name: sanity Type: Distribute Volume ID: f72df54d-410c-4f34-b181-65d8bd0cdcc4 Status: Started Number of Bricks: 3 Transport-type: tcp Bricks: Brick1: fan.lab.eng.blr.redhat.com:/rhs/brick1/sanity Brick2: mia.lab.eng.blr.redhat.com:/rhs/brick1/sanity Brick3: fred.lab.eng.blr.redhat.com:/rhs/brick1/sanity [root@fred ~]# gluster peer status Number of Peers: 3 Hostname: mia.lab.eng.blr.redhat.com Port: 24007 Uuid: 1698dc55-2245-4b20-9b8c-60fbe77a06ff State: Peer in Cluster (Connected) Hostname: fan.lab.eng.blr.redhat.com Uuid: c6dfd028-d46f-4d20-a9c6-17c04e7fb919 State: Peer in Cluster (Connected) Hostname: cutlass.lab.eng.blr.redhat.com Uuid: 8969af20-77e0-41a5-bb8e-500d1a238f1b State: Peer in Cluster (Connected) cutlass was part of cluster but no bricks were present on it on cutlass:- [root@cutlass ~]# gluster v info Volume Name: t1 Type: Distribute Volume ID: 67627603-4968-46c5-a42f-b15d31e742d7 Status: Started Number of Bricks: 5 Transport-type: tcp Bricks: Brick1: fan.lab.eng.blr.redhat.com:/rhs/brick1/t1 Brick2: mia.lab.eng.blr.redhat.com:/rhs/brick1/t1 Brick3: fred.lab.eng.blr.redhat.com:/rhs/brick1/t1 Brick4: fred.lab.eng.blr.redhat.com:/rhs/brick1/t2 Brick5: fred.lab.eng.blr.redhat.com:/rhs/brick1/t3 Volume Name: sanity Type: Distribute Volume ID: f72df54d-410c-4f34-b181-65d8bd0cdcc4 Status: Started Number of Bricks: 3 Transport-type: tcp Bricks: Brick1: fan.lab.eng.blr.redhat.com:/rhs/brick1/sanity Brick2: mia.lab.eng.blr.redhat.com:/rhs/brick1/sanity Brick3: fred.lab.eng.blr.redhat.com:/rhs/brick1/sanity [root@cutlass ~]# pgrep glusterfsd [root@cutlass ~]# pgrep glusterd 1501 [root@cutlass ~]# gluster volume status Another transaction is in progress. Please try again after sometime. [root@cutlass ~]# gluster volume status Connection failed. Please check if gluster daemon is operational. [root@cutlass ~]# service glusterd status glusterd dead but pid file exists [root@cutlass ~]# cat /proc/sys/kernel/core_pattern /var/log/core/core.%p.%t.dump [root@cutlass ~]# ls -l /var/log/core total 244 -rw-------. 1 root root 247944 May 13 17:01 core.2579.1368442013.dump.1.xz [root@cutlass ~]# date Thu May 16 12:19:15 IST 2013 less /var/log/messages <snip> May 16 12:02:38 cutlass kernel: Pid: 1501, comm: glusterd Not tainted 2.6.32-358.6.1.el6.x86_64 #1 May 16 12:02:38 cutlass kernel: Call Trace: May 16 12:02:38 cutlass kernel: [<ffffffff810cb5f1>] ? cpuset_print_task_mems_allowed+0x91/0xb0 May 16 12:02:38 cutlass kernel: [<ffffffff8111cdf0>] ? dump_header+0x90/0x1b0 May 16 12:02:38 cutlass kernel: [<ffffffff8121d1fc>] ? security_real_capable_noaudit+0x3c/0x70 May 16 12:02:38 cutlass kernel: [<ffffffff8111d272>] ? oom_kill_process+0x82/0x2a0 May 16 12:02:38 cutlass kernel: [<ffffffff8111d1b1>] ? select_bad_process+0xe1/0x120 May 16 12:02:38 cutlass kernel: [<ffffffff8111d6b0>] ? out_of_memory+0x220/0x3c0 May 16 12:02:38 cutlass kernel: [<ffffffff8112c35c>] ? __alloc_pages_nodemask+0x8ac/0x8d0 May 16 12:02:38 cutlass kernel: [<ffffffff81160a5a>] ? alloc_pages_vma+0x9a/0x150 May 16 12:02:38 cutlass kernel: [<ffffffff81143ddb>] ? handle_pte_fault+0x76b/0xb50 May 16 12:02:38 cutlass kernel: [<ffffffff81434730>] ? sock_aio_read+0x0/0x1b0 May 16 12:02:38 cutlass kernel: [<ffffffff81180c3b>] ? do_sync_readv_writev+0xfb/0x140 May 16 12:02:38 cutlass kernel: [<ffffffff811443fa>] ? handle_mm_fault+0x23a/0x310 May 16 12:02:38 cutlass kernel: [<ffffffff810474c9>] ? __do_page_fault+0x139/0x480 May 16 12:02:38 cutlass kernel: [<ffffffff815135ce>] ? do_page_fault+0x3e/0xa0 May 16 12:02:38 cutlass kernel: [<ffffffff81510985>] ? page_fault+0x25/0x30 May 16 12:02:38 cutlass kernel: Mem-Info: May 16 12:02:38 cutlass kernel: Node 0 DMA per-cpu: May 16 12:02:38 cutlass kernel: CPU 0: hi: 0, btch: 1 usd: 0 May 16 12:02:38 cutlass kernel: CPU 1: hi: 0, btch: 1 usd: 0 May 16 12:02:38 cutlass kernel: Node 0 DMA32 per-cpu: May 16 12:02:38 cutlass kernel: CPU 0: hi: 186, btch: 31 usd: 0 May 16 12:02:38 cutlass kernel: CPU 1: hi: 186, btch: 31 usd: 30 May 16 12:02:38 cutlass kernel: Node 0 Normal per-cpu: May 16 12:02:38 cutlass kernel: CPU 0: hi: 186, btch: 31 usd: 0 May 16 12:02:38 cutlass kernel: CPU 1: hi: 186, btch: 31 usd: 30 May 16 12:02:38 cutlass kernel: active_anon:756829 inactive_anon:190533 isolated_anon:0 May 16 12:02:38 cutlass kernel: active_file:29 inactive_file:18 isolated_file:0 May 16 12:02:38 cutlass kernel: unevictable:11001 dirty:0 writeback:0 unstable:0 May 16 12:02:38 cutlass kernel: free:21243 slab_reclaimable:2718 slab_unreclaimable:13978 May 16 12:02:38 cutlass kernel: mapped:1127 shmem:6 pagetables:5582 bounce:0 May 16 12:02:38 cutlass kernel: Node 0 DMA free:15724kB min:248kB low:308kB high:372kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15320kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes May 16 12:02:38 cutlass kernel: lowmem_reserve[]: 0 3512 4017 4017 May 16 12:02:38 cutlass kernel: Node 0 DMA32 free:60836kB min:58868kB low:73584kB high:88300kB active_anon:2851384kB inactive_anon:587436kB active_file:80kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3596500kB mlocked:0kB dirty:0kB writeback:0kB mapped:68kB shmem:0kB slab_reclaimable:696kB slab_unreclaimable:1208kB kernel_stack:56kB pagetables:12384kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no May 16 12:02:38 cutlass kernel: lowmem_reserve[]: 0 0 505 505 May 16 12:02:38 cutlass kernel: Node 0 Normal free:8412kB min:8464kB low:10580kB high:12696kB active_anon:175932kB inactive_anon:174696kB active_file:36kB inactive_file:72kB unevictable:44004kB isolated(anon):0kB isolated(file):0kB present:517120kB mlocked:17432kB dirty:0kB writeback:0kB mapped:4440kB shmem:24kB slab_reclaimable:10176kB slab_unreclaimable:54704kB kernel_stack:1784kB pagetables:9944kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:19 all_unreclaimable? no May 16 12:02:38 cutlass kernel: lowmem_reserve[]: 0 0 0 0 May 16 12:02:38 cutlass kernel: Node 0 DMA: 3*4kB 2*8kB 1*16kB 2*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15724kB May 16 12:02:38 cutlass kernel: Node 0 DMA32: 211*4kB 109*8kB 81*16kB 49*32kB 33*64kB 15*128kB 10*256kB 7*512kB 29*1024kB 6*2048kB 1*4096kB = 60836kB May 16 12:02:38 cutlass kernel: Node 0 Normal: 273*4kB 165*8kB 59*16kB 18*32kB 8*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 8412kB May 16 12:02:38 cutlass kernel: 4255 total pagecache pages May 16 12:02:38 cutlass kernel: 3102 pages in swap cache May 16 12:02:38 cutlass kernel: Swap cache stats: add 1018276, delete 1015174, find 892/1146 May 16 12:02:38 cutlass kernel: Free swap = 0kB May 16 12:02:38 cutlass kernel: Total swap = 4063224kB May 16 12:02:38 cutlass kernel: 1048575 pages RAM May 16 12:02:38 cutlass kernel: 34819 pages reserved May 16 12:02:38 cutlass kernel: 9528 pages shared May 16 12:02:38 cutlass kernel: 987608 pages non-shared May 16 12:02:38 cutlass kernel: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name May 16 12:02:38 cutlass kernel: [ 552] 0 552 2864 59 1 -17 -1000 udevd May 16 12:02:38 cutlass kernel: [ 1318] 0 1318 2279 74 0 0 0 dhclient May 16 12:02:38 cutlass kernel: [ 1373] 0 1373 47283 623 1 0 0 vdsm-reg-setup May 16 12:02:38 cutlass kernel: [ 1379] 0 1379 62270 210 0 0 0 rsyslogd May 16 12:02:38 cutlass kernel: [ 1408] 0 1408 2704 113 1 0 0 irqbalance May 16 12:02:38 cutlass kernel: [ 1427] 32 1427 4743 145 1 0 0 rpcbind May 16 12:02:38 cutlass kernel: [ 1478] 0 1478 6290 63 0 0 0 rpc.idmapd May 16 12:02:38 cutlass kernel: [ 1501] 0 1501 2060653 941420 1 0 0 glusterd May 16 12:02:38 cutlass kernel: [ 1530] 81 1530 7943 130 1 0 0 dbus-daemon May 16 12:02:38 cutlass kernel: [ 1604] 68 1604 6265 268 1 0 0 hald May 16 12:02:38 cutlass kernel: [ 1605] 0 1605 4526 126 1 0 0 hald-runner May 16 12:02:38 cutlass kernel: [ 1633] 0 1633 5055 115 1 0 0 hald-addon-inpu May 16 12:02:38 cutlass kernel: [ 1643] 68 1643 4451 154 1 0 0 hald-addon-acpi May 16 12:02:38 cutlass kernel: [ 1775] 0 1775 96424 198 1 0 0 automount May 16 12:02:38 cutlass kernel: [ 1795] 0 1795 16029 87 0 -17 -1000 sshd May 16 12:02:38 cutlass kernel: [ 1811] 0 1811 22204 142 1 0 0 sendmail May 16 12:02:38 cutlass kernel: [ 1820] 51 1820 19540 101 1 0 0 sendmail May 16 12:02:38 cutlass kernel: [ 1843] 0 1843 27544 142 0 0 0 abrtd May 16 12:02:38 cutlass kernel: [ 1857] 0 1857 27051 107 1 0 0 ksmtuned May 16 12:02:38 cutlass kernel: [ 1869] 0 1869 43658 267 1 0 0 tuned May 16 12:02:38 cutlass kernel: [ 1879] 0 1879 29303 191 1 0 0 crond May 16 12:02:38 cutlass kernel: [ 1892] 0 1892 25972 87 1 0 0 rhsmcertd May 16 12:02:38 cutlass kernel: [ 1906] 0 1906 3387 829 1 0 0 wdmd May 16 12:02:38 cutlass kernel: [ 1917] 179 1917 65809 4366 0 0 0 sanlock May 16 12:02:38 cutlass kernel: [ 1919] 0 1919 5769 45 0 0 0 sanlock-helper May 16 12:02:38 cutlass kernel: [ 1936] 0 1936 15480 47 1 0 0 certmonger May 16 12:02:38 cutlass kernel: [ 2031] 0 2031 58412 1082 1 -17 -1000 multipathd May 16 12:02:38 cutlass kernel: [ 2059] 38 2059 7540 223 1 0 0 ntpd May 16 12:02:38 cutlass kernel: [ 2109] 0 2109 7238 5701 1 0 -17 iscsiuio May 16 12:02:38 cutlass kernel: [ 2114] 0 2114 1219 115 1 0 0 iscsid May 16 12:02:38 cutlass kernel: [ 2115] 0 2115 1344 832 1 0 -17 iscsid May 16 12:02:38 cutlass kernel: [ 2126] 0 2126 199365 352 1 0 0 libvirtd May 16 12:02:38 cutlass kernel: [ 2300] 0 2300 2864 52 1 -17 -1000 udevd May 16 12:02:38 cutlass kernel: [ 2301] 0 2301 2864 39 0 -17 -1000 udevd May 16 12:02:38 cutlass kernel: [ 2345] 36 2345 2309 48 1 0 0 respawn May 16 12:02:38 cutlass kernel: [ 2349] 36 2349 359993 1977 1 0 0 vdsm May 16 12:02:38 cutlass kernel: [ 2355] 0 2355 1015 112 0 0 0 mingetty May 16 12:02:38 cutlass kernel: [ 2357] 0 2357 1015 112 0 0 0 mingetty May 16 12:02:38 cutlass kernel: [ 2359] 0 2359 1015 112 0 0 0 mingetty May 16 12:02:38 cutlass kernel: [ 2361] 0 2361 1015 112 1 0 0 mingetty May 16 12:02:38 cutlass kernel: [ 2363] 0 2363 1015 112 0 0 0 mingetty May 16 12:02:38 cutlass kernel: [ 2365] 0 2365 1015 112 1 0 0 mingetty May 16 12:02:38 cutlass kernel: [ 2383] 0 2383 19104 215 0 0 0 sudo May 16 12:02:38 cutlass kernel: [ 2384] 0 2384 148838 301 0 0 0 python May 16 12:02:38 cutlass kernel: [ 3962] 0 3962 52871 192 1 0 0 smbd May 16 12:02:38 cutlass kernel: [ 3965] 0 3965 53000 86 1 0 0 smbd May 16 12:02:38 cutlass kernel: [ 3986] 0 3986 83703 331 0 0 0 glusterfs May 16 12:02:38 cutlass kernel: [ 4003] 29 4003 6621 177 0 0 0 rpc.statd May 16 12:02:38 cutlass kernel: [ 4044] 0 4044 74567 326 1 0 0 glusterfs May 16 12:02:38 cutlass kernel: [14848] 0 14848 24465 279 1 0 0 sshd May 16 12:02:38 cutlass kernel: [14852] 0 14852 27083 244 1 0 0 bash May 16 12:02:38 cutlass kernel: [15046] 0 15046 25226 125 1 0 0 sleep May 16 12:02:38 cutlass kernel: Out of memory: Kill process 1501 (glusterd) score 915 or sacrifice child May 16 12:02:38 cutlass kernel: Killed process 1501, UID 0, (glusterd) total-vm:8242612kB, anon-rss:3764760kB, file-rss:916kB Actual results: Expected results: Additional info:
Looking at the logs of cutlass, the issue seems to be with syncop. There have been several fixes to syncop which have been made after v3.4.0.8. Git hashes of the fixes are, 7503237 syncop: synctask shouldn't yawn, it could miss a 'wake 2b525e1 syncop: Remove task from syncbarrier's waitq before 'wake 3496933 syncop: Update synctask state appropriately These fixes are available in v3.4.0.9 and above. Can you check if this occurs again?
Moving to ON_QA. Haven't heard of any more OOM kill since the above mentioned patches were pulled in. Please check again with the latest packages.
verified with glusterfs-3.4.0.12rhs.beta3, working as per expectation
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html