963549 – OOM - killed glusterd ( that RHS server is part of cluster but not a single volume bricks is there)

Bug 963549 - OOM - killed glusterd ( that RHS server is part of cluster but not a single volume bricks is there)

Summary: OOM - killed glusterd ( that RHS server is part of cluster but not a single v...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Kaushal
QA Contact:	amainkar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-05-16 06:51 UTC by Rachana Patel
Modified:	2015-04-20 11:56 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.4.0.12rhs.beta3
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-09-23 22:35:33 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Rachana Patel 2013-05-16 06:51:56 UTC

Description of problem:
 OOM - killed glusterd ( that RHS server is part of cluster but not a single volume bricks is there)

Version-Release number of selected component (if applicable):
3.4.0.8rhs-1.el6.x86_64

How reproducible:


Steps to Reproduce:


[root@fred ~]# gluster volume info
 
Volume Name: t1
Type: Distribute
Volume ID: 67627603-4968-46c5-a42f-b15d31e742d7
Status: Started
Number of Bricks: 5
Transport-type: tcp
Bricks:
Brick1: fan.lab.eng.blr.redhat.com:/rhs/brick1/t1
Brick2: mia.lab.eng.blr.redhat.com:/rhs/brick1/t1
Brick3: fred.lab.eng.blr.redhat.com:/rhs/brick1/t1
Brick4: fred.lab.eng.blr.redhat.com:/rhs/brick1/t2
Brick5: fred.lab.eng.blr.redhat.com:/rhs/brick1/t3
 
Volume Name: sanity
Type: Distribute
Volume ID: f72df54d-410c-4f34-b181-65d8bd0cdcc4
Status: Started
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: fan.lab.eng.blr.redhat.com:/rhs/brick1/sanity
Brick2: mia.lab.eng.blr.redhat.com:/rhs/brick1/sanity
Brick3: fred.lab.eng.blr.redhat.com:/rhs/brick1/sanity
[root@fred ~]# gluster peer status
Number of Peers: 3

Hostname: mia.lab.eng.blr.redhat.com
Port: 24007
Uuid: 1698dc55-2245-4b20-9b8c-60fbe77a06ff
State: Peer in Cluster (Connected)

Hostname: fan.lab.eng.blr.redhat.com
Uuid: c6dfd028-d46f-4d20-a9c6-17c04e7fb919
State: Peer in Cluster (Connected)

Hostname: cutlass.lab.eng.blr.redhat.com
Uuid: 8969af20-77e0-41a5-bb8e-500d1a238f1b
State: Peer in Cluster (Connected)


cutlass was part of cluster but no bricks were present on it

on cutlass:-
[root@cutlass ~]# gluster v info
 
Volume Name: t1
Type: Distribute
Volume ID: 67627603-4968-46c5-a42f-b15d31e742d7
Status: Started
Number of Bricks: 5
Transport-type: tcp
Bricks:
Brick1: fan.lab.eng.blr.redhat.com:/rhs/brick1/t1
Brick2: mia.lab.eng.blr.redhat.com:/rhs/brick1/t1
Brick3: fred.lab.eng.blr.redhat.com:/rhs/brick1/t1
Brick4: fred.lab.eng.blr.redhat.com:/rhs/brick1/t2
Brick5: fred.lab.eng.blr.redhat.com:/rhs/brick1/t3
 
Volume Name: sanity
Type: Distribute
Volume ID: f72df54d-410c-4f34-b181-65d8bd0cdcc4
Status: Started
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: fan.lab.eng.blr.redhat.com:/rhs/brick1/sanity
Brick2: mia.lab.eng.blr.redhat.com:/rhs/brick1/sanity
Brick3: fred.lab.eng.blr.redhat.com:/rhs/brick1/sanity
[root@cutlass ~]# pgrep glusterfsd
[root@cutlass ~]# pgrep glusterd
1501
[root@cutlass ~]# gluster volume status
Another transaction is in progress. Please try again after sometime.
 
[root@cutlass ~]# gluster volume status
Connection failed. Please check if gluster daemon is operational.
[root@cutlass ~]# service glusterd status
glusterd dead but pid file exists

[root@cutlass ~]# cat /proc/sys/kernel/core_pattern 
/var/log/core/core.%p.%t.dump
[root@cutlass ~]# ls -l /var/log/core
total 244
-rw-------. 1 root root 247944 May 13 17:01 core.2579.1368442013.dump.1.xz

[root@cutlass ~]# date
Thu May 16 12:19:15 IST 2013



less /var/log/messages
<snip>
May 16 12:02:38 cutlass kernel: Pid: 1501, comm: glusterd Not tainted 2.6.32-358.6.1.el6.x86_64 #1
May 16 12:02:38 cutlass kernel: Call Trace:
May 16 12:02:38 cutlass kernel: [<ffffffff810cb5f1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
May 16 12:02:38 cutlass kernel: [<ffffffff8111cdf0>] ? dump_header+0x90/0x1b0
May 16 12:02:38 cutlass kernel: [<ffffffff8121d1fc>] ? security_real_capable_noaudit+0x3c/0x70
May 16 12:02:38 cutlass kernel: [<ffffffff8111d272>] ? oom_kill_process+0x82/0x2a0
May 16 12:02:38 cutlass kernel: [<ffffffff8111d1b1>] ? select_bad_process+0xe1/0x120
May 16 12:02:38 cutlass kernel: [<ffffffff8111d6b0>] ? out_of_memory+0x220/0x3c0
May 16 12:02:38 cutlass kernel: [<ffffffff8112c35c>] ? __alloc_pages_nodemask+0x8ac/0x8d0
May 16 12:02:38 cutlass kernel: [<ffffffff81160a5a>] ? alloc_pages_vma+0x9a/0x150
May 16 12:02:38 cutlass kernel: [<ffffffff81143ddb>] ? handle_pte_fault+0x76b/0xb50
May 16 12:02:38 cutlass kernel: [<ffffffff81434730>] ? sock_aio_read+0x0/0x1b0
May 16 12:02:38 cutlass kernel: [<ffffffff81180c3b>] ? do_sync_readv_writev+0xfb/0x140
May 16 12:02:38 cutlass kernel: [<ffffffff811443fa>] ? handle_mm_fault+0x23a/0x310
May 16 12:02:38 cutlass kernel: [<ffffffff810474c9>] ? __do_page_fault+0x139/0x480
May 16 12:02:38 cutlass kernel: [<ffffffff815135ce>] ? do_page_fault+0x3e/0xa0
May 16 12:02:38 cutlass kernel: [<ffffffff81510985>] ? page_fault+0x25/0x30
May 16 12:02:38 cutlass kernel: Mem-Info:
May 16 12:02:38 cutlass kernel: Node 0 DMA per-cpu:
May 16 12:02:38 cutlass kernel: CPU    0: hi:    0, btch:   1 usd:   0
May 16 12:02:38 cutlass kernel: CPU    1: hi:    0, btch:   1 usd:   0
May 16 12:02:38 cutlass kernel: Node 0 DMA32 per-cpu:
May 16 12:02:38 cutlass kernel: CPU    0: hi:  186, btch:  31 usd:   0
May 16 12:02:38 cutlass kernel: CPU    1: hi:  186, btch:  31 usd:  30
May 16 12:02:38 cutlass kernel: Node 0 Normal per-cpu:
May 16 12:02:38 cutlass kernel: CPU    0: hi:  186, btch:  31 usd:   0
May 16 12:02:38 cutlass kernel: CPU    1: hi:  186, btch:  31 usd:  30
May 16 12:02:38 cutlass kernel: active_anon:756829 inactive_anon:190533 isolated_anon:0
May 16 12:02:38 cutlass kernel: active_file:29 inactive_file:18 isolated_file:0
May 16 12:02:38 cutlass kernel: unevictable:11001 dirty:0 writeback:0 unstable:0
May 16 12:02:38 cutlass kernel: free:21243 slab_reclaimable:2718 slab_unreclaimable:13978
May 16 12:02:38 cutlass kernel: mapped:1127 shmem:6 pagetables:5582 bounce:0
May 16 12:02:38 cutlass kernel: Node 0 DMA free:15724kB min:248kB low:308kB high:372kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15320kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
May 16 12:02:38 cutlass kernel: lowmem_reserve[]: 0 3512 4017 4017
May 16 12:02:38 cutlass kernel: Node 0 DMA32 free:60836kB min:58868kB low:73584kB high:88300kB active_anon:2851384kB inactive_anon:587436kB active_file:80kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3596500kB mlocked:0kB dirty:0kB writeback:0kB mapped:68kB shmem:0kB slab_reclaimable:696kB slab_unreclaimable:1208kB kernel_stack:56kB pagetables:12384kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
May 16 12:02:38 cutlass kernel: lowmem_reserve[]: 0 0 505 505
May 16 12:02:38 cutlass kernel: Node 0 Normal free:8412kB min:8464kB low:10580kB high:12696kB active_anon:175932kB inactive_anon:174696kB active_file:36kB inactive_file:72kB unevictable:44004kB isolated(anon):0kB isolated(file):0kB present:517120kB mlocked:17432kB dirty:0kB writeback:0kB mapped:4440kB shmem:24kB slab_reclaimable:10176kB slab_unreclaimable:54704kB kernel_stack:1784kB pagetables:9944kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:19 all_unreclaimable? no
May 16 12:02:38 cutlass kernel: lowmem_reserve[]: 0 0 0 0
May 16 12:02:38 cutlass kernel: Node 0 DMA: 3*4kB 2*8kB 1*16kB 2*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15724kB
May 16 12:02:38 cutlass kernel: Node 0 DMA32: 211*4kB 109*8kB 81*16kB 49*32kB 33*64kB 15*128kB 10*256kB 7*512kB 29*1024kB 6*2048kB 1*4096kB = 60836kB
May 16 12:02:38 cutlass kernel: Node 0 Normal: 273*4kB 165*8kB 59*16kB 18*32kB 8*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 8412kB
May 16 12:02:38 cutlass kernel: 4255 total pagecache pages
May 16 12:02:38 cutlass kernel: 3102 pages in swap cache
May 16 12:02:38 cutlass kernel: Swap cache stats: add 1018276, delete 1015174, find 892/1146
May 16 12:02:38 cutlass kernel: Free swap  = 0kB
May 16 12:02:38 cutlass kernel: Total swap = 4063224kB
May 16 12:02:38 cutlass kernel: 1048575 pages RAM
May 16 12:02:38 cutlass kernel: 34819 pages reserved
May 16 12:02:38 cutlass kernel: 9528 pages shared
May 16 12:02:38 cutlass kernel: 987608 pages non-shared
May 16 12:02:38 cutlass kernel: [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
May 16 12:02:38 cutlass kernel: [  552]     0   552     2864       59   1     -17         -1000 udevd
May 16 12:02:38 cutlass kernel: [ 1318]     0  1318     2279       74   0       0             0 dhclient
May 16 12:02:38 cutlass kernel: [ 1373]     0  1373    47283      623   1       0             0 vdsm-reg-setup
May 16 12:02:38 cutlass kernel: [ 1379]     0  1379    62270      210   0       0             0 rsyslogd
May 16 12:02:38 cutlass kernel: [ 1408]     0  1408     2704      113   1       0             0 irqbalance
May 16 12:02:38 cutlass kernel: [ 1427]    32  1427     4743      145   1       0             0 rpcbind
May 16 12:02:38 cutlass kernel: [ 1478]     0  1478     6290       63   0       0             0 rpc.idmapd
May 16 12:02:38 cutlass kernel: [ 1501]     0  1501  2060653   941420   1       0             0 glusterd
May 16 12:02:38 cutlass kernel: [ 1530]    81  1530     7943      130   1       0             0 dbus-daemon
May 16 12:02:38 cutlass kernel: [ 1604]    68  1604     6265      268   1       0             0 hald
May 16 12:02:38 cutlass kernel: [ 1605]     0  1605     4526      126   1       0             0 hald-runner
May 16 12:02:38 cutlass kernel: [ 1633]     0  1633     5055      115   1       0             0 hald-addon-inpu
May 16 12:02:38 cutlass kernel: [ 1643]    68  1643     4451      154   1       0             0 hald-addon-acpi
May 16 12:02:38 cutlass kernel: [ 1775]     0  1775    96424      198   1       0             0 automount
May 16 12:02:38 cutlass kernel: [ 1795]     0  1795    16029       87   0     -17         -1000 sshd
May 16 12:02:38 cutlass kernel: [ 1811]     0  1811    22204      142   1       0             0 sendmail
May 16 12:02:38 cutlass kernel: [ 1820]    51  1820    19540      101   1       0             0 sendmail
May 16 12:02:38 cutlass kernel: [ 1843]     0  1843    27544      142   0       0             0 abrtd
May 16 12:02:38 cutlass kernel: [ 1857]     0  1857    27051      107   1       0             0 ksmtuned
May 16 12:02:38 cutlass kernel: [ 1869]     0  1869    43658      267   1       0             0 tuned
May 16 12:02:38 cutlass kernel: [ 1879]     0  1879    29303      191   1       0             0 crond
May 16 12:02:38 cutlass kernel: [ 1892]     0  1892    25972       87   1       0             0 rhsmcertd
May 16 12:02:38 cutlass kernel: [ 1906]     0  1906     3387      829   1       0             0 wdmd
May 16 12:02:38 cutlass kernel: [ 1917]   179  1917    65809     4366   0       0             0 sanlock
May 16 12:02:38 cutlass kernel: [ 1919]     0  1919     5769       45   0       0             0 sanlock-helper
May 16 12:02:38 cutlass kernel: [ 1936]     0  1936    15480       47   1       0             0 certmonger
May 16 12:02:38 cutlass kernel: [ 2031]     0  2031    58412     1082   1     -17         -1000 multipathd
May 16 12:02:38 cutlass kernel: [ 2059]    38  2059     7540      223   1       0             0 ntpd
May 16 12:02:38 cutlass kernel: [ 2109]     0  2109     7238     5701   1       0           -17 iscsiuio
May 16 12:02:38 cutlass kernel: [ 2114]     0  2114     1219      115   1       0             0 iscsid
May 16 12:02:38 cutlass kernel: [ 2115]     0  2115     1344      832   1       0           -17 iscsid
May 16 12:02:38 cutlass kernel: [ 2126]     0  2126   199365      352   1       0             0 libvirtd
May 16 12:02:38 cutlass kernel: [ 2300]     0  2300     2864       52   1     -17         -1000 udevd
May 16 12:02:38 cutlass kernel: [ 2301]     0  2301     2864       39   0     -17         -1000 udevd
May 16 12:02:38 cutlass kernel: [ 2345]    36  2345     2309       48   1       0             0 respawn
May 16 12:02:38 cutlass kernel: [ 2349]    36  2349   359993     1977   1       0             0 vdsm
May 16 12:02:38 cutlass kernel: [ 2355]     0  2355     1015      112   0       0             0 mingetty
May 16 12:02:38 cutlass kernel: [ 2357]     0  2357     1015      112   0       0             0 mingetty
May 16 12:02:38 cutlass kernel: [ 2359]     0  2359     1015      112   0       0             0 mingetty
May 16 12:02:38 cutlass kernel: [ 2361]     0  2361     1015      112   1       0             0 mingetty
May 16 12:02:38 cutlass kernel: [ 2363]     0  2363     1015      112   0       0             0 mingetty
May 16 12:02:38 cutlass kernel: [ 2365]     0  2365     1015      112   1       0             0 mingetty
May 16 12:02:38 cutlass kernel: [ 2383]     0  2383    19104      215   0       0             0 sudo
May 16 12:02:38 cutlass kernel: [ 2384]     0  2384   148838      301   0       0             0 python
May 16 12:02:38 cutlass kernel: [ 3962]     0  3962    52871      192   1       0             0 smbd
May 16 12:02:38 cutlass kernel: [ 3965]     0  3965    53000       86   1       0             0 smbd
May 16 12:02:38 cutlass kernel: [ 3986]     0  3986    83703      331   0       0             0 glusterfs
May 16 12:02:38 cutlass kernel: [ 4003]    29  4003     6621      177   0       0             0 rpc.statd
May 16 12:02:38 cutlass kernel: [ 4044]     0  4044    74567      326   1       0             0 glusterfs
May 16 12:02:38 cutlass kernel: [14848]     0 14848    24465      279   1       0             0 sshd
May 16 12:02:38 cutlass kernel: [14852]     0 14852    27083      244   1       0             0 bash
May 16 12:02:38 cutlass kernel: [15046]     0 15046    25226      125   1       0             0 sleep
May 16 12:02:38 cutlass kernel: Out of memory: Kill process 1501 (glusterd) score 915 or sacrifice child
May 16 12:02:38 cutlass kernel: Killed process 1501, UID 0, (glusterd) total-vm:8242612kB, anon-rss:3764760kB, file-rss:916kB

  
Actual results:


Expected results:


Additional info:

Comment 3 Kaushal 2013-06-24 05:41:35 UTC

Looking at the logs of cutlass, the issue seems to be with syncop. There have been several fixes to syncop which have been made after v3.4.0.8.
Git hashes of the fixes are,
7503237 syncop: synctask shouldn't yawn, it could miss a 'wake
2b525e1 syncop: Remove task from syncbarrier's waitq before 'wake
3496933 syncop: Update synctask state appropriately

These fixes are available in v3.4.0.9 and above. Can you check if this occurs again?

Comment 4 Kaushal 2013-07-11 06:36:07 UTC

Moving to ON_QA. Haven't heard of any more OOM kill since the above mentioned patches were pulled in.
Please check again with the latest packages.

Comment 5 Rachana Patel 2013-07-16 04:17:40 UTC

verified with glusterfs-3.4.0.12rhs.beta3, working as per expectation

Comment 6 Scott Haines 2013-09-23 22:35:33 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.