1415632 – Systemic setup brick process got killed due to Out Of Memory (OOM)

Bug 1415632 - Systemic setup brick process got killed due to Out Of Memory (OOM)

Summary: Systemic setup brick process got killed due to Out Of Memory (OOM)

Keywords:
Status:	CLOSED DUPLICATE of bug 1411329
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	rhgs-3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Mohit Agrawal
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-23 10:04 UTC by Nag Pavan Chilakam
Modified:	2017-02-23 14:05 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-01-24 15:29:14 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nag Pavan Chilakam 2017-01-23 10:04:47 UTC

Description of problem:
========================
During my systemic testing,
I hit issues which were reported as below
https://bugzilla.redhat.com/show_bug.cgi?id=1415178 ===>systemic testing: seeing lot of ping time outs which would lead to splitbrains 
https://bugzilla.redhat.com/show_bug.cgi?id=1415101 ===> glustershd process crashed on systemic setup 


I kept the setup for further debugging over the weekend untouched, but with IOs running which were triggered previously.

I saw today that the brick#3(dht-subvol 2) process got killed due to OOM

One note: the shd crashed on node#4 for which the above bz was reported, and this brick is a replica pair of node#4 brick


Note: node2 was hung with kernel hangs and hence had to reboot(10.70.37.86)



[root@dhcp35-156 bricks]# gluster v status
Status of volume: systemic
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.20:/rhs/brick1/systemic      49152     0          Y       4154 
Brick 10.70.37.86:/rhs/brick1/systemic      49152     0          Y       2285 
Brick 10.70.35.156:/rhs/brick1/systemic     N/A       N/A        N       N/A  
Brick 10.70.37.154:/rhs/brick1/systemic     N/A       N/A        N       N/A  
Brick 10.70.35.20:/rhs/brick2/systemic      N/A       N/A        N       N/A  
Brick 10.70.37.86:/rhs/brick2/systemic      49153     0          Y       2290 
Brick 10.70.35.156:/rhs/brick2/systemic     49153     0          Y       4045 
Brick 10.70.37.154:/rhs/brick2/systemic     49153     0          Y       3873 
Self-heal Daemon on localhost               N/A       N/A        Y       19630
Quota Daemon on localhost                   N/A       N/A        Y       19638
Self-heal Daemon on 10.70.37.86             N/A       N/A        Y       16060
Quota Daemon on 10.70.37.86                 N/A       N/A        Y       16068
Self-heal Daemon on dhcp35-20.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       19131
Quota Daemon on dhcp35-20.lab.eng.blr.redha
t.com                                       N/A       N/A        Y       19139
Self-heal Daemon on 10.70.37.154            N/A       N/A        Y       30598
Quota Daemon on 10.70.37.154                N/A       N/A        Y       30606
 
Task Status of Volume systemic
------------------------------------------------------------------------------
There are no active volume tasks







 free:43291 free_pcp:61 free_cma:0
[Sat Jan 21 07:28:01 2017] Node 0 DMA free:15908kB min:60kB low:72kB high:88kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[Sat Jan 21 07:28:01 2017] lowmem_reserve[]: 0 2815 16475 16475
[Sat Jan 21 07:28:01 2017] Node 0 DMA32 free:85160kB min:11532kB low:14412kB high:17296kB active_anon:1944336kB inactive_anon:546052kB active_file:0kB inactive_file:68kB unevictable:4524kB isolated(anon):0kB isolated(file):0kB present:3129332kB managed:2883568kB mlocked:4524kB dirty:0kB writeback:0kB mapped:2120kB shmem:556kB slab_reclaimable:10696kB slab_unreclaimable:277408kB kernel_stack:2016kB pagetables:5736kB unstable:0kB bounce:0kB free_pcp:4kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3 all_unreclaimable? yes
[Sat Jan 21 07:28:01 2017] lowmem_reserve[]: 0 0 13659 13659
[Sat Jan 21 07:28:01 2017] Node 0 Normal free:72096kB min:55984kB low:69980kB high:83976kB active_anon:12328876kB inactive_anon:1232716kB active_file:424kB inactive_file:352kB unevictable:27908kB isolated(anon):0kB isolated(file):0kB present:14262272kB managed:13987552kB mlocked:27908kB dirty:0kB writeback:44kB mapped:6116kB shmem:2528kB slab_reclaimable:33824kB slab_unreclaimable:68868kB kernel_stack:4176kB pagetables:50240kB unstable:0kB bounce:0kB free_pcp:240kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1219 all_unreclaimable? yes
[Sat Jan 21 07:28:01 2017] lowmem_reserve[]: 0 0 0 0
[Sat Jan 21 07:28:01 2017] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[Sat Jan 21 07:28:01 2017] Node 0 DMA32: 21331*4kB (UEM) 1*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 85332kB
[Sat Jan 21 07:28:01 2017] Node 0 Normal: 17899*4kB (UEM) 29*8kB (UEM) 23*16kB (M) 2*32kB (M) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 72324kB
[Sat Jan 21 07:28:01 2017] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Sat Jan 21 07:28:01 2017] 33005 total pagecache pages
[Sat Jan 21 07:28:01 2017] 29897 pages in swap cache
[Sat Jan 21 07:28:01 2017] Swap cache stats: add 295752345, delete 295722448, find 92579823/127325977
[Sat Jan 21 07:28:01 2017] Free swap  = 0kB
[Sat Jan 21 07:28:01 2017] Total swap = 8585212kB
[Sat Jan 21 07:28:01 2017] 4351899 pages RAM
[Sat Jan 21 07:28:01 2017] 0 pages HighMem/MovableOnly
[Sat Jan 21 07:28:01 2017] 130142 pages reserved
[Sat Jan 21 07:28:01 2017] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[Sat Jan 21 07:28:01 2017] [  561]     0   561     9204      277      22       47             0 systemd-journal
[Sat Jan 21 07:28:01 2017] [  580]     0   580    87696      321      35     1470             0 lvmetad
[Sat Jan 21 07:28:01 2017] [  609]     0   609    79648     1811      36        0         -1000 multipathd
[Sat Jan 21 07:28:01 2017] [  616]     0   616    11745      272      23      615         -1000 systemd-udevd
[Sat Jan 21 07:28:01 2017] [  787]     0   787    46853     7504      44        0         -1000 dmeventd
[Sat Jan 21 07:28:01 2017] [  889]     0   889    13854      214      27       90         -1000 auditd
[Sat Jan 21 07:28:01 2017] [  909]     0   909     4202      201      14       46             0 alsactl
[Sat Jan 21 07:28:01 2017] [  910]     0   910     1092      105       8       16             0 rngd
[Sat Jan 21 07:28:01 2017] [  911]     0   911    53103      262      57      384             0 abrtd
[Sat Jan 21 07:28:01 2017] [  912]    81   912    24620      330      17       85          -900 dbus-daemon
[Sat Jan 21 07:28:01 2017] [  914]   997   914    28962      282      28       72             0 chronyd
[Sat Jan 21 07:28:01 2017] [  925]     0   925     4826      212      13       42             0 irqbalance
[Sat Jan 21 07:28:01 2017] [  926]   998   926   132071      328      55     1668             0 polkitd
[Sat Jan 21 07:28:01 2017] [  934]     0   934    52548      238      53      326             0 abrt-watch-log
[Sat Jan 21 07:28:01 2017] [  939]     0   939     6070      190      16      124             0 smartd
[Sat Jan 21 07:28:01 2017] [  940]     0   940     6048      252      16       37             0 systemd-logind
[Sat Jan 21 07:28:01 2017] [  944]     0   944     1083       95       7       34             0 acpid
[Sat Jan 21 07:28:01 2017] [  957]    32   957    16237      134      34      123             0 rpcbind
[Sat Jan 21 07:28:01 2017] [  963]     0   963    82803      853      84     5442             0 firewalld
[Sat Jan 21 07:28:01 2017] [  964]     0   964    50303      144      38      114             0 gssproxy
[Sat Jan 21 07:28:01 2017] [  972]     0   972    31555      249      18      129             0 crond
[Sat Jan 21 07:28:01 2017] [  973]     0   973     6461      183      16       47             0 atd
[Sat Jan 21 07:28:01 2017] [  979]     0   979    27509      164      10       30             0 agetty
[Sat Jan 21 07:28:01 2017] [  981]     0   981   127850      539      67      268             0 NetworkManager
[Sat Jan 21 07:28:01 2017] [ 1084]     0  1084    28206      274      56     3076             0 dhclient
[Sat Jan 21 07:28:01 2017] [ 1309]     0  1309   158898      279     122      972             0 libvirtd
[Sat Jan 21 07:28:01 2017] [ 1311]   995  1311    11050      143      24      172             0 nrpe
[Sat Jan 21 07:28:01 2017] [ 1314]     0  1314    71328      230      41      316             0 rsyslogd
[Sat Jan 21 07:28:01 2017] [ 1316]     0  1316   138290      452      90     2576             0 tuned
[Sat Jan 21 07:28:01 2017] [ 1319]     0  1319   168455     1525      80     6170             0 glusterd
[Sat Jan 21 07:28:01 2017] [ 1324]     0  1324    20617       74      39      213         -1000 sshd
[Sat Jan 21 07:28:01 2017] [ 1325]     0  1325    28335      114      11       18             0 rhsmcertd
[Sat Jan 21 07:28:01 2017] [ 1421]     0  1421    22244      189      42      236             0 master
[Sat Jan 21 07:28:01 2017] [ 1429]    89  1429    22314      256      41      233             0 qmgr
[Sat Jan 21 07:28:01 2017] [ 1657]     0  1657    26978       99       9       11             0 rhnsd
[Sat Jan 21 07:28:01 2017] [ 4026]     0  4026  4351104  2481504    7524  1271633             0 glusterfsd
[Sat Jan 21 07:28:01 2017] [ 4045]     0  4045  2933885  1494036    4702   816818             0 glusterfsd
[Sat Jan 21 07:28:01 2017] [ 4066]     0  4066   643284     2959     289    16464             0 glusterfs
[Sat Jan 21 07:28:01 2017] [ 4097]     0  4097   135927      322      73    13924             0 glusterfs
[Sat Jan 21 07:28:01 2017] [ 6258]     0  6258    31927      122      18      129             0 screen
[Sat Jan 21 07:28:01 2017] [ 6259]     0  6259    29317      279      14      523             0 bash
[Sat Jan 21 07:28:01 2017] [14012]     0 14012    26974      136      10        1             0 sleep
[Sat Jan 21 07:28:01 2017] [14127]    89 14127    22270      470      43        0             0 pickup
[Sat Jan 21 07:28:01 2017] Out of memory: Kill process 4026 (glusterfsd) score 572 or sacrifice child
[Sat Jan 21 07:28:01 2017] Killed process 4026 (glusterfsd) total-vm:17404416kB, anon-rss:9926016kB, file-rss:0kB, shmem-rss:0kB

Comment 2 Nag Pavan Chilakam 2017-01-23 13:21:29 UTC

logs available @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1415632

Comment 3 Atin Mukherjee 2017-01-23 13:24:32 UTC

Can we please check if this is not duplicate of BZ 1413351 ?

Comment 4 Mohit Agrawal 2017-01-24 04:08:42 UTC

Hi,

Is it possible to share statedump of glustershd process from other nodes in cluster if it is up?

Regards
Mohit Agrawal

Comment 5 Mohit Agrawal 2017-01-24 05:47:17 UTC

(In reply to Mohit Agrawal from comment #4)
> Hi,
> 
> Is it possible to share statedump of glustershd process from other nodes in
> cluster if it is up?
> 
> Regards
> Mohit Agrawal

Please share the statedump of brick process also those are running.

Regards
Mohit Agrawal

Comment 6 Mohit Agrawal 2017-01-24 05:47:24 UTC

(In reply to Mohit Agrawal from comment #4)
> Hi,
> 
> Is it possible to share statedump of glustershd process from other nodes in
> cluster if it is up?
> 
> Regards
> Mohit Agrawal

Please share the statedump of brick process also those are running.

Regards
Mohit Agrawal

Comment 7 Nag Pavan Chilakam 2017-01-24 11:43:33 UTC

Hi Mohit, the statedump of the bricks are available on the volume.
you can mount the volume and visit the below path

/mnt/systemic/logs/rhs-client11.lab.eng.blr.redhat.com/statedumps


it has all the dumps of all bricks of all nodes.

Given this is systemic testing it may be highly difficult to point to one statedump

Comment 8 Mohit Agrawal 2017-01-24 12:30:01 UTC

Hi,

  Thank you for sharing the location of statedump.I have checked current brick process
  statedump in /var/run/gluster, for specific to dict leak it is similar to bugzilla (https://bugzilla.redhat.com/show_bug.cgi?id=1411329) and current memory consumption of brick
  process is also high 

   ps -aef | grep glusterfsd
root      4154     1 75 Jan18 ?        4-10:47:51 /usr/sbin/glusterfsd -s 10.70.35.20 --volfile-id systemic.10.70.35.20.rhs-brick1-systemic -p /var/lib/glusterd/vols/systemic/run/10.70.35.20-rhs-brick1-systemic.pid -S /var/run/gluster/ecb70f1c5d8ade863b15ff45abb9d46c.socket --brick-name /rhs/brick1/systemic -l /var/log/glusterfs/bricks/rhs-brick1-systemic.log --xlator-option *-posix.glusterd-uuid=c5954ab2-6283-4e02-a6f6-1fa6cd943f49 --brick-port 49152 --xlator-option systemic-server.listen-port=49152
root     24472 24391  0 17:56 pts/4    00:00:00 grep --color=auto glusterfsd
[root@dhcp35-20 gluster]# pmap -x 4154 | grep "total"


total kB         21724380 14717652 14714608

dict leak issue is already fixed from the bugzilla (1411329)

Regards
Mohit Agrawal

Note You need to log in before you can comment on or make changes to this bug.