Bug 1411329 - OOM kill of glusterfsd during continuous add-bricks
Summary: OOM kill of glusterfsd during continuous add-bricks
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: marker
Version: rhgs-3.2
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: RHGS 3.2.0
Assignee: Mohit Agrawal
QA Contact: Prasad Desala
URL:
Whiteboard:
: 1415632 (view as bug list)
Depends On: 1412917 1417622
Blocks: 1351528
TreeView+ depends on / blocked
 
Reported: 2017-01-09 13:51 UTC by Prasad Desala
Modified: 2017-03-23 06:02 UTC (History)
13 users (show)

Fixed In Version: glusterfs-3.8.4-13
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1412917 (view as bug list)
Environment:
Last Closed: 2017-03-23 06:02:13 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1413351 0 unspecified CLOSED [Scale] : Brick process oom-killed and rebalance failed. 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHSA-2017:0486 0 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Internal Links: 1413351

Description Prasad Desala 2017-01-09 13:51:43 UTC
Description of problem:
=======================
OOM kill of glusterfsd is seen during parallel untar + continuous add-bricks.

Version-Release number of selected component (if applicable):
3.8.4-10.el7rhgs.x86_64

How reproducible:
=================
2/2

Steps to Reproduce:
===================
1) Create ganesha cluster and create a distributed-replicate volume.
2) Enable nfs-ganesha on the volume with mdcache settings.
3) Mount the volume.
4) On the mount point, copy linux kernel package and create three empty directories say 1,2 and 3
5) Now, simultaneously start untarring the kernel package to directories 1,2 and 3.
tar -xvf linux-4.4.1.tar -C 1
tar -xvf linux-4.4.1.tar -C 2
tar -xvf linux-4.4.1.tar -C 3
6) While untar is in-progress, continuously add bricks to the volume.

glusterfsd OOM killed.

Jan  9 17:54:09 dhcp47-155 lrmd[14091]:  notice: dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:29592:stderr [ 0 bytes (0 B) copied, 11.162 s, 0.0 kB/s ]
Jan  9 17:54:24 dhcp47-155 corosync[14072]: [MAIN  ] Corosync main process was not scheduled for 2967.8682 ms (threshold is 1840.0000 ms). Consider token timeout increase.
Jan  9 17:54:24 dhcp47-155 kernel: lrmd invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0
Jan  9 17:54:24 dhcp47-155 kernel: lrmd cpuset=/ mems_allowed=0
Jan  9 17:54:24 dhcp47-155 kernel: CPU: 1 PID: 14091 Comm: lrmd Not tainted 3.10.0-514.2.2.el7.x86_64 #1
Jan  9 17:54:24 dhcp47-155 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Jan  9 17:54:24 dhcp47-155 kernel: ffff880036283ec0 000000003402a8ea ffff88021aaebae8 ffffffff816860cc
Jan  9 17:54:24 dhcp47-155 kernel: ffff88021aaebb78 ffffffff81681077 ffffffff812ae87b ffff8800bb68efa0
Jan  9 17:54:24 dhcp47-155 kernel: ffff8800bb68efb8 ffffffff00000206 fbfeebff00000000 000000000000000d
Jan  9 17:54:24 dhcp47-155 kernel: Call Trace:
Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff816860cc>] dump_stack+0x19/0x1b
Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81681077>] dump_header+0x8e/0x225
Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff812ae87b>] ? cred_has_capability+0x6b/0x120
Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff8113cc03>] ? proc_do_uts_string+0x123/0x130
Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff8118476e>] oom_kill_process+0x24e/0x3c0
Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff810937ee>] ? has_capability_noaudit+0x1e/0x30
Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81184fa6>] out_of_memory+0x4b6/0x4f0
Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81681b80>] __alloc_pages_slowpath+0x5d7/0x725
Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff8118b0b5>] __alloc_pages_nodemask+0x405/0x420
Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff810831cd>] copy_process+0x1dd/0x1960
Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81084b01>] do_fork+0x91/0x2c0
Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff8121cc80>] ? get_unused_fd_flags+0x30/0x40
Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff8121ccd7>] ? __fd_install+0x47/0x60
Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81084db6>] SyS_clone+0x16/0x20
Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81696a19>] stub_clone+0x69/0x90
Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff816966c9>] ? system_call_fastpath+0x16/0x1b
Jan  9 17:54:24 dhcp47-155 kernel: Mem-Info:
Jan  9 17:54:24 dhcp47-155 kernel: active_anon:1576807 inactive_anon:311606 isolated_anon:0#012 active_file:376 inactive_file:1858 isolated_file:0#012 unevictable:8617 dirty:0 writeback:6 unstable:0#012 slab_reclaimable:11488 slab_unreclaimable:18027#012 mapped:12310 shmem:93443 pagetables:9098 bounce:0#012 free:25869 free_pcp:19 free_cma:0
Jan  9 17:54:24 dhcp47-155 kernel: Node 0 DMA free:15908kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jan  9 17:54:24 dhcp47-155 kernel: lowmem_reserve[]: 0 2816 7807 7807
Jan  9 17:54:24 dhcp47-155 kernel: Node 0 DMA32 free:44460kB min:24324kB low:30404kB high:36484kB active_anon:2178252kB inactive_anon:558164kB active_file:916kB inactive_file:4532kB unevictable:10208kB isolated(anon):0kB isolated(file):0kB present:3129336kB managed:2884596kB mlocked:10208kB dirty:0kB writeback:12kB mapped:14092kB shmem:114240kB slab_reclaimable:18136kB slab_unreclaimable:24584kB kernel_stack:5616kB pagetables:11996kB unstable:0kB bounce:0kB free_pcp:28kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:7473 all_unreclaimable? yes
Jan  9 17:54:28 dhcp47-155 kernel: lowmem_reserve[]: 0 0 4990 4990
Jan  9 17:54:28 dhcp47-155 kernel: Node 0 Normal free:43108kB min:43120kB low:53900kB high:64680kB active_anon:4128976kB inactive_anon:688260kB active_file:588kB inactive_file:2900kB unevictable:24260kB isolated(anon):0kB isolated(file):0kB present:5242880kB managed:5110044kB mlocked:24260kB dirty:0kB writeback:12kB mapped:35148kB shmem:259532kB slab_reclaimable:27816kB slab_unreclaimable:47524kB kernel_stack:13424kB pagetables:24396kB unstable:0kB bounce:0kB free_pcp:48kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:5880 all_unreclaimable? yes
Jan  9 17:54:28 dhcp47-155 kernel: lowmem_reserve[]: 0 0 0 0
Jan  9 17:54:28 dhcp47-155 kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
Jan  9 17:54:28 dhcp47-155 kernel: Node 0 DMA32: 5017*4kB (UE) 3064*8kB (UE) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44580kB
Jan  9 17:54:28 dhcp47-155 kernel: Node 0 Normal: 10808*4kB (UM) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43232kB
Jan  9 17:54:28 dhcp47-155 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan  9 17:54:28 dhcp47-155 kernel: 147527 total pagecache pages
Jan  9 17:54:28 dhcp47-155 kernel: 49426 pages in swap cache
Jan  9 17:54:28 dhcp47-155 kernel: Swap cache stats: add 1417257, delete 1367831, find 6662219/6739670
Jan  9 17:54:28 dhcp47-155 kernel: Free swap  = 0kB
Jan  9 17:54:28 dhcp47-155 kernel: Total swap = 2097148kB
Jan  9 17:54:28 dhcp47-155 kernel: 2097052 pages RAM
Jan  9 17:54:28 dhcp47-155 kernel: 0 pages HighMem/MovableOnly
Jan  9 17:54:28 dhcp47-155 kernel: 94415 pages reserved
Jan  9 17:54:28 dhcp47-155 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Jan  9 17:54:28 dhcp47-155 kernel: [  698]     0   698    13398     4048      31       58             0 systemd-journal
Jan  9 17:54:28 dhcp47-155 kernel: [  705]     0   705   218768      813      42      637             0 lvmetad
Jan  9 17:54:28 dhcp47-155 kernel: [  723]     0   723    11580      674      23      430         -1000 systemd-udevd
Jan  9 17:54:28 dhcp47-155 kernel: [ 1328]     0  1328    13854      217      27       84         -1000 auditd
Jan  9 17:54:28 dhcp47-155 kernel: [ 1347]   998  1347   132067      472      54     1846             0 polkitd
Jan  9 17:54:28 dhcp47-155 kernel: [ 1348]    81  1348    24617      310      18       62          -900 dbus-daemon
Jan  9 17:54:28 dhcp47-155 kernel: [ 1352]     0  1352     6066      291      15       30             0 systemd-logind
Jan  9 17:54:28 dhcp47-155 kernel: [ 1354]     0  1354     4827      213      14       44             0 irqbalance
Jan  9 17:54:28 dhcp47-155 kernel: [ 1355]     0  1355    53131      264      57      410             0 abrtd
Jan  9 17:54:28 dhcp47-155 kernel: [ 1364]   997  1364    28962      313      27       36             0 chronyd
Jan  9 17:54:28 dhcp47-155 kernel: [ 1367]    32  1367    16263      183      35      118             0 rpcbind
Jan  9 17:54:28 dhcp47-155 kernel: [ 1369]     0  1369    50303      148      37      114             0 gssproxy
Jan  9 17:54:28 dhcp47-155 kernel: [ 1370]     0  1370    82866     4667      82     1779             0 firewalld
Jan  9 17:54:28 dhcp47-155 kernel: [ 1734]     0  1734    28206      147      53     3061             0 dhclient
Jan  9 17:54:28 dhcp47-155 kernel: [ 1828]     0  1828   138288     1830      87     1280             0 tuned
Jan  9 17:54:28 dhcp47-155 kernel: [ 1832]     0  1832    28335      120      11       17             0 rhsmcertd
Jan  9 17:54:28 dhcp47-155 kernel: [ 1847]     0  1847    20617       83      42      188         -1000 sshd
Jan  9 17:54:28 dhcp47-155 kernel: [ 1962]     0  1962    22244      196      42      236             0 master
Jan  9 17:54:28 dhcp47-155 kernel: [ 1964]    89  1964    22287      211      44      235             0 qmgr
Jan  9 17:54:28 dhcp47-155 kernel: [ 2354]     0  2354    31555      246      17      130             0 crond
Jan  9 17:54:28 dhcp47-155 kernel: [ 2362]     0  2362    26978      130       9        9             0 rhnsd
Jan  9 17:54:28 dhcp47-155 kernel: [ 2378]     0  2378    27509      165      10       30             0 agetty
Jan  9 17:54:28 dhcp47-155 kernel: [19586]     0 19586   179084     8344      48        0         -1000 dmeventd
Jan  9 17:54:28 dhcp47-155 kernel: [ 6147]     0  6147   137119     3257     121       91             0 rsyslogd
Jan  9 17:54:28 dhcp47-155 kernel: [14689]     0 14689   243755     3755      91     7687             0 glusterfs
Jan  9 17:54:28 dhcp47-155 kernel: [17037]     0 17037   172515    23542     135    11287             0 glusterd
Jan  9 17:54:28 dhcp47-155 kernel: [15978]     0 15978   263238      721     106     8845             0 glusterfsd
Jan  9 17:54:28 dhcp47-155 kernel: [15998]     0 15998   263238      745     105     8190             0 glusterfsd
Jan  9 17:54:28 dhcp47-155 kernel: [16018]     0 16018   263238      177     104     8846             0 glusterfsd
Jan  9 17:54:28 dhcp47-155 kernel: [11397]     0 11397   314084     2379     110     8227             0 glusterfsd
Jan  9 17:54:28 dhcp47-155 kernel: [11417]     0 11417   330983     2919     111     8186             0 glusterfsd
Jan  9 17:54:28 dhcp47-155 kernel: [11437]     0 11437   330982     2398     113     8552             0 glusterfsd
Jan  9 17:54:28 dhcp47-155 kernel: [30763]     0 30763   262627      782     108     9807             0 glusterfsd
Jan  9 17:54:28 dhcp47-155 kernel: [30782]     0 30782   262626      839     105     9788             0 glusterfsd
Jan  9 17:54:28 dhcp47-155 kernel: [30801]     0 30801   262626      822     106     8770             0 glusterfsd
Jan  9 17:54:28 dhcp47-155 kernel: [15232]     0 15232   494829     3141     131     8902             0 glusterfsd
Jan  9 17:54:28 dhcp47-155 kernel: [14072]     0 14072    49475     5891      62     1220             0 corosync
Jan  9 17:54:28 dhcp47-155 kernel: [14088]     0 14088    33158      442      66      972             0 pacemakerd
Jan  9 17:54:28 dhcp47-155 kernel: [14089]   189 14089    34571     2375      67     1201             0 cib
Jan  9 17:54:28 dhcp47-155 kernel: [14090]     0 14090    34287     1119      69      364             0 stonithd
Jan  9 17:54:28 dhcp47-155 kernel: [14091]     0 14091    26273      459      52      227             0 lrmd
Jan  9 17:54:28 dhcp47-155 kernel: [14092]   189 14092    31731     1286      62      410             0 attrd
Jan  9 17:54:28 dhcp47-155 kernel: [14093]   189 14093    39003     3989      71      182             0 pengine
Jan  9 17:54:28 dhcp47-155 kernel: [14094]   189 14094    46982     4041      81      703             0 crmd
Jan  9 17:54:28 dhcp47-155 kernel: [14099]     0 14099   244360    10766     108      620             0 pcsd
Jan  9 17:54:28 dhcp47-155 kernel: [21690]     0 21690  1008670   407551    1106    55342             0 glusterfsd
Jan  9 17:54:28 dhcp47-155 kernel: [21709]     0 21709  1005223   346311    1125   128491             0 glusterfsd
Jan  9 17:54:28 dhcp47-155 kernel: [21736]     0 21736  1010977   307333    1124   163215             0 glusterfsd
Jan  9 17:54:28 dhcp47-155 kernel: [32553]     0 32553   151450      400      66     6330             0 glusterfsd
Jan  9 17:54:28 dhcp47-155 kernel: [22781]     0 22781   625453   229961     817    18368             0 glusterfs
Jan  9 17:54:28 dhcp47-155 kernel: [ 8894]     0  8894  1480087   336954    1214        0             0 ganesha.nfsd
Jan  9 17:54:28 dhcp47-155 kernel: [27848]     0 27848    35743      602      71        0             0 sshd
Jan  9 17:54:28 dhcp47-155 kernel: [27860]     0 27860    28879      329      14        0             0 bash
Jan  9 17:54:28 dhcp47-155 kernel: [ 9987]    89  9987    22270      472      42        0             0 pickup
Jan  9 17:54:28 dhcp47-155 kernel: [21817]     0 21817    35726      595      72        0             0 sshd
Jan  9 17:54:28 dhcp47-155 kernel: [21894]     0 21894    28879      289      14        0             0 bash
Jan  9 17:54:28 dhcp47-155 kernel: [29371]     0 29371   367307     8841     116        0             0 glusterfsd
Jan  9 17:54:28 dhcp47-155 kernel: [29745]     0 29745    39435     2870      26        0             0 glusterfs
Jan  9 17:54:28 dhcp47-155 kernel: [29746]     0 29746   205479    23309     131        0             0 glusterfs
Jan  9 17:54:28 dhcp47-155 kernel: [29753]     0 29753    39435     2869      27        0             0 glusterfs
Jan  9 17:54:28 dhcp47-155 kernel: [29754]     0 29754   111964    12874      87        0             0 glusterfs
Jan  9 17:54:28 dhcp47-155 kernel: [29771]     0 29771    28811      231      13        0             0 S28Quota-root-x
Jan  9 17:54:28 dhcp47-155 kernel: [29777]     0 29777    40032     3400      28        0             0 glusterfs
Jan  9 17:54:28 dhcp47-155 kernel: [29779]     0 29779    90002     3539      45        0             0 glusterfs
Jan  9 17:54:28 dhcp47-155 kernel: [29784]     0 29784     9807      135      21      418             0 systemd-udevd
Jan  9 17:54:28 dhcp47-155 kernel: Out of memory: Kill process 21709 (glusterfsd) score 182 or sacrifice child
Jan  9 17:54:28 dhcp47-155 kernel: Killed process 21709 (glusterfsd) total-vm:4020892kB, anon-rss:1384648kB, file-rss:596kB, shmem-rss:0kB
Jan  9 17:54:28 dhcp47-155 corosync[14072]: [TOTEM ] A processor failed, forming new configuration.
Jan  9 17:54:28 dhcp47-155 corosync[14072]: [TOTEM ] Process pause detected for 2433 ms, flushing membership messages.
Jan  9 17:54:28 dhcp47-155 corosync[14072]: [TOTEM ] A new membership (10.70.46.42:2700) was formed. Members
Jan  9 17:54:28 dhcp47-155 corosync[14072]: [QUORUM] Members[4]: 1 2 3 4
Jan  9 17:54:28 dhcp47-155 corosync[14072]: [MAIN  ] Completed service synchronization, ready to provide service.
Jan  9 17:54:28 dhcp47-155 crmd[14094]:  notice: High CPU load detected: 9.650000
Jan  9 17:54:29 dhcp47-155 lrmd[14091]:  notice: dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:29839:stderr [ 0+0 records in ]
Jan  9 17:54:29 dhcp47-155 lrmd[14091]:  notice: dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:29839:stderr [ 0+0 records out ]
Jan  9 17:54:29 dhcp47-155 lrmd[14091]:  notice: dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:29839:stderr [ 0 bytes (0 B) copied, 0.0124017 s, 0.0 kB/s ]
Jan  9 17:54:40 dhcp47-155 lrmd[14091]:  notice: dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30094:stderr [ 0+0 records in ]
Jan  9 17:54:40 dhcp47-155 lrmd[14091]:  notice: dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30094:stderr [ 0+0 records out ]
Jan  9 17:54:40 dhcp47-155 lrmd[14091]:  notice: dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30094:stderr [ 0 bytes (0 B) copied, 0.0972941 s, 0.0 kB/s ]
Jan  9 17:54:50 dhcp47-155 lrmd[14091]:  notice: dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30200:stderr [ 0+0 records in ]
Jan  9 17:54:50 dhcp47-155 lrmd[14091]:  notice: dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30200:stderr [ 0+0 records out ]
Jan  9 17:54:50 dhcp47-155 lrmd[14091]:  notice: dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30200:stderr [ 0 bytes (0 B) copied, 0.0104596 s, 0.0 kB/s ]
Jan  9 17:54:57 dhcp47-155 crmd[14094]:  notice: High CPU load detected: 8.460000

Actual results:
===============
OOM kill of glusterfsd during continuous add-bricks.

Expected results:
=================
There should not be any OOM kills

Comment 2 Atin Mukherjee 2017-01-09 13:55:48 UTC
(In reply to Prasad Desala from comment #0)
> Description of problem:
> =======================
> OOM kill of glusterfsd is seen during parallel untar + continuous add-bricks.
> 
> Version-Release number of selected component (if applicable):
> 3.8.4-10.el7rhgs.x86_64
> 
> How reproducible:
> =================
> 2/2
> 
> Steps to Reproduce:
> ===================
> 1) Create ganesha cluster and create a distributed-replicate volume.
> 2) Enable nfs-ganesha on the volume with mdcache settings.
> 3) Mount the volume.
> 4) On the mount point, copy linux kernel package and create three empty
> directories say 1,2 and 3
> 5) Now, simultaneously start untarring the kernel package to directories 1,2
> and 3.
> tar -xvf linux-4.4.1.tar -C 1
> tar -xvf linux-4.4.1.tar -C 2
> tar -xvf linux-4.4.1.tar -C 3
> 6) While untar is in-progress, continuously add bricks to the volume.

How many add-brick commands here?

> 
> glusterfsd OOM killed.
> 
> Jan  9 17:54:09 dhcp47-155 lrmd[14091]:  notice:
> dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:29592:stderr [ 0
> bytes (0 B) copied, 11.162 s, 0.0 kB/s ]
> Jan  9 17:54:24 dhcp47-155 corosync[14072]: [MAIN  ] Corosync main process
> was not scheduled for 2967.8682 ms (threshold is 1840.0000 ms). Consider
> token timeout increase.
> Jan  9 17:54:24 dhcp47-155 kernel: lrmd invoked oom-killer:
> gfp_mask=0x3000d0, order=2, oom_score_adj=0
> Jan  9 17:54:24 dhcp47-155 kernel: lrmd cpuset=/ mems_allowed=0
> Jan  9 17:54:24 dhcp47-155 kernel: CPU: 1 PID: 14091 Comm: lrmd Not tainted
> 3.10.0-514.2.2.el7.x86_64 #1
> Jan  9 17:54:24 dhcp47-155 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1
> 01/01/2011
> Jan  9 17:54:24 dhcp47-155 kernel: ffff880036283ec0 000000003402a8ea
> ffff88021aaebae8 ffffffff816860cc
> Jan  9 17:54:24 dhcp47-155 kernel: ffff88021aaebb78 ffffffff81681077
> ffffffff812ae87b ffff8800bb68efa0
> Jan  9 17:54:24 dhcp47-155 kernel: ffff8800bb68efb8 ffffffff00000206
> fbfeebff00000000 000000000000000d
> Jan  9 17:54:24 dhcp47-155 kernel: Call Trace:
> Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff816860cc>] dump_stack+0x19/0x1b
> Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81681077>]
> dump_header+0x8e/0x225
> Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff812ae87b>] ?
> cred_has_capability+0x6b/0x120
> Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff8113cc03>] ?
> proc_do_uts_string+0x123/0x130
> Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff8118476e>]
> oom_kill_process+0x24e/0x3c0
> Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff810937ee>] ?
> has_capability_noaudit+0x1e/0x30
> Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81184fa6>]
> out_of_memory+0x4b6/0x4f0
> Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81681b80>]
> __alloc_pages_slowpath+0x5d7/0x725
> Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff8118b0b5>]
> __alloc_pages_nodemask+0x405/0x420
> Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff810831cd>]
> copy_process+0x1dd/0x1960
> Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81084b01>] do_fork+0x91/0x2c0
> Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff8121cc80>] ?
> get_unused_fd_flags+0x30/0x40
> Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff8121ccd7>] ?
> __fd_install+0x47/0x60
> Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81084db6>] SyS_clone+0x16/0x20
> Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81696a19>] stub_clone+0x69/0x90
> Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff816966c9>] ?
> system_call_fastpath+0x16/0x1b
> Jan  9 17:54:24 dhcp47-155 kernel: Mem-Info:
> Jan  9 17:54:24 dhcp47-155 kernel: active_anon:1576807 inactive_anon:311606
> isolated_anon:0#012 active_file:376 inactive_file:1858 isolated_file:0#012
> unevictable:8617 dirty:0 writeback:6 unstable:0#012 slab_reclaimable:11488
> slab_unreclaimable:18027#012 mapped:12310 shmem:93443 pagetables:9098
> bounce:0#012 free:25869 free_pcp:19 free_cma:0
> Jan  9 17:54:24 dhcp47-155 kernel: Node 0 DMA free:15908kB min:132kB
> low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB
> inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB
> mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
> kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB
> local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> Jan  9 17:54:24 dhcp47-155 kernel: lowmem_reserve[]: 0 2816 7807 7807
> Jan  9 17:54:24 dhcp47-155 kernel: Node 0 DMA32 free:44460kB min:24324kB
> low:30404kB high:36484kB active_anon:2178252kB inactive_anon:558164kB
> active_file:916kB inactive_file:4532kB unevictable:10208kB
> isolated(anon):0kB isolated(file):0kB present:3129336kB managed:2884596kB
> mlocked:10208kB dirty:0kB writeback:12kB mapped:14092kB shmem:114240kB
> slab_reclaimable:18136kB slab_unreclaimable:24584kB kernel_stack:5616kB
> pagetables:11996kB unstable:0kB bounce:0kB free_pcp:28kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:7473 all_unreclaimable? yes
> Jan  9 17:54:28 dhcp47-155 kernel: lowmem_reserve[]: 0 0 4990 4990
> Jan  9 17:54:28 dhcp47-155 kernel: Node 0 Normal free:43108kB min:43120kB
> low:53900kB high:64680kB active_anon:4128976kB inactive_anon:688260kB
> active_file:588kB inactive_file:2900kB unevictable:24260kB
> isolated(anon):0kB isolated(file):0kB present:5242880kB managed:5110044kB
> mlocked:24260kB dirty:0kB writeback:12kB mapped:35148kB shmem:259532kB
> slab_reclaimable:27816kB slab_unreclaimable:47524kB kernel_stack:13424kB
> pagetables:24396kB unstable:0kB bounce:0kB free_pcp:48kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:5880 all_unreclaimable? yes
> Jan  9 17:54:28 dhcp47-155 kernel: lowmem_reserve[]: 0 0 0 0
> Jan  9 17:54:28 dhcp47-155 kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB
> (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M)
> 3*4096kB (M) = 15908kB
> Jan  9 17:54:28 dhcp47-155 kernel: Node 0 DMA32: 5017*4kB (UE) 3064*8kB (UE)
> 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 44580kB
> Jan  9 17:54:28 dhcp47-155 kernel: Node 0 Normal: 10808*4kB (UM) 0*8kB
> 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 43232kB
> Jan  9 17:54:28 dhcp47-155 kernel: Node 0 hugepages_total=0 hugepages_free=0
> hugepages_surp=0 hugepages_size=2048kB
> Jan  9 17:54:28 dhcp47-155 kernel: 147527 total pagecache pages
> Jan  9 17:54:28 dhcp47-155 kernel: 49426 pages in swap cache
> Jan  9 17:54:28 dhcp47-155 kernel: Swap cache stats: add 1417257, delete
> 1367831, find 6662219/6739670
> Jan  9 17:54:28 dhcp47-155 kernel: Free swap  = 0kB
> Jan  9 17:54:28 dhcp47-155 kernel: Total swap = 2097148kB
> Jan  9 17:54:28 dhcp47-155 kernel: 2097052 pages RAM
> Jan  9 17:54:28 dhcp47-155 kernel: 0 pages HighMem/MovableOnly
> Jan  9 17:54:28 dhcp47-155 kernel: 94415 pages reserved
> Jan  9 17:54:28 dhcp47-155 kernel: [ pid ]   uid  tgid total_vm      rss
> nr_ptes swapents oom_score_adj name
> Jan  9 17:54:28 dhcp47-155 kernel: [  698]     0   698    13398     4048    
> 31       58             0 systemd-journal
> Jan  9 17:54:28 dhcp47-155 kernel: [  705]     0   705   218768      813    
> 42      637             0 lvmetad
> Jan  9 17:54:28 dhcp47-155 kernel: [  723]     0   723    11580      674    
> 23      430         -1000 systemd-udevd
> Jan  9 17:54:28 dhcp47-155 kernel: [ 1328]     0  1328    13854      217    
> 27       84         -1000 auditd
> Jan  9 17:54:28 dhcp47-155 kernel: [ 1347]   998  1347   132067      472    
> 54     1846             0 polkitd
> Jan  9 17:54:28 dhcp47-155 kernel: [ 1348]    81  1348    24617      310    
> 18       62          -900 dbus-daemon
> Jan  9 17:54:28 dhcp47-155 kernel: [ 1352]     0  1352     6066      291    
> 15       30             0 systemd-logind
> Jan  9 17:54:28 dhcp47-155 kernel: [ 1354]     0  1354     4827      213    
> 14       44             0 irqbalance
> Jan  9 17:54:28 dhcp47-155 kernel: [ 1355]     0  1355    53131      264    
> 57      410             0 abrtd
> Jan  9 17:54:28 dhcp47-155 kernel: [ 1364]   997  1364    28962      313    
> 27       36             0 chronyd
> Jan  9 17:54:28 dhcp47-155 kernel: [ 1367]    32  1367    16263      183    
> 35      118             0 rpcbind
> Jan  9 17:54:28 dhcp47-155 kernel: [ 1369]     0  1369    50303      148    
> 37      114             0 gssproxy
> Jan  9 17:54:28 dhcp47-155 kernel: [ 1370]     0  1370    82866     4667    
> 82     1779             0 firewalld
> Jan  9 17:54:28 dhcp47-155 kernel: [ 1734]     0  1734    28206      147    
> 53     3061             0 dhclient
> Jan  9 17:54:28 dhcp47-155 kernel: [ 1828]     0  1828   138288     1830    
> 87     1280             0 tuned
> Jan  9 17:54:28 dhcp47-155 kernel: [ 1832]     0  1832    28335      120    
> 11       17             0 rhsmcertd
> Jan  9 17:54:28 dhcp47-155 kernel: [ 1847]     0  1847    20617       83    
> 42      188         -1000 sshd
> Jan  9 17:54:28 dhcp47-155 kernel: [ 1962]     0  1962    22244      196    
> 42      236             0 master
> Jan  9 17:54:28 dhcp47-155 kernel: [ 1964]    89  1964    22287      211    
> 44      235             0 qmgr
> Jan  9 17:54:28 dhcp47-155 kernel: [ 2354]     0  2354    31555      246    
> 17      130             0 crond
> Jan  9 17:54:28 dhcp47-155 kernel: [ 2362]     0  2362    26978      130    
> 9        9             0 rhnsd
> Jan  9 17:54:28 dhcp47-155 kernel: [ 2378]     0  2378    27509      165    
> 10       30             0 agetty
> Jan  9 17:54:28 dhcp47-155 kernel: [19586]     0 19586   179084     8344    
> 48        0         -1000 dmeventd
> Jan  9 17:54:28 dhcp47-155 kernel: [ 6147]     0  6147   137119     3257    
> 121       91             0 rsyslogd
> Jan  9 17:54:28 dhcp47-155 kernel: [14689]     0 14689   243755     3755    
> 91     7687             0 glusterfs
> Jan  9 17:54:28 dhcp47-155 kernel: [17037]     0 17037   172515    23542    
> 135    11287             0 glusterd
> Jan  9 17:54:28 dhcp47-155 kernel: [15978]     0 15978   263238      721    
> 106     8845             0 glusterfsd
> Jan  9 17:54:28 dhcp47-155 kernel: [15998]     0 15998   263238      745    
> 105     8190             0 glusterfsd
> Jan  9 17:54:28 dhcp47-155 kernel: [16018]     0 16018   263238      177    
> 104     8846             0 glusterfsd
> Jan  9 17:54:28 dhcp47-155 kernel: [11397]     0 11397   314084     2379    
> 110     8227             0 glusterfsd
> Jan  9 17:54:28 dhcp47-155 kernel: [11417]     0 11417   330983     2919    
> 111     8186             0 glusterfsd
> Jan  9 17:54:28 dhcp47-155 kernel: [11437]     0 11437   330982     2398    
> 113     8552             0 glusterfsd
> Jan  9 17:54:28 dhcp47-155 kernel: [30763]     0 30763   262627      782    
> 108     9807             0 glusterfsd
> Jan  9 17:54:28 dhcp47-155 kernel: [30782]     0 30782   262626      839    
> 105     9788             0 glusterfsd
> Jan  9 17:54:28 dhcp47-155 kernel: [30801]     0 30801   262626      822    
> 106     8770             0 glusterfsd
> Jan  9 17:54:28 dhcp47-155 kernel: [15232]     0 15232   494829     3141    
> 131     8902             0 glusterfsd
> Jan  9 17:54:28 dhcp47-155 kernel: [14072]     0 14072    49475     5891    
> 62     1220             0 corosync
> Jan  9 17:54:28 dhcp47-155 kernel: [14088]     0 14088    33158      442    
> 66      972             0 pacemakerd
> Jan  9 17:54:28 dhcp47-155 kernel: [14089]   189 14089    34571     2375    
> 67     1201             0 cib
> Jan  9 17:54:28 dhcp47-155 kernel: [14090]     0 14090    34287     1119    
> 69      364             0 stonithd
> Jan  9 17:54:28 dhcp47-155 kernel: [14091]     0 14091    26273      459    
> 52      227             0 lrmd
> Jan  9 17:54:28 dhcp47-155 kernel: [14092]   189 14092    31731     1286    
> 62      410             0 attrd
> Jan  9 17:54:28 dhcp47-155 kernel: [14093]   189 14093    39003     3989    
> 71      182             0 pengine
> Jan  9 17:54:28 dhcp47-155 kernel: [14094]   189 14094    46982     4041    
> 81      703             0 crmd
> Jan  9 17:54:28 dhcp47-155 kernel: [14099]     0 14099   244360    10766    
> 108      620             0 pcsd
> Jan  9 17:54:28 dhcp47-155 kernel: [21690]     0 21690  1008670   407551   
> 1106    55342             0 glusterfsd
> Jan  9 17:54:28 dhcp47-155 kernel: [21709]     0 21709  1005223   346311   
> 1125   128491             0 glusterfsd
> Jan  9 17:54:28 dhcp47-155 kernel: [21736]     0 21736  1010977   307333   
> 1124   163215             0 glusterfsd
> Jan  9 17:54:28 dhcp47-155 kernel: [32553]     0 32553   151450      400    
> 66     6330             0 glusterfsd
> Jan  9 17:54:28 dhcp47-155 kernel: [22781]     0 22781   625453   229961    
> 817    18368             0 glusterfs
> Jan  9 17:54:28 dhcp47-155 kernel: [ 8894]     0  8894  1480087   336954   
> 1214        0             0 ganesha.nfsd
> Jan  9 17:54:28 dhcp47-155 kernel: [27848]     0 27848    35743      602    
> 71        0             0 sshd
> Jan  9 17:54:28 dhcp47-155 kernel: [27860]     0 27860    28879      329    
> 14        0             0 bash
> Jan  9 17:54:28 dhcp47-155 kernel: [ 9987]    89  9987    22270      472    
> 42        0             0 pickup
> Jan  9 17:54:28 dhcp47-155 kernel: [21817]     0 21817    35726      595    
> 72        0             0 sshd
> Jan  9 17:54:28 dhcp47-155 kernel: [21894]     0 21894    28879      289    
> 14        0             0 bash
> Jan  9 17:54:28 dhcp47-155 kernel: [29371]     0 29371   367307     8841    
> 116        0             0 glusterfsd
> Jan  9 17:54:28 dhcp47-155 kernel: [29745]     0 29745    39435     2870    
> 26        0             0 glusterfs
> Jan  9 17:54:28 dhcp47-155 kernel: [29746]     0 29746   205479    23309    
> 131        0             0 glusterfs
> Jan  9 17:54:28 dhcp47-155 kernel: [29753]     0 29753    39435     2869    
> 27        0             0 glusterfs
> Jan  9 17:54:28 dhcp47-155 kernel: [29754]     0 29754   111964    12874    
> 87        0             0 glusterfs
> Jan  9 17:54:28 dhcp47-155 kernel: [29771]     0 29771    28811      231    
> 13        0             0 S28Quota-root-x
> Jan  9 17:54:28 dhcp47-155 kernel: [29777]     0 29777    40032     3400    
> 28        0             0 glusterfs
> Jan  9 17:54:28 dhcp47-155 kernel: [29779]     0 29779    90002     3539    
> 45        0             0 glusterfs
> Jan  9 17:54:28 dhcp47-155 kernel: [29784]     0 29784     9807      135    
> 21      418             0 systemd-udevd
> Jan  9 17:54:28 dhcp47-155 kernel: Out of memory: Kill process 21709
> (glusterfsd) score 182 or sacrifice child
> Jan  9 17:54:28 dhcp47-155 kernel: Killed process 21709 (glusterfsd)
> total-vm:4020892kB, anon-rss:1384648kB, file-rss:596kB, shmem-rss:0kB
> Jan  9 17:54:28 dhcp47-155 corosync[14072]: [TOTEM ] A processor failed,
> forming new configuration.
> Jan  9 17:54:28 dhcp47-155 corosync[14072]: [TOTEM ] Process pause detected
> for 2433 ms, flushing membership messages.
> Jan  9 17:54:28 dhcp47-155 corosync[14072]: [TOTEM ] A new membership
> (10.70.46.42:2700) was formed. Members
> Jan  9 17:54:28 dhcp47-155 corosync[14072]: [QUORUM] Members[4]: 1 2 3 4
> Jan  9 17:54:28 dhcp47-155 corosync[14072]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jan  9 17:54:28 dhcp47-155 crmd[14094]:  notice: High CPU load detected:
> 9.650000
> Jan  9 17:54:29 dhcp47-155 lrmd[14091]:  notice:
> dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:29839:stderr [
> 0+0 records in ]
> Jan  9 17:54:29 dhcp47-155 lrmd[14091]:  notice:
> dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:29839:stderr [
> 0+0 records out ]
> Jan  9 17:54:29 dhcp47-155 lrmd[14091]:  notice:
> dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:29839:stderr [ 0
> bytes (0 B) copied, 0.0124017 s, 0.0 kB/s ]
> Jan  9 17:54:40 dhcp47-155 lrmd[14091]:  notice:
> dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30094:stderr [
> 0+0 records in ]
> Jan  9 17:54:40 dhcp47-155 lrmd[14091]:  notice:
> dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30094:stderr [
> 0+0 records out ]
> Jan  9 17:54:40 dhcp47-155 lrmd[14091]:  notice:
> dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30094:stderr [ 0
> bytes (0 B) copied, 0.0972941 s, 0.0 kB/s ]
> Jan  9 17:54:50 dhcp47-155 lrmd[14091]:  notice:
> dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30200:stderr [
> 0+0 records in ]
> Jan  9 17:54:50 dhcp47-155 lrmd[14091]:  notice:
> dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30200:stderr [
> 0+0 records out ]
> Jan  9 17:54:50 dhcp47-155 lrmd[14091]:  notice:
> dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30200:stderr [ 0
> bytes (0 B) copied, 0.0104596 s, 0.0 kB/s ]
> Jan  9 17:54:57 dhcp47-155 crmd[14094]:  notice: High CPU load detected:
> 8.460000
> 
> Actual results:
> ===============
> OOM kill of glusterfsd during continuous add-bricks.
> 
> Expected results:
> =================
> There should not be any OOM kills

Comment 4 Prasad Desala 2017-01-09 14:06:11 UTC
(In reply to Atin Mukherjee from comment #2)
> (In reply to Prasad Desala from comment #0)
> > Description of problem:
> > =======================
> > OOM kill of glusterfsd is seen during parallel untar + continuous add-bricks.
> > 
> > Version-Release number of selected component (if applicable):
> > 3.8.4-10.el7rhgs.x86_64
> > 
> > How reproducible:
> > =================
> > 2/2
> > 
> > Steps to Reproduce:
> > ===================
> > 1) Create ganesha cluster and create a distributed-replicate volume.
> > 2) Enable nfs-ganesha on the volume with mdcache settings.
> > 3) Mount the volume.
> > 4) On the mount point, copy linux kernel package and create three empty
> > directories say 1,2 and 3
> > 5) Now, simultaneously start untarring the kernel package to directories 1,2
> > and 3.
> > tar -xvf linux-4.4.1.tar -C 1
> > tar -xvf linux-4.4.1.tar -C 2
> > tar -xvf linux-4.4.1.tar -C 3
> > 6) While untar is in-progress, continuously add bricks to the volume.
> 
> How many add-brick commands here?

Initially, the volume is 6x2. I have started parallel untar and started adding bricks.

Below are the add-brick commands that were executed,

gluster v add-brick distrep replica 2 10.70.46.42:/bricks/brick3/b3 10.70.46.101:/bricks/brick3/b3
gluster v add-brick distrep replica 2 10.70.47.155:/bricks/brick3/b3 10.70.47.167:/bricks/brick3/b3
gluster v add-brick distrep replica 2 10.70.46.42:/bricks/brick4/b4 10.70.46.101:/bricks/brick4/b4
gluster v add-brick distrep replica 2 10.70.47.155:/bricks/brick4/b4 10.70.47.167:/bricks/brick4/b4

> > 
> > glusterfsd OOM killed.
> > 
> > Jan  9 17:54:09 dhcp47-155 lrmd[14091]:  notice:
> > dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:29592:stderr [ 0
> > bytes (0 B) copied, 11.162 s, 0.0 kB/s ]
> > Jan  9 17:54:24 dhcp47-155 corosync[14072]: [MAIN  ] Corosync main process
> > was not scheduled for 2967.8682 ms (threshold is 1840.0000 ms). Consider
> > token timeout increase.
> > Jan  9 17:54:24 dhcp47-155 kernel: lrmd invoked oom-killer:
> > gfp_mask=0x3000d0, order=2, oom_score_adj=0
> > Jan  9 17:54:24 dhcp47-155 kernel: lrmd cpuset=/ mems_allowed=0
> > Jan  9 17:54:24 dhcp47-155 kernel: CPU: 1 PID: 14091 Comm: lrmd Not tainted
> > 3.10.0-514.2.2.el7.x86_64 #1
> > Jan  9 17:54:24 dhcp47-155 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1
> > 01/01/2011
> > Jan  9 17:54:24 dhcp47-155 kernel: ffff880036283ec0 000000003402a8ea
> > ffff88021aaebae8 ffffffff816860cc
> > Jan  9 17:54:24 dhcp47-155 kernel: ffff88021aaebb78 ffffffff81681077
> > ffffffff812ae87b ffff8800bb68efa0
> > Jan  9 17:54:24 dhcp47-155 kernel: ffff8800bb68efb8 ffffffff00000206
> > fbfeebff00000000 000000000000000d
> > Jan  9 17:54:24 dhcp47-155 kernel: Call Trace:
> > Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff816860cc>] dump_stack+0x19/0x1b
> > Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81681077>]
> > dump_header+0x8e/0x225
> > Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff812ae87b>] ?
> > cred_has_capability+0x6b/0x120
> > Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff8113cc03>] ?
> > proc_do_uts_string+0x123/0x130
> > Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff8118476e>]
> > oom_kill_process+0x24e/0x3c0
> > Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff810937ee>] ?
> > has_capability_noaudit+0x1e/0x30
> > Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81184fa6>]
> > out_of_memory+0x4b6/0x4f0
> > Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81681b80>]
> > __alloc_pages_slowpath+0x5d7/0x725
> > Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff8118b0b5>]
> > __alloc_pages_nodemask+0x405/0x420
> > Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff810831cd>]
> > copy_process+0x1dd/0x1960
> > Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81084b01>] do_fork+0x91/0x2c0
> > Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff8121cc80>] ?
> > get_unused_fd_flags+0x30/0x40
> > Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff8121ccd7>] ?
> > __fd_install+0x47/0x60
> > Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81084db6>] SyS_clone+0x16/0x20
> > Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff81696a19>] stub_clone+0x69/0x90
> > Jan  9 17:54:24 dhcp47-155 kernel: [<ffffffff816966c9>] ?
> > system_call_fastpath+0x16/0x1b
> > Jan  9 17:54:24 dhcp47-155 kernel: Mem-Info:
> > Jan  9 17:54:24 dhcp47-155 kernel: active_anon:1576807 inactive_anon:311606
> > isolated_anon:0#012 active_file:376 inactive_file:1858 isolated_file:0#012
> > unevictable:8617 dirty:0 writeback:6 unstable:0#012 slab_reclaimable:11488
> > slab_unreclaimable:18027#012 mapped:12310 shmem:93443 pagetables:9098
> > bounce:0#012 free:25869 free_pcp:19 free_cma:0
> > Jan  9 17:54:24 dhcp47-155 kernel: Node 0 DMA free:15908kB min:132kB
> > low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB
> > inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
> > present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB
> > mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
> > kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB
> > local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> > all_unreclaimable? yes
> > Jan  9 17:54:24 dhcp47-155 kernel: lowmem_reserve[]: 0 2816 7807 7807
> > Jan  9 17:54:24 dhcp47-155 kernel: Node 0 DMA32 free:44460kB min:24324kB
> > low:30404kB high:36484kB active_anon:2178252kB inactive_anon:558164kB
> > active_file:916kB inactive_file:4532kB unevictable:10208kB
> > isolated(anon):0kB isolated(file):0kB present:3129336kB managed:2884596kB
> > mlocked:10208kB dirty:0kB writeback:12kB mapped:14092kB shmem:114240kB
> > slab_reclaimable:18136kB slab_unreclaimable:24584kB kernel_stack:5616kB
> > pagetables:11996kB unstable:0kB bounce:0kB free_pcp:28kB local_pcp:0kB
> > free_cma:0kB writeback_tmp:0kB pages_scanned:7473 all_unreclaimable? yes
> > Jan  9 17:54:28 dhcp47-155 kernel: lowmem_reserve[]: 0 0 4990 4990
> > Jan  9 17:54:28 dhcp47-155 kernel: Node 0 Normal free:43108kB min:43120kB
> > low:53900kB high:64680kB active_anon:4128976kB inactive_anon:688260kB
> > active_file:588kB inactive_file:2900kB unevictable:24260kB
> > isolated(anon):0kB isolated(file):0kB present:5242880kB managed:5110044kB
> > mlocked:24260kB dirty:0kB writeback:12kB mapped:35148kB shmem:259532kB
> > slab_reclaimable:27816kB slab_unreclaimable:47524kB kernel_stack:13424kB
> > pagetables:24396kB unstable:0kB bounce:0kB free_pcp:48kB local_pcp:0kB
> > free_cma:0kB writeback_tmp:0kB pages_scanned:5880 all_unreclaimable? yes
> > Jan  9 17:54:28 dhcp47-155 kernel: lowmem_reserve[]: 0 0 0 0
> > Jan  9 17:54:28 dhcp47-155 kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB
> > (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M)
> > 3*4096kB (M) = 15908kB
> > Jan  9 17:54:28 dhcp47-155 kernel: Node 0 DMA32: 5017*4kB (UE) 3064*8kB (UE)
> > 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> > 44580kB
> > Jan  9 17:54:28 dhcp47-155 kernel: Node 0 Normal: 10808*4kB (UM) 0*8kB
> > 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> > 43232kB
> > Jan  9 17:54:28 dhcp47-155 kernel: Node 0 hugepages_total=0 hugepages_free=0
> > hugepages_surp=0 hugepages_size=2048kB
> > Jan  9 17:54:28 dhcp47-155 kernel: 147527 total pagecache pages
> > Jan  9 17:54:28 dhcp47-155 kernel: 49426 pages in swap cache
> > Jan  9 17:54:28 dhcp47-155 kernel: Swap cache stats: add 1417257, delete
> > 1367831, find 6662219/6739670
> > Jan  9 17:54:28 dhcp47-155 kernel: Free swap  = 0kB
> > Jan  9 17:54:28 dhcp47-155 kernel: Total swap = 2097148kB
> > Jan  9 17:54:28 dhcp47-155 kernel: 2097052 pages RAM
> > Jan  9 17:54:28 dhcp47-155 kernel: 0 pages HighMem/MovableOnly
> > Jan  9 17:54:28 dhcp47-155 kernel: 94415 pages reserved
> > Jan  9 17:54:28 dhcp47-155 kernel: [ pid ]   uid  tgid total_vm      rss
> > nr_ptes swapents oom_score_adj name
> > Jan  9 17:54:28 dhcp47-155 kernel: [  698]     0   698    13398     4048    
> > 31       58             0 systemd-journal
> > Jan  9 17:54:28 dhcp47-155 kernel: [  705]     0   705   218768      813    
> > 42      637             0 lvmetad
> > Jan  9 17:54:28 dhcp47-155 kernel: [  723]     0   723    11580      674    
> > 23      430         -1000 systemd-udevd
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 1328]     0  1328    13854      217    
> > 27       84         -1000 auditd
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 1347]   998  1347   132067      472    
> > 54     1846             0 polkitd
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 1348]    81  1348    24617      310    
> > 18       62          -900 dbus-daemon
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 1352]     0  1352     6066      291    
> > 15       30             0 systemd-logind
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 1354]     0  1354     4827      213    
> > 14       44             0 irqbalance
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 1355]     0  1355    53131      264    
> > 57      410             0 abrtd
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 1364]   997  1364    28962      313    
> > 27       36             0 chronyd
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 1367]    32  1367    16263      183    
> > 35      118             0 rpcbind
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 1369]     0  1369    50303      148    
> > 37      114             0 gssproxy
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 1370]     0  1370    82866     4667    
> > 82     1779             0 firewalld
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 1734]     0  1734    28206      147    
> > 53     3061             0 dhclient
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 1828]     0  1828   138288     1830    
> > 87     1280             0 tuned
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 1832]     0  1832    28335      120    
> > 11       17             0 rhsmcertd
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 1847]     0  1847    20617       83    
> > 42      188         -1000 sshd
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 1962]     0  1962    22244      196    
> > 42      236             0 master
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 1964]    89  1964    22287      211    
> > 44      235             0 qmgr
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 2354]     0  2354    31555      246    
> > 17      130             0 crond
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 2362]     0  2362    26978      130    
> > 9        9             0 rhnsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 2378]     0  2378    27509      165    
> > 10       30             0 agetty
> > Jan  9 17:54:28 dhcp47-155 kernel: [19586]     0 19586   179084     8344    
> > 48        0         -1000 dmeventd
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 6147]     0  6147   137119     3257    
> > 121       91             0 rsyslogd
> > Jan  9 17:54:28 dhcp47-155 kernel: [14689]     0 14689   243755     3755    
> > 91     7687             0 glusterfs
> > Jan  9 17:54:28 dhcp47-155 kernel: [17037]     0 17037   172515    23542    
> > 135    11287             0 glusterd
> > Jan  9 17:54:28 dhcp47-155 kernel: [15978]     0 15978   263238      721    
> > 106     8845             0 glusterfsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [15998]     0 15998   263238      745    
> > 105     8190             0 glusterfsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [16018]     0 16018   263238      177    
> > 104     8846             0 glusterfsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [11397]     0 11397   314084     2379    
> > 110     8227             0 glusterfsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [11417]     0 11417   330983     2919    
> > 111     8186             0 glusterfsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [11437]     0 11437   330982     2398    
> > 113     8552             0 glusterfsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [30763]     0 30763   262627      782    
> > 108     9807             0 glusterfsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [30782]     0 30782   262626      839    
> > 105     9788             0 glusterfsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [30801]     0 30801   262626      822    
> > 106     8770             0 glusterfsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [15232]     0 15232   494829     3141    
> > 131     8902             0 glusterfsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [14072]     0 14072    49475     5891    
> > 62     1220             0 corosync
> > Jan  9 17:54:28 dhcp47-155 kernel: [14088]     0 14088    33158      442    
> > 66      972             0 pacemakerd
> > Jan  9 17:54:28 dhcp47-155 kernel: [14089]   189 14089    34571     2375    
> > 67     1201             0 cib
> > Jan  9 17:54:28 dhcp47-155 kernel: [14090]     0 14090    34287     1119    
> > 69      364             0 stonithd
> > Jan  9 17:54:28 dhcp47-155 kernel: [14091]     0 14091    26273      459    
> > 52      227             0 lrmd
> > Jan  9 17:54:28 dhcp47-155 kernel: [14092]   189 14092    31731     1286    
> > 62      410             0 attrd
> > Jan  9 17:54:28 dhcp47-155 kernel: [14093]   189 14093    39003     3989    
> > 71      182             0 pengine
> > Jan  9 17:54:28 dhcp47-155 kernel: [14094]   189 14094    46982     4041    
> > 81      703             0 crmd
> > Jan  9 17:54:28 dhcp47-155 kernel: [14099]     0 14099   244360    10766    
> > 108      620             0 pcsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [21690]     0 21690  1008670   407551   
> > 1106    55342             0 glusterfsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [21709]     0 21709  1005223   346311   
> > 1125   128491             0 glusterfsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [21736]     0 21736  1010977   307333   
> > 1124   163215             0 glusterfsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [32553]     0 32553   151450      400    
> > 66     6330             0 glusterfsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [22781]     0 22781   625453   229961    
> > 817    18368             0 glusterfs
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 8894]     0  8894  1480087   336954   
> > 1214        0             0 ganesha.nfsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [27848]     0 27848    35743      602    
> > 71        0             0 sshd
> > Jan  9 17:54:28 dhcp47-155 kernel: [27860]     0 27860    28879      329    
> > 14        0             0 bash
> > Jan  9 17:54:28 dhcp47-155 kernel: [ 9987]    89  9987    22270      472    
> > 42        0             0 pickup
> > Jan  9 17:54:28 dhcp47-155 kernel: [21817]     0 21817    35726      595    
> > 72        0             0 sshd
> > Jan  9 17:54:28 dhcp47-155 kernel: [21894]     0 21894    28879      289    
> > 14        0             0 bash
> > Jan  9 17:54:28 dhcp47-155 kernel: [29371]     0 29371   367307     8841    
> > 116        0             0 glusterfsd
> > Jan  9 17:54:28 dhcp47-155 kernel: [29745]     0 29745    39435     2870    
> > 26        0             0 glusterfs
> > Jan  9 17:54:28 dhcp47-155 kernel: [29746]     0 29746   205479    23309    
> > 131        0             0 glusterfs
> > Jan  9 17:54:28 dhcp47-155 kernel: [29753]     0 29753    39435     2869    
> > 27        0             0 glusterfs
> > Jan  9 17:54:28 dhcp47-155 kernel: [29754]     0 29754   111964    12874    
> > 87        0             0 glusterfs
> > Jan  9 17:54:28 dhcp47-155 kernel: [29771]     0 29771    28811      231    
> > 13        0             0 S28Quota-root-x
> > Jan  9 17:54:28 dhcp47-155 kernel: [29777]     0 29777    40032     3400    
> > 28        0             0 glusterfs
> > Jan  9 17:54:28 dhcp47-155 kernel: [29779]     0 29779    90002     3539    
> > 45        0             0 glusterfs
> > Jan  9 17:54:28 dhcp47-155 kernel: [29784]     0 29784     9807      135    
> > 21      418             0 systemd-udevd
> > Jan  9 17:54:28 dhcp47-155 kernel: Out of memory: Kill process 21709
> > (glusterfsd) score 182 or sacrifice child
> > Jan  9 17:54:28 dhcp47-155 kernel: Killed process 21709 (glusterfsd)
> > total-vm:4020892kB, anon-rss:1384648kB, file-rss:596kB, shmem-rss:0kB
> > Jan  9 17:54:28 dhcp47-155 corosync[14072]: [TOTEM ] A processor failed,
> > forming new configuration.
> > Jan  9 17:54:28 dhcp47-155 corosync[14072]: [TOTEM ] Process pause detected
> > for 2433 ms, flushing membership messages.
> > Jan  9 17:54:28 dhcp47-155 corosync[14072]: [TOTEM ] A new membership
> > (10.70.46.42:2700) was formed. Members
> > Jan  9 17:54:28 dhcp47-155 corosync[14072]: [QUORUM] Members[4]: 1 2 3 4
> > Jan  9 17:54:28 dhcp47-155 corosync[14072]: [MAIN  ] Completed service
> > synchronization, ready to provide service.
> > Jan  9 17:54:28 dhcp47-155 crmd[14094]:  notice: High CPU load detected:
> > 9.650000
> > Jan  9 17:54:29 dhcp47-155 lrmd[14091]:  notice:
> > dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:29839:stderr [
> > 0+0 records in ]
> > Jan  9 17:54:29 dhcp47-155 lrmd[14091]:  notice:
> > dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:29839:stderr [
> > 0+0 records out ]
> > Jan  9 17:54:29 dhcp47-155 lrmd[14091]:  notice:
> > dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:29839:stderr [ 0
> > bytes (0 B) copied, 0.0124017 s, 0.0 kB/s ]
> > Jan  9 17:54:40 dhcp47-155 lrmd[14091]:  notice:
> > dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30094:stderr [
> > 0+0 records in ]
> > Jan  9 17:54:40 dhcp47-155 lrmd[14091]:  notice:
> > dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30094:stderr [
> > 0+0 records out ]
> > Jan  9 17:54:40 dhcp47-155 lrmd[14091]:  notice:
> > dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30094:stderr [ 0
> > bytes (0 B) copied, 0.0972941 s, 0.0 kB/s ]
> > Jan  9 17:54:50 dhcp47-155 lrmd[14091]:  notice:
> > dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30200:stderr [
> > 0+0 records in ]
> > Jan  9 17:54:50 dhcp47-155 lrmd[14091]:  notice:
> > dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30200:stderr [
> > 0+0 records out ]
> > Jan  9 17:54:50 dhcp47-155 lrmd[14091]:  notice:
> > dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:30200:stderr [ 0
> > bytes (0 B) copied, 0.0104596 s, 0.0 kB/s ]
> > Jan  9 17:54:57 dhcp47-155 crmd[14094]:  notice: High CPU load detected:
> > 8.460000
> > 
> > Actual results:
> > ===============
> > OOM kill of glusterfsd during continuous add-bricks.
> > 
> > Expected results:
> > =================
> > There should not be any OOM kills

Comment 5 Raghavendra G 2017-01-10 05:59:17 UTC
Prasad,

Some questions:

* Is it possible to get statedump of the process which crashed (I assume its mount process)?

* Is it just add-brick or were you also running rebalance? In other words was there a data migration and/or fix-layout being run?

Comment 6 Raghavendra G 2017-01-10 06:01:48 UTC
There is a patch [1] which fixes leaks in inprogress/migration_check tasks. However, this codepaths get executed in mount process only if there is data migration going on files on which I/O is done.

[1] http://review.gluster.org/16308

Comment 7 Raghavendra G 2017-01-10 06:16:23 UTC
<raghuhg> Prasad, ping
<Prasad> raghuhg, pong
<raghuhg> on bz 1411329,
<raghuhg> is there a statedump you collected?
<raghuhg> was it the mount process which crashed?
<Prasad> nope, its a brick process
<Prasad> for the second question.. I was doing only add-bricks
<raghuhg> ok. If its brick process, my comment is invalid
<Prasad> hmm
<raghuhg> Is it possible to collect statedumps?
<raghuhg> you can possibly get statedumps of other bricks if they are still up
<Prasad> yes, but the system is not in the same state and I have done few add-brick/remove-brick opertions.. will it help if I take the state dumps of brick processes now ?
<raghuhg> its fine. 
<raghuhg> how is the memory consumption of other bricks?
<raghuhg> are they too consuming too much memory?
<Prasad> ok, I will collect it now 
<Prasad> I think so..
<raghuhg> also get mem consumption of other bricks
<raghuhg> and the brick that crashed, was it a new brick or old brick?
<Prasad> raghuhg, it was an old brick
<raghuhg> ok. If possible also post the sequence of add/remove bricks done
<raghuhg> it'll help us to figure out older bricks
<raghuhg> If one brick crashed, its more likely that other old bricks also consume more memory
<Prasad> yes, I have updted the add-brick sequence in the BZ, comment 4
<raghuhg> (of course based on I/O is happening on the files present on the brick)
<raghuhg> Prasad, you didn't do any add/remove-brick after that?
<raghuhg> Comment 4
<Prasad> nope, I hit this issue while I was adding bricks
<raghuhg> ok

Comment 9 Raghavendra G 2017-01-10 10:34:32 UTC
There seems to be a dict leak in brick process:

pool-name=glusterfs:data_t
hot-count=16384
cold-count=0
padded_sizeof=92
alloc-count=13191046
max-alloc=16384
pool-misses=11984577
cur-stdalloc=276176
max-stdalloc=276228
-----=-----
pool-name=glusterfs:data_pair_t
hot-count=481
cold-count=15903
padded_sizeof=68
alloc-count=11089148
max-alloc=724
pool-misses=0
cur-stdalloc=0
max-stdalloc=0
-----=-----
pool-name=glusterfs:dict_t
hot-count=4096
cold-count=0
padded_sizeof=172
alloc-count=5819264
max-alloc=4096
pool-misses=5717508
cur-stdalloc=287995
max-stdalloc=287998

Comment 10 Raghavendra G 2017-01-10 10:35:38 UTC
Note that there were no fops in progress:

pool-name=glusterfs:rpcsvc_request_t
hot-count=0
cold-count=8
padded_sizeof=2820
alloc-count=0
max-alloc=0
pool-misses=0
cur-stdalloc=0
max-stdalloc=0

Comment 11 Raghavendra G 2017-01-10 10:48:04 UTC
There seems to be some leaks of memory allocated with memdup:

[protocol/server.distrep-server - usage-type gf_common_mt_memdup memusage]
size=125110
num_allocs=6761
max_size=130591
max_num_allocs=6845
total_allocs=4002730

Went through server_setvolume. But couldn't find any obvious leaks. May be I should go through code once again

Comment 12 Raghavendra G 2017-01-10 10:55:25 UTC
Marker and Upcall are worst offenders with mem-pool usage. Both accounting to a total memory of ~(23 + 47) = 70MB

[features/marker.distrep-marker - usage-type gf_common_mt_mem_pool memusage]
size=24820556
num_allocs=269788
max_size=24821004
max_num_allocs=269792
total_allocs=4025204
--
[features/upcall.distrep-upcall - usage-type gf_common_mt_mem_pool memusage]
size=49547464
num_allocs=288157
max_size=49547636
max_num_allocs=288158
total_allocs=1435066

Comment 13 Raghavendra G 2017-01-10 11:02:14 UTC
(In reply to Raghavendra G from comment #12)
> Marker and Upcall are worst offenders with mem-pool usage. Both accounting
> to a total memory of ~(23 + 47) = 70MB
> 
> [features/upcall.distrep-upcall - usage-type gf_common_mt_mem_pool memusage]
> size=49547464
> num_allocs=288157
> max_size=49547636
> max_num_allocs=288158
> total_allocs=1435066

Poornima/Soumya,

Is it possible to take a look at why there is so much of memory consumption?

Comment 14 Raghavendra G 2017-01-10 11:07:02 UTC
(In reply to Raghavendra G from comment #12)
> Marker and Upcall are worst offenders with mem-pool usage. Both accounting
> to a total memory of ~(23 + 47) = 70MB
> 
> [features/marker.distrep-marker - usage-type gf_common_mt_mem_pool memusage]
> size=24820556
> num_allocs=269788
> max_size=24821004
> max_num_allocs=269792
> total_allocs=4025204

Sanoj,

Can you take a look at why marker is consuming so much of memory in mem-pools?

Comment 15 Mohit Agrawal 2017-01-11 04:26:50 UTC
Hi,

As per attached sosrepot, it seems memory footprint is high for glusterfsd and gansesha.nfsd.

@Prasad
  Is it possible for you to share the statedump of ganesha.nfsd also.

Regards
Mohit Agrawal

Comment 16 Sanoj Unnikrishnan 2017-01-11 06:00:55 UTC
In the code we try allocating from required mempool and if it fails, we fall  back to  mem of type gf_common_mt_mem_pool
So if we look at which mempool seems to be full in statedump we see,
From the statedump logs it looks
marker
gf_common_mt_asprintf
upcall
gf_common_mt_char

gf_common_mt_asprintf used in gf_vasprintf. This function expects  caller to free. Will have to investigate all the callers.

So we can suspect either a mem leak or it could be frequent logging that is building up a large queue, which is drained slowly.

looking further into where the logging happens in marker code path and map this with brick logs.

Comment 17 Prasad Desala 2017-01-11 06:05:11 UTC
(In reply to Mohit Agrawal from comment #15)
> Hi,
> 
> As per attached sosrepot, it seems memory footprint is high for glusterfsd
> and gansesha.nfsd.
> 
> @Prasad
>   Is it possible for you to share the statedump of ganesha.nfsd also.

The state dump of ganesha process is not possible at this moment.

Comment 20 Raghavendra G 2017-01-11 07:59:08 UTC
[features/marker.distrep-marker - usage-type gf_common_mt_asprintf memusage]
size=570964
num_allocs=285468
max_size=570970
max_num_allocs=285471
total_allocs=2194054


init_xtime_priv (xlator_t *this, dict_t *options)
{
                ret = gf_asprintf (& (priv->marker_xattr), "%s.%s.%s",
                                   MARKER_XATTR_PREFIX, priv->volume_uuid,
                                   XTIME);
}

Note that this is unconditionally allocating to priv->marker_xattr and is called in reconfigure codepath. This is a definite leak.

Comment 21 Raghavendra G 2017-01-11 08:14:35 UTC
All the above comments were made by referring to statedumps/server-1/bricks-brick7-b7.2428.dump.1484029391

Comment 22 Prasad Desala 2017-01-11 09:17:02 UTC
Mentioning a point here,
Before the OOM kill is seen, I had also issued rm -rf multiple times from multiple clients to clear the dataset on the mount point.

Comment 23 Soumya Koduri 2017-01-11 09:19:05 UTC
[features/upcall.distrep-upcall - usage-type gf_common_mt_mem_pool memusage]
size=49547464
num_allocs=288157
max_size=49547636
max_num_allocs=288158
total_allocs=1435066


[features/upcall.distrep-upcall - usage-type gf_upcall_mt_upcall_inode_ctx_t memusage]
size=18336
num_allocs=191
max_size=2038560
max_num_allocs=21235
total_allocs=271612

wrt upcall, We observed that number of inode_ctx allocations matches with the number of active+lru inodes.

The only thing unknown is why there are so many mem-pool allocations. It could be expected with so many fops being sent to the server. As mentioned by Sanoj in earlier comments, maybe even in this case, its taking a while to clean up this memory. Taking statedumps at frequent intervals and analysing the memory consumption may help in determining if its indeed a leak.

Comment 26 Mohit Agrawal 2017-01-12 10:42:27 UTC
Hi,


Currently we have found one below code(the same is for up_fsetxxat also) path that holds dict leak but there could be other path also those are having leak .

>>>>>>>>>>>>>>>>>>>>>>

0x7ff423dac373 : mem_get0+0x13/0x90 [/usr/lib64/libglusterfs.so.0.0.1]
 0x7ff423d7d355 : get_new_dict_full+0x25/0x120 [/usr/lib64/libglusterfs.so.0.0.1]
 0x7ff423d7dbab : dict_new+0xb/0x20 [/usr/lib64/libglusterfs.so.0.0.1]
 0x7ff423d7fa0a : dict_copy_with_ref+0x3a/0xe0 [/usr/lib64/libglusterfs.so.0.0.1]
 0x7ff41419733a : up_setxattr+0x3a/0x450 [/usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so]
 0x7ff423e16684 : default_setxattr_resume+0x1d4/0x250 [/usr/lib64/libglusterfs.so.0.0.1]
 0x7ff423da86ed : call_resume+0x7d/0xd0 [/usr/lib64/libglusterfs.so.0.0.1]
 0x7ff40fdf9957 : iot_worker+0x117/0x220 [/usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so]
 0x7ff422be6dc5 : 0x7ff422be6dc5 [/usr/lib64/libpthread-2.17.so+0x7dc5/0x218000]


>>>>>>>>>>>>>>>>>>>>>>>>>>

I am trying to find other path also, will send a patch after spend some more time on this.


Regards
Mohit Agrawal

Comment 27 Atin Mukherjee 2017-01-16 10:22:55 UTC
upstream patch : http://review.gluster.org/16392

Comment 30 Atin Mukherjee 2017-01-17 07:40:10 UTC
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/95350

Comment 32 Atin Mukherjee 2017-01-24 15:29:14 UTC
*** Bug 1415632 has been marked as a duplicate of this bug. ***

Comment 33 Prasad Desala 2017-02-07 06:44:33 UTC
This issue is not reproduced on glusterfs version 3.8.4-13.el7rhgs.x86_64.

Steps:
=====
1) Created a ganesha cluster and created a distributed-replicate volume (4x2).
2) Enable nfs-ganesha on the volume with mdcache settings.
3) Mount the volume.
4) On the mount point, copied linux kernel package and created four empty directories say 1, 2 ,3, 4.
5) Now, simultaneously started untarring the kernel package to directories 1,2,3 and 4.
tar -xvf linux-4.4.1.tar -C 1
tar -xvf linux-4.4.1.tar -C 2
tar -xvf linux-4.4.1.tar -C 3
tar -xvf linux-4.4.1.tar -C 4
6) While untar is in-progress, continuously added bricks to the volume. 

Initially the volume is 4x2, while untar is in-progress continuously added bricks to the volume and scaled it up to 14x2. During these add-bricks operations I Didn't see OOM kills of glusterfsd.

Hence, moving this BZ to Verified.

Comment 35 errata-xmlrpc 2017-03-23 06:02:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.