Bug 1400071 - OOM kill of glusterfs fuse mount process seen on client where i was doing deletes
Summary: OOM kill of glusterfs fuse mount process seen on client where i was doing del...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: fuse
Version: rhgs-3.2
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Csaba Henk
QA Contact: Prasanth
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-30 12:02 UTC by Nag Pavan Chilakam
Modified: 2020-10-06 11:00 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-06 11:00:34 UTC
Embargoed:


Attachments (Terms of Use)

Description Nag Pavan Chilakam 2016-11-30 12:02:16 UTC
The purpose of raising this bug is to track seperately two actions performed on the same set of files 
While doing rm -rf  of files which were also being renamed from a different client , saw oom kills of fuse mount
As discussed in triage meeting , for the bz https://bugzilla.redhat.com/show_bug.cgi?id=1381140  , we are raising a seperate bug

For more info refer to bz https://bugzilla.redhat.com/show_bug.cgi?id=1381140

Also refer to bug https://bugzilla.redhat.com/show_bug.cgi?id=1400067 which was raised to track  with 3 different bugs

Comment 4 Amar Tumballi 2018-02-07 04:26:40 UTC
We have noticed that the bug is not reproduced in the latest version of the product (RHGS-3.3.1+).

If the bug is still relevant and is being reproduced, feel free to reopen the bug.

Comment 9 Amar Tumballi 2019-03-08 11:12:32 UTC
Noticed that there was no 'readdir-ahead' enabled on the volume, so my suspicion is not valid.

@Nag, if you are restarting the client and starting the same type of job, would like to get a statedump with now/ 1hr  later/ 24hr later, timeframe. Just by looking at the logs, not much is evident.

I see that there are too many errors which are returned to the application (like EPERM, ENOENT etc). So, one possibility is, there is a leaky path during negative scenarios, which doesn't get executed in happy path. Need similar tests to check for exact leak.

Comment 12 Nag Pavan Chilakam 2019-03-12 05:23:50 UTC
Hit the problem again, in less than 7 hrs
IOs being run from client are below:

1)while true;do find *|xargs stat;done--->from root of volume mount

2) removing some directories (which had untarred linux ) 
/mnt/rpcx3-new/IOs/kernel/dhcp35-77.lab.eng.blr.redhat.com

[root@dhcp35-77 dhcp35-77.lab.eng.blr.redhat.com]# time rm -rf dir.3*                                                rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/main_usb.c’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/rf.c’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/usbpipe.h’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/Kconfig’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/TODO’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/baseband.c’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/card.c’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/card.h’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/channel.c’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/device.h’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/dpc.c’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/firmware.h’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/key.c’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/mac.h’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/power.h’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/rf.h’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/rxtx.c’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/rxtx.h’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/wcmd.c’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/vt6656/wcmd.h’: Transport endpoint is not connected
rm: cannot remove ‘dir.3/linux-4.20.8/drivers/staging/wlan-ng’: Transport endpoint is not connected
rm: fts_read failed: Transport endpoint is not connected

real    384m9.243s
user    0m0.622s
sys     0m7.248s


3)untar of linux kernel on same parent directory as 2) but without conflicting 2)

4)capturing resource o/p to a file in append mode every 2 minutes on mount point

5) taking statedumps every 30min


sosreports and logs with statedumps of fuse mount proc taken every 30min
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1400071/clients/reproducedissue-with-statedumps-of-fuse-mount/

Comment 13 Nag Pavan Chilakam 2019-03-12 05:33:00 UTC
[Mon Mar 11 17:48:50 2019] glustersigwait invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
[Mon Mar 11 17:48:50 2019] glustersigwait cpuset=/ mems_allowed=0
[Mon Mar 11 17:48:50 2019] CPU: 0 PID: 32195 Comm: glustersigwait Kdump: loaded Not tainted 3.10.0-957.5.1.el7.x86_64 #1
[Mon Mar 11 17:48:50 2019] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[Mon Mar 11 17:48:50 2019] Call Trace:
[Mon Mar 11 17:48:50 2019]  [<ffffffff99761e41>] dump_stack+0x19/0x1b
[Mon Mar 11 17:48:50 2019]  [<ffffffff9975c86a>] dump_header+0x90/0x229
[Mon Mar 11 17:48:50 2019]  [<ffffffff99300f3b>] ? cred_has_capability+0x6b/0x120
[Mon Mar 11 17:48:50 2019]  [<ffffffff991ba524>] oom_kill_process+0x254/0x3d0
[Mon Mar 11 17:48:50 2019]  [<ffffffff9930101e>] ? selinux_capable+0x2e/0x40
[Mon Mar 11 17:48:50 2019]  [<ffffffff991bad66>] out_of_memory+0x4b6/0x4f0
[Mon Mar 11 17:48:50 2019]  [<ffffffff9975d36e>] __alloc_pages_slowpath+0x5d6/0x724
[Mon Mar 11 17:48:50 2019]  [<ffffffff991c1145>] __alloc_pages_nodemask+0x405/0x420
[Mon Mar 11 17:48:50 2019]  [<ffffffff99211535>] alloc_pages_vma+0xb5/0x200
[Mon Mar 11 17:48:50 2019]  [<ffffffff991ff785>] __read_swap_cache_async+0x115/0x190
[Mon Mar 11 17:48:50 2019]  [<ffffffff991ff826>] read_swap_cache_async+0x26/0x60
[Mon Mar 11 17:48:50 2019]  [<ffffffff991ff90c>] swapin_readahead+0xac/0x110
[Mon Mar 11 17:48:50 2019]  [<ffffffff991e9a02>] handle_pte_fault+0x812/0xd10
[Mon Mar 11 17:48:50 2019]  [<ffffffff991ec01d>] handle_mm_fault+0x39d/0x9b0
[Mon Mar 11 17:48:50 2019]  [<ffffffff9976f5e3>] __do_page_fault+0x203/0x500
[Mon Mar 11 17:48:50 2019]  [<ffffffff9976f9c6>] trace_do_page_fault+0x56/0x150
[Mon Mar 11 17:48:50 2019]  [<ffffffff9976ef42>] do_async_page_fault+0x22/0xf0
[Mon Mar 11 17:48:50 2019]  [<ffffffff9976b788>] async_page_fault+0x28/0x30
[Mon Mar 11 17:48:50 2019] Mem-Info:
[Mon Mar 11 17:48:50 2019] active_anon:665224 inactive_anon:250707 isolated_anon:0
 active_file:0 inactive_file:19 isolated_file:0
 unevictable:0 dirty:0 writeback:1 unstable:0
 slab_reclaimable:5816 slab_unreclaimable:9495
 mapped:118 shmem:226 pagetables:4908 bounce:0
 free:21607 free_pcp:487 free_cma:0
[Mon Mar 11 17:48:50 2019] Node 0 DMA free:15364kB min:276kB low:344kB high:412kB active_anon:148kB inactive_anon:356kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:12kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[Mon Mar 11 17:48:50 2019] lowmem_reserve[]: 0 2815 3773 3773
[Mon Mar 11 17:48:50 2019] Node 0 DMA32 free:53972kB min:50200kB low:62748kB high:75300kB active_anon:2210716kB inactive_anon:552440kB active_file:0kB inactive_file:68kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129336kB managed:2883116kB mlocked:0kB dirty:0kB writeback:4kB mapped:336kB shmem:704kB slab_reclaimable:13916kB slab_unreclaimable:23952kB kernel_stack:1488kB pagetables:13244kB unstable:0kB bounce:0kB free_pcp:1580kB local_pcp:56kB free_cma:0kB writeback_tmp:0kB pages_scanned:1361 all_unreclaimable? yes
[Mon Mar 11 17:48:50 2019] lowmem_reserve[]: 0 0 958 958
[Mon Mar 11 17:48:50 2019] Node 0 Normal free:17092kB min:17100kB low:21372kB high:25648kB active_anon:450032kB inactive_anon:450032kB active_file:8kB inactive_file:8kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1048576kB managed:981200kB mlocked:0kB dirty:0kB writeback:0kB mapped:136kB shmem:200kB slab_reclaimable:9344kB slab_unreclaimable:14008kB kernel_stack:1296kB pagetables:6376kB unstable:0kB bounce:0kB free_pcp:368kB local_pcp:108kB free_cma:0kB writeback_tmp:0kB pages_scanned:951 all_unreclaimable? yes
[Mon Mar 11 17:48:50 2019] lowmem_reserve[]: 0 0 0 0
[Mon Mar 11 17:48:50 2019] Node 0 DMA: 3*4kB (EM) 3*8kB (UE) 4*16kB (UE) 3*32kB (UE) 1*64kB (E) 2*128kB (UE) 2*256kB (UE) 2*512kB (EM) 3*1024kB (UEM) 1*2048kB (E) 2*4096kB (M) = 15364kB
[Mon Mar 11 17:48:50 2019] Node 0 DMA32: 695*4kB (UEM) 776*8kB (UE) 741*16kB (UEM) 383*32kB (UEM) 204*64kB (UEM) 58*128kB (UEM) 2*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 54092kB
[Mon Mar 11 17:48:50 2019] Node 0 Normal: 227*4kB (UEM) 266*8kB (UEM) 293*16kB (UEM) 135*32kB (UEM) 59*64kB (UEM) 10*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17100kB
[Mon Mar 11 17:48:50 2019] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Mon Mar 11 17:48:50 2019] 8296 total pagecache pages
[Mon Mar 11 17:48:50 2019] 8015 pages in swap cache
[Mon Mar 11 17:48:50 2019] Swap cache stats: add 495331795, delete 495325510, find 411130330/473729900
[Mon Mar 11 17:48:50 2019] Free swap  = 0kB
[Mon Mar 11 17:48:50 2019] Total swap = 4063228kB
[Mon Mar 11 17:48:50 2019] 1048476 pages RAM
[Mon Mar 11 17:48:50 2019] 0 pages HighMem/MovableOnly
[Mon Mar 11 17:48:50 2019] 78420 pages reserved
[Mon Mar 11 17:48:50 2019] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[Mon Mar 11 17:48:50 2019] [ 1603]     0  1603     9866       98      25       63             0 systemd-journal
[Mon Mar 11 17:48:50 2019] [ 1631]     0  1631    50270        0      31      413             0 lvmetad
[Mon Mar 11 17:48:50 2019] [ 1636]     0  1636    11920        2      27      574         -1000 systemd-udevd
[Mon Mar 11 17:48:50 2019] [ 2973]     0  2973    13880        8      26      103         -1000 auditd
[Mon Mar 11 17:48:50 2019] [ 2997]     0  2997     6594       18      18       58             0 systemd-logind
[Mon Mar 11 17:48:50 2019] [ 2998]   999  2998   153254        0      59     2306             0 polkitd
[Mon Mar 11 17:48:50 2019] [ 3002]     0  3002     5383       20      14       41             0 irqbalance
[Mon Mar 11 17:48:50 2019] [ 3005]    81  3005    16686       72      34      175          -900 dbus-daemon
[Mon Mar 11 17:48:50 2019] [ 3008]   998  3008    29446       17      29       97             0 chronyd
[Mon Mar 11 17:48:50 2019] [ 3054]     0  3054    31572       16      21      141             0 crond
[Mon Mar 11 17:48:50 2019] [ 3058]     0  3058    27523        1      10       31             0 agetty
[Mon Mar 11 17:48:50 2019] [ 3062]     0  3062    90507        0      98     6499             0 firewalld
[Mon Mar 11 17:48:50 2019] [ 3063]     0  3063   156358      233      89      332             0 NetworkManager
[Mon Mar 11 17:48:50 2019] [ 3366]     0  3366    26865       44      54      455             0 dhclient
[Mon Mar 11 17:48:50 2019] [ 3575]     0  3575    28215        0      57      270         -1000 sshd
[Mon Mar 11 17:48:50 2019] [ 3578]     0  3578   143546      130      98     3259             0 tuned
[Mon Mar 11 17:48:50 2019] [ 3579]     0  3579    54102      592      43     1131             0 rsyslogd
[Mon Mar 11 17:48:50 2019] [ 3601]     0  3601    26992        2      10       37             0 rhnsd
[Mon Mar 11 17:48:50 2019] [ 3807]     0  3807    22411       20      44      239             0 master
[Mon Mar 11 17:48:50 2019] [ 3812]    89  3812    22454       17      48      244             0 qmgr
[Mon Mar 11 17:48:50 2019] [ 4284]     0  4284    32008       26      17      169             0 screen
[Mon Mar 11 17:48:50 2019] [ 4285]     0  4285    33513      186      23     4582             0 bash
[Mon Mar 11 17:48:50 2019] [ 6544]     0  6544    79626        0     121     3537          -900 rhsmd
[Mon Mar 11 17:48:50 2019] [21953]     0 21953    39331        0      79      504             0 sshd
[Mon Mar 11 17:48:50 2019] [22543]     0 22543    28912        2      13      146             0 bash
[Mon Mar 11 17:48:50 2019] [24745]     0 24745    31976        1      18      145             0 screen
[Mon Mar 11 17:48:50 2019] [24746]     0 24746    28893        2      13      139             0 bash
[Mon Mar 11 17:48:50 2019] [25498]     0 25498    31975        1      18      148             0 screen
[Mon Mar 11 17:48:50 2019] [25499]     0 25499    28893        2      14      131             0 bash
[Mon Mar 11 17:48:50 2019] [ 1624]     0  1624    31975       91      18       74             0 screen
[Mon Mar 11 17:48:50 2019] [ 1625]     0  1625    29021      168      13       80             0 bash
[Mon Mar 11 17:48:50 2019] [11285]     0 11285    45590        3      46      232             0 crond
[Mon Mar 11 17:48:50 2019] [17449]     0 17449    45590        3      46      232             0 crond
[Mon Mar 11 17:48:50 2019] [20152]     0 20152    45590        3      46      232             0 crond
[Mon Mar 11 17:48:50 2019] [25375]     0 25375    45590        3      46      232             0 crond
[Mon Mar 11 17:48:50 2019] [27958]     0 27958    45590        3      46      232             0 crond
[Mon Mar 11 17:48:50 2019] [10551]     0 10551    45590        3      46      232             0 crond
[Mon Mar 11 17:48:50 2019] [32193]     0 32193  1956326   905728    3303   740779             0 glusterfs
[Mon Mar 11 17:48:50 2019] [32712]     0 32712    32008        2      17      185             0 screen
[Mon Mar 11 17:48:50 2019] [32713]     0 32713    28893       45      13       88             0 bash
[Mon Mar 11 17:48:50 2019] [ 9021]    89  9021    22437       14      44      237             0 pickup
[Mon Mar 11 17:48:50 2019] [ 3081]     0  3081     4120       31      13        0             0 find
[Mon Mar 11 17:48:50 2019] [ 3082]     0  3082    27063       24       9        0             0 xargs
[Mon Mar 11 17:48:50 2019] [ 3104]     0  3104    26988       18      10        0             0 sleep
[Mon Mar 11 17:48:50 2019] [ 3105]     0  3105    26988       19       8        0             0 sleep
[Mon Mar 11 17:48:50 2019] Out of memory: Kill process 32193 (glusterfs) score 805 or sacrifice child
[Mon Mar 11 17:48:50 2019] Killed process 32193 (glusterfs) total-vm:7825304kB, anon-rss:3622912kB, file-rss:0kB, shmem-rss:0kB
[root@dhcp35-109 glusterfs]# ls

Comment 14 Amar Tumballi 2019-03-12 08:14:51 UTC
Few questions:

Looks like the process got killed when the overall memory usage was 900MB ??

> [Mon Mar 11 17:48:50 2019] [32193]     0 32193  1956326   905728    3303   740779             0 glusterfs


Please note, if you want to run with that much less memory, recommended client configuration is '-olru-limit=10000'. As I checked the inode table details in statedump, and the new feature is already looks like it is working fine. (limit is 128k, and number of inodes in lru-list is always lesser).


Next observation, the graph which was not active, held significant memory in its mem-pools. Some of the top contributors are:

quick-read- ~150MB
io-cache - ~30MB
io-stats - ~30MB


While the inode_ctx of replicate/client-protocol/dht were also above 10s of MBs. So, I am not able to see any issues per say with the process, but for the given workload, the memory is not enough, or the above option should be set.

Comment 15 Nithya Balachandran 2019-03-12 08:52:09 UTC
Please include the gluster volume info output whenever filing a BZ.

Comment 16 Nag Pavan Chilakam 2019-03-12 09:28:54 UTC
My bad, requested details as below and also available @ https://docs.google.com/spreadsheets/d/17Yf9ZRWnWOpbRyFQ2ZYxAAlp9I_yarzKZdjN8idBJM0/edit#gid=1472913705

[root@rhs-client19 ~]# gluster v info
 
Volume Name: rpcx3
Type: Distributed-Replicate
Volume ID: f7532c65-63d0-4e4a-a5b5-c95238635eff
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x 3 = 15
Transport-type: tcp
Bricks:
Brick1: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3
Brick2: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3
Brick3: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3
Brick4: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3
Brick5: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3
Brick6: rhs-client38.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3
Brick7: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3
Brick8: rhs-client38.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3
Brick9: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3
Brick10: rhs-client38.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3-newb
Brick11: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3-newb
Brick12: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3-newb
Brick13: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3-newb
Brick14: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3-newb
Brick15: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3-newb
Options Reconfigured:
cluster.rebal-throttle: aggressive
diagnostics.client-log-level: INFO
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
features.uss: enable
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
[root@rhs-client19 ~]# gluster v status
Status of volume: rpcx3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick rhs-client19.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3                        49152     0          Y       10824
Brick rhs-client25.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3                        49152     0          Y       5232 
Brick rhs-client32.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3                        49152     0          Y       10898
Brick rhs-client25.lab.eng.blr.redhat.com:/
gluster/brick2/rpcx3                        49153     0          Y       5253 
Brick rhs-client32.lab.eng.blr.redhat.com:/
gluster/brick2/rpcx3                        49153     0          Y       10904
Brick rhs-client38.lab.eng.blr.redhat.com:/
gluster/brick2/rpcx3                        49152     0          Y       31256
Brick rhs-client32.lab.eng.blr.redhat.com:/
gluster/brick3/rpcx3                        49154     0          Y       10998
Brick rhs-client38.lab.eng.blr.redhat.com:/
gluster/brick3/rpcx3                        49153     0          Y       31277
Brick rhs-client19.lab.eng.blr.redhat.com:/
gluster/brick3/rpcx3                        49153     0          Y       10826
Brick rhs-client38.lab.eng.blr.redhat.com:/
gluster/brick3/rpcx3-newb                   49155     0          Y       19062
Brick rhs-client19.lab.eng.blr.redhat.com:/
gluster/brick2/rpcx3-newb                   49155     0          Y       29805
Brick rhs-client25.lab.eng.blr.redhat.com:/
gluster/brick3/rpcx3-newb                   49155     0          Y       30021
Brick rhs-client19.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3-newb                   49156     0          Y       29826
Brick rhs-client25.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3-newb                   49156     0          Y       30042
Brick rhs-client32.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3-newb                   49156     0          Y       1636 
Snapshot Daemon on localhost                49154     0          Y       10872
Self-heal Daemon on localhost               N/A       N/A        Y       29849
Quota Daemon on localhost                   N/A       N/A        Y       29860
Snapshot Daemon on rhs-client32.lab.eng.blr
.redhat.com                                 49155     0          Y       11221
Self-heal Daemon on rhs-client32.lab.eng.bl
r.redhat.com                                N/A       N/A        Y       1658 
Quota Daemon on rhs-client32.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       1668 
Snapshot Daemon on rhs-client38.lab.eng.blr
.redhat.com                                 49154     0          Y       18492
Self-heal Daemon on rhs-client38.lab.eng.bl
r.redhat.com                                N/A       N/A        Y       19097
Quota Daemon on rhs-client38.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       19115
Snapshot Daemon on rhs-client25.lab.eng.blr
.redhat.com                                 49154     0          Y       9833 
Self-heal Daemon on rhs-client25.lab.eng.bl
r.redhat.com                                N/A       N/A        Y       30065
Quota Daemon on rhs-client25.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       30076
 
Task Status of Volume rpcx3
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 2cd252ed-3202-4c7f-99bd-6326058c797f
Status               : in progress

Comment 17 Nag Pavan Chilakam 2019-03-12 09:52:41 UTC
(In reply to Amar Tumballi from comment #14)
> Few questions:
> 
> Looks like the process got killed when the overall memory usage was 900MB ??
> 
> > [Mon Mar 11 17:48:50 2019] [32193]     0 32193  1956326   905728    3303   740779             0 glusterfs
> 
> 
> Please note, if you want to run with that much less memory, recommended
> client configuration is '-olru-limit=10000'. As I checked the inode table
> details in statedump, and the new feature is already looks like it is
> working fine. (limit is 128k, and number of inodes in lru-list is always
> lesser).
> 
> 
> Next observation, the graph which was not active, held significant memory in
> its mem-pools. Some of the top contributors are:
> 
> quick-read- ~150MB
> io-cache - ~30MB
> io-stats - ~30MB
> 
> 
> While the inode_ctx of replicate/client-protocol/dht were also above 10s of
> MBs. So, I am not able to see any issues per say with the process, but for
> the given workload, the memory is not enough, or the above option should be
> set.


I am retrying with a 16gb baremetal client . Will update the results accordingly

Comment 24 Amar Tumballi 2019-08-06 04:16:54 UTC
Re-open when you happen to test it with 3.5.0 bits and see the behavior. For now, it is DEFERRED.

Comment 25 Nag Pavan Chilakam 2019-09-03 06:40:53 UTC
While OOM kill didn't yet happen given I am testing on a 64GB machine, I see memory spiking to almost 30GB on a 64GB machine. So I have been able to reproduce the memory leak.
I have the statedumps and will be attaching.
Hence proposing it back for fixing in 3.5.0 as it has not been fixed in 3.5.0 as from comment#23

Comment 27 Nag Pavan Chilakam 2019-09-03 06:46:29 UTC
Sunil,
can you help expedite this issue, as i need the machines due to limited resources?

Comment 47 Raghavendra G 2019-11-25 04:22:38 UTC
(In reply to Csaba Henk from comment #36)
> Nithya, I see your point.
> 
> In Glusterfs:
> 
> struct _dentry {
>     struct list_head inode_list; /* list of dentries of inode */
>     struct list_head hash;       /* hash table pointers */
>     inode_t *inode;              /* inode of this directory entry */
>     char *name;                  /* name of the directory entry */
>     inode_t *parent;             /* directory of the entry */
> };
> 
> In Linux kernel:
> 
> struct dentry {
> 	/* RCU lookup touched fields */
> 	unsigned int d_flags;		/* protected by d_lock */
> 	seqcount_t d_seq;		/* per dentry seqlock */
> 	struct hlist_bl_node d_hash;	/* lookup hash list */
> 	struct dentry *d_parent;	/* parent directory */
>         ...
> }
> 
> (https://github.com/torvalds/linux/blob/v5.2/include/linux/dcache.h#L94)
> 
> That is, in Glusterfs, parent of a dentry is an inode, while in kernel,
> parent of dentry is a dentry. So in kernel the in-memory tree is laid out
> purely from dentries, decoupled from inodes and their lifetime cycle.

But the dentry structure definition also has a pointer to an inode:

struct dentry {
	/* RCU lookup touched fields */
	unsigned int d_flags;		/* protected by d_lock */
	seqcount_t d_seq;		/* per dentry seqlock */
	struct hlist_bl_node d_hash;	/* lookup hash list */
	struct dentry *d_parent;	/* parent directory */
	struct qstr d_name;
	struct inode *d_inode;		/* Where the name belongs to - NULL is
					 * negative */

So, what happens in kernel when a file in deep directory structure is looked up?
1. Are all the dentries leading the file to root present in memory? I think yes
2. Do all the dentries point to a valid inode? Is it ok to have a NULL inode to many/all of these parent dentries

I think we should think whether maintaining tree based hierarchical namespace requires all the parent inodes to be present. What would happen if we don't do that? What operations on that namespace can fail?

> 
> It could be worth to consider adopting a similar model.

Comment 54 Csaba Henk 2020-09-29 11:39:42 UTC
Hi Nag, is this issue still actual?

Comment 56 Csaba Henk 2020-10-02 09:59:10 UTC
I went through this bug again. What I think should be done is to continue the investigation about inode/dentry layout optimization that has been brought up in comments #36 and #47. That is something which 1) definitely should happen in the future 2) won't happen in the foreseeable future (say, current upstream release cycle).

Comment 57 Csaba Henk 2020-10-06 09:16:59 UTC
The bug got closed accidentally in comment #56. Reopening as it's deemed to capture a relevant place for inprovement.

Comment 58 Csaba Henk 2020-10-06 11:00:34 UTC
Having consulted with Sunil, we decided to continue the investigation of the supposed root cause in an RFE (https://github.com/gluster/glusterfs/issues/1544, "file tree memory layout optimization"), and close this bz.


Note You need to log in before you can comment on or make changes to this bug.