Bug 1379324

Summary:

fuse mount crashed with OOM kill

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Nag Pavan Chilakam <nchilaka>

Component:

core

Assignee:

Vijay Bellur <vbellur>

Status:

CLOSED WORKSFORME

QA Contact:

Rahul Hinduja <rhinduja>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

rhgs-3.2

CC:

amukherj, csaba, nchilaka, rhs-bugs, storage-qa-internal

Target Milestone:

---

Keywords:

ZStream

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-02-07 04:26:30 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
statedump1	none
statedump2	none
statedump3	none
5 statedumps tar image	none
glusterfs-fuse rpm for debugging	none

Description Nag Pavan Chilakam 2016-09-26 11:33:58 UTC

Description of problem:
=======================
I am running systemic testing for 3.2 downstream.
I created a 4x2 volume on 4 node cluster and triggered IOs from 4 different Clients
After about 2 days, one client mount crashed with following details in dmesg



[Fri Sep 23 20:02:17 2016] SELinux: 2048 avtab hash slots, 105340 rules.
[Fri Sep 23 20:02:17 2016] SELinux: 2048 avtab hash slots, 105340 rules.
[Fri Sep 23 20:02:17 2016] SELinux:  8 users, 105 roles, 4953 types, 303 bools, 1 sens, 1024 cats
[Fri Sep 23 20:02:17 2016] SELinux:  83 classes, 105340 rules
[Fri Sep 23 20:02:17 2016] SELinux:  Context system_u:unconfined_r:sandbox_t:s0-s0:c0.c1023 became invalid (unmapped).
[Fri Sep 23 20:02:17 2016] SELinux:  Context unconfined_u:unconfined_r:sandbox_t:s0-s0:c0.c1023 became invalid (unmapped).
[Fri Sep 23 20:14:23 2016] SELinux: 2048 avtab hash slots, 105340 rules.
[Fri Sep 23 20:14:23 2016] SELinux: 2048 avtab hash slots, 105340 rules.
[Fri Sep 23 20:14:23 2016] SELinux:  8 users, 105 roles, 4953 types, 303 bools, 1 sens, 1024 cats
[Fri Sep 23 20:14:23 2016] SELinux:  83 classes, 105340 rules
[Fri Sep 23 20:14:28 2016] SELinux: 2048 avtab hash slots, 105340 rules.
[Fri Sep 23 20:14:28 2016] SELinux: 2048 avtab hash slots, 105340 rules.
[Fri Sep 23 20:14:28 2016] SELinux:  8 users, 105 roles, 4953 types, 303 bools, 1 sens, 1024 cats
[Fri Sep 23 20:14:28 2016] SELinux:  83 classes, 105340 rules
[Fri Sep 23 20:14:32 2016] SELinux: 2048 avtab hash slots, 105340 rules.
[Fri Sep 23 20:14:32 2016] SELinux: 2048 avtab hash slots, 105340 rules.
[Fri Sep 23 20:14:32 2016] SELinux:  8 users, 105 roles, 4953 types, 303 bools, 1 sens, 1024 cats
[Fri Sep 23 20:14:32 2016] SELinux:  83 classes, 105340 rules
[Fri Sep 23 20:14:36 2016] SELinux: 2048 avtab hash slots, 105340 rules.
[Fri Sep 23 20:14:36 2016] SELinux: 2048 avtab hash slots, 105340 rules.
[Fri Sep 23 20:14:36 2016] SELinux:  8 users, 105 roles, 4953 types, 303 bools, 1 sens, 1024 cats
[Fri Sep 23 20:14:36 2016] SELinux:  83 classes, 105340 rules
[Fri Sep 23 20:14:40 2016] SELinux: 2048 avtab hash slots, 105340 rules.
[Fri Sep 23 20:14:40 2016] SELinux: 2048 avtab hash slots, 105340 rules.
[Fri Sep 23 20:14:40 2016] SELinux:  8 users, 105 roles, 4953 types, 303 bools, 1 sens, 1024 cats
[Fri Sep 23 20:14:40 2016] SELinux:  83 classes, 105340 rules
[Fri Sep 23 20:15:00 2016] type=1305 audit(1474641907.651:954375): audit_pid=0 old=975 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1
[Fri Sep 23 20:15:00 2016] type=1130 audit(1474641907.656:954376): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=auditd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[Fri Sep 23 20:15:00 2016] type=1131 audit(1474641907.656:954377): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=auditd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[Fri Sep 23 20:15:01 2016] type=1107 audit(1474641908.663:954378): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc:  received policyload notice (seqno=7)
 exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?'
[Fri Sep 23 20:15:01 2016] type=1107 audit(1474641908.663:954379): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc:  received policyload notice (seqno=8)
 exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?'
[Fri Sep 23 20:15:01 2016] type=1107 audit(1474641908.663:954380): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc:  received policyload notice (seqno=9)
 exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?'
[Fri Sep 23 20:15:01 2016] type=1107 audit(1474641908.663:954381): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc:  received policyload notice (seqno=10)
 exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?'
[Fri Sep 23 20:15:01 2016] type=1107 audit(1474641908.663:954382): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc:  received policyload notice (seqno=11)
 exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?'
[Fri Sep 23 20:15:01 2016] type=1305 audit(1474641908.712:954383): audit_enabled=1 old=1 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1
[Fri Sep 23 20:15:01 2016] type=1305 audit(1474641908.712:954384): audit_pid=21101 old=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1
[Fri Sep 23 20:27:08 2016] SELinux: initialized (dev fuse, type fuse), uses genfs_contexts
[Sun Sep 25 21:12:11 2016] ksmtuned invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0
[Sun Sep 25 21:12:11 2016] ksmtuned cpuset=/ mems_allowed=0
[Sun Sep 25 21:12:11 2016] CPU: 3 PID: 6773 Comm: ksmtuned Not tainted 3.10.0-327.18.2.el7.x86_64 #1
[Sun Sep 25 21:12:11 2016] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[Sun Sep 25 21:12:11 2016]  ffff8802310e8000 000000001068a1c2 ffff880233c77b68 ffffffff816359e4
[Sun Sep 25 21:12:11 2016]  ffff880233c77bf8 ffffffff8163097f 0000000000000000 ffff880233c77b98
[Sun Sep 25 21:12:11 2016]  ffffffff81641188 fffeefff00000000 000000000000000b ffffffff8163dc03
[Sun Sep 25 21:12:11 2016] Call Trace:
[Sun Sep 25 21:12:11 2016]  [<ffffffff816359e4>] dump_stack+0x19/0x1b
[Sun Sep 25 21:12:11 2016]  [<ffffffff8163097f>] dump_header+0x8e/0x214
[Sun Sep 25 21:12:11 2016]  [<ffffffff81641188>] ? do_async_page_fault+0x78/0xe0
[Sun Sep 25 21:12:11 2016]  [<ffffffff8163dc03>] ? page_fault+0x23/0x30
[Sun Sep 25 21:12:11 2016]  [<ffffffff8116d04e>] oom_kill_process+0x24e/0x3b0
[Sun Sep 25 21:12:11 2016]  [<ffffffff8116d876>] out_of_memory+0x4b6/0x4f0
[Sun Sep 25 21:12:11 2016]  [<ffffffff81173a55>] __alloc_pages_nodemask+0xa95/0xb90
[Sun Sep 25 21:12:11 2016]  [<ffffffff81078d53>] copy_process.part.25+0x163/0x1610
[Sun Sep 25 21:12:11 2016]  [<ffffffff812861e6>] ? security_file_alloc+0x16/0x20
[Sun Sep 25 21:12:11 2016]  [<ffffffff8107a3e1>] do_fork+0xe1/0x320
[Sun Sep 25 21:12:11 2016]  [<ffffffff81090711>] ? __set_task_blocked+0x41/0xa0
[Sun Sep 25 21:12:11 2016]  [<ffffffff8107a6a6>] SyS_clone+0x16/0x20
[Sun Sep 25 21:12:11 2016]  [<ffffffff816464d9>] stub_clone+0x69/0x90
[Sun Sep 25 21:12:11 2016]  [<ffffffff81646189>] ? system_call_fastpath+0x16/0x1b
[Sun Sep 25 21:12:11 2016] Mem-Info:
[Sun Sep 25 21:12:11 2016] Node 0 DMA per-cpu:
[Sun Sep 25 21:12:11 2016] CPU    0: hi:    0, btch:   1 usd:   0
[Sun Sep 25 21:12:11 2016] CPU    1: hi:    0, btch:   1 usd:   0
[Sun Sep 25 21:12:11 2016] CPU    2: hi:    0, btch:   1 usd:   0
[Sun Sep 25 21:12:11 2016] CPU    3: hi:    0, btch:   1 usd:   0
[Sun Sep 25 21:12:11 2016] Node 0 DMA32 per-cpu:
[Sun Sep 25 21:12:11 2016] CPU    0: hi:  186, btch:  31 usd:   0
[Sun Sep 25 21:12:11 2016] CPU    1: hi:  186, btch:  31 usd:   0
[Sun Sep 25 21:12:11 2016] CPU    2: hi:  186, btch:  31 usd:   0
[Sun Sep 25 21:12:11 2016] CPU    3: hi:  186, btch:  31 usd:   0
[Sun Sep 25 21:12:11 2016] Node 0 Normal per-cpu:
[Sun Sep 25 21:12:11 2016] CPU    0: hi:  186, btch:  31 usd:   0
[Sun Sep 25 21:12:11 2016] CPU    1: hi:  186, btch:  31 usd:   0
[Sun Sep 25 21:12:11 2016] CPU    2: hi:  186, btch:  31 usd:   0
[Sun Sep 25 21:12:11 2016] CPU    3: hi:  186, btch:  31 usd:   0
[Sun Sep 25 21:12:11 2016] active_anon:1513633 inactive_anon:398755 isolated_anon:0
 active_file:7 inactive_file:2 isolated_file:0
 unevictable:11103 dirty:0 writeback:0 unstable:0
 free:29209 slab_reclaimable:8504 slab_unreclaimable:16123
 mapped:3226 shmem:84 pagetables:6587 bounce:0
 free_cma:0
[Sun Sep 25 21:12:11 2016] Node 0 DMA free:15908kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[Sun Sep 25 21:12:11 2016] lowmem_reserve[]: 0 2815 7812 7812
[Sun Sep 25 21:12:11 2016] Node 0 DMA32 free:44336kB min:24308kB low:30384kB high:36460kB active_anon:2152032kB inactive_anon:625544kB active_file:0kB inactive_file:8kB unevictable:19840kB isolated(anon):0kB isolated(file):0kB present:3129336kB managed:2884968kB mlocked:19840kB dirty:0kB writeback:0kB mapped:5368kB shmem:248kB slab_reclaimable:10044kB slab_unreclaimable:16136kB kernel_stack:1104kB pagetables:7340kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:180121 all_unreclaimable? yes
[Sun Sep 25 21:12:11 2016] lowmem_reserve[]: 0 0 4996 4996
[Sun Sep 25 21:12:11 2016] Node 0 Normal free:56592kB min:43136kB low:53920kB high:64704kB active_anon:3902500kB inactive_anon:969476kB active_file:28kB inactive_file:0kB unevictable:24572kB isolated(anon):0kB isolated(file):0kB present:5251072kB managed:5116220kB mlocked:24572kB dirty:0kB writeback:0kB mapped:7536kB shmem:88kB slab_reclaimable:23972kB slab_unreclaimable:48356kB kernel_stack:3344kB pagetables:19008kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:96 all_unreclaimable? no
[Sun Sep 25 21:12:11 2016] lowmem_reserve[]: 0 0 0 0
[Sun Sep 25 21:12:11 2016] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15908kB
[Sun Sep 25 21:12:11 2016] Node 0 DMA32: 11062*4kB (UM) 3*8kB (M) 2*16kB (M) 1*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44336kB
[Sun Sep 25 21:12:11 2016] Node 0 Normal: 10910*4kB (UEM) 113*8kB (UM) 104*16kB (UM) 68*32kB (UM) 42*64kB (UM) 27*128kB (UM) 9*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 56832kB
[Sun Sep 25 21:12:11 2016] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Sun Sep 25 21:12:11 2016] 195281 total pagecache pages
[Sun Sep 25 21:12:11 2016] 192047 pages in swap cache
[Sun Sep 25 21:12:11 2016] Swap cache stats: add 97289535, delete 97097488, find 18915469/30778116
[Sun Sep 25 21:12:11 2016] Free swap  = 0kB
[Sun Sep 25 21:12:11 2016] Total swap = 4194300kB
[Sun Sep 25 21:12:11 2016] 2099100 pages RAM
[Sun Sep 25 21:12:11 2016] 0 pages HighMem/MovableOnly
[Sun Sep 25 21:12:11 2016] 94826 pages reserved
[Sun Sep 25 21:12:11 2016] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[Sun Sep 25 21:12:11 2016] [  565]     0   565     9491      291      22       70             0 systemd-journal
[Sun Sep 25 21:12:11 2016] [  997]     0   997     4189      178      14       45             0 alsactl
[Sun Sep 25 21:12:11 2016] [ 1002]     0  1002    81554      296      80     4315             0 firewalld
[Sun Sep 25 21:12:11 2016] [ 1003]     0  1003     6621      203      17      143             0 smartd
[Sun Sep 25 21:12:11 2016] [ 1005]   994  1005     2133      129       9       29             0 lsmd
[Sun Sep 25 21:12:11 2016] [ 1008]     0  1008     8196      227      20     1662             0 systemd-logind
[Sun Sep 25 21:12:11 2016] [ 1010]     0  1010     1094      127       8       17             0 rngd
[Sun Sep 25 21:12:11 2016] [ 1019]     0  1019    52567      249      56      401             0 abrtd
[Sun Sep 25 21:12:11 2016] [ 1020]     0  1020    51996      215      56      335             0 abrt-watch-log
[Sun Sep 25 21:12:11 2016] [ 1023]    81  1023    25699      301      19      120          -900 dbus-daemon
[Sun Sep 25 21:12:11 2016] [ 1033]   990  1033    25162      241      20       53             0 chronyd
[Sun Sep 25 21:12:11 2016] [ 1074]     0  1074     1085      100       7       34             0 acpid
[Sun Sep 25 21:12:11 2016] [ 1076]   179  1076    64898     3954      29       11             0 sanlock
[Sun Sep 25 21:12:11 2016] [ 1077]     0  1077     5883        2      15       47             0 sanlock-helper
[Sun Sep 25 21:12:11 2016] [ 1125]     0  1125     3453      863      12        0             0 wdmd
[Sun Sep 25 21:12:11 2016] [ 1679]     0  1679    27631       97      52     3115             0 dhclient
[Sun Sep 25 21:12:11 2016] [ 1764]     0  1764    20640      233      44      213         -1000 sshd
[Sun Sep 25 21:12:11 2016] [ 1769]     0  1769     7336       88      16       37             0 iscsid
[Sun Sep 25 21:12:11 2016] [ 1771]     0  1771     7462     1268      19        0           -17 iscsid
[Sun Sep 25 21:12:11 2016] [ 1780]   991  1780    11592      146      26      173             0 nrpe
[Sun Sep 25 21:12:11 2016] [ 1781]     0  1781    28335       99      10       39             0 rhsmcertd
[Sun Sep 25 21:12:11 2016] [ 1791]     0  1791     6491      180      17       65             0 atd
[Sun Sep 25 21:12:11 2016] [ 1806]     0  1806    27509      170      10       31             0 agetty
[Sun Sep 25 21:12:11 2016] [ 2071]     0  2071    26978      124       9       19             0 rhnsd
[Sun Sep 25 21:12:11 2016] [12606]    29 12606    13222      194      29     1771             0 rpc.statd
[Sun Sep 25 21:12:11 2016] [24557]     0 24557    31592      255      17      142             0 crond
[Sun Sep 25 21:12:11 2016] [24619]     0 24619    11369      601      24      383         -1000 systemd-udevd
[Sun Sep 25 21:12:11 2016] [24665]     0 24665    82764     5427      41        0         -1000 dmeventd
[Sun Sep 25 21:12:11 2016] [ 6451]     0  6451    32283      275      27      392             0 lvmetad
[Sun Sep 25 21:12:11 2016] [ 6546]     0  6546   138264      420      91     2545             0 tuned
[Sun Sep 25 21:12:11 2016] [ 6593]     0  6593    63162     1709      35        0         -1000 multipathd
[Sun Sep 25 21:12:11 2016] [ 6701]    32  6701    16227      129      35      117             0 rpcbind
[Sun Sep 25 21:12:11 2016] [ 6773]     0  6773    28845       85      12       23             0 ksmtuned
[Sun Sep 25 21:12:11 2016] [ 6833]     0  6833    50842      151      40      115             0 gssproxy
[Sun Sep 25 21:12:11 2016] [ 6858]     0  6858   125958      356     120      932             0 libvirtd
[Sun Sep 25 21:12:11 2016] [ 6917]   996  6917   132532      205      54     2264             0 polkitd
[Sun Sep 25 21:12:11 2016] [20641]     0 20641   167221      215     121      161             0 rsyslogd
[Sun Sep 25 21:12:11 2016] [21101]     0 21101    12797      216      25       92         -1000 auditd
[Sun Sep 25 21:12:11 2016] [21277]     0 21277  2849455  1719417    5229   923965             0 glusterfs
[Sun Sep 25 21:12:11 2016] [21557]     0 21557    31473      182      16      112             0 screen
[Sun Sep 25 21:12:11 2016] [21558]     0 21558    29110      227      13      358             0 bash
[Sun Sep 25 21:12:11 2016] [21658]     0 21658    31407      190      14       87             0 screen
[Sun Sep 25 21:12:11 2016] [21659]     0 21659    30727      397      16     1809             0 bash
[Sun Sep 25 21:12:11 2016] [14809]     0 14809    30718      303      16      102             0 find
[Sun Sep 25 21:12:11 2016] Out of memory: Kill process 21277 (glusterfs) score 841 or sacrifice child
[Sun Sep 25 21:12:11 2016] Killed process 21277 (glusterfs) total-vm:11397820kB, anon-rss:6876624kB, file-rss:1044kB
[root@dhcp35-79 glusterfs]# 



Mount details:
10.70.37.150:distrepvol on /mnt/distrepvol type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
[root@dhcp35-79 glusterfs]# cd /mnt/distrepvol
-bash: cd: /mnt/distrepvol: Transport endpoint is not connected

I am unable to cd into the mount point with transport end point error

I remember raising something like this previously due to old glibc, but this one is new glibc(refer https://bugzilla.redhat.com/show_bug.cgi?id=1339197)

[root@dhcp35-79 glusterfs]# rpm -qa|grep gluster
glusterfs-3.8.4-1.el7rhgs.x86_64
glusterfs-server-3.8.4-1.el7rhgs.x86_64
glusterfs-api-3.8.4-1.el7rhgs.x86_64
python-gluster-3.8.4-1.el7rhgs.noarch
glusterfs-events-3.8.4-1.el7rhgs.x86_64
glusterfs-libs-3.8.4-1.el7rhgs.x86_64
glusterfs-fuse-3.8.4-1.el7rhgs.x86_64
glusterfs-rdma-3.8.4-1.el7rhgs.x86_64
glusterfs-cli-3.8.4-1.el7rhgs.x86_64
glusterfs-debuginfo-3.8.4-1.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-1.el7rhgs.x86_64
glusterfs-devel-3.8.4-1.el7rhgs.x86_64
[root@dhcp35-79 glusterfs]# rpm -qa|grep glibc
glibc-common-2.17-106.el7_2.8.x86_64
glibc-2.17-106.el7_2.8.x86_64
[root@dhcp35-79 glusterfs]# ^C
[root@dhcp35-79 glusterfs]# 




More iNformation:
https://docs.google.com/spreadsheets/d/1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=760435885



IO run from client is lookup as below:
IO: To run lookups in loop
for i in {1..1000};do ls -lRt;stat *;sleep 500;done

Comment 2 Nag Pavan Chilakam 2016-09-27 13:09:54 UTC

With the same cluster setup, I am rerunning the lookups on a new client. I feel the issue seems to be with the lookups as I see that for every 1 min the Mem consumption by glusterfs mount process is increasing by about ~0.1%
From this specific Client I created a directory and started lookup on the root of mount directory using below command:
for i in {1..1000};do ls -lRt;stat *;sleep 500;done
Also, note that I am collecting top and free -h output by appending it to a file in the directory i created from this mount. This is the only write IO being done from *THIS* mount point (there can be some other IOs from other clients)


Following is the memory consumption by glusterfs mount process every 1 min
 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
14567 root      20   0  720424  64348   3716 S   0.0  0.8   0:00.29 glusterfs
14567 root      20   0  720424  68472   3716 S   0.0  0.9   0:00.32 glusterfs
14567 root      20   0  785960  68676   3732 S   0.0  0.9   0:00.38 glusterfs
14567 root      20   0  785960  68708   3732 S   0.0  0.9   0:00.42 glusterfs
14567 root      20   0  985732  80352   4016 S  18.8  1.0   0:06.80 glusterfs
14567 root      20   0  985732  89856   4016 S  31.2  1.1   0:24.41 glusterfs
14567 root      20   0  985732 101584   4016 S  31.2  1.3   0:42.36 glusterfs
14567 root      20   0  985732 111684   4016 S   0.0  1.4   0:59.91 glusterfs
14567 root      20   0  985732 123144   4016 S  31.2  1.5   1:18.09 glusterfs
14567 root      20   0  985732 135124   4016 S  31.2  1.7   1:36.66 glusterfs
14567 root      20   0  985732 144356   4016 S  31.2  1.8   1:55.16 glusterfs
14567 root      20   0  985732 157132   4016 S  26.7  2.0   2:12.72 glusterfs
14567 root      20   0  985732 168676   4016 S  40.0  2.1   2:30.25 glusterfs
14567 root      20   0  985732 182460   4016 S  26.7  2.3   2:49.70 glusterfs
14567 root      20   0  985732 193696   4016 S  29.4  2.4   3:07.33 glusterfs
14567 root      20   0  985732 208052   4032 S  33.3  2.6   3:24.99 glusterfs
14567 root      20   0  985732 219024   4032 S  31.2  2.7   3:42.46 glusterfs



Another Sample:
took two different samples in a time span of ~3 min:
First sample:
[root@dhcp35-179 logs]# free -h
              total        used        free      shared  buff/cache   available
Mem:           7.6G        357M        6.0G        763M        1.3G        6.2G
Swap:          4.0G        108M        3.9G
[root@dhcp35-179 logs]# top -n 1 -b
top - 18:24:26 up 67 days,  5:06,  1 user,  load average: 0.29, 0.23, 0.14
Tasks: 225 total,   1 running, 224 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.7 us,  3.4 sy,  0.0 ni, 93.1 id,  0.0 wa,  0.0 hi,  0.2 si,  0.6 st
KiB Mem :  8017096 total,  6336308 free,   368696 used,  1312092 buff/cache
KiB Swap:  4194300 total,  4082984 free,   111316 used.  6525196 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
14567 root      20   0  985732 134596   4016 S  53.3  1.7   1:35.92 glusterfs
 1013 root      20   0    4376    528    496 S  20.0  0.0  54:06.94 rngd
15233 root      20   0  157680   2140   1488 R   6.7  0.0   0:00.01 top
    1 root      20   0  191868   6656   3524 S   0.0  0.1  16:33.93 systemd

Second Sample:
============
[root@dhcp35-179 logs]# free -h
              total        used        free      shared  buff/cache   available
Mem:           7.6G        401M        6.0G        773M        1.3G        6.2G
Swap:          4.0G        108M        3.9G
[root@dhcp35-179 logs]# top -n 1 -b
top - 18:28:16 up 67 days,  5:10,  1 user,  load average: 0.22, 0.26, 0.17
Tasks: 224 total,   1 running, 223 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.7 us,  3.4 sy,  0.0 ni, 93.1 id,  0.0 wa,  0.0 hi,  0.2 si,  0.6 st
KiB Mem :  8017096 total,  6262540 free,   413840 used,  1340716 buff/cache
KiB Swap:  4194300 total,  4082984 free,   111316 used.  6453096 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
14567 root      20   0  985732 179488   4016 S  25.0  2.2   2:44.08 glusterfs
    1 root      20   0  191868   6656   3524 S   0.0  0.1  16:33.94 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:02.81 kthreadd



I have even take the process dumps between two intervals, which is available at /var/run (not in /var/run/gluster but in just /var/run)

Comment 3 Csaba Henk 2016-09-29 10:28:20 UTC

(In reply to nchilaka from comment #2)
> With the same cluster setup, I am rerunning the lookups on a new client. I
> feel the issue seems to be with the lookups as I see that for every 1 min
> the Mem consumption by glusterfs mount process is increasing by about ~0.1%
> From this specific Client I created a directory and started lookup on the
> root of mount directory using below command:

What is the exact setup now? Ie. particular questions I have:
- This time are you doing lookups only from the new client, or the four (five?) clients exhibit simultaneous activity?
- Also, how does the file system tree look like in the voiume? Is it just the single dir you created now? If not, and there is already stuff there, what's the significance of creating a new directory?

My other type of questions are about reproducibility. Do you see the same issue if you create a similar setup on an other set of nodes? And if yes, then if you try to simplify the cluster? Ie. cutting back on the number of replicas / size of the distribution set (maybe simplifying as far as only distributed / only replicated)?

Comment 4 Nag Pavan Chilakam 2016-09-29 10:48:20 UTC

(In reply to Csaba Henk from comment #3)
> (In reply to nchilaka from comment #2)
> > With the same cluster setup, I am rerunning the lookups on a new client. I
> > feel the issue seems to be with the lookups as I see that for every 1 min
> > the Mem consumption by glusterfs mount process is increasing by about ~0.1%
> > From this specific Client I created a directory and started lookup on the
> > root of mount directory using below command:
> 
> What is the exact setup now? Ie. particular questions I have:
> - This time are you doing lookups only from the new client, or the four
> (five?) clients exhibit simultaneous activity?
> - Also, how does the file system tree look like in the voiume? Is it just
> the single dir you created now? If not, and there is already stuff there,
> what's the significance of creating a new directory?
I have a 4x2 volume, spanning across 4 gluster nodes
Regarding IOs:
I have 4 clients, each mounting the volume from different cluster node IP
client1: untar of linux kernel
client2: using dd for 2million file creation of 10KB size 
client3:mkdir creation with lot of deapth and breadth
client4: doing Lookup using ls and stat from the root ===>hitting oom kill only here.

> 
> My other type of questions are about reproducibility. Do you see the same
> issue if you create a similar setup on an other set of nodes? And if yes,
> then if you try to simplify the cluster? Ie. cutting back on the number of
> replicas / size of the distribution set (maybe simplifying as far as only
> distributed / only replicated)?

Reproducible: I observed this on atleast another 2 machines while doing lookups....same workload as client4
I have to see about reproducing on a smaller/simplified setup. But this setup is for testing systemic style .

Reason why I am having directory names like that,they are all client-hostnames and I am collecting log information like memory and cpu usage into those directories.....this tackles two things:one is log collection and the other is continuous append IO

Comment 5 Nag Pavan Chilakam 2016-09-29 11:18:43 UTC

(In reply to nchilaka from comment #4)
> (In reply to Csaba Henk from comment #3)
> > (In reply to nchilaka from comment #2)
> > > With the same cluster setup, I am rerunning the lookups on a new client. I
> > > feel the issue seems to be with the lookups as I see that for every 1 min
> > > the Mem consumption by glusterfs mount process is increasing by about ~0.1%
> > > From this specific Client I created a directory and started lookup on the
> > > root of mount directory using below command:
> > 
> > What is the exact setup now? Ie. particular questions I have:
> > - This time are you doing lookups only from the new client, or the four
> > (five?) clients exhibit simultaneous activity?
> > - Also, how does the file system tree look like in the voiume? Is it just
> > the single dir you created now? If not, and there is already stuff there,
> > what's the significance of creating a new directory?
> I have a 4x2 volume, spanning across 4 gluster nodes
> Regarding IOs:
> I have 4 clients, each mounting the volume from different cluster node IP
> client1: untar of linux kernel
> client2: using dd for 2million file creation of 10KB size 
> client3:mkdir creation with lot of deapth and breadth
> client4: doing Lookup using ls and stat from the root ===>hitting oom kill
> only here.
> 
> > 
> > My other type of questions are about reproducibility. Do you see the same
> > issue if you create a similar setup on an other set of nodes? And if yes,
> > then if you try to simplify the cluster? Ie. cutting back on the number of
> > replicas / size of the distribution set (maybe simplifying as far as only
> > distributed / only replicated)?
> 
> Reproducible: I observed this on atleast another 2 machines while doing
> lookups....same workload as client4
> I have to see about reproducing on a smaller/simplified setup. But this
> setup is for testing systemic style .
> 
> Reason why I am having directory names like that,they are all
> client-hostnames and I am collecting log information like memory and cpu
> usage into those directories.....this tackles two things:one is log
> collection and the other is continuous append IO




Also the file structure is as below:
I used for directory creation in dhcp35-86 folder using "
for i in {1..1000};do for j in {1..1000};do for k in {1..1000};do for l in {1..1000};do for m in {1..1000};do mkdir -p level1.$i;mkdir -p level1.$i/level2.$j;mkdir -p level1.$i/level2.$j/level3.$k;mkdir -p level1.$i/level2.$j/level3.$k/level4.$l;mkdir -p level1.$i/level2.$j/level3.$k/level4.$l;mkdir -p level1.$i/level2.$j/level3.$k/level4.$l/level5.$m;done;done;done;done;done"

The current structure is as below and growing
distrepvol/rootdir1/dhcp35-86.lab.eng.blr.redhat.com/dirs/level1.1/level2.1/level{3.1..3.12 and growing}/level{4.1..4.1000}/level{1..1000}

While the other hostdirectories are not of much depth

Comment 6 Csaba Henk 2016-09-30 10:42:46 UTC

Can you please attach the statedumps you have?

Comment 7 Csaba Henk 2016-09-30 10:53:35 UTC

(In reply to nchilaka from comment #2)
> Also, note that I am collecting top and free -h output by appending it to a
> file in the directory i created from this mount. This is the only write IO
> being done from *THIS* mount point (there can be some other IOs from other
> clients)
> 
> 
> Following is the memory consumption by glusterfs mount process every 1 min
>  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
> 14567 root      20   0  720424  64348   3716 S   0.0  0.8   0:00.29 glusterfs
> 14567 root      20   0  720424  68472   3716 S   0.0  0.9   0:00.32 glusterfs
> 14567 root      20   0  785960  68676   3732 S   0.0  0.9   0:00.38 glusterfs
> 14567 root      20   0  785960  68708   3732 S   0.0  0.9   0:00.42 glusterfs
> 14567 root      20   0  985732  80352   4016 S  18.8  1.0   0:06.80 glusterfs
> 14567 root      20   0  985732  89856   4016 S  31.2  1.1   0:24.41 glusterfs
> 14567 root      20   0  985732 101584   4016 S  31.2  1.3   0:42.36 glusterfs
> 14567 root      20   0  985732 111684   4016 S   0.0  1.4   0:59.91 glusterfs
> 14567 root      20   0  985732 123144   4016 S  31.2  1.5   1:18.09 glusterfs
> 14567 root      20   0  985732 135124   4016 S  31.2  1.7   1:36.66 glusterfs
> 14567 root      20   0  985732 144356   4016 S  31.2  1.8   1:55.16 glusterfs
> 14567 root      20   0  985732 157132   4016 S  26.7  2.0   2:12.72 glusterfs
> 14567 root      20   0  985732 168676   4016 S  40.0  2.1   2:30.25 glusterfs
> 14567 root      20   0  985732 182460   4016 S  26.7  2.3   2:49.70 glusterfs
> 14567 root      20   0  985732 193696   4016 S  29.4  2.4   3:07.33 glusterfs
> 14567 root      20   0  985732 208052   4032 S  33.3  2.6   3:24.99 glusterfs
> 14567 root      20   0  985732 219024   4032 S  31.2  2.7   3:42.46 glusterfs

Methodologically, it's a bit weird that you write these stats to the filesystem under investigation, but probably you are right and this does not make much of a difference.

What would be a useful extension: record these stats not at regular times but after each run of the .. ls -ltR; stat * ... loop. Or even better, keep making records at regular times but *also* after the runs (indicating that a given record is at a run boundary). (Endowed with timeststamps, too.)

Then we could see the amount of non-freed memory originating from a given run.

> I have even take the process dumps between two intervals, which is available
> at /var/run (not in /var/run/gluster but in just /var/run)

Do you mean statedumps? Would it be possible to attach them to the BZ entry?

Comment 8 Nag Pavan Chilakam 2016-09-30 11:04:37 UTC

(In reply to Csaba Henk from comment #7)
> (In reply to nchilaka from comment #2)
> > Also, note that I am collecting top and free -h output by appending it to a
> > file in the directory i created from this mount. This is the only write IO
> > being done from *THIS* mount point (there can be some other IOs from other
> > clients)
> > 
> > 
> > Following is the memory consumption by glusterfs mount process every 1 min
> >  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
> > 14567 root      20   0  720424  64348   3716 S   0.0  0.8   0:00.29 glusterfs
> > 14567 root      20   0  720424  68472   3716 S   0.0  0.9   0:00.32 glusterfs
> > 14567 root      20   0  785960  68676   3732 S   0.0  0.9   0:00.38 glusterfs
> > 14567 root      20   0  785960  68708   3732 S   0.0  0.9   0:00.42 glusterfs
> > 14567 root      20   0  985732  80352   4016 S  18.8  1.0   0:06.80 glusterfs
> > 14567 root      20   0  985732  89856   4016 S  31.2  1.1   0:24.41 glusterfs
> > 14567 root      20   0  985732 101584   4016 S  31.2  1.3   0:42.36 glusterfs
> > 14567 root      20   0  985732 111684   4016 S   0.0  1.4   0:59.91 glusterfs
> > 14567 root      20   0  985732 123144   4016 S  31.2  1.5   1:18.09 glusterfs
> > 14567 root      20   0  985732 135124   4016 S  31.2  1.7   1:36.66 glusterfs
> > 14567 root      20   0  985732 144356   4016 S  31.2  1.8   1:55.16 glusterfs
> > 14567 root      20   0  985732 157132   4016 S  26.7  2.0   2:12.72 glusterfs
> > 14567 root      20   0  985732 168676   4016 S  40.0  2.1   2:30.25 glusterfs
> > 14567 root      20   0  985732 182460   4016 S  26.7  2.3   2:49.70 glusterfs
> > 14567 root      20   0  985732 193696   4016 S  29.4  2.4   3:07.33 glusterfs
> > 14567 root      20   0  985732 208052   4032 S  33.3  2.6   3:24.99 glusterfs
> > 14567 root      20   0  985732 219024   4032 S  31.2  2.7   3:42.46 glusterfs
> 
> Methodologically, it's a bit weird that you write these stats to the
> filesystem under investigation, but probably you are right and this does not
> make much of a difference.
> 
> What would be a useful extension: record these stats not at regular times
> but after each run of the .. ls -ltR; stat * ... loop. Or even better, keep
> making records at regular times but *also* after the runs (indicating that a
> given record is at a run boundary). (Endowed with timeststamps, too.)
> 
> Then we could see the amount of non-freed memory originating from a given
> run.
> 
> > I have even take the process dumps between two intervals, which is available
> > at /var/run (not in /var/run/gluster but in just /var/run)
> 
> Do you mean statedumps? Would it be possible to attach them to the BZ entry?




I have attached some of the statedumps 

Also, note that the last lookup message displayed was "trnasport end point error"
./dhcp35-86.lab.eng.blr.redhat.com/dirs/level1.1/level2.1/level3.11/level4.267:
ls: ./dhcp35-86.lab.eng.blr.redhat.com/dirs/level1.1/level2.1/level3.11/level4.267/level5.275: Transport endpoint is not connected

I tried to cd to the directory but the session hangs

I have  the treestructure of the directory which hosts many directories, but the problem is that it is a 600MB file.

Comment 9 Nag Pavan Chilakam 2016-09-30 11:06:37 UTC

Created attachment 1206220 [details]
statedump1

Comment 10 Nag Pavan Chilakam 2016-09-30 11:08:20 UTC

Created attachment 1206221 [details]
statedump2

Comment 11 Nag Pavan Chilakam 2016-09-30 11:10:25 UTC

Created attachment 1206222 [details]
statedump3

Comment 12 Nag Pavan Chilakam 2016-09-30 11:14:04 UTC

Created attachment 1206223 [details]
5 statedumps tar image

Comment 13 Nag Pavan Chilakam 2016-09-30 11:17:11 UTC

Csaba,
You can look into statedumps available in attachment 1206223 [details]. ignore the other attachments.
The treepath log even after zipping and tarring is still 31MB, which cannot be attached here.

Comment 14 Csaba Henk 2016-10-13 00:38:50 UTC

Nag, the RPM

http://download.eng.bos.redhat.com/brewroot/work/tasks/434/11900434/glusterfs-fuse-3.8.4-0.el7rhgs.x86_64.rpm

delivers a glusterfs fuse client patched to make statedumps with more details about mem ory usage.

So could you please:
- install the RPM
- repeat the test
- meanwhile take statedumps the same way you did before
- attach the statedums.

Comment 15 Csaba Henk 2016-11-02 23:05:43 UTC

Created attachment 1216808 [details]
glusterfs-fuse rpm for debugging

The rpm is no longer available at the location indicated by the previous comment, so providing it as attachment.

Comment 16 Atin Mukherjee 2016-11-07 11:42:58 UTC

Can we retest the same with the rpms shared in comment 15 and share the results?

Comment 17 Csaba Henk 2016-11-11 05:52:14 UTC

Comment on attachment 1216808 [details]
glusterfs-fuse rpm for debugging

For the record -- the change applied on the source is this one:

http://review.gluster.org/13736

Comment 24 Amar Tumballi 2018-02-07 04:26:30 UTC

We have noticed that the bug is not reproduced in the latest version of the product (RHGS-3.3.1+).

If the bug is still relevant and is being reproduced, feel free to reopen the bug.