Description of problem: ======================= I am running systemic testing for 3.2 downstream. I created a 4x2 volume on 4 node cluster and triggered IOs from 4 different Clients After about 2 days, one client mount crashed with following details in dmesg [Fri Sep 23 20:02:17 2016] SELinux: 2048 avtab hash slots, 105340 rules. [Fri Sep 23 20:02:17 2016] SELinux: 2048 avtab hash slots, 105340 rules. [Fri Sep 23 20:02:17 2016] SELinux: 8 users, 105 roles, 4953 types, 303 bools, 1 sens, 1024 cats [Fri Sep 23 20:02:17 2016] SELinux: 83 classes, 105340 rules [Fri Sep 23 20:02:17 2016] SELinux: Context system_u:unconfined_r:sandbox_t:s0-s0:c0.c1023 became invalid (unmapped). [Fri Sep 23 20:02:17 2016] SELinux: Context unconfined_u:unconfined_r:sandbox_t:s0-s0:c0.c1023 became invalid (unmapped). [Fri Sep 23 20:14:23 2016] SELinux: 2048 avtab hash slots, 105340 rules. [Fri Sep 23 20:14:23 2016] SELinux: 2048 avtab hash slots, 105340 rules. [Fri Sep 23 20:14:23 2016] SELinux: 8 users, 105 roles, 4953 types, 303 bools, 1 sens, 1024 cats [Fri Sep 23 20:14:23 2016] SELinux: 83 classes, 105340 rules [Fri Sep 23 20:14:28 2016] SELinux: 2048 avtab hash slots, 105340 rules. [Fri Sep 23 20:14:28 2016] SELinux: 2048 avtab hash slots, 105340 rules. [Fri Sep 23 20:14:28 2016] SELinux: 8 users, 105 roles, 4953 types, 303 bools, 1 sens, 1024 cats [Fri Sep 23 20:14:28 2016] SELinux: 83 classes, 105340 rules [Fri Sep 23 20:14:32 2016] SELinux: 2048 avtab hash slots, 105340 rules. [Fri Sep 23 20:14:32 2016] SELinux: 2048 avtab hash slots, 105340 rules. [Fri Sep 23 20:14:32 2016] SELinux: 8 users, 105 roles, 4953 types, 303 bools, 1 sens, 1024 cats [Fri Sep 23 20:14:32 2016] SELinux: 83 classes, 105340 rules [Fri Sep 23 20:14:36 2016] SELinux: 2048 avtab hash slots, 105340 rules. [Fri Sep 23 20:14:36 2016] SELinux: 2048 avtab hash slots, 105340 rules. [Fri Sep 23 20:14:36 2016] SELinux: 8 users, 105 roles, 4953 types, 303 bools, 1 sens, 1024 cats [Fri Sep 23 20:14:36 2016] SELinux: 83 classes, 105340 rules [Fri Sep 23 20:14:40 2016] SELinux: 2048 avtab hash slots, 105340 rules. [Fri Sep 23 20:14:40 2016] SELinux: 2048 avtab hash slots, 105340 rules. [Fri Sep 23 20:14:40 2016] SELinux: 8 users, 105 roles, 4953 types, 303 bools, 1 sens, 1024 cats [Fri Sep 23 20:14:40 2016] SELinux: 83 classes, 105340 rules [Fri Sep 23 20:15:00 2016] type=1305 audit(1474641907.651:954375): audit_pid=0 old=975 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1 [Fri Sep 23 20:15:00 2016] type=1130 audit(1474641907.656:954376): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=auditd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [Fri Sep 23 20:15:00 2016] type=1131 audit(1474641907.656:954377): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=auditd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [Fri Sep 23 20:15:01 2016] type=1107 audit(1474641908.663:954378): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc: received policyload notice (seqno=7) exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?' [Fri Sep 23 20:15:01 2016] type=1107 audit(1474641908.663:954379): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc: received policyload notice (seqno=8) exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?' [Fri Sep 23 20:15:01 2016] type=1107 audit(1474641908.663:954380): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc: received policyload notice (seqno=9) exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?' [Fri Sep 23 20:15:01 2016] type=1107 audit(1474641908.663:954381): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc: received policyload notice (seqno=10) exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?' [Fri Sep 23 20:15:01 2016] type=1107 audit(1474641908.663:954382): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc: received policyload notice (seqno=11) exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?' [Fri Sep 23 20:15:01 2016] type=1305 audit(1474641908.712:954383): audit_enabled=1 old=1 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1 [Fri Sep 23 20:15:01 2016] type=1305 audit(1474641908.712:954384): audit_pid=21101 old=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1 [Fri Sep 23 20:27:08 2016] SELinux: initialized (dev fuse, type fuse), uses genfs_contexts [Sun Sep 25 21:12:11 2016] ksmtuned invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0 [Sun Sep 25 21:12:11 2016] ksmtuned cpuset=/ mems_allowed=0 [Sun Sep 25 21:12:11 2016] CPU: 3 PID: 6773 Comm: ksmtuned Not tainted 3.10.0-327.18.2.el7.x86_64 #1 [Sun Sep 25 21:12:11 2016] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [Sun Sep 25 21:12:11 2016] ffff8802310e8000 000000001068a1c2 ffff880233c77b68 ffffffff816359e4 [Sun Sep 25 21:12:11 2016] ffff880233c77bf8 ffffffff8163097f 0000000000000000 ffff880233c77b98 [Sun Sep 25 21:12:11 2016] ffffffff81641188 fffeefff00000000 000000000000000b ffffffff8163dc03 [Sun Sep 25 21:12:11 2016] Call Trace: [Sun Sep 25 21:12:11 2016] [<ffffffff816359e4>] dump_stack+0x19/0x1b [Sun Sep 25 21:12:11 2016] [<ffffffff8163097f>] dump_header+0x8e/0x214 [Sun Sep 25 21:12:11 2016] [<ffffffff81641188>] ? do_async_page_fault+0x78/0xe0 [Sun Sep 25 21:12:11 2016] [<ffffffff8163dc03>] ? page_fault+0x23/0x30 [Sun Sep 25 21:12:11 2016] [<ffffffff8116d04e>] oom_kill_process+0x24e/0x3b0 [Sun Sep 25 21:12:11 2016] [<ffffffff8116d876>] out_of_memory+0x4b6/0x4f0 [Sun Sep 25 21:12:11 2016] [<ffffffff81173a55>] __alloc_pages_nodemask+0xa95/0xb90 [Sun Sep 25 21:12:11 2016] [<ffffffff81078d53>] copy_process.part.25+0x163/0x1610 [Sun Sep 25 21:12:11 2016] [<ffffffff812861e6>] ? security_file_alloc+0x16/0x20 [Sun Sep 25 21:12:11 2016] [<ffffffff8107a3e1>] do_fork+0xe1/0x320 [Sun Sep 25 21:12:11 2016] [<ffffffff81090711>] ? __set_task_blocked+0x41/0xa0 [Sun Sep 25 21:12:11 2016] [<ffffffff8107a6a6>] SyS_clone+0x16/0x20 [Sun Sep 25 21:12:11 2016] [<ffffffff816464d9>] stub_clone+0x69/0x90 [Sun Sep 25 21:12:11 2016] [<ffffffff81646189>] ? system_call_fastpath+0x16/0x1b [Sun Sep 25 21:12:11 2016] Mem-Info: [Sun Sep 25 21:12:11 2016] Node 0 DMA per-cpu: [Sun Sep 25 21:12:11 2016] CPU 0: hi: 0, btch: 1 usd: 0 [Sun Sep 25 21:12:11 2016] CPU 1: hi: 0, btch: 1 usd: 0 [Sun Sep 25 21:12:11 2016] CPU 2: hi: 0, btch: 1 usd: 0 [Sun Sep 25 21:12:11 2016] CPU 3: hi: 0, btch: 1 usd: 0 [Sun Sep 25 21:12:11 2016] Node 0 DMA32 per-cpu: [Sun Sep 25 21:12:11 2016] CPU 0: hi: 186, btch: 31 usd: 0 [Sun Sep 25 21:12:11 2016] CPU 1: hi: 186, btch: 31 usd: 0 [Sun Sep 25 21:12:11 2016] CPU 2: hi: 186, btch: 31 usd: 0 [Sun Sep 25 21:12:11 2016] CPU 3: hi: 186, btch: 31 usd: 0 [Sun Sep 25 21:12:11 2016] Node 0 Normal per-cpu: [Sun Sep 25 21:12:11 2016] CPU 0: hi: 186, btch: 31 usd: 0 [Sun Sep 25 21:12:11 2016] CPU 1: hi: 186, btch: 31 usd: 0 [Sun Sep 25 21:12:11 2016] CPU 2: hi: 186, btch: 31 usd: 0 [Sun Sep 25 21:12:11 2016] CPU 3: hi: 186, btch: 31 usd: 0 [Sun Sep 25 21:12:11 2016] active_anon:1513633 inactive_anon:398755 isolated_anon:0 active_file:7 inactive_file:2 isolated_file:0 unevictable:11103 dirty:0 writeback:0 unstable:0 free:29209 slab_reclaimable:8504 slab_unreclaimable:16123 mapped:3226 shmem:84 pagetables:6587 bounce:0 free_cma:0 [Sun Sep 25 21:12:11 2016] Node 0 DMA free:15908kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [Sun Sep 25 21:12:11 2016] lowmem_reserve[]: 0 2815 7812 7812 [Sun Sep 25 21:12:11 2016] Node 0 DMA32 free:44336kB min:24308kB low:30384kB high:36460kB active_anon:2152032kB inactive_anon:625544kB active_file:0kB inactive_file:8kB unevictable:19840kB isolated(anon):0kB isolated(file):0kB present:3129336kB managed:2884968kB mlocked:19840kB dirty:0kB writeback:0kB mapped:5368kB shmem:248kB slab_reclaimable:10044kB slab_unreclaimable:16136kB kernel_stack:1104kB pagetables:7340kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:180121 all_unreclaimable? yes [Sun Sep 25 21:12:11 2016] lowmem_reserve[]: 0 0 4996 4996 [Sun Sep 25 21:12:11 2016] Node 0 Normal free:56592kB min:43136kB low:53920kB high:64704kB active_anon:3902500kB inactive_anon:969476kB active_file:28kB inactive_file:0kB unevictable:24572kB isolated(anon):0kB isolated(file):0kB present:5251072kB managed:5116220kB mlocked:24572kB dirty:0kB writeback:0kB mapped:7536kB shmem:88kB slab_reclaimable:23972kB slab_unreclaimable:48356kB kernel_stack:3344kB pagetables:19008kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:96 all_unreclaimable? no [Sun Sep 25 21:12:11 2016] lowmem_reserve[]: 0 0 0 0 [Sun Sep 25 21:12:11 2016] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15908kB [Sun Sep 25 21:12:11 2016] Node 0 DMA32: 11062*4kB (UM) 3*8kB (M) 2*16kB (M) 1*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44336kB [Sun Sep 25 21:12:11 2016] Node 0 Normal: 10910*4kB (UEM) 113*8kB (UM) 104*16kB (UM) 68*32kB (UM) 42*64kB (UM) 27*128kB (UM) 9*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 56832kB [Sun Sep 25 21:12:11 2016] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [Sun Sep 25 21:12:11 2016] 195281 total pagecache pages [Sun Sep 25 21:12:11 2016] 192047 pages in swap cache [Sun Sep 25 21:12:11 2016] Swap cache stats: add 97289535, delete 97097488, find 18915469/30778116 [Sun Sep 25 21:12:11 2016] Free swap = 0kB [Sun Sep 25 21:12:11 2016] Total swap = 4194300kB [Sun Sep 25 21:12:11 2016] 2099100 pages RAM [Sun Sep 25 21:12:11 2016] 0 pages HighMem/MovableOnly [Sun Sep 25 21:12:11 2016] 94826 pages reserved [Sun Sep 25 21:12:11 2016] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [Sun Sep 25 21:12:11 2016] [ 565] 0 565 9491 291 22 70 0 systemd-journal [Sun Sep 25 21:12:11 2016] [ 997] 0 997 4189 178 14 45 0 alsactl [Sun Sep 25 21:12:11 2016] [ 1002] 0 1002 81554 296 80 4315 0 firewalld [Sun Sep 25 21:12:11 2016] [ 1003] 0 1003 6621 203 17 143 0 smartd [Sun Sep 25 21:12:11 2016] [ 1005] 994 1005 2133 129 9 29 0 lsmd [Sun Sep 25 21:12:11 2016] [ 1008] 0 1008 8196 227 20 1662 0 systemd-logind [Sun Sep 25 21:12:11 2016] [ 1010] 0 1010 1094 127 8 17 0 rngd [Sun Sep 25 21:12:11 2016] [ 1019] 0 1019 52567 249 56 401 0 abrtd [Sun Sep 25 21:12:11 2016] [ 1020] 0 1020 51996 215 56 335 0 abrt-watch-log [Sun Sep 25 21:12:11 2016] [ 1023] 81 1023 25699 301 19 120 -900 dbus-daemon [Sun Sep 25 21:12:11 2016] [ 1033] 990 1033 25162 241 20 53 0 chronyd [Sun Sep 25 21:12:11 2016] [ 1074] 0 1074 1085 100 7 34 0 acpid [Sun Sep 25 21:12:11 2016] [ 1076] 179 1076 64898 3954 29 11 0 sanlock [Sun Sep 25 21:12:11 2016] [ 1077] 0 1077 5883 2 15 47 0 sanlock-helper [Sun Sep 25 21:12:11 2016] [ 1125] 0 1125 3453 863 12 0 0 wdmd [Sun Sep 25 21:12:11 2016] [ 1679] 0 1679 27631 97 52 3115 0 dhclient [Sun Sep 25 21:12:11 2016] [ 1764] 0 1764 20640 233 44 213 -1000 sshd [Sun Sep 25 21:12:11 2016] [ 1769] 0 1769 7336 88 16 37 0 iscsid [Sun Sep 25 21:12:11 2016] [ 1771] 0 1771 7462 1268 19 0 -17 iscsid [Sun Sep 25 21:12:11 2016] [ 1780] 991 1780 11592 146 26 173 0 nrpe [Sun Sep 25 21:12:11 2016] [ 1781] 0 1781 28335 99 10 39 0 rhsmcertd [Sun Sep 25 21:12:11 2016] [ 1791] 0 1791 6491 180 17 65 0 atd [Sun Sep 25 21:12:11 2016] [ 1806] 0 1806 27509 170 10 31 0 agetty [Sun Sep 25 21:12:11 2016] [ 2071] 0 2071 26978 124 9 19 0 rhnsd [Sun Sep 25 21:12:11 2016] [12606] 29 12606 13222 194 29 1771 0 rpc.statd [Sun Sep 25 21:12:11 2016] [24557] 0 24557 31592 255 17 142 0 crond [Sun Sep 25 21:12:11 2016] [24619] 0 24619 11369 601 24 383 -1000 systemd-udevd [Sun Sep 25 21:12:11 2016] [24665] 0 24665 82764 5427 41 0 -1000 dmeventd [Sun Sep 25 21:12:11 2016] [ 6451] 0 6451 32283 275 27 392 0 lvmetad [Sun Sep 25 21:12:11 2016] [ 6546] 0 6546 138264 420 91 2545 0 tuned [Sun Sep 25 21:12:11 2016] [ 6593] 0 6593 63162 1709 35 0 -1000 multipathd [Sun Sep 25 21:12:11 2016] [ 6701] 32 6701 16227 129 35 117 0 rpcbind [Sun Sep 25 21:12:11 2016] [ 6773] 0 6773 28845 85 12 23 0 ksmtuned [Sun Sep 25 21:12:11 2016] [ 6833] 0 6833 50842 151 40 115 0 gssproxy [Sun Sep 25 21:12:11 2016] [ 6858] 0 6858 125958 356 120 932 0 libvirtd [Sun Sep 25 21:12:11 2016] [ 6917] 996 6917 132532 205 54 2264 0 polkitd [Sun Sep 25 21:12:11 2016] [20641] 0 20641 167221 215 121 161 0 rsyslogd [Sun Sep 25 21:12:11 2016] [21101] 0 21101 12797 216 25 92 -1000 auditd [Sun Sep 25 21:12:11 2016] [21277] 0 21277 2849455 1719417 5229 923965 0 glusterfs [Sun Sep 25 21:12:11 2016] [21557] 0 21557 31473 182 16 112 0 screen [Sun Sep 25 21:12:11 2016] [21558] 0 21558 29110 227 13 358 0 bash [Sun Sep 25 21:12:11 2016] [21658] 0 21658 31407 190 14 87 0 screen [Sun Sep 25 21:12:11 2016] [21659] 0 21659 30727 397 16 1809 0 bash [Sun Sep 25 21:12:11 2016] [14809] 0 14809 30718 303 16 102 0 find [Sun Sep 25 21:12:11 2016] Out of memory: Kill process 21277 (glusterfs) score 841 or sacrifice child [Sun Sep 25 21:12:11 2016] Killed process 21277 (glusterfs) total-vm:11397820kB, anon-rss:6876624kB, file-rss:1044kB [root@dhcp35-79 glusterfs]# Mount details: 10.70.37.150:distrepvol on /mnt/distrepvol type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) [root@dhcp35-79 glusterfs]# cd /mnt/distrepvol -bash: cd: /mnt/distrepvol: Transport endpoint is not connected I am unable to cd into the mount point with transport end point error I remember raising something like this previously due to old glibc, but this one is new glibc(refer https://bugzilla.redhat.com/show_bug.cgi?id=1339197) [root@dhcp35-79 glusterfs]# rpm -qa|grep gluster glusterfs-3.8.4-1.el7rhgs.x86_64 glusterfs-server-3.8.4-1.el7rhgs.x86_64 glusterfs-api-3.8.4-1.el7rhgs.x86_64 python-gluster-3.8.4-1.el7rhgs.noarch glusterfs-events-3.8.4-1.el7rhgs.x86_64 glusterfs-libs-3.8.4-1.el7rhgs.x86_64 glusterfs-fuse-3.8.4-1.el7rhgs.x86_64 glusterfs-rdma-3.8.4-1.el7rhgs.x86_64 glusterfs-cli-3.8.4-1.el7rhgs.x86_64 glusterfs-debuginfo-3.8.4-1.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-1.el7rhgs.x86_64 glusterfs-devel-3.8.4-1.el7rhgs.x86_64 [root@dhcp35-79 glusterfs]# rpm -qa|grep glibc glibc-common-2.17-106.el7_2.8.x86_64 glibc-2.17-106.el7_2.8.x86_64 [root@dhcp35-79 glusterfs]# ^C [root@dhcp35-79 glusterfs]# More iNformation: https://docs.google.com/spreadsheets/d/1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=760435885 IO run from client is lookup as below: IO: To run lookups in loop for i in {1..1000};do ls -lRt;stat *;sleep 500;done
With the same cluster setup, I am rerunning the lookups on a new client. I feel the issue seems to be with the lookups as I see that for every 1 min the Mem consumption by glusterfs mount process is increasing by about ~0.1% From this specific Client I created a directory and started lookup on the root of mount directory using below command: for i in {1..1000};do ls -lRt;stat *;sleep 500;done Also, note that I am collecting top and free -h output by appending it to a file in the directory i created from this mount. This is the only write IO being done from *THIS* mount point (there can be some other IOs from other clients) Following is the memory consumption by glusterfs mount process every 1 min PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14567 root 20 0 720424 64348 3716 S 0.0 0.8 0:00.29 glusterfs 14567 root 20 0 720424 68472 3716 S 0.0 0.9 0:00.32 glusterfs 14567 root 20 0 785960 68676 3732 S 0.0 0.9 0:00.38 glusterfs 14567 root 20 0 785960 68708 3732 S 0.0 0.9 0:00.42 glusterfs 14567 root 20 0 985732 80352 4016 S 18.8 1.0 0:06.80 glusterfs 14567 root 20 0 985732 89856 4016 S 31.2 1.1 0:24.41 glusterfs 14567 root 20 0 985732 101584 4016 S 31.2 1.3 0:42.36 glusterfs 14567 root 20 0 985732 111684 4016 S 0.0 1.4 0:59.91 glusterfs 14567 root 20 0 985732 123144 4016 S 31.2 1.5 1:18.09 glusterfs 14567 root 20 0 985732 135124 4016 S 31.2 1.7 1:36.66 glusterfs 14567 root 20 0 985732 144356 4016 S 31.2 1.8 1:55.16 glusterfs 14567 root 20 0 985732 157132 4016 S 26.7 2.0 2:12.72 glusterfs 14567 root 20 0 985732 168676 4016 S 40.0 2.1 2:30.25 glusterfs 14567 root 20 0 985732 182460 4016 S 26.7 2.3 2:49.70 glusterfs 14567 root 20 0 985732 193696 4016 S 29.4 2.4 3:07.33 glusterfs 14567 root 20 0 985732 208052 4032 S 33.3 2.6 3:24.99 glusterfs 14567 root 20 0 985732 219024 4032 S 31.2 2.7 3:42.46 glusterfs Another Sample: took two different samples in a time span of ~3 min: First sample: [root@dhcp35-179 logs]# free -h total used free shared buff/cache available Mem: 7.6G 357M 6.0G 763M 1.3G 6.2G Swap: 4.0G 108M 3.9G [root@dhcp35-179 logs]# top -n 1 -b top - 18:24:26 up 67 days, 5:06, 1 user, load average: 0.29, 0.23, 0.14 Tasks: 225 total, 1 running, 224 sleeping, 0 stopped, 0 zombie %Cpu(s): 2.7 us, 3.4 sy, 0.0 ni, 93.1 id, 0.0 wa, 0.0 hi, 0.2 si, 0.6 st KiB Mem : 8017096 total, 6336308 free, 368696 used, 1312092 buff/cache KiB Swap: 4194300 total, 4082984 free, 111316 used. 6525196 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14567 root 20 0 985732 134596 4016 S 53.3 1.7 1:35.92 glusterfs 1013 root 20 0 4376 528 496 S 20.0 0.0 54:06.94 rngd 15233 root 20 0 157680 2140 1488 R 6.7 0.0 0:00.01 top 1 root 20 0 191868 6656 3524 S 0.0 0.1 16:33.93 systemd Second Sample: ============ [root@dhcp35-179 logs]# free -h total used free shared buff/cache available Mem: 7.6G 401M 6.0G 773M 1.3G 6.2G Swap: 4.0G 108M 3.9G [root@dhcp35-179 logs]# top -n 1 -b top - 18:28:16 up 67 days, 5:10, 1 user, load average: 0.22, 0.26, 0.17 Tasks: 224 total, 1 running, 223 sleeping, 0 stopped, 0 zombie %Cpu(s): 2.7 us, 3.4 sy, 0.0 ni, 93.1 id, 0.0 wa, 0.0 hi, 0.2 si, 0.6 st KiB Mem : 8017096 total, 6262540 free, 413840 used, 1340716 buff/cache KiB Swap: 4194300 total, 4082984 free, 111316 used. 6453096 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14567 root 20 0 985732 179488 4016 S 25.0 2.2 2:44.08 glusterfs 1 root 20 0 191868 6656 3524 S 0.0 0.1 16:33.94 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:02.81 kthreadd I have even take the process dumps between two intervals, which is available at /var/run (not in /var/run/gluster but in just /var/run)
(In reply to nchilaka from comment #2) > With the same cluster setup, I am rerunning the lookups on a new client. I > feel the issue seems to be with the lookups as I see that for every 1 min > the Mem consumption by glusterfs mount process is increasing by about ~0.1% > From this specific Client I created a directory and started lookup on the > root of mount directory using below command: What is the exact setup now? Ie. particular questions I have: - This time are you doing lookups only from the new client, or the four (five?) clients exhibit simultaneous activity? - Also, how does the file system tree look like in the voiume? Is it just the single dir you created now? If not, and there is already stuff there, what's the significance of creating a new directory? My other type of questions are about reproducibility. Do you see the same issue if you create a similar setup on an other set of nodes? And if yes, then if you try to simplify the cluster? Ie. cutting back on the number of replicas / size of the distribution set (maybe simplifying as far as only distributed / only replicated)?
(In reply to Csaba Henk from comment #3) > (In reply to nchilaka from comment #2) > > With the same cluster setup, I am rerunning the lookups on a new client. I > > feel the issue seems to be with the lookups as I see that for every 1 min > > the Mem consumption by glusterfs mount process is increasing by about ~0.1% > > From this specific Client I created a directory and started lookup on the > > root of mount directory using below command: > > What is the exact setup now? Ie. particular questions I have: > - This time are you doing lookups only from the new client, or the four > (five?) clients exhibit simultaneous activity? > - Also, how does the file system tree look like in the voiume? Is it just > the single dir you created now? If not, and there is already stuff there, > what's the significance of creating a new directory? I have a 4x2 volume, spanning across 4 gluster nodes Regarding IOs: I have 4 clients, each mounting the volume from different cluster node IP client1: untar of linux kernel client2: using dd for 2million file creation of 10KB size client3:mkdir creation with lot of deapth and breadth client4: doing Lookup using ls and stat from the root ===>hitting oom kill only here. > > My other type of questions are about reproducibility. Do you see the same > issue if you create a similar setup on an other set of nodes? And if yes, > then if you try to simplify the cluster? Ie. cutting back on the number of > replicas / size of the distribution set (maybe simplifying as far as only > distributed / only replicated)? Reproducible: I observed this on atleast another 2 machines while doing lookups....same workload as client4 I have to see about reproducing on a smaller/simplified setup. But this setup is for testing systemic style . Reason why I am having directory names like that,they are all client-hostnames and I am collecting log information like memory and cpu usage into those directories.....this tackles two things:one is log collection and the other is continuous append IO
(In reply to nchilaka from comment #4) > (In reply to Csaba Henk from comment #3) > > (In reply to nchilaka from comment #2) > > > With the same cluster setup, I am rerunning the lookups on a new client. I > > > feel the issue seems to be with the lookups as I see that for every 1 min > > > the Mem consumption by glusterfs mount process is increasing by about ~0.1% > > > From this specific Client I created a directory and started lookup on the > > > root of mount directory using below command: > > > > What is the exact setup now? Ie. particular questions I have: > > - This time are you doing lookups only from the new client, or the four > > (five?) clients exhibit simultaneous activity? > > - Also, how does the file system tree look like in the voiume? Is it just > > the single dir you created now? If not, and there is already stuff there, > > what's the significance of creating a new directory? > I have a 4x2 volume, spanning across 4 gluster nodes > Regarding IOs: > I have 4 clients, each mounting the volume from different cluster node IP > client1: untar of linux kernel > client2: using dd for 2million file creation of 10KB size > client3:mkdir creation with lot of deapth and breadth > client4: doing Lookup using ls and stat from the root ===>hitting oom kill > only here. > > > > > My other type of questions are about reproducibility. Do you see the same > > issue if you create a similar setup on an other set of nodes? And if yes, > > then if you try to simplify the cluster? Ie. cutting back on the number of > > replicas / size of the distribution set (maybe simplifying as far as only > > distributed / only replicated)? > > Reproducible: I observed this on atleast another 2 machines while doing > lookups....same workload as client4 > I have to see about reproducing on a smaller/simplified setup. But this > setup is for testing systemic style . > > Reason why I am having directory names like that,they are all > client-hostnames and I am collecting log information like memory and cpu > usage into those directories.....this tackles two things:one is log > collection and the other is continuous append IO Also the file structure is as below: I used for directory creation in dhcp35-86 folder using " for i in {1..1000};do for j in {1..1000};do for k in {1..1000};do for l in {1..1000};do for m in {1..1000};do mkdir -p level1.$i;mkdir -p level1.$i/level2.$j;mkdir -p level1.$i/level2.$j/level3.$k;mkdir -p level1.$i/level2.$j/level3.$k/level4.$l;mkdir -p level1.$i/level2.$j/level3.$k/level4.$l;mkdir -p level1.$i/level2.$j/level3.$k/level4.$l/level5.$m;done;done;done;done;done" The current structure is as below and growing distrepvol/rootdir1/dhcp35-86.lab.eng.blr.redhat.com/dirs/level1.1/level2.1/level{3.1..3.12 and growing}/level{4.1..4.1000}/level{1..1000} While the other hostdirectories are not of much depth
Can you please attach the statedumps you have?
(In reply to nchilaka from comment #2) > Also, note that I am collecting top and free -h output by appending it to a > file in the directory i created from this mount. This is the only write IO > being done from *THIS* mount point (there can be some other IOs from other > clients) > > > Following is the memory consumption by glusterfs mount process every 1 min > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 14567 root 20 0 720424 64348 3716 S 0.0 0.8 0:00.29 glusterfs > 14567 root 20 0 720424 68472 3716 S 0.0 0.9 0:00.32 glusterfs > 14567 root 20 0 785960 68676 3732 S 0.0 0.9 0:00.38 glusterfs > 14567 root 20 0 785960 68708 3732 S 0.0 0.9 0:00.42 glusterfs > 14567 root 20 0 985732 80352 4016 S 18.8 1.0 0:06.80 glusterfs > 14567 root 20 0 985732 89856 4016 S 31.2 1.1 0:24.41 glusterfs > 14567 root 20 0 985732 101584 4016 S 31.2 1.3 0:42.36 glusterfs > 14567 root 20 0 985732 111684 4016 S 0.0 1.4 0:59.91 glusterfs > 14567 root 20 0 985732 123144 4016 S 31.2 1.5 1:18.09 glusterfs > 14567 root 20 0 985732 135124 4016 S 31.2 1.7 1:36.66 glusterfs > 14567 root 20 0 985732 144356 4016 S 31.2 1.8 1:55.16 glusterfs > 14567 root 20 0 985732 157132 4016 S 26.7 2.0 2:12.72 glusterfs > 14567 root 20 0 985732 168676 4016 S 40.0 2.1 2:30.25 glusterfs > 14567 root 20 0 985732 182460 4016 S 26.7 2.3 2:49.70 glusterfs > 14567 root 20 0 985732 193696 4016 S 29.4 2.4 3:07.33 glusterfs > 14567 root 20 0 985732 208052 4032 S 33.3 2.6 3:24.99 glusterfs > 14567 root 20 0 985732 219024 4032 S 31.2 2.7 3:42.46 glusterfs Methodologically, it's a bit weird that you write these stats to the filesystem under investigation, but probably you are right and this does not make much of a difference. What would be a useful extension: record these stats not at regular times but after each run of the .. ls -ltR; stat * ... loop. Or even better, keep making records at regular times but *also* after the runs (indicating that a given record is at a run boundary). (Endowed with timeststamps, too.) Then we could see the amount of non-freed memory originating from a given run. > I have even take the process dumps between two intervals, which is available > at /var/run (not in /var/run/gluster but in just /var/run) Do you mean statedumps? Would it be possible to attach them to the BZ entry?
(In reply to Csaba Henk from comment #7) > (In reply to nchilaka from comment #2) > > Also, note that I am collecting top and free -h output by appending it to a > > file in the directory i created from this mount. This is the only write IO > > being done from *THIS* mount point (there can be some other IOs from other > > clients) > > > > > > Following is the memory consumption by glusterfs mount process every 1 min > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 14567 root 20 0 720424 64348 3716 S 0.0 0.8 0:00.29 glusterfs > > 14567 root 20 0 720424 68472 3716 S 0.0 0.9 0:00.32 glusterfs > > 14567 root 20 0 785960 68676 3732 S 0.0 0.9 0:00.38 glusterfs > > 14567 root 20 0 785960 68708 3732 S 0.0 0.9 0:00.42 glusterfs > > 14567 root 20 0 985732 80352 4016 S 18.8 1.0 0:06.80 glusterfs > > 14567 root 20 0 985732 89856 4016 S 31.2 1.1 0:24.41 glusterfs > > 14567 root 20 0 985732 101584 4016 S 31.2 1.3 0:42.36 glusterfs > > 14567 root 20 0 985732 111684 4016 S 0.0 1.4 0:59.91 glusterfs > > 14567 root 20 0 985732 123144 4016 S 31.2 1.5 1:18.09 glusterfs > > 14567 root 20 0 985732 135124 4016 S 31.2 1.7 1:36.66 glusterfs > > 14567 root 20 0 985732 144356 4016 S 31.2 1.8 1:55.16 glusterfs > > 14567 root 20 0 985732 157132 4016 S 26.7 2.0 2:12.72 glusterfs > > 14567 root 20 0 985732 168676 4016 S 40.0 2.1 2:30.25 glusterfs > > 14567 root 20 0 985732 182460 4016 S 26.7 2.3 2:49.70 glusterfs > > 14567 root 20 0 985732 193696 4016 S 29.4 2.4 3:07.33 glusterfs > > 14567 root 20 0 985732 208052 4032 S 33.3 2.6 3:24.99 glusterfs > > 14567 root 20 0 985732 219024 4032 S 31.2 2.7 3:42.46 glusterfs > > Methodologically, it's a bit weird that you write these stats to the > filesystem under investigation, but probably you are right and this does not > make much of a difference. > > What would be a useful extension: record these stats not at regular times > but after each run of the .. ls -ltR; stat * ... loop. Or even better, keep > making records at regular times but *also* after the runs (indicating that a > given record is at a run boundary). (Endowed with timeststamps, too.) > > Then we could see the amount of non-freed memory originating from a given > run. > > > I have even take the process dumps between two intervals, which is available > > at /var/run (not in /var/run/gluster but in just /var/run) > > Do you mean statedumps? Would it be possible to attach them to the BZ entry? I have attached some of the statedumps Also, note that the last lookup message displayed was "trnasport end point error" ./dhcp35-86.lab.eng.blr.redhat.com/dirs/level1.1/level2.1/level3.11/level4.267: ls: ./dhcp35-86.lab.eng.blr.redhat.com/dirs/level1.1/level2.1/level3.11/level4.267/level5.275: Transport endpoint is not connected I tried to cd to the directory but the session hangs I have the treestructure of the directory which hosts many directories, but the problem is that it is a 600MB file.
Created attachment 1206220 [details] statedump1
Created attachment 1206221 [details] statedump2
Created attachment 1206222 [details] statedump3
Created attachment 1206223 [details] 5 statedumps tar image
Csaba, You can look into statedumps available in attachment 1206223 [details]. ignore the other attachments. The treepath log even after zipping and tarring is still 31MB, which cannot be attached here.
Nag, the RPM http://download.eng.bos.redhat.com/brewroot/work/tasks/434/11900434/glusterfs-fuse-3.8.4-0.el7rhgs.x86_64.rpm delivers a glusterfs fuse client patched to make statedumps with more details about mem ory usage. So could you please: - install the RPM - repeat the test - meanwhile take statedumps the same way you did before - attach the statedums.
Created attachment 1216808 [details] glusterfs-fuse rpm for debugging The rpm is no longer available at the location indicated by the previous comment, so providing it as attachment.
Can we retest the same with the rpms shared in comment 15 and share the results?
Comment on attachment 1216808 [details] glusterfs-fuse rpm for debugging For the record -- the change applied on the source is this one: http://review.gluster.org/13736
We have noticed that the bug is not reproduced in the latest version of the product (RHGS-3.3.1+). If the bug is still relevant and is being reproduced, feel free to reopen the bug.