Description of problem: ======================== During my systemic testing, I hit issues which were reported as below https://bugzilla.redhat.com/show_bug.cgi?id=1415178 ===>systemic testing: seeing lot of ping time outs which would lead to splitbrains https://bugzilla.redhat.com/show_bug.cgi?id=1415101 ===> glustershd process crashed on systemic setup I kept the setup for further debugging over the weekend untouched, but with IOs running which were triggered previously. I saw today that the brick#3(dht-subvol 2) process got killed due to OOM One note: the shd crashed on node#4 for which the above bz was reported, and this brick is a replica pair of node#4 brick Note: node2 was hung with kernel hangs and hence had to reboot(10.70.37.86) [root@dhcp35-156 bricks]# gluster v status Status of volume: systemic Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.20:/rhs/brick1/systemic 49152 0 Y 4154 Brick 10.70.37.86:/rhs/brick1/systemic 49152 0 Y 2285 Brick 10.70.35.156:/rhs/brick1/systemic N/A N/A N N/A Brick 10.70.37.154:/rhs/brick1/systemic N/A N/A N N/A Brick 10.70.35.20:/rhs/brick2/systemic N/A N/A N N/A Brick 10.70.37.86:/rhs/brick2/systemic 49153 0 Y 2290 Brick 10.70.35.156:/rhs/brick2/systemic 49153 0 Y 4045 Brick 10.70.37.154:/rhs/brick2/systemic 49153 0 Y 3873 Self-heal Daemon on localhost N/A N/A Y 19630 Quota Daemon on localhost N/A N/A Y 19638 Self-heal Daemon on 10.70.37.86 N/A N/A Y 16060 Quota Daemon on 10.70.37.86 N/A N/A Y 16068 Self-heal Daemon on dhcp35-20.lab.eng.blr.r edhat.com N/A N/A Y 19131 Quota Daemon on dhcp35-20.lab.eng.blr.redha t.com N/A N/A Y 19139 Self-heal Daemon on 10.70.37.154 N/A N/A Y 30598 Quota Daemon on 10.70.37.154 N/A N/A Y 30606 Task Status of Volume systemic ------------------------------------------------------------------------------ There are no active volume tasks free:43291 free_pcp:61 free_cma:0 [Sat Jan 21 07:28:01 2017] Node 0 DMA free:15908kB min:60kB low:72kB high:88kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [Sat Jan 21 07:28:01 2017] lowmem_reserve[]: 0 2815 16475 16475 [Sat Jan 21 07:28:01 2017] Node 0 DMA32 free:85160kB min:11532kB low:14412kB high:17296kB active_anon:1944336kB inactive_anon:546052kB active_file:0kB inactive_file:68kB unevictable:4524kB isolated(anon):0kB isolated(file):0kB present:3129332kB managed:2883568kB mlocked:4524kB dirty:0kB writeback:0kB mapped:2120kB shmem:556kB slab_reclaimable:10696kB slab_unreclaimable:277408kB kernel_stack:2016kB pagetables:5736kB unstable:0kB bounce:0kB free_pcp:4kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3 all_unreclaimable? yes [Sat Jan 21 07:28:01 2017] lowmem_reserve[]: 0 0 13659 13659 [Sat Jan 21 07:28:01 2017] Node 0 Normal free:72096kB min:55984kB low:69980kB high:83976kB active_anon:12328876kB inactive_anon:1232716kB active_file:424kB inactive_file:352kB unevictable:27908kB isolated(anon):0kB isolated(file):0kB present:14262272kB managed:13987552kB mlocked:27908kB dirty:0kB writeback:44kB mapped:6116kB shmem:2528kB slab_reclaimable:33824kB slab_unreclaimable:68868kB kernel_stack:4176kB pagetables:50240kB unstable:0kB bounce:0kB free_pcp:240kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1219 all_unreclaimable? yes [Sat Jan 21 07:28:01 2017] lowmem_reserve[]: 0 0 0 0 [Sat Jan 21 07:28:01 2017] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB [Sat Jan 21 07:28:01 2017] Node 0 DMA32: 21331*4kB (UEM) 1*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 85332kB [Sat Jan 21 07:28:01 2017] Node 0 Normal: 17899*4kB (UEM) 29*8kB (UEM) 23*16kB (M) 2*32kB (M) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 72324kB [Sat Jan 21 07:28:01 2017] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [Sat Jan 21 07:28:01 2017] 33005 total pagecache pages [Sat Jan 21 07:28:01 2017] 29897 pages in swap cache [Sat Jan 21 07:28:01 2017] Swap cache stats: add 295752345, delete 295722448, find 92579823/127325977 [Sat Jan 21 07:28:01 2017] Free swap = 0kB [Sat Jan 21 07:28:01 2017] Total swap = 8585212kB [Sat Jan 21 07:28:01 2017] 4351899 pages RAM [Sat Jan 21 07:28:01 2017] 0 pages HighMem/MovableOnly [Sat Jan 21 07:28:01 2017] 130142 pages reserved [Sat Jan 21 07:28:01 2017] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [Sat Jan 21 07:28:01 2017] [ 561] 0 561 9204 277 22 47 0 systemd-journal [Sat Jan 21 07:28:01 2017] [ 580] 0 580 87696 321 35 1470 0 lvmetad [Sat Jan 21 07:28:01 2017] [ 609] 0 609 79648 1811 36 0 -1000 multipathd [Sat Jan 21 07:28:01 2017] [ 616] 0 616 11745 272 23 615 -1000 systemd-udevd [Sat Jan 21 07:28:01 2017] [ 787] 0 787 46853 7504 44 0 -1000 dmeventd [Sat Jan 21 07:28:01 2017] [ 889] 0 889 13854 214 27 90 -1000 auditd [Sat Jan 21 07:28:01 2017] [ 909] 0 909 4202 201 14 46 0 alsactl [Sat Jan 21 07:28:01 2017] [ 910] 0 910 1092 105 8 16 0 rngd [Sat Jan 21 07:28:01 2017] [ 911] 0 911 53103 262 57 384 0 abrtd [Sat Jan 21 07:28:01 2017] [ 912] 81 912 24620 330 17 85 -900 dbus-daemon [Sat Jan 21 07:28:01 2017] [ 914] 997 914 28962 282 28 72 0 chronyd [Sat Jan 21 07:28:01 2017] [ 925] 0 925 4826 212 13 42 0 irqbalance [Sat Jan 21 07:28:01 2017] [ 926] 998 926 132071 328 55 1668 0 polkitd [Sat Jan 21 07:28:01 2017] [ 934] 0 934 52548 238 53 326 0 abrt-watch-log [Sat Jan 21 07:28:01 2017] [ 939] 0 939 6070 190 16 124 0 smartd [Sat Jan 21 07:28:01 2017] [ 940] 0 940 6048 252 16 37 0 systemd-logind [Sat Jan 21 07:28:01 2017] [ 944] 0 944 1083 95 7 34 0 acpid [Sat Jan 21 07:28:01 2017] [ 957] 32 957 16237 134 34 123 0 rpcbind [Sat Jan 21 07:28:01 2017] [ 963] 0 963 82803 853 84 5442 0 firewalld [Sat Jan 21 07:28:01 2017] [ 964] 0 964 50303 144 38 114 0 gssproxy [Sat Jan 21 07:28:01 2017] [ 972] 0 972 31555 249 18 129 0 crond [Sat Jan 21 07:28:01 2017] [ 973] 0 973 6461 183 16 47 0 atd [Sat Jan 21 07:28:01 2017] [ 979] 0 979 27509 164 10 30 0 agetty [Sat Jan 21 07:28:01 2017] [ 981] 0 981 127850 539 67 268 0 NetworkManager [Sat Jan 21 07:28:01 2017] [ 1084] 0 1084 28206 274 56 3076 0 dhclient [Sat Jan 21 07:28:01 2017] [ 1309] 0 1309 158898 279 122 972 0 libvirtd [Sat Jan 21 07:28:01 2017] [ 1311] 995 1311 11050 143 24 172 0 nrpe [Sat Jan 21 07:28:01 2017] [ 1314] 0 1314 71328 230 41 316 0 rsyslogd [Sat Jan 21 07:28:01 2017] [ 1316] 0 1316 138290 452 90 2576 0 tuned [Sat Jan 21 07:28:01 2017] [ 1319] 0 1319 168455 1525 80 6170 0 glusterd [Sat Jan 21 07:28:01 2017] [ 1324] 0 1324 20617 74 39 213 -1000 sshd [Sat Jan 21 07:28:01 2017] [ 1325] 0 1325 28335 114 11 18 0 rhsmcertd [Sat Jan 21 07:28:01 2017] [ 1421] 0 1421 22244 189 42 236 0 master [Sat Jan 21 07:28:01 2017] [ 1429] 89 1429 22314 256 41 233 0 qmgr [Sat Jan 21 07:28:01 2017] [ 1657] 0 1657 26978 99 9 11 0 rhnsd [Sat Jan 21 07:28:01 2017] [ 4026] 0 4026 4351104 2481504 7524 1271633 0 glusterfsd [Sat Jan 21 07:28:01 2017] [ 4045] 0 4045 2933885 1494036 4702 816818 0 glusterfsd [Sat Jan 21 07:28:01 2017] [ 4066] 0 4066 643284 2959 289 16464 0 glusterfs [Sat Jan 21 07:28:01 2017] [ 4097] 0 4097 135927 322 73 13924 0 glusterfs [Sat Jan 21 07:28:01 2017] [ 6258] 0 6258 31927 122 18 129 0 screen [Sat Jan 21 07:28:01 2017] [ 6259] 0 6259 29317 279 14 523 0 bash [Sat Jan 21 07:28:01 2017] [14012] 0 14012 26974 136 10 1 0 sleep [Sat Jan 21 07:28:01 2017] [14127] 89 14127 22270 470 43 0 0 pickup [Sat Jan 21 07:28:01 2017] Out of memory: Kill process 4026 (glusterfsd) score 572 or sacrifice child [Sat Jan 21 07:28:01 2017] Killed process 4026 (glusterfsd) total-vm:17404416kB, anon-rss:9926016kB, file-rss:0kB, shmem-rss:0kB
logs available @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1415632
Can we please check if this is not duplicate of BZ 1413351 ?
Hi, Is it possible to share statedump of glustershd process from other nodes in cluster if it is up? Regards Mohit Agrawal
(In reply to Mohit Agrawal from comment #4) > Hi, > > Is it possible to share statedump of glustershd process from other nodes in > cluster if it is up? > > Regards > Mohit Agrawal Please share the statedump of brick process also those are running. Regards Mohit Agrawal
Hi Mohit, the statedump of the bricks are available on the volume. you can mount the volume and visit the below path /mnt/systemic/logs/rhs-client11.lab.eng.blr.redhat.com/statedumps it has all the dumps of all bricks of all nodes. Given this is systemic testing it may be highly difficult to point to one statedump
Hi, Thank you for sharing the location of statedump.I have checked current brick process statedump in /var/run/gluster, for specific to dict leak it is similar to bugzilla (https://bugzilla.redhat.com/show_bug.cgi?id=1411329) and current memory consumption of brick process is also high ps -aef | grep glusterfsd root 4154 1 75 Jan18 ? 4-10:47:51 /usr/sbin/glusterfsd -s 10.70.35.20 --volfile-id systemic.10.70.35.20.rhs-brick1-systemic -p /var/lib/glusterd/vols/systemic/run/10.70.35.20-rhs-brick1-systemic.pid -S /var/run/gluster/ecb70f1c5d8ade863b15ff45abb9d46c.socket --brick-name /rhs/brick1/systemic -l /var/log/glusterfs/bricks/rhs-brick1-systemic.log --xlator-option *-posix.glusterd-uuid=c5954ab2-6283-4e02-a6f6-1fa6cd943f49 --brick-port 49152 --xlator-option systemic-server.listen-port=49152 root 24472 24391 0 17:56 pts/4 00:00:00 grep --color=auto glusterfsd [root@dhcp35-20 gluster]# pmap -x 4154 | grep "total" total kB 21724380 14717652 14714608 dict leak issue is already fixed from the bugzilla (1411329) Regards Mohit Agrawal