Description of problem: ====================== Had a 6 node brick mux setup, for verifying bZ#1463517 - Brick Multiplexing:dmesg shows request_sock_TCP: Possible SYN flooding on port 49152 and memory related backtraces had run IOs for 10 of the 40 1x3 volumes for almost 2 days , without any issues With this setup, I wanted to check a in-service upgrade for brick mux setup, and went ahead with 2 nodes at a time(1node hosting brick , 1 just a dummy peer) all was good as far as I updated 5 nodes(3 dummy, 2 bricknodes) But post upgrading the 3rd node, the glustershd, didn't come online. I tried to trigger heal, but heal failed for all volumes as below Launching heal operation to perform index self heal on volume vname_20 has been unsuccessful on bricks that are down. Please check if all brick processes are running. possibly due to 1476828 - selfheal deamon getting connection refused, due to bricks listening on different ports I then went to stop all the volumes , but they failed after few stopped as below [root@dhcp35-45 ~]# for i in $(gluster v list);do gluster v stop $i --mode=script;done volume stop: vname_1: success volume stop: vname_10: success volume stop: vname_11: success volume stop: vname_12: success volume stop: vname_13: failed: Volume vname_13 is not in the started state volume stop: vname_14: failed: Commit failed on 10.70.35.122. Error: error volume stop: vname_15: failed: Commit failed on 10.70.35.122. Error: error Found a glustershd core [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/gl'. Program terminated with signal 11, Segmentation fault. #0 0x0000555dca759ff1 in glusterfs_handle_translator_op (req=0x7fb750003490) at glusterfsd-mgmt.c:674 674 any = active->first; Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-8.el7.x86_64 libcom_err-1.42.9-10.el7.x86_64 libgcc-4.8.5-16.el7.x86_64 libselinux-2.5-11.el7.x86_64 libuuid-2.23.2-43.el7.x86_64 openssl-libs-1.0.2k-8.el7.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) bt #0 0x0000555dca759ff1 in glusterfs_handle_translator_op (req=0x7fb750003490) at glusterfsd-mgmt.c:674 #1 0x00007fb762c14f12 in synctask_wrap (old_task=<optimized out>) at syncop.c:375 #2 0x00007fb761258d40 in ?? () from /lib64/libc.so.6 #3 0x0000000000000000 in ?? () (gdb) t a a bt Thread 7 (Thread 0x7fb7630b9780 (LWP 14305)): #0 0x00007fb761a3ef57 in pthread_join () from /lib64/libpthread.so.0 #1 0x00007fb762c394a0 in event_dispatch_epoll (event_pool=0x555dcc8dffd0) at event-epoll.c:732 #2 0x0000555dca752eb3 in main (argc=13, argv=<optimized out>) at glusterfsd.c:2479 Thread 6 (Thread 0x7fb7599e7700 (LWP 14307)): #0 0x00007fb761a45371 in sigwait () from /lib64/libpthread.so.0 #1 0x0000555dca75601b in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2069 #2 0x00007fb761a3de25 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fb76130a34d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7fb75a1e8700 (LWP 14306)): #0 0x00007fb761a44e4d in nanosleep () from /lib64/libpthread.so.0 #1 0x00007fb762bebcfe in gf_timer_proc (data=0x555dcc8e7d20) at timer.c:176 #2 0x00007fb761a3de25 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fb76130a34d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7fb755f24700 (LWP 14311)): #0 0x00007fb76130a923 in epoll_wait () from /lib64/libc.so.6 #1 0x00007fb762c38fe2 in event_dispatch_epoll_worker (data=0x555dcc9268b0) at event-epoll.c:638 #2 0x00007fb761a3de25 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fb76130a34d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7fb7581e4700 (LWP 14310)): #0 0x00007fb761a41cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fb762c17448 in syncenv_task (proc=proc@entry=0x555dcc8e8930) at syncop.c:603 #2 0x00007fb762c18290 in syncenv_processor (thdata=0x555dcc8e8930) at syncop.c:695 #3 0x00007fb761a3de25 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fb76130a34d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7fb7591e6700 (LWP 14308)): #0 0x00007fb7612d11ad in nanosleep () from /lib64/libc.so.6 #1 0x00007fb7612d1044 in sleep () from /lib64/libc.so.6 #2 0x00007fb762c051cd in pool_sweeper (arg=<optimized out>) at mem-pool.c:464 #3 0x00007fb761a3de25 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fb76130a34d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7fb7589e5700 (LWP 14309)): #0 0x0000555dca759ff1 in glusterfs_handle_translator_op (req=0x7fb750003490) at glusterfsd-mgmt.c:674 #1 0x00007fb762c14f12 in synctask_wrap (old_task=<optimized out>) at syncop.c:375 #2 0x00007fb761258d40 in ?? () from /lib64/libc.so.6 #3 0x0000000000000000 in ?? () Version-Release number of selected component (if applicable): ======= 3.8.4-36 to 3.8.4-37 Note: when I restarted gluster related services about 3 times in the problem , node , before actually stopping and upgrading(not back to back, but with some time interval) (reason, manual error, due to multitasking between different test execution by me)
Created attachment 1308724 [details] core file
core attached, while logs can be found at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.<id>/