Bug 1478010 - glustershd crashes when upgrading a brick mux setup
Summary: glustershd crashes when upgrading a brick mux setup
Keywords:
Status: CLOSED DUPLICATE of bug 1460245
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: replicate
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Ravishankar N
QA Contact: Nag Pavan Chilakam
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-03 12:52 UTC by Nag Pavan Chilakam
Modified: 2018-01-20 07:14 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-24 09:15:01 UTC
Embargoed:


Attachments (Terms of Use)
core file (291.49 KB, application/x-gzip)
2017-08-03 12:55 UTC, Nag Pavan Chilakam
no flags Details

Description Nag Pavan Chilakam 2017-08-03 12:52:22 UTC
Description of problem:
======================

Had a 6 node brick mux setup, for verifying bZ#1463517 - Brick Multiplexing:dmesg shows request_sock_TCP: Possible SYN flooding on port 49152 and memory related backtraces


had run IOs for 10 of the 40 1x3 volumes for almost 2 days , without any issues

With this setup, I wanted to check a in-service upgrade for brick mux setup, and went ahead with 2 nodes at a time(1node hosting brick , 1 just  a dummy peer)

all was good as far as I updated 5 nodes(3 dummy, 2 bricknodes)

But post upgrading the 3rd node, the glustershd, didn't come online. 
I tried to trigger heal, but heal failed for all volumes as below

Launching heal operation to perform index self heal on volume vname_20 has been unsuccessful on bricks that are down. Please check if all brick processes are running.


possibly due to 1476828 - selfheal deamon getting connection refused, due to bricks listening on different ports


I then went to stop all the volumes , but they failed after few stopped as below
[root@dhcp35-45 ~]# for i in $(gluster v list);do gluster v stop $i --mode=script;done
volume stop: vname_1: success
volume stop: vname_10: success
volume stop: vname_11: success
volume stop: vname_12: success
volume stop: vname_13: failed: Volume vname_13 is not in the started state
volume stop: vname_14: failed: Commit failed on 10.70.35.122. Error: error
volume stop: vname_15: failed: Commit failed on 10.70.35.122. Error: error

Found a glustershd core

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/gl'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000555dca759ff1 in glusterfs_handle_translator_op (req=0x7fb750003490) at glusterfsd-mgmt.c:674
674	        any = active->first;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-8.el7.x86_64 libcom_err-1.42.9-10.el7.x86_64 libgcc-4.8.5-16.el7.x86_64 libselinux-2.5-11.el7.x86_64 libuuid-2.23.2-43.el7.x86_64 openssl-libs-1.0.2k-8.el7.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x0000555dca759ff1 in glusterfs_handle_translator_op (req=0x7fb750003490) at glusterfsd-mgmt.c:674
#1  0x00007fb762c14f12 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
#2  0x00007fb761258d40 in ?? () from /lib64/libc.so.6
#3  0x0000000000000000 in ?? ()
(gdb) t a  a bt

Thread 7 (Thread 0x7fb7630b9780 (LWP 14305)):
#0  0x00007fb761a3ef57 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fb762c394a0 in event_dispatch_epoll (event_pool=0x555dcc8dffd0) at event-epoll.c:732
#2  0x0000555dca752eb3 in main (argc=13, argv=<optimized out>) at glusterfsd.c:2479

Thread 6 (Thread 0x7fb7599e7700 (LWP 14307)):
#0  0x00007fb761a45371 in sigwait () from /lib64/libpthread.so.0
#1  0x0000555dca75601b in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2069
#2  0x00007fb761a3de25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fb76130a34d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fb75a1e8700 (LWP 14306)):
#0  0x00007fb761a44e4d in nanosleep () from /lib64/libpthread.so.0
#1  0x00007fb762bebcfe in gf_timer_proc (data=0x555dcc8e7d20) at timer.c:176
#2  0x00007fb761a3de25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fb76130a34d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fb755f24700 (LWP 14311)):
#0  0x00007fb76130a923 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fb762c38fe2 in event_dispatch_epoll_worker (data=0x555dcc9268b0) at event-epoll.c:638
#2  0x00007fb761a3de25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fb76130a34d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fb7581e4700 (LWP 14310)):
#0  0x00007fb761a41cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fb762c17448 in syncenv_task (proc=proc@entry=0x555dcc8e8930) at syncop.c:603
#2  0x00007fb762c18290 in syncenv_processor (thdata=0x555dcc8e8930) at syncop.c:695
#3  0x00007fb761a3de25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fb76130a34d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fb7591e6700 (LWP 14308)):
#0  0x00007fb7612d11ad in nanosleep () from /lib64/libc.so.6
#1  0x00007fb7612d1044 in sleep () from /lib64/libc.so.6
#2  0x00007fb762c051cd in pool_sweeper (arg=<optimized out>) at mem-pool.c:464
#3  0x00007fb761a3de25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fb76130a34d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fb7589e5700 (LWP 14309)):
#0  0x0000555dca759ff1 in glusterfs_handle_translator_op (req=0x7fb750003490) at glusterfsd-mgmt.c:674
#1  0x00007fb762c14f12 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
#2  0x00007fb761258d40 in ?? () from /lib64/libc.so.6
#3  0x0000000000000000 in ?? ()


Version-Release number of selected component (if applicable):
=======
3.8.4-36 to 3.8.4-37

Note: when I restarted gluster related services  about 3 times in the problem , node , before actually stopping and upgrading(not back to back, but with some time interval) (reason, manual error, due to multitasking between different test execution by me)

Comment 2 Nag Pavan Chilakam 2017-08-03 12:55:26 UTC
Created attachment 1308724 [details]
core file

Comment 3 Nag Pavan Chilakam 2017-08-03 12:56:04 UTC
core attached, while logs can be found at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.<id>/


Note You need to log in before you can comment on or make changes to this bug.