Description of problem: ========================= Following is the backtrace of the observed crash: [New LWP 13894] [New LWP 13892] [New LWP 13891] [New LWP 13890] [New LWP 14045] [New LWP 13893] [New LWP 14044] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid'. Program terminated with signal 11, Segmentation fault. #0 0x00007fded99ed126 in rcu_read_unlock_bp () from /lib64/liburcu-bp.so.1 Thread 7 (Thread 0x7fded6535700 (LWP 14044)): #0 0x00007fdee45bb705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007fdeda33be23 in hooks_worker (args=<optimized out>) at glusterd-hooks.c:523 conf = <optimized out> hooks_priv = 0x7fdee754e480 stub = <optimized out> #2 0x00007fdee45b7df5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007fdee3efe1ad in clone () from /lib64/libc.so.6 No symbol table info available. Thread 6 (Thread 0x7fdedb5d5700 (LWP 13893)): #0 0x00007fdee45bbab2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007fdee5488438 in syncenv_task (proc=proc@entry=0x7fdee74f0ae0) at syncop.c:591 env = 0x7fdee74f0ae0 task = 0x0 sleep_till = {tv_sec = 1434179489, tv_nsec = 0} ret = <optimized out> #2 0x00007fdee54890a0 in syncenv_processor (thdata=0x7fdee74f0ae0) at syncop.c:683 env = 0x7fdee74f0ae0 proc = 0x7fdee74f0ae0 task = <optimized out> #3 0x00007fdee45b7df5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00007fdee3efe1ad in clone () from /lib64/libc.so.6 No symbol table info available. Thread 5 (Thread 0x7fded5d34700 (LWP 14045)): #0 0x00007fdee45bb705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007fdee548657b in __synclock_lock (lock=lock@entry=0x7fded6afc7d8) at syncop.c:909 task = 0x0 __FUNCTION__ = "__synclock_lock" #2 0x00007fdee5489366 in synclock_lock (lock=lock@entry=0x7fded6afc7d8) at syncop.c:935 ret = 0 #3 0x00007fdeda2eb84d in glusterd_big_locked_cbk (req=0x7fdee7585278, iov=0x7fdee75852b8, count=1, myframe=0x7fdee30921dc, fn=0x7fdeda2e8ab0 <__glusterd_friend_update_cbk>) at glusterd-rpc-ops.c:211 priv = 0x7fded6ad3050 ret = -1 #4 0x00007fdee521b480 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fdee7584d40, pollin=pollin@entry=0x7fdec800e4a0) at rpc-clnt.c:766 conn = 0x7fdee7584d70 saved_frame = <optimized out> ret = 0 req = 0x7fdee7585278 xid = 9 __FUNCTION__ = "rpc_clnt_handle_reply" #5 0x00007fdee521b6f4 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fdee7584d70, event=<optimized out>, data=0x7fdec800e4a0) at rpc-clnt.c:894 conn = 0x7fdee7584d70 clnt = 0x7fdee7584d40 ret = -1 req_info = 0x0 pollin = 0x7fdec800e4a0 clnt_mydata = 0x0 #6 0x00007fdee52175d3 in rpc_transport_notify (this=this@entry=0x7fdee7587e70, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fdec800e4a0) at rpc-transport.c:543 ret = -1 __FUNCTION__ = "rpc_transport_notify" #7 0x00007fded7d045a6 in socket_event_poll_in (this=this@entry=0x7fdee7587e70) at socket.c:2290 ret = <optimized out> pollin = 0x7fdec800e4a0 priv = <optimized out> #8 0x00007fded7d07374 in socket_event_handler (fd=fd@entry=14, idx=idx@entry=4, data=0x7fdee7587e70, poll_in=1, poll_out=0, poll_err=0) at socket.c:2403 this = 0x7fdee7587e70 priv = 0x7fdee7588b00 ret = 0 __FUNCTION__ = "socket_event_handler" #9 0x00007fdee54a448a in event_dispatch_epoll_handler (event=0x7fded5d33e80, event_pool=0x7fdee74e1c10) at event-epoll.c:572 handler = 0x7fded7d07210 <socket_event_handler> gen = 1 slot = 0x7fdee7524900 data = <optimized out> ret = -1 fd = 14 ev_data = 0x7fded5d33e84 idx = 4 #10 event_dispatch_epoll_worker (data=0x7fdee74f4a30) at event-epoll.c:674 event = {events = 1, data = {ptr = 0x100000004, fd = 4, u32 = 4, u64 = 4294967300}} ret = <optimized out> ev_data = 0x7fdee74f4a30 event_pool = 0x7fdee74e1c10 myindex = 1 timetodie = 0 __FUNCTION__ = "event_dispatch_epoll_worker" #11 0x00007fdee45b7df5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #12 0x00007fdee3efe1ad in clone () from /lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x7fdee58f6780 (LWP 13890)): #0 0x00007fdee45b8f27 in pthread_join () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007fdee54a4795 in event_dispatch_epoll (event_pool=0x7fdee74e1c10) at event-epoll.c:759 i = <optimized out> t_id = 140595046860544 pollercount = 1 ret = 0 ev_data = 0x7fdee74f4a30 __FUNCTION__ = "event_dispatch_epoll" #2 0x00007fdee59116e7 in main (argc=3, argv=0x7fff45d697a8) at glusterfsd.c:2326 ctx = 0x7fdee74c3010 ret = 0 cmdlinestr = "/usr/sbin/glusterd -p /var/run/glusterd.pid", '\000' <repeats 4052 times> cmd = 0x7fdee74c3010 __FUNCTION__ = "main" Thread 3 (Thread 0x7fdedc5d7700 (LWP 13891)): #0 0x00007fdee45be99d in nanosleep () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007fdee5469934 in gf_timer_proc (ctx=0x7fdee74c3010) at timer.c:195 now = 153571559234362 now_ts = {tv_sec = 153571, tv_nsec = 559234362} reg = 0x7fdee74f0730 sleepts = {tv_sec = 1, tv_nsec = 0} event = 0x7fdecc000ad0 old_THIS = <optimized out> __FUNCTION__ = "gf_timer_proc" #2 0x00007fdee45b7df5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007fdee3efe1ad in clone () from /lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x7fdedbdd6700 (LWP 13892)): #0 0x00007fdee45beec1 in sigwait () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007fdee59146eb in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:1982 set = {__val = {18947, 0 <repeats 15 times>}} ret = <optimized out> sig = 0 #2 0x00007fdee45b7df5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007fdee3efe1ad in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x7fdedadd4700 (LWP 13894)): #0 0x00007fded99ed126 in rcu_read_unlock_bp () from /lib64/liburcu-bp.so.1 No symbol table info available. #1 0x00007fdeda2ba629 in glusterd_ac_send_friend_update (event=<optimized out>, ctx=<optimized out>) at glusterd-sm.c:575 ret = 0 cur_peerinfo = 0x7fdee7564120 peerinfo = <optimized out> proc = <optimized out> this = 0x7fdee74f9040 ev_ctx = {uuid = {<optimized out> <repeats 16 times>}, hostname = <optimized out>, op = 1} priv = 0x7fded6ad3050 friends = 0x7fdee570b510 key = "friend4", '\000' <repeats 92 times> count = <optimized out> __FUNCTION__ = "glusterd_ac_send_friend_update" #2 0x00007fdeda2bbec2 in glusterd_friend_sm () at glusterd-sm.c:1228 event = 0x7fdec00079e0 tmp = 0x7fdeda5d1160 <gd_friend_sm_queue> ret = <optimized out> handler = <optimized out> state = 0x7fdeda5c7900 <glusterd_state_befriended> peerinfo = <optimized out> event_type = GD_FRIEND_EVENT_LOCAL_ACC is_await_conn = _gf_false quorum_action = _gf_false old_state = GD_FRIEND_STATE_BEFRIENDED this = <optimized out> priv = 0x7fded6ad3050 __FUNCTION__ = "glusterd_friend_sm" #3 0x00007fdeda35c736 in glusterd_op_state_machine_mgmt_v3_lock (ctx=0x7fdec0001c80, lock_req=0x7fdec84076e0, req=0x7fded7f11b78) at glusterd-mgmt-handler.c:119 ret = 0 this = 0x7fdee74f9040 txn_op_info = {state = {state = GD_OP_STATE_DEFAULT, time = {tv_sec = 0, tv_usec = 0}}, pending_count = 0, brick_pending_count = 0, op_count = 0, op = GD_OP_STATUS_VOLUME, op_peers = {next = 0x0, prev = 0x0}, op_ctx = 0x7fdee570b59c, req = 0x7fded7f11b78, op_ret = 0, op_errno = 0, op_errstr = 0x0, pending_bricks = {next = 0x0, prev = 0x0}, txn_generation = 8} #4 glusterd_handle_mgmt_v3_lock_fn (req=req@entry=0x7fded7f11b78) at glusterd-mgmt-handler.c:210 lock_req = {uuid = "\316d\213\066\255\224G\242\261\225r\204\367\230fM", txn_id = "\324\325\342\314\334\356E|\221\375\315\343'\312\247\224", op = 18, dict = {dict_len = 194, dict_val = 0x7fdec0012c80 ""}} ret = <optimized out> ctx = 0x7fdec0001c80 this = 0x7fdee74f9040 is_synctasked = <optimized out> free_ctx = _gf_false __FUNCTION__ = "glusterd_handle_mgmt_v3_lock_fn" #5 0x00007fdeda2b1370 in glusterd_big_locked_handler (req=0x7fded7f11b78, actor_fn=0x7fdeda35c260 <glusterd_handle_mgmt_v3_lock_fn>) at glusterd-handler.c:83 priv = 0x7fded6ad3050 ret = -1 #6 0x00007fdee5486482 in synctask_wrap (old_task=<optimized out>) at syncop.c:375 task = 0x7fdec800eb00 #7 0x00007fdee3e4f0f0 in ?? () from /lib64/libc.so.6 No symbol table info available. #8 0x0000000000000000 in ?? () No symbol table info available. Version-Release number of selected component (if applicable): ============================================================ :: [ BEGIN ] :: Running 'rpm -qa | grep gluster' glusterfs-fuse-3.7.1-1.el7rhgs.x86_64 glusterfs-server-3.7.1-1.el7rhgs.x86_64 How reproducible: ================ 1/1 Steps to Reproduce: ===================== 1. Running the AFR "heal_full" Automated tests. Actual results: ================ https://beaker.engineering.redhat.com/jobs/980505
Shwetha spoke to me about this before raising this bug. The sos-reports for this beaker run are missing for some unknown reason. Hence, we didn't have enough information for confirm the cause for this bug. Shwetha volunteered to run the test again and check if it fails again. Hopefully, we can get the sos-reports in case it happens again. Till then we cannot proceed further with the bug.
Shwetha, Could you provide the steps performed to hit this issue?
RCA : crash mainly due to rcu_unlock() is called in different thread. i.e rcu lock is taken in one thread and rcu_unlock is called in different thread. To avoid this I sent patch http://review.gluster.org/#/c/10285/14 , but still some places we forgot to do.
Upstream mainline patch http://review.gluster.org/#/c/11276/ posted for review
I have observed this crash in my setup too. I cannot say what steps exactly led to the crash, but I was doing the following operations - 1. On a volume that was mounted on a FUSE client, I tried to disable quota and unmount the volume. The `umount' command was stuck and did not complete for a long time. 2. I tried to stop the volume but the stop request timed out and failed. 3. After volume stop, I was unable to see volume status as it failed with "Another transaction in progress". 4. The volume was now in an inconsistent state as it was shown as started in volume info output but attempts to stop it would fail with "volume not in started state" message. 5. I removed the volume directory from /var/lib/glusterd/vols and tried to restart glusterd. Restart failed and I checked the logs. glusterd had crashed. From glusterd logs - ------------------------------------------------------------------------------- [2015-07-03 07:43:07.474874] E [MSGID: 106069] [glusterd-volgen.c:1021:volgen_write_volfile] 0-management: failed to create volfile /var/lib/glusterd/vols/rep/rep-snapd.vol [2015-07-03 07:43:07.474925] E [MSGID: 106069] [glusterd-snapd-svc.c:254:glusterd_snapdsvc_start] 0-management: Couldn't create snapd volfile for volume: rep [2015-07-03 07:43:07.475000] E [MSGID: 106113] [glusterd-snapd-svc.c:341:glusterd_snapdsvc_restart] 0-management: Couldn't start snapd for vol: rep [2015-07-03 07:43:07.562023] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: a1b83030-3890-45fc-9489-0815341722a3 [2015-07-03 07:43:08.153177] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2015-07-03 07:43:08.304427] W [glusterfsd.c:1219:cleanup_and_exit] (--> 0-: received signum (15), shutting down pending frames: frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2015-07-03 07:43:08 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.7.1 /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7fe6234c8826] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7fe6234e83ef] /lib64/libc.so.6(+0x38072326a0)[0x7fe621e676a0] /usr/lib64/liburcu-bp.so.1(rcu_read_unlock_bp+0x16)[0x7fe617925de6] /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(__glusterd_handle_friend_update+0x701)[0x7fe617ed0bd1] /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7fe617eb5a0f] /usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x295)[0x7fe623291ee5] /usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x103)[0x7fe623292123] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7fe623293ad8] /usr/lib64/glusterfs/3.7.1/rpc-transport/socket.so(+0xa255)[0x7fe616557255] /usr/lib64/glusterfs/3.7.1/rpc-transport/socket.so(+0xbe4d)[0x7fe616558e4d] /usr/lib64/libglusterfs.so.0(+0x89970)[0x7fe62352c970] /lib64/libpthread.so.0(+0x3807607a51)[0x7fe6225b3a51] /lib64/libc.so.6(clone+0x6d)[0x7fe621f1d96d] ---------
patch is available in upstream, so moving to post
Version :glusterfs-3.7.1-7.el6rhs.x86_64 Faced a similar glusterd crash while probing another node . [2015-07-06 20:53:06.046749] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 3ed85019-8616-458 1-8115-22a2554dea26 [2015-07-06 20:53:06.046775] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2015-07-06 20:53:06.048525] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 7c1cd644-75ea-4d2 d-b228-09e25827cd45 [2015-07-06 20:53:06.048562] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend pending frames: frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2015-07-06 20:53:06 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.7.1 /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7fd560ce7826] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7fd560d073ef] /lib64/libc.so.6(+0x3da84326a0)[0x7fd55f6866a0] /usr/lib64/liburcu-bp.so.1(rcu_read_unlock_bp+0x16)[0x7fd555144de6] /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(+0x58a57)[0x7fd5556f4a57] /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_friend_sm+0x189)[0x7fd5556f4479] /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(__glusterd_handle_cli_probe+0x14f)[0x7fd5556f0a6f] /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7fd5556d4a6f] /usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7fd560d2f1f2] /lib64/libc.so.6(+0x3da84438f0)[0x7fd55f6978f0] bt: ==(gdb) bt #0 0x00007fd555144de6 in rcu_read_unlock_bp () from /usr/lib64/liburcu-bp.so.1 #1 0x00007fd5556f4a57 in glusterd_ac_send_friend_update (event=<value optimized out>, ctx=<value optimized out>) at glusterd-sm.c:592 #2 0x00007fd5556f4479 in glusterd_friend_sm () at glusterd-sm.c:1257 #3 0x00007fd5556f0a6f in __glusterd_handle_cli_probe (req=0x7fd561eb45cc) at glusterd-handler.c:1220 #4 0x00007fd5556d4a6f in glusterd_big_locked_handler (req=0x7fd561eb45cc, actor_fn=0x7fd5556f0920 <__glusterd_handle_cli_probe>) at glusterd-handler.c:83 #5 0x00007fd560d2f1f2 in synctask_wrap (old_task=<value optimized out>) at syncop.c:381 #6 0x00007fd55f6978f0 in ?? () from /lib64/libc.so.6 #7 0x0000000000000000 in ?? ()
Logs for comment 11: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1231635/
Since this issue was hit atleast 3 times as of now and also with the fact that the fix is available upstream, proposing this issue as blocker for RHGS 3.1
As suggested by Dev ( Anand Nekkunti ), this case is the rarely hit race. The probability of hitting this race is expected to increase, when there are more activated snapshots and peer probe was done. I tried the case where snapshot creation was happening in one script and concurrently peer probe/deprobe was happening. I have created 100+ snaps along with peer probing/deprobing, there were no glusterd crash found. Tried disk replacement procedure as well and triggered 'heal full' on the replaced disks. That works well and there are no issues. Marking this bug as VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days