Description of problem: ====================== as part of verifying BZ#1593865 - shd crash on startup I am executing testcase RHG3-13523, which is below https://polarion.engineering.redhat.com/polarion/#/project/RHG3/workitem?id=RHG3-13523 Steps 1. create a 2x3 volume and start it on 6 node setup 2. `while true; do gluster volume heal <volname>;sleep 0.5; done` in one terminal. 3. In another terminal, keep running 'service glusterd restart" ie as below while true;do systemctl start glusterd ;sleep 3;systemctl stop glusterd;sleep 2;date;done First time backTrace log: ###################### [2018-07-24 13:26:54.638204] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2018-07-24 13:26:54.638422] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: eeb4fc95-4d67-4237-9738-ffdb8910539c, host: rhs-client32.lab.eng.blr.redhat.com, port: 0 [2018-07-24 13:26:54.688005] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f [2018-07-24 13:26:54.688832] I [MSGID: 106163] [glusterd-handshake.c:1319:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31302 [2018-07-24 13:26:54.721627] I [MSGID: 106143] [glusterd-pmap.c:282:pmap_registry_bind] 0-pmap: adding brick /gluster/brick1/distrep3 on port 49152 [2018-07-24 13:26:54.722021] I [MSGID: 106163] [glusterd-handshake.c:1319:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31302 [2018-07-24 13:26:54.746702] I [MSGID: 106490] [glusterd-handler.c:2627:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f [2018-07-24 13:26:54.763606] I [MSGID: 106493] [glusterd-handler.c:3890:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhs-client19.lab.eng.blr.redhat.com (0), ret: 0, op_ret: 0 [2018-07-24 13:26:54.813848] I [MSGID: 106490] [glusterd-handler.c:2627:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: eeb4fc95-4d67-4237-9738-ffdb8910539c [2018-07-24 13:26:54.814770] E [MSGID: 106010] [glusterd-utils.c:3412:glusterd_compare_friend_volume] 0-management: Version of Cksums get differ. local cksum = 1268816566, remote cksum = 2466777340 on peer rhs-client32.lab.eng.blr.redhat.com [2018-07-24 13:26:54.814936] I [MSGID: 106493] [glusterd-handler.c:3890:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhs-client32.lab.eng.blr.redhat.com (0), ret: 0, op_ret: -1 [2018-07-24 13:26:54.880925] W [glusterfsd.c:1367:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7fe4c9247dd5] -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xe5) [0x55a8b3503575] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x55a8b35033eb] ) 0-: received signum (15), shutting down [2018-07-24 13:26:54.881062] I [MSGID: 106492] [glusterd-handler.c:2805:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f [2018-07-24 13:26:54.881116] I [MSGID: 106502] [glusterd-handler.c:2850:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend pending frames: frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2018-07-24 13:26:54 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 st_atim.tv_nsec 1 [2018-07-24 13:26:56.904618] I [MSGID: 100030] [glusterfsd.c:2504:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.12.2 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO) [2018-07-24 13:26:56.909603] I [MSGID: 106478] [glusterd.c:1451:init] 0-management: Maximum allowed open file descriptors set to 65536 [2018-07-24 13:26:56.909657] I [MSGID: 106479] [glusterd.c:1509:init] 0-management: Using /var/lib/glusterd as working directory [2018-07-24 13:26:56.909675] I [MSGID: 106479] [glusterd.c:1514:init] 0-management: Using /var/run/gluster as pid file working directory [2018-07-24 13:26:56.914451] E [rpc-transport.c:283:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/3.12.2/rpc-transport/rdma.so: cannot open shared object file: No such file or directory [2018-07-24 13:26:56.914476] W [rpc-transport.c:287:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine [2018-07-24 13:26:56.914497] W [rpcsvc.c:1661:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed [2018-07-24 13:26:56.914541] E [MSGID: 106243] [glusterd.c:1792:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport [2018-07-24 13:26:56.915921] I [MSGID: 106228] [glusterd.c:499:glusterd_check_gsync_present] 0-glusterd: geo-replication module not installed in the system [No such file or directory] [2018-07-24 13:26:56.916332] I [MSGID: 106513] [glusterd-store.c:2246:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 31302 [2018-07-24 13:26:56.917027] I [MSGID: 106544] [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: 37c7f3f1-8d16-435f-849c-0bc14e385152 [2018-07-24 13:26:57.205347] I [MSGID: 106498] [glusterd-handler.c:3693:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 The message "I [MSGID: 106498] [glusterd-handler.c:3693:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0" repeated 4 times between [2018-07-24 13:26:57.205347] and [2018-07-24 13:26:57.205619] [2018-07-24 13:26:57.205667] W [MSGID: 106062] [glusterd-handler.c:3487:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout Second time ############### [2018-07-24 13:30:02.617709] I [MSGID: 106132] [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: bitd already stopped [2018-07-24 13:30:02.617733] I [MSGID: 106568] [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: bitd service is stopped [2018-07-24 13:30:02.617783] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600 [2018-07-24 13:30:02.618007] I [MSGID: 106132] [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: scrub already stopped [2018-07-24 13:30:02.618034] I [MSGID: 106568] [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: scrub service is stopped [2018-07-24 13:30:02.921435] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2018-07-24 13:30:02.921700] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f [2018-07-24 13:30:02.921809] I [MSGID: 106492] [glusterd-handler.c:2805:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f [2018-07-24 13:30:02.921917] I [MSGID: 106502] [glusterd-handler.c:2850:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2018-07-24 13:30:03.055263] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f [2018-07-24 13:30:03.055487] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 7ab0e428-0982-4e26-ad1e-3347f2c838f0, host: rhs-client30.lab.eng.blr.redhat.com, port: 0 [2018-07-24 13:30:03.097658] I [MSGID: 106143] [glusterd-pmap.c:282:pmap_registry_bind] 0-pmap: adding brick /gluster/brick1/distrep3 on port 49152 [2018-07-24 13:30:03.098543] I [MSGID: 106163] [glusterd-handshake.c:1319:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31302 [2018-07-24 13:30:03.140050] I [MSGID: 106490] [glusterd-handler.c:2627:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 7ab0e428-0982-4e26-ad1e-3347f2c838f0 [2018-07-24 13:30:03.140829] E [MSGID: 106010] [glusterd-utils.c:3412:glusterd_compare_friend_volume] 0-management: Version of Cksums get differ. local cksum = 1268816566, remote cksum = 2466777340 on peer rhs-client30.lab.eng.blr.redhat.com [2018-07-24 13:30:03.141052] I [MSGID: 106493] [glusterd-handler.c:3890:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhs-client30.lab.eng.blr.redhat.com (0), ret: 0, op_ret: -1 [2018-07-24 13:30:03.163881] W [glusterfsd.c:1367:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f28b8dd5dd5] -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xe5) [0x55d19b27b575] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x55d19b27b3eb] ) 0-: received signum (15), shutting down pending frames: frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2018-07-24 13:30:03 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.12.2 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f28b9f76cc0] /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f28b9f80c04] /lib64/libc.so.6(+0x36280)[0x7f28b85d6280] /lib64/liburcu-bp.so.1(rcu_read_lock_bp+0x2d)[0x7f28ae46f0ad] /usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x30b81)[0x7f28aea0db81] /usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x29844)[0x7f28aea06844] /usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x23a5e)[0x7f28aea00a5e] /lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325)[0x7f28b9d38955] /lib64/libgfrpc.so.0(rpcsvc_notify+0x10b)[0x7f28b9d38b3b] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f28b9d3aa73] /usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x7566)[0x7f28abc4a566] /usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9b0c)[0x7f28abc4cb0c] [2018-07-24 13:30:05.187577] I [MSGID: 100030] [glusterfsd.c:2504:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.12.2 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO) [2018-07-24 13:30:05.192508] I [MSGID: 106478] [glusterd.c:1451:init] 0-management: Maximum allowed open file descriptors set to 65536 [2018-07-24 13:30:05.192560] I [MSGID: 106479] [glusterd.c:1509:init] 0-management: Using /var/lib/glusterd as working directory [2018-07-24 13:30:05.192577] I [MSGID: 106479] [glusterd.c:1514:init] 0-management: Using /var/run/gluster as pid file working directory [2018-07-24 13:30:05.197134] E [rpc-transport.c:283:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/3.12.2/rpc-transport/rdma.so: cannot open shared object file: No such file or directory [2018-07-24 13:30:05.197159] W [rpc-transport.c:287:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine [2018-07-24 13:30:05.197172] W [rpcsvc.c:1661:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed [2018-07-24 13:30:05.197185] E [MSGID: 106243] [glusterd.c:1792:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport [2018-07-24 13:30:05.198410] I [MSGID: 106228] [glusterd.c:499:glusterd_check_gsync_present] 0-glusterd: geo-replication module not installed in the system [No such file or directory] [2018-07-24 13:30:05.198806] I [MSGID: 106513] [glusterd-store.c:2246:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 31302 Version-Release number of selected component (if applicable): --------------- 3.12.2-14 How reproducible: ============== twice in about 50 attempts
logs available @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1607888/ glusterd log for the below trace @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1607888/rhs-client18.lab.eng.blr.redhat.com_glusterd_backtrace_node/glusterfs/glusterd.log Below is the complete backtrace dumped in glusterd log [2018-07-24 13:30:03.141052] I [MSGID: 106493] [glusterd-handler.c:3890:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhs-client30.lab.eng.blr.redhat.com (0), ret: 0, op_ret: -1 [2018-07-24 13:30:03.163881] W [glusterfsd.c:1367:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f28b8dd5dd5] -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xe5) [0x55d19b27b575] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x55d1 9b27b3eb] ) 0-: received signum (15), shutting down pending frames: frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2018-07-24 13:30:03 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.12.2 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f28b9f76cc0] /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f28b9f80c04] /lib64/libc.so.6(+0x36280)[0x7f28b85d6280] /lib64/liburcu-bp.so.1(rcu_read_lock_bp+0x2d)[0x7f28ae46f0ad] /usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x30b81)[0x7f28aea0db81] /usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x29844)[0x7f28aea06844] /usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x23a5e)[0x7f28aea00a5e] /lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325)[0x7f28b9d38955] /lib64/libgfrpc.so.0(rpcsvc_notify+0x10b)[0x7f28b9d38b3b] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f28b9d3aa73] /usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x7566)[0x7f28abc4a566] /usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9b0c)[0x7f28abc4cb0c] [2018-07-24 13:30:05.187577] I [MSGID: 100030] [glusterfsd.c:2504:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.12.2 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO) [2018-07-24 13:30:05.192508] I [MSGID: 106478] [glusterd.c:1451:init] 0-management: Maximum allowed open file descriptors set to 65536
Also, i only see the backtrace, no core found, if i Need to change the title (remove the word crash), kindly feel free to correct me.
How can we have a crash with out not having core file? That is only possible if we have messed up with core pattern, isn't it?
Also from the backtrace in the logs it does look like glusterd crashed during cleanup_and_exit () and I believe this is similar to urcu cleanup related crash which we observed earlier. OTOH, restarting glusterd instance in a matter of 3 seconds in a loop is also not something what is recommended. We could run with issues like 24007 port not being free in such a short interval and also glusterd can go down before finishing its handshaking with other peers which can lead to urcu clean up races.
(In reply to Atin Mukherjee from comment #6) > How can we have a crash with out not having core file? That is only possible > if we have messed up with core pattern, isn't it? As mentioned, in c#5 , there was no crash, only backtrace. changing the title too
"there was no crash, only backtrace." - what does this mean? Can this ever happen? We dump the backtrace logs in the log file when there's a SEGV.
Core was not generated and I haven;t changed any settings wrt core generation, These machines have been in use for quite some time and never did I see a core not generating problem. For now, I am ok with the current resoultion of closing it as insuff data, as it is not possible for dev to debug in this case.