Bug 1607888

Summary:	backtrace seen in glusterd log when triggering glusterd restart on issuing of index heal (TC#RHG3-13523)
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Nag Pavan Chilakam <nchilaka>
Component:	glusterd	Assignee:	Atin Mukherjee <amukherj>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Bala Konda Reddy M <bmekala>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.4	CC:	nchilaka, rhs-bugs, sankarshan, sheggodu, storage-qa-internal, vbellur
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-08-12 05:30:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nag Pavan Chilakam 2018-07-24 13:41:11 UTC

Description of problem:
======================
as part of verifying BZ#1593865 - shd crash on startup
I am executing testcase RHG3-13523, which is below

https://polarion.engineering.redhat.com/polarion/#/project/RHG3/workitem?id=RHG3-13523

Steps
1. create a 2x3 volume and start it on 6 node setup
2. `while true; do gluster volume heal <volname>;sleep 0.5; done` in one terminal.
3. In another terminal, keep running 'service glusterd restart" ie as below
while true;do systemctl start glusterd ;sleep 3;systemctl stop glusterd;sleep 2;date;done




First time backTrace log:
######################
[2018-07-24 13:26:54.638204] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600
[2018-07-24 13:26:54.638422] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: eeb4fc95-4d67-4237-9738-ffdb8910539c, host: rhs-client32.lab.eng.blr.redhat.com, port: 0
[2018-07-24 13:26:54.688005] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f
[2018-07-24 13:26:54.688832] I [MSGID: 106163] [glusterd-handshake.c:1319:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31302
[2018-07-24 13:26:54.721627] I [MSGID: 106143] [glusterd-pmap.c:282:pmap_registry_bind] 0-pmap: adding brick /gluster/brick1/distrep3 on port 49152
[2018-07-24 13:26:54.722021] I [MSGID: 106163] [glusterd-handshake.c:1319:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31302
[2018-07-24 13:26:54.746702] I [MSGID: 106490] [glusterd-handler.c:2627:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f
[2018-07-24 13:26:54.763606] I [MSGID: 106493] [glusterd-handler.c:3890:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhs-client19.lab.eng.blr.redhat.com (0), ret: 0, op_ret: 0
[2018-07-24 13:26:54.813848] I [MSGID: 106490] [glusterd-handler.c:2627:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: eeb4fc95-4d67-4237-9738-ffdb8910539c
[2018-07-24 13:26:54.814770] E [MSGID: 106010] [glusterd-utils.c:3412:glusterd_compare_friend_volume] 0-management: Version of Cksums get differ. local cksum = 1268816566, remote cksum = 2466777340 on peer rhs-client32.lab.eng.blr.redhat.com
[2018-07-24 13:26:54.814936] I [MSGID: 106493] [glusterd-handler.c:3890:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhs-client32.lab.eng.blr.redhat.com (0), ret: 0, op_ret: -1
[2018-07-24 13:26:54.880925] W [glusterfsd.c:1367:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7fe4c9247dd5] -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xe5) [0x55a8b3503575] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x55a8b35033eb] ) 0-: received signum (15), shutting down
[2018-07-24 13:26:54.881062] I [MSGID: 106492] [glusterd-handler.c:2805:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f
[2018-07-24 13:26:54.881116] I [MSGID: 106502] [glusterd-handler.c:2850:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2018-07-24 13:26:54
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
st_atim.tv_nsec 1
[2018-07-24 13:26:56.904618] I [MSGID: 100030] [glusterfsd.c:2504:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.12.2 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
[2018-07-24 13:26:56.909603] I [MSGID: 106478] [glusterd.c:1451:init] 0-management: Maximum allowed open file descriptors set to 65536
[2018-07-24 13:26:56.909657] I [MSGID: 106479] [glusterd.c:1509:init] 0-management: Using /var/lib/glusterd as working directory
[2018-07-24 13:26:56.909675] I [MSGID: 106479] [glusterd.c:1514:init] 0-management: Using /var/run/gluster as pid file working directory
[2018-07-24 13:26:56.914451] E [rpc-transport.c:283:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/3.12.2/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
[2018-07-24 13:26:56.914476] W [rpc-transport.c:287:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine
[2018-07-24 13:26:56.914497] W [rpcsvc.c:1661:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2018-07-24 13:26:56.914541] E [MSGID: 106243] [glusterd.c:1792:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2018-07-24 13:26:56.915921] I [MSGID: 106228] [glusterd.c:499:glusterd_check_gsync_present] 0-glusterd: geo-replication module not installed in the system [No such file or directory]
[2018-07-24 13:26:56.916332] I [MSGID: 106513] [glusterd-store.c:2246:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 31302
[2018-07-24 13:26:56.917027] I [MSGID: 106544] [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: 37c7f3f1-8d16-435f-849c-0bc14e385152
[2018-07-24 13:26:57.205347] I [MSGID: 106498] [glusterd-handler.c:3693:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0
The message "I [MSGID: 106498] [glusterd-handler.c:3693:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0" repeated 4 times between [2018-07-24 13:26:57.205347] and [2018-07-24 13:26:57.205619]
[2018-07-24 13:26:57.205667] W [MSGID: 106062] [glusterd-handler.c:3487:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout





Second time
###############
[2018-07-24 13:30:02.617709] I [MSGID: 106132] [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: bitd already stopped
[2018-07-24 13:30:02.617733] I [MSGID: 106568] [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: bitd service is stopped
[2018-07-24 13:30:02.617783] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600
[2018-07-24 13:30:02.618007] I [MSGID: 106132] [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: scrub already stopped
[2018-07-24 13:30:02.618034] I [MSGID: 106568] [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: scrub service is stopped
[2018-07-24 13:30:02.921435] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600
[2018-07-24 13:30:02.921700] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f
[2018-07-24 13:30:02.921809] I [MSGID: 106492] [glusterd-handler.c:2805:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f
[2018-07-24 13:30:02.921917] I [MSGID: 106502] [glusterd-handler.c:2850:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2018-07-24 13:30:03.055263] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f
[2018-07-24 13:30:03.055487] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 7ab0e428-0982-4e26-ad1e-3347f2c838f0, host: rhs-client30.lab.eng.blr.redhat.com, port: 0
[2018-07-24 13:30:03.097658] I [MSGID: 106143] [glusterd-pmap.c:282:pmap_registry_bind] 0-pmap: adding brick /gluster/brick1/distrep3 on port 49152
[2018-07-24 13:30:03.098543] I [MSGID: 106163] [glusterd-handshake.c:1319:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31302
[2018-07-24 13:30:03.140050] I [MSGID: 106490] [glusterd-handler.c:2627:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 7ab0e428-0982-4e26-ad1e-3347f2c838f0
[2018-07-24 13:30:03.140829] E [MSGID: 106010] [glusterd-utils.c:3412:glusterd_compare_friend_volume] 0-management: Version of Cksums get differ. local cksum = 1268816566, remote cksum = 2466777340 on peer rhs-client30.lab.eng.blr.redhat.com
[2018-07-24 13:30:03.141052] I [MSGID: 106493] [glusterd-handler.c:3890:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhs-client30.lab.eng.blr.redhat.com (0), ret: 0, op_ret: -1
[2018-07-24 13:30:03.163881] W [glusterfsd.c:1367:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f28b8dd5dd5] -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xe5) [0x55d19b27b575] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x55d19b27b3eb] ) 0-: received signum (15), shutting down
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2018-07-24 13:30:03
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.2
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f28b9f76cc0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f28b9f80c04]
/lib64/libc.so.6(+0x36280)[0x7f28b85d6280]
/lib64/liburcu-bp.so.1(rcu_read_lock_bp+0x2d)[0x7f28ae46f0ad]
/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x30b81)[0x7f28aea0db81]
/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x29844)[0x7f28aea06844]
/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x23a5e)[0x7f28aea00a5e]
/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325)[0x7f28b9d38955]
/lib64/libgfrpc.so.0(rpcsvc_notify+0x10b)[0x7f28b9d38b3b]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f28b9d3aa73]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x7566)[0x7f28abc4a566]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9b0c)[0x7f28abc4cb0c]
[2018-07-24 13:30:05.187577] I [MSGID: 100030] [glusterfsd.c:2504:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.12.2 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
[2018-07-24 13:30:05.192508] I [MSGID: 106478] [glusterd.c:1451:init] 0-management: Maximum allowed open file descriptors set to 65536
[2018-07-24 13:30:05.192560] I [MSGID: 106479] [glusterd.c:1509:init] 0-management: Using /var/lib/glusterd as working directory
[2018-07-24 13:30:05.192577] I [MSGID: 106479] [glusterd.c:1514:init] 0-management: Using /var/run/gluster as pid file working directory
[2018-07-24 13:30:05.197134] E [rpc-transport.c:283:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/3.12.2/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
[2018-07-24 13:30:05.197159] W [rpc-transport.c:287:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine
[2018-07-24 13:30:05.197172] W [rpcsvc.c:1661:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2018-07-24 13:30:05.197185] E [MSGID: 106243] [glusterd.c:1792:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2018-07-24 13:30:05.198410] I [MSGID: 106228] [glusterd.c:499:glusterd_check_gsync_present] 0-glusterd: geo-replication module not installed in the system [No such file or directory]
[2018-07-24 13:30:05.198806] I [MSGID: 106513] [glusterd-store.c:2246:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 31302



Version-Release number of selected component (if applicable):
---------------
3.12.2-14

How reproducible:
==============
twice in about 50 attempts

Comment 4 Nag Pavan Chilakam 2018-07-25 06:27:18 UTC

logs available @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1607888/

glusterd log for the below trace @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1607888/rhs-client18.lab.eng.blr.redhat.com_glusterd_backtrace_node/glusterfs/glusterd.log



Below is the complete backtrace dumped in glusterd log
[2018-07-24 13:30:03.141052] I [MSGID: 106493] [glusterd-handler.c:3890:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhs-client30.lab.eng.blr.redhat.com (0), ret: 0, op_ret: -1
[2018-07-24 13:30:03.163881] W [glusterfsd.c:1367:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f28b8dd5dd5] -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xe5) [0x55d19b27b575] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x55d1
9b27b3eb] ) 0-: received signum (15), shutting down
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2018-07-24 13:30:03
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.2
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f28b9f76cc0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f28b9f80c04]
/lib64/libc.so.6(+0x36280)[0x7f28b85d6280]
/lib64/liburcu-bp.so.1(rcu_read_lock_bp+0x2d)[0x7f28ae46f0ad]
/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x30b81)[0x7f28aea0db81]
/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x29844)[0x7f28aea06844]
/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x23a5e)[0x7f28aea00a5e]
/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325)[0x7f28b9d38955]
/lib64/libgfrpc.so.0(rpcsvc_notify+0x10b)[0x7f28b9d38b3b]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f28b9d3aa73]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x7566)[0x7f28abc4a566]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9b0c)[0x7f28abc4cb0c]
[2018-07-24 13:30:05.187577] I [MSGID: 100030] [glusterfsd.c:2504:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.12.2 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
[2018-07-24 13:30:05.192508] I [MSGID: 106478] [glusterd.c:1451:init] 0-management: Maximum allowed open file descriptors set to 65536

Comment 5 Nag Pavan Chilakam 2018-07-25 06:28:08 UTC

Also, i only see the backtrace, no core found, if i Need to change the title (remove the word crash), kindly feel free to correct me.

Comment 6 Atin Mukherjee 2018-07-25 06:40:47 UTC

How can we have a crash with out not having core file? That is only possible if we have messed up with core pattern, isn't it?

Comment 7 Atin Mukherjee 2018-07-25 06:48:10 UTC

Also from the backtrace in the logs it does look like glusterd crashed during cleanup_and_exit () and I believe this is similar to urcu cleanup related crash which we observed earlier. 

OTOH, restarting glusterd instance in a matter of 3 seconds in a loop is also not something what is recommended. We could run with issues like 24007 port not being free in such a short interval and also glusterd can go down before finishing its handshaking with other peers which can lead to urcu clean up races.

Comment 8 Nag Pavan Chilakam 2018-07-25 08:58:23 UTC

(In reply to Atin Mukherjee from comment #6)
> How can we have a crash with out not having core file? That is only possible
> if we have messed up with core pattern, isn't it?

As mentioned, in c#5 , there was no crash, only backtrace.
changing the title too

Comment 9 Atin Mukherjee 2018-07-25 11:28:02 UTC

"there was no crash, only backtrace." - what does this mean? Can this ever happen? We dump the backtrace logs in the log file when there's a SEGV.

Comment 19 Nag Pavan Chilakam 2018-11-30 11:51:53 UTC

Core was not generated and I haven;t changed any settings wrt core generation,
These machines have been in use for quite some time and never did I see a core not generating problem.
For now, I am ok with the current resoultion of closing it as insuff data, as it is not possible for dev to debug in this case.