Bug 1607888 - backtrace seen in glusterd log when triggering glusterd restart on issuing of index heal (TC#RHG3-13523)
Summary: backtrace seen in glusterd log when triggering glusterd restart on issuing of...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterd
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Atin Mukherjee
QA Contact: Bala Konda Reddy M
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-24 13:41 UTC by Nag Pavan Chilakam
Modified: 2018-11-30 11:51 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-08-12 05:30:44 UTC
Embargoed:


Attachments (Terms of Use)

Description Nag Pavan Chilakam 2018-07-24 13:41:11 UTC
Description of problem:
======================
as part of verifying BZ#1593865 - shd crash on startup
I am executing testcase RHG3-13523, which is below

https://polarion.engineering.redhat.com/polarion/#/project/RHG3/workitem?id=RHG3-13523

Steps
1. create a 2x3 volume and start it on 6 node setup
2. `while true; do gluster volume heal <volname>;sleep 0.5; done` in one terminal.
3. In another terminal, keep running 'service glusterd restart" ie as below
while true;do systemctl start glusterd ;sleep 3;systemctl stop glusterd;sleep 2;date;done




First time backTrace log:
######################
[2018-07-24 13:26:54.638204] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600
[2018-07-24 13:26:54.638422] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: eeb4fc95-4d67-4237-9738-ffdb8910539c, host: rhs-client32.lab.eng.blr.redhat.com, port: 0
[2018-07-24 13:26:54.688005] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f
[2018-07-24 13:26:54.688832] I [MSGID: 106163] [glusterd-handshake.c:1319:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31302
[2018-07-24 13:26:54.721627] I [MSGID: 106143] [glusterd-pmap.c:282:pmap_registry_bind] 0-pmap: adding brick /gluster/brick1/distrep3 on port 49152
[2018-07-24 13:26:54.722021] I [MSGID: 106163] [glusterd-handshake.c:1319:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31302
[2018-07-24 13:26:54.746702] I [MSGID: 106490] [glusterd-handler.c:2627:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f
[2018-07-24 13:26:54.763606] I [MSGID: 106493] [glusterd-handler.c:3890:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhs-client19.lab.eng.blr.redhat.com (0), ret: 0, op_ret: 0
[2018-07-24 13:26:54.813848] I [MSGID: 106490] [glusterd-handler.c:2627:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: eeb4fc95-4d67-4237-9738-ffdb8910539c
[2018-07-24 13:26:54.814770] E [MSGID: 106010] [glusterd-utils.c:3412:glusterd_compare_friend_volume] 0-management: Version of Cksums get differ. local cksum = 1268816566, remote cksum = 2466777340 on peer rhs-client32.lab.eng.blr.redhat.com
[2018-07-24 13:26:54.814936] I [MSGID: 106493] [glusterd-handler.c:3890:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhs-client32.lab.eng.blr.redhat.com (0), ret: 0, op_ret: -1
[2018-07-24 13:26:54.880925] W [glusterfsd.c:1367:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7fe4c9247dd5] -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xe5) [0x55a8b3503575] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x55a8b35033eb] ) 0-: received signum (15), shutting down
[2018-07-24 13:26:54.881062] I [MSGID: 106492] [glusterd-handler.c:2805:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f
[2018-07-24 13:26:54.881116] I [MSGID: 106502] [glusterd-handler.c:2850:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2018-07-24 13:26:54
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
st_atim.tv_nsec 1
[2018-07-24 13:26:56.904618] I [MSGID: 100030] [glusterfsd.c:2504:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.12.2 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
[2018-07-24 13:26:56.909603] I [MSGID: 106478] [glusterd.c:1451:init] 0-management: Maximum allowed open file descriptors set to 65536
[2018-07-24 13:26:56.909657] I [MSGID: 106479] [glusterd.c:1509:init] 0-management: Using /var/lib/glusterd as working directory
[2018-07-24 13:26:56.909675] I [MSGID: 106479] [glusterd.c:1514:init] 0-management: Using /var/run/gluster as pid file working directory
[2018-07-24 13:26:56.914451] E [rpc-transport.c:283:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/3.12.2/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
[2018-07-24 13:26:56.914476] W [rpc-transport.c:287:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine
[2018-07-24 13:26:56.914497] W [rpcsvc.c:1661:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2018-07-24 13:26:56.914541] E [MSGID: 106243] [glusterd.c:1792:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2018-07-24 13:26:56.915921] I [MSGID: 106228] [glusterd.c:499:glusterd_check_gsync_present] 0-glusterd: geo-replication module not installed in the system [No such file or directory]
[2018-07-24 13:26:56.916332] I [MSGID: 106513] [glusterd-store.c:2246:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 31302
[2018-07-24 13:26:56.917027] I [MSGID: 106544] [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: 37c7f3f1-8d16-435f-849c-0bc14e385152
[2018-07-24 13:26:57.205347] I [MSGID: 106498] [glusterd-handler.c:3693:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0
The message "I [MSGID: 106498] [glusterd-handler.c:3693:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0" repeated 4 times between [2018-07-24 13:26:57.205347] and [2018-07-24 13:26:57.205619]
[2018-07-24 13:26:57.205667] W [MSGID: 106062] [glusterd-handler.c:3487:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout





Second time
###############
[2018-07-24 13:30:02.617709] I [MSGID: 106132] [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: bitd already stopped
[2018-07-24 13:30:02.617733] I [MSGID: 106568] [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: bitd service is stopped
[2018-07-24 13:30:02.617783] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600
[2018-07-24 13:30:02.618007] I [MSGID: 106132] [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: scrub already stopped
[2018-07-24 13:30:02.618034] I [MSGID: 106568] [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: scrub service is stopped
[2018-07-24 13:30:02.921435] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600
[2018-07-24 13:30:02.921700] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f
[2018-07-24 13:30:02.921809] I [MSGID: 106492] [glusterd-handler.c:2805:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f
[2018-07-24 13:30:02.921917] I [MSGID: 106502] [glusterd-handler.c:2850:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2018-07-24 13:30:03.055263] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 372f4f6e-2ff8-46dc-ba0f-8a836793c24f
[2018-07-24 13:30:03.055487] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 7ab0e428-0982-4e26-ad1e-3347f2c838f0, host: rhs-client30.lab.eng.blr.redhat.com, port: 0
[2018-07-24 13:30:03.097658] I [MSGID: 106143] [glusterd-pmap.c:282:pmap_registry_bind] 0-pmap: adding brick /gluster/brick1/distrep3 on port 49152
[2018-07-24 13:30:03.098543] I [MSGID: 106163] [glusterd-handshake.c:1319:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31302
[2018-07-24 13:30:03.140050] I [MSGID: 106490] [glusterd-handler.c:2627:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 7ab0e428-0982-4e26-ad1e-3347f2c838f0
[2018-07-24 13:30:03.140829] E [MSGID: 106010] [glusterd-utils.c:3412:glusterd_compare_friend_volume] 0-management: Version of Cksums get differ. local cksum = 1268816566, remote cksum = 2466777340 on peer rhs-client30.lab.eng.blr.redhat.com
[2018-07-24 13:30:03.141052] I [MSGID: 106493] [glusterd-handler.c:3890:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhs-client30.lab.eng.blr.redhat.com (0), ret: 0, op_ret: -1
[2018-07-24 13:30:03.163881] W [glusterfsd.c:1367:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f28b8dd5dd5] -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xe5) [0x55d19b27b575] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x55d19b27b3eb] ) 0-: received signum (15), shutting down
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2018-07-24 13:30:03
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.2
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f28b9f76cc0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f28b9f80c04]
/lib64/libc.so.6(+0x36280)[0x7f28b85d6280]
/lib64/liburcu-bp.so.1(rcu_read_lock_bp+0x2d)[0x7f28ae46f0ad]
/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x30b81)[0x7f28aea0db81]
/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x29844)[0x7f28aea06844]
/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x23a5e)[0x7f28aea00a5e]
/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325)[0x7f28b9d38955]
/lib64/libgfrpc.so.0(rpcsvc_notify+0x10b)[0x7f28b9d38b3b]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f28b9d3aa73]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x7566)[0x7f28abc4a566]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9b0c)[0x7f28abc4cb0c]
[2018-07-24 13:30:05.187577] I [MSGID: 100030] [glusterfsd.c:2504:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.12.2 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
[2018-07-24 13:30:05.192508] I [MSGID: 106478] [glusterd.c:1451:init] 0-management: Maximum allowed open file descriptors set to 65536
[2018-07-24 13:30:05.192560] I [MSGID: 106479] [glusterd.c:1509:init] 0-management: Using /var/lib/glusterd as working directory
[2018-07-24 13:30:05.192577] I [MSGID: 106479] [glusterd.c:1514:init] 0-management: Using /var/run/gluster as pid file working directory
[2018-07-24 13:30:05.197134] E [rpc-transport.c:283:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/3.12.2/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
[2018-07-24 13:30:05.197159] W [rpc-transport.c:287:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine
[2018-07-24 13:30:05.197172] W [rpcsvc.c:1661:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2018-07-24 13:30:05.197185] E [MSGID: 106243] [glusterd.c:1792:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2018-07-24 13:30:05.198410] I [MSGID: 106228] [glusterd.c:499:glusterd_check_gsync_present] 0-glusterd: geo-replication module not installed in the system [No such file or directory]
[2018-07-24 13:30:05.198806] I [MSGID: 106513] [glusterd-store.c:2246:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 31302



Version-Release number of selected component (if applicable):
---------------
3.12.2-14

How reproducible:
==============
twice in about 50 attempts

Comment 4 Nag Pavan Chilakam 2018-07-25 06:27:18 UTC
logs available @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1607888/

glusterd log for the below trace @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1607888/rhs-client18.lab.eng.blr.redhat.com_glusterd_backtrace_node/glusterfs/glusterd.log



Below is the complete backtrace dumped in glusterd log
[2018-07-24 13:30:03.141052] I [MSGID: 106493] [glusterd-handler.c:3890:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhs-client30.lab.eng.blr.redhat.com (0), ret: 0, op_ret: -1
[2018-07-24 13:30:03.163881] W [glusterfsd.c:1367:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f28b8dd5dd5] -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xe5) [0x55d19b27b575] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x55d1
9b27b3eb] ) 0-: received signum (15), shutting down
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2018-07-24 13:30:03
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.2
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f28b9f76cc0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f28b9f80c04]
/lib64/libc.so.6(+0x36280)[0x7f28b85d6280]
/lib64/liburcu-bp.so.1(rcu_read_lock_bp+0x2d)[0x7f28ae46f0ad]
/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x30b81)[0x7f28aea0db81]
/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x29844)[0x7f28aea06844]
/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x23a5e)[0x7f28aea00a5e]
/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325)[0x7f28b9d38955]
/lib64/libgfrpc.so.0(rpcsvc_notify+0x10b)[0x7f28b9d38b3b]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f28b9d3aa73]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x7566)[0x7f28abc4a566]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9b0c)[0x7f28abc4cb0c]
[2018-07-24 13:30:05.187577] I [MSGID: 100030] [glusterfsd.c:2504:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.12.2 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
[2018-07-24 13:30:05.192508] I [MSGID: 106478] [glusterd.c:1451:init] 0-management: Maximum allowed open file descriptors set to 65536

Comment 5 Nag Pavan Chilakam 2018-07-25 06:28:08 UTC
Also, i only see the backtrace, no core found, if i Need to change the title (remove the word crash), kindly feel free to correct me.

Comment 6 Atin Mukherjee 2018-07-25 06:40:47 UTC
How can we have a crash with out not having core file? That is only possible if we have messed up with core pattern, isn't it?

Comment 7 Atin Mukherjee 2018-07-25 06:48:10 UTC
Also from the backtrace in the logs it does look like glusterd crashed during cleanup_and_exit () and I believe this is similar to urcu cleanup related crash which we observed earlier. 

OTOH, restarting glusterd instance in a matter of 3 seconds in a loop is also not something what is recommended. We could run with issues like 24007 port not being free in such a short interval and also glusterd can go down before finishing its handshaking with other peers which can lead to urcu clean up races.

Comment 8 Nag Pavan Chilakam 2018-07-25 08:58:23 UTC
(In reply to Atin Mukherjee from comment #6)
> How can we have a crash with out not having core file? That is only possible
> if we have messed up with core pattern, isn't it?

As mentioned, in c#5 , there was no crash, only backtrace.
changing the title too

Comment 9 Atin Mukherjee 2018-07-25 11:28:02 UTC
"there was no crash, only backtrace." - what does this mean? Can this ever happen? We dump the backtrace logs in the log file when there's a SEGV.

Comment 19 Nag Pavan Chilakam 2018-11-30 11:51:53 UTC
Core was not generated and I haven;t changed any settings wrt core generation,
These machines have been in use for quite some time and never did I see a core not generating problem.
For now, I am ok with the current resoultion of closing it as insuff data, as it is not possible for dev to debug in this case.


Note You need to log in before you can comment on or make changes to this bug.