Description: After glusterd restart with approx 30 seconds(+/- 5 seconds) time gap executed mentioned commands "gluster peer status --xml" and "gluster vol status --xml" and these are timing out with "Error: Request timed out". Version: glusterfs-6.0-3.el7rhgs.x86_64 How Reproducible: 2/3 Steps to Reproduce: 1. Peer probe 5 nodes and form cluster 2. Create a replicate volume 1X3 and start it 3. Restart glusterd on one node N1 4. Wait for peer handshake to complete. 5. Execute gluster peer status --xml Actual Output: ]# gluster peer status --xml Error : Request timed out Expected Output: gluster commands should not time out after glusterd restart with a time gap of 30 seconds. Additional Information: [root@dhcp47-112 ~]# date Tue May 28 03:29:47 EDT 2019 [root@dhcp47-112 ~]# systemctl restart glusterd [root@dhcp47-112 ~]# date Tue May 28 03:30:04 EDT 2019 [root@dhcp47-112 ~]# date; gluster peer status --xml Tue May 28 03:30:13 EDT 2019 Error : Request timed out After glusterd restart there is enough time of 26 seconds for peer handshake [root@dhcp47-112 ~]# date Tue May 28 03:33:38 EDT 2019 [root@dhcp47-112 ~]# systemctl restart glusterd [root@dhcp47-112 ~]# date; gluster vol status --xml Tue May 28 03:34:01 EDT 2019 Error : Request timed out attaching glusterd log: [2019-05-28 07:29:52.491938] I [MSGID: 100030] [glusterfsd.c:2821:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 6.0 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO) [2019-05-28 07:29:52.493093] I [glusterfsd.c:2546:daemonize] 0-glusterfs: Pid of current running process is 21499 [2019-05-28 07:29:52.497786] I [MSGID: 106478] [glusterd.c:1426:init] 0-management: Maximum allowed open file descriptors set to 65536 [2019-05-28 07:29:52.497888] I [MSGID: 106479] [glusterd.c:1482:init] 0-management: Using /var/lib/glusterd as working directory [2019-05-28 07:29:52.497911] I [MSGID: 106479] [glusterd.c:1488:init] 0-management: Using /var/run/gluster as pid file working directory [2019-05-28 07:29:52.504612] I [socket.c:991:__socket_server_bind] 0-socket.management: process started listening on port (24007) [2019-05-28 07:29:52.506864] W [MSGID: 103071] [rdma.c:4472:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device] [2019-05-28 07:29:52.506917] W [MSGID: 103055] [rdma.c:4782:init] 0-rdma.management: Failed to initialize IB Device [2019-05-28 07:29:52.506935] W [rpc-transport.c:368:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed [2019-05-28 07:29:52.507098] W [rpcsvc.c:1991:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed [2019-05-28 07:29:52.507125] E [MSGID: 106244] [glusterd.c:1781:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport [2019-05-28 07:29:52.509600] I [socket.c:962:__socket_server_bind] 0-socket.management: closing (AF_UNIX) reuse check socket 12 [2019-05-28 07:29:52.510440] I [MSGID: 106059] [glusterd.c:1865:init] 0-management: max-port override: 60999 [2019-05-28 07:29:54.300708] I [MSGID: 106513] [glusterd-store.c:2397:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 40100 [2019-05-28 07:29:54.518744] I [MSGID: 106544] [glusterd.c:152:glusterd_uuid_init] 0-management: retrieved UUID: 6e7a6b9f-79cd-4673-a234-123ef3c52be6 [2019-05-28 07:29:54.573243] I [MSGID: 106498] [glusterd-handler.c:3752:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 The message "I [MSGID: 106498] [glusterd-handler.c:3752:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0" repeated 4 times between [2019-05-28 07:29:54.573243] and [2019-05-28 07:29:54.574044] [2019-05-28 07:29:54.574163] W [MSGID: 106061] [glusterd-handler.c:3548:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2019-05-28 07:29:54.574242] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2019-05-28 07:29:54.579167] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2019-05-28 07:29:54.583370] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2019-05-28 07:29:54.587709] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2019-05-28 07:29:54.591973] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 Final graph: +------------------------------------------------------------------------------+ 1: volume management 2: type mgmt/glusterd 3: option rpc-auth.auth-glusterfs on 4: option rpc-auth.auth-unix on 5: option rpc-auth.auth-null on 6: option rpc-auth-allow-insecure on 7: option transport.listen-backlog 1024 8: option max-port 60999 9: option event-threads 1 10: option ping-timeout 0 11: option transport.rdma.listen-port 24008 12: option transport.socket.listen-port 24007 13: option transport.socket.read-fail-log off 14: option transport.socket.keepalive-interval 2 15: option transport.socket.keepalive-time 10 16: option transport-type rdma 17: option working-directory /var/lib/glusterd 18: end-volume 19: +------------------------------------------------------------------------------+ The message "W [MSGID: 106061] [glusterd-handler.c:3548:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout" repeated 4 times between [2019-05-28 07:29:54.574163] and [2019-05-28 07:29:54.591959] [2019-05-28 07:29:54.598103] I [MSGID: 101190] [event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0 [2019-05-28 07:29:56.224737] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: b1ba1a47-86ab-42f4-bbdc-24793c6eaf31, host: 10.70.46.198, port: 0 [2019-05-28 07:29:56.290035] I [glusterd-utils.c:6424:glusterd_brick_start] 0-management: discovered already-running brick /bricks/brick1/dddistdisp [2019-05-28 07:29:56.290148] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2019-05-28 07:29:56.290388] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: b9c26960-cd32-495f-a7a2-f1aba4a4ae3a, host: 10.70.47.162, port: 0 [2019-05-28 07:29:56.674280] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-nfs: setting frame-timeout to 600 [2019-05-28 07:29:56.674607] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping nfs daemon running in pid: 21396 [2019-05-28 07:29:56.674919] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: c621234d-e846-4f82-9f05-062b01b5095e, host: 10.70.47.147, port: 0 [2019-05-28 07:29:56.757093] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: b1ba1a47-86ab-42f4-bbdc-24793c6eaf31 [2019-05-28 07:29:56.757174] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2019-05-28 07:29:56.940748] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: b9c26960-cd32-495f-a7a2-f1aba4a4ae3a [2019-05-28 07:29:56.940844] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2019-05-28 07:29:57.099100] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: bf50c5f1-dd75-4d8a-b42b-a1014f8873a5, host: 10.70.47.9, port: 0 [2019-05-28 07:29:57.189502] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: b090abe4-3993-4421-9ceb-05d83971d93c, host: 10.70.46.248, port: 0 [2019-05-28 07:29:57.249514] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: c621234d-e846-4f82-9f05-062b01b5095e [2019-05-28 07:29:57.249576] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2019-05-28 07:29:57.374471] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: b1ba1a47-86ab-42f4-bbdc-24793c6eaf31 [2019-05-28 07:29:57.374557] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: b9c26960-cd32-495f-a7a2-f1aba4a4ae3a [2019-05-28 07:29:57.374654] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: b090abe4-3993-4421-9ceb-05d83971d93c [2019-05-28 07:29:57.374720] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2019-05-28 07:29:57.592196] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: c621234d-e846-4f82-9f05-062b01b5095e [2019-05-28 07:29:57.592406] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: bf50c5f1-dd75-4d8a-b42b-a1014f8873a5 [2019-05-28 07:29:57.592497] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2019-05-28 07:29:57.842358] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: nfs service is stopped [2019-05-28 07:29:57.843790] I [MSGID: 106540] [glusterd-utils.c:5318:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered MOUNTV3 successfully [2019-05-28 07:29:57.844206] I [MSGID: 106540] [glusterd-utils.c:5325:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered MOUNTV1 successfully [2019-05-28 07:29:57.844583] I [MSGID: 106540] [glusterd-utils.c:5332:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NFSV3 successfully [2019-05-28 07:29:57.844952] I [MSGID: 106540] [glusterd-utils.c:5339:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NLM v4 successfully [2019-05-28 07:29:57.845318] I [MSGID: 106540] [glusterd-utils.c:5346:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NLM v1 successfully [2019-05-28 07:29:57.845684] I [MSGID: 106540] [glusterd-utils.c:5353:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered ACL v3 successfully [2019-05-28 07:29:57.869929] I [MSGID: 106567] [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting nfs service [2019-05-28 07:29:58.873184] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-quotad: setting frame-timeout to 600 [2019-05-28 07:29:58.873544] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: quotad already stopped [2019-05-28 07:29:58.873589] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: quotad service is stopped [2019-05-28 07:29:58.873647] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-bitd: setting frame-timeout to 600 [2019-05-28 07:29:58.873879] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: bitd already stopped [2019-05-28 07:29:58.873918] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: bitd service is stopped [2019-05-28 07:29:58.874046] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600 [2019-05-28 07:29:58.874239] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: scrub already stopped [2019-05-28 07:29:58.874273] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: scrub service is stopped [2019-05-28 07:29:58.874343] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2019-05-28 07:29:58.874551] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-gfproxyd: setting frame-timeout to 600 [2019-05-28 07:29:58.924321] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600 [2019-05-28 07:29:58.924644] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: glustershd already stopped [2019-05-28 07:29:58.924717] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-05-28 07:29:58.957770] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600 [2019-05-28 07:29:58.957950] I [MSGID: 106567] [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting glustershd service [2019-05-28 07:29:59.963315] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 21553 [2019-05-28 07:29:59.964004] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: bf50c5f1-dd75-4d8a-b42b-a1014f8873a5 [2019-05-28 07:29:59.964460] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: b090abe4-3993-4421-9ceb-05d83971d93c [2019-05-28 07:29:59.964555] I [MSGID: 106163] [glusterd-handshake.c:1410:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 40100 The message "I [MSGID: 106163] [glusterd-handshake.c:1410:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 40100" repeated 4 times between [2019-05-28 07:29:59.964555] and [2019-05-28 07:30:00.149536] [2019-05-28 07:30:00.191379] I [MSGID: 106490] [glusterd-handler.c:2670:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: b1ba1a47-86ab-42f4-bbdc-24793c6eaf31 [2019-05-28 07:30:00.224842] I [MSGID: 106493] [glusterd-handler.c:3948:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.46.198 (0), ret: 0, op_ret: 0 [2019-05-28 07:30:00.383140] I [MSGID: 106490] [glusterd-handler.c:2670:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: b9c26960-cd32-495f-a7a2-f1aba4a4ae3a [2019-05-28 07:30:00.449695] I [MSGID: 106493] [glusterd-handler.c:3948:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.47.162 (0), ret: 0, op_ret: 0 [2019-05-28 07:30:00.525416] I [MSGID: 106490] [glusterd-handler.c:2670:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: c621234d-e846-4f82-9f05-062b01b5095e [2019-05-28 07:30:00.549857] I [MSGID: 106493] [glusterd-handler.c:3948:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.47.147 (0), ret: 0, op_ret: 0 [2019-05-28 07:30:00.624930] I [MSGID: 106490] [glusterd-handler.c:2670:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: b090abe4-3993-4421-9ceb-05d83971d93c [2019-05-28 07:30:00.649959] I [MSGID: 106493] [glusterd-handler.c:3948:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.46.248 (0), ret: 0, op_ret: 0 [2019-05-28 07:30:00.741715] I [MSGID: 106490] [glusterd-handler.c:2670:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: bf50c5f1-dd75-4d8a-b42b-a1014f8873a5 [2019-05-28 07:30:00.775125] I [MSGID: 106493] [glusterd-handler.c:3948:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.47.9 (0), ret: 0, op_ret: 0 [2019-05-28 07:30:00.875102] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: b1ba1a47-86ab-42f4-bbdc-24793c6eaf31 [2019-05-28 07:30:00.875170] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2019-05-28 07:30:01.117623] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: b1ba1a47-86ab-42f4-bbdc-24793c6eaf31 [2019-05-28 07:30:01.118081] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: b9c26960-cd32-495f-a7a2-f1aba4a4ae3a [2019-05-28 07:30:01.118292] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2019-05-28 07:30:01.359766] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: c621234d-e846-4f82-9f05-062b01b5095e [2019-05-28 07:30:01.359877] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2019-05-28 07:30:01.501224] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: b9c26960-cd32-495f-a7a2-f1aba4a4ae3a [2019-05-28 07:30:01.501344] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: b090abe4-3993-4421-9ceb-05d83971d93c [2019-05-28 07:30:01.501388] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2019-05-28 07:30:01.601546] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: c621234d-e846-4f82-9f05-062b01b5095e [2019-05-28 07:30:01.601651] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: bf50c5f1-dd75-4d8a-b42b-a1014f8873a5 [2019-05-28 07:30:01.601690] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2019-05-28 07:30:01.685244] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: b090abe4-3993-4421-9ceb-05d83971d93c [2019-05-28 07:30:01.685334] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: bf50c5f1-dd75-4d8a-b42b-a1014f8873a5
When I tried to reproduce this in Bala's setup, there were no volumes present in the cluster. After creating one volume I'm able to hit this issue. In glusterd.log I have observed the following. If we execute the peer status command once glustershd is started, it works as expected. But If we execute the command, just after friend handshake complete it hungs. Also, one thing I didn't understand is why glustershd is started and stopped several times, once the glusterd is restarted? I see different pid's in the following log. [2019-05-30 05:39:51.667174] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: c621234d-e846-4f82-9f05-062b01b5095e [2019-05-30 05:39:51.872878] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-05-30 05:39:51.900464] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600 [2019-05-30 05:39:51.900700] I [MSGID: 106567] [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting glustershd service [2019-05-30 05:39:52.907268] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 21169 [2019-05-30 05:39:53.907629] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-05-30 05:39:54.274356] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600 [2019-05-30 05:39:54.274626] I [MSGID: 106567] [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting glustershd service [2019-05-30 05:39:55.281439] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 21181 [2019-05-30 05:39:56.281810] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-05-30 05:39:56.307311] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600 [2019-05-30 05:39:56.307605] I [MSGID: 106567] [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting glustershd service [2019-05-30 05:39:57.313569] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 21193 [2019-05-30 05:39:58.313830] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-05-30 05:39:58.356303] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600 [2019-05-30 05:39:58.356562] I [MSGID: 106567] [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting glustershd service Rafi, can you please take a look at this? If you are loaded with work, please provide me with some pointers to debug this issue. Thanks, Sanju
(In reply to Sanju from comment #4) > When I tried to reproduce this in Bala's setup, there were no volumes > present in the cluster. After creating one volume I'm able to hit this issue. > > In glusterd.log I have observed the following. If we execute the peer status > command once glustershd is started, it works as expected. But If we execute > the command, just after friend handshake complete it hungs. Also, one thing > I didn't understand is why glustershd is started and stopped several times, > once the glusterd is restarted? I see different pid's in the following log. > > [2019-05-30 05:39:51.667174] I [MSGID: 106493] > [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received > ACC from uuid: c621234d-e846-4f82-9f05-062b01b5095e > [2019-05-30 05:39:51.872878] I [MSGID: 106568] > [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service > is stopped > [2019-05-30 05:39:51.900464] I [rpc-clnt.c:1014:rpc_clnt_connection_init] > 0-glustershd: setting frame-timeout to 600 > [2019-05-30 05:39:51.900700] I [MSGID: 106567] > [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting > glustershd service > [2019-05-30 05:39:52.907268] I [MSGID: 106568] > [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping > glustershd daemon running in pid: 21169 > [2019-05-30 05:39:53.907629] I [MSGID: 106568] > [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service > is stopped > [2019-05-30 05:39:54.274356] I [rpc-clnt.c:1014:rpc_clnt_connection_init] > 0-glustershd: setting frame-timeout to 600 > [2019-05-30 05:39:54.274626] I [MSGID: 106567] > [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting > glustershd service > [2019-05-30 05:39:55.281439] I [MSGID: 106568] > [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping > glustershd daemon running in pid: 21181 > [2019-05-30 05:39:56.281810] I [MSGID: 106568] > [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service > is stopped > [2019-05-30 05:39:56.307311] I [rpc-clnt.c:1014:rpc_clnt_connection_init] > 0-glustershd: setting frame-timeout to 600 > [2019-05-30 05:39:56.307605] I [MSGID: 106567] > [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting > glustershd service > [2019-05-30 05:39:57.313569] I [MSGID: 106568] > [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping > glustershd daemon running in pid: 21193 > [2019-05-30 05:39:58.313830] I [MSGID: 106568] > [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service > is stopped > [2019-05-30 05:39:58.356303] I [rpc-clnt.c:1014:rpc_clnt_connection_init] > 0-glustershd: setting frame-timeout to 600 > [2019-05-30 05:39:58.356562] I [MSGID: 106567] > [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting > glustershd service > > Rafi, can you please take a look at this? If you are loaded with work, > please provide me with some pointers to debug this issue. > > Thanks, > Sanju This is an expected behavior where we call svc manager multiple times during glusterd restart, especially when you have more than 1 node in the trusted storage pool. This is not just for shd, but for all daemons. Particular to shd, I have sent a patch to optimise the restart scenario https://review.gluster.org/#/c/glusterfs/+/22729/. But that won't help with other daemons. Nevertheless, Do you have any reason why you suspect calling multiple managers will lead to a deadlock?