Bug 1714460

Summary:	gluster peer status --xml command is timing out post glusterd restart (due to multiplexing)
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Bala Konda Reddy M <bmekala>
Component:	glusterd	Assignee:	Sanju <srakonde>
Status:	CLOSED DEFERRED	QA Contact:	Bala Konda Reddy M <bmekala>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.5	CC:	amukherj, rhs-bugs, rkavunga, srakonde, storage-qa-internal, vbellur, vdas
Target Milestone:	---	Keywords:	Automation
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:	shd-multiplexing
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-21 09:26:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Bala Konda Reddy M 2019-05-28 07:42:53 UTC

Description:

After glusterd restart with approx 30 seconds(+/- 5 seconds) time gap executed mentioned commands "gluster peer status --xml"  and "gluster vol status --xml"  and these are timing out with "Error: Request timed out".  
 
Version:
glusterfs-6.0-3.el7rhgs.x86_64

How Reproducible:
2/3


Steps to Reproduce:
1. Peer probe 5 nodes and form cluster
2. Create a replicate volume 1X3 and start it
3. Restart glusterd on one node N1
4. Wait for peer handshake to complete.
5. Execute gluster peer status --xml

Actual Output:
]# gluster peer status --xml
Error : Request timed out

Expected Output:
gluster commands should not time out after glusterd restart with a time gap of 30 seconds.


Additional Information:

[root@dhcp47-112 ~]# date
Tue May 28 03:29:47 EDT 2019
[root@dhcp47-112 ~]# systemctl restart glusterd
[root@dhcp47-112 ~]# date
Tue May 28 03:30:04 EDT 2019
[root@dhcp47-112 ~]# date; gluster peer status --xml
Tue May 28 03:30:13 EDT 2019
Error : Request timed out

After glusterd restart there is enough time of 26 seconds for peer handshake

[root@dhcp47-112 ~]# date
Tue May 28 03:33:38 EDT 2019
[root@dhcp47-112 ~]# systemctl restart glusterd
[root@dhcp47-112 ~]# date; gluster vol status --xml
Tue May 28 03:34:01 EDT 2019
Error : Request timed out


attaching glusterd log:

[2019-05-28 07:29:52.491938] I [MSGID: 100030] [glusterfsd.c:2821:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 6.0 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
[2019-05-28 07:29:52.493093] I [glusterfsd.c:2546:daemonize] 0-glusterfs: Pid of current running process is 21499
[2019-05-28 07:29:52.497786] I [MSGID: 106478] [glusterd.c:1426:init] 0-management: Maximum allowed open file descriptors set to 65536
[2019-05-28 07:29:52.497888] I [MSGID: 106479] [glusterd.c:1482:init] 0-management: Using /var/lib/glusterd as working directory
[2019-05-28 07:29:52.497911] I [MSGID: 106479] [glusterd.c:1488:init] 0-management: Using /var/run/gluster as pid file working directory
[2019-05-28 07:29:52.504612] I [socket.c:991:__socket_server_bind] 0-socket.management: process started listening on port (24007)
[2019-05-28 07:29:52.506864] W [MSGID: 103071] [rdma.c:4472:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device]
[2019-05-28 07:29:52.506917] W [MSGID: 103055] [rdma.c:4782:init] 0-rdma.management: Failed to initialize IB Device
[2019-05-28 07:29:52.506935] W [rpc-transport.c:368:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed
[2019-05-28 07:29:52.507098] W [rpcsvc.c:1991:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2019-05-28 07:29:52.507125] E [MSGID: 106244] [glusterd.c:1781:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2019-05-28 07:29:52.509600] I [socket.c:962:__socket_server_bind] 0-socket.management: closing (AF_UNIX) reuse check socket 12
[2019-05-28 07:29:52.510440] I [MSGID: 106059] [glusterd.c:1865:init] 0-management: max-port override: 60999
[2019-05-28 07:29:54.300708] I [MSGID: 106513] [glusterd-store.c:2397:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 40100
[2019-05-28 07:29:54.518744] I [MSGID: 106544] [glusterd.c:152:glusterd_uuid_init] 0-management: retrieved UUID: 6e7a6b9f-79cd-4673-a234-123ef3c52be6
[2019-05-28 07:29:54.573243] I [MSGID: 106498] [glusterd-handler.c:3752:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0
The message "I [MSGID: 106498] [glusterd-handler.c:3752:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0" repeated 4 times between [2019-05-28 07:29:54.573243] and [2019-05-28 07:29:54.574044]
[2019-05-28 07:29:54.574163] W [MSGID: 106061] [glusterd-handler.c:3548:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout
[2019-05-28 07:29:54.574242] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-05-28 07:29:54.579167] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-05-28 07:29:54.583370] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-05-28 07:29:54.587709] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-05-28 07:29:54.591973] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
Final graph:
+------------------------------------------------------------------------------+
  1: volume management
  2:     type mgmt/glusterd
  3:     option rpc-auth.auth-glusterfs on
  4:     option rpc-auth.auth-unix on
  5:     option rpc-auth.auth-null on
  6:     option rpc-auth-allow-insecure on
  7:     option transport.listen-backlog 1024
  8:     option max-port 60999
  9:     option event-threads 1
 10:     option ping-timeout 0
 11:     option transport.rdma.listen-port 24008
 12:     option transport.socket.listen-port 24007
 13:     option transport.socket.read-fail-log off
 14:     option transport.socket.keepalive-interval 2
 15:     option transport.socket.keepalive-time 10
 16:     option transport-type rdma
 17:     option working-directory /var/lib/glusterd
 18: end-volume
 19:  
+------------------------------------------------------------------------------+
The message "W [MSGID: 106061] [glusterd-handler.c:3548:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout" repeated 4 times between [2019-05-28 07:29:54.574163] and [2019-05-28 07:29:54.591959]
[2019-05-28 07:29:54.598103] I [MSGID: 101190] [event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0
[2019-05-28 07:29:56.224737] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: b1ba1a47-86ab-42f4-bbdc-24793c6eaf31, host: 10.70.46.198, port: 0
[2019-05-28 07:29:56.290035] I [glusterd-utils.c:6424:glusterd_brick_start] 0-management: discovered already-running brick /bricks/brick1/dddistdisp
[2019-05-28 07:29:56.290148] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-05-28 07:29:56.290388] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: b9c26960-cd32-495f-a7a2-f1aba4a4ae3a, host: 10.70.47.162, port: 0
[2019-05-28 07:29:56.674280] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-nfs: setting frame-timeout to 600
[2019-05-28 07:29:56.674607] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping nfs daemon running in pid: 21396
[2019-05-28 07:29:56.674919] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: c621234d-e846-4f82-9f05-062b01b5095e, host: 10.70.47.147, port: 0
[2019-05-28 07:29:56.757093] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: b1ba1a47-86ab-42f4-bbdc-24793c6eaf31
[2019-05-28 07:29:56.757174] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-05-28 07:29:56.940748] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: b9c26960-cd32-495f-a7a2-f1aba4a4ae3a
[2019-05-28 07:29:56.940844] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-05-28 07:29:57.099100] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: bf50c5f1-dd75-4d8a-b42b-a1014f8873a5, host: 10.70.47.9, port: 0
[2019-05-28 07:29:57.189502] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: b090abe4-3993-4421-9ceb-05d83971d93c, host: 10.70.46.248, port: 0
[2019-05-28 07:29:57.249514] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: c621234d-e846-4f82-9f05-062b01b5095e
[2019-05-28 07:29:57.249576] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-05-28 07:29:57.374471] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: b1ba1a47-86ab-42f4-bbdc-24793c6eaf31
[2019-05-28 07:29:57.374557] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: b9c26960-cd32-495f-a7a2-f1aba4a4ae3a
[2019-05-28 07:29:57.374654] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: b090abe4-3993-4421-9ceb-05d83971d93c
[2019-05-28 07:29:57.374720] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-05-28 07:29:57.592196] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: c621234d-e846-4f82-9f05-062b01b5095e
[2019-05-28 07:29:57.592406] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: bf50c5f1-dd75-4d8a-b42b-a1014f8873a5
[2019-05-28 07:29:57.592497] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-05-28 07:29:57.842358] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: nfs service is stopped
[2019-05-28 07:29:57.843790] I [MSGID: 106540] [glusterd-utils.c:5318:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered MOUNTV3 successfully
[2019-05-28 07:29:57.844206] I [MSGID: 106540] [glusterd-utils.c:5325:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered MOUNTV1 successfully
[2019-05-28 07:29:57.844583] I [MSGID: 106540] [glusterd-utils.c:5332:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NFSV3 successfully
[2019-05-28 07:29:57.844952] I [MSGID: 106540] [glusterd-utils.c:5339:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NLM v4 successfully
[2019-05-28 07:29:57.845318] I [MSGID: 106540] [glusterd-utils.c:5346:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NLM v1 successfully
[2019-05-28 07:29:57.845684] I [MSGID: 106540] [glusterd-utils.c:5353:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered ACL v3 successfully
[2019-05-28 07:29:57.869929] I [MSGID: 106567] [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting nfs service
[2019-05-28 07:29:58.873184] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-quotad: setting frame-timeout to 600
[2019-05-28 07:29:58.873544] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: quotad already stopped
[2019-05-28 07:29:58.873589] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: quotad service is stopped
[2019-05-28 07:29:58.873647] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-bitd: setting frame-timeout to 600
[2019-05-28 07:29:58.873879] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: bitd already stopped
[2019-05-28 07:29:58.873918] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: bitd service is stopped
[2019-05-28 07:29:58.874046] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600
[2019-05-28 07:29:58.874239] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: scrub already stopped
[2019-05-28 07:29:58.874273] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: scrub service is stopped
[2019-05-28 07:29:58.874343] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600
[2019-05-28 07:29:58.874551] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-gfproxyd: setting frame-timeout to 600
[2019-05-28 07:29:58.924321] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600
[2019-05-28 07:29:58.924644] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: glustershd already stopped
[2019-05-28 07:29:58.924717] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service is stopped
[2019-05-28 07:29:58.957770] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600
[2019-05-28 07:29:58.957950] I [MSGID: 106567] [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting glustershd service
[2019-05-28 07:29:59.963315] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 21553
[2019-05-28 07:29:59.964004] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: bf50c5f1-dd75-4d8a-b42b-a1014f8873a5
[2019-05-28 07:29:59.964460] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: b090abe4-3993-4421-9ceb-05d83971d93c
[2019-05-28 07:29:59.964555] I [MSGID: 106163] [glusterd-handshake.c:1410:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 40100
The message "I [MSGID: 106163] [glusterd-handshake.c:1410:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 40100" repeated 4 times between [2019-05-28 07:29:59.964555] and [2019-05-28 07:30:00.149536]
[2019-05-28 07:30:00.191379] I [MSGID: 106490] [glusterd-handler.c:2670:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: b1ba1a47-86ab-42f4-bbdc-24793c6eaf31
[2019-05-28 07:30:00.224842] I [MSGID: 106493] [glusterd-handler.c:3948:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.46.198 (0), ret: 0, op_ret: 0
[2019-05-28 07:30:00.383140] I [MSGID: 106490] [glusterd-handler.c:2670:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: b9c26960-cd32-495f-a7a2-f1aba4a4ae3a
[2019-05-28 07:30:00.449695] I [MSGID: 106493] [glusterd-handler.c:3948:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.47.162 (0), ret: 0, op_ret: 0
[2019-05-28 07:30:00.525416] I [MSGID: 106490] [glusterd-handler.c:2670:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: c621234d-e846-4f82-9f05-062b01b5095e
[2019-05-28 07:30:00.549857] I [MSGID: 106493] [glusterd-handler.c:3948:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.47.147 (0), ret: 0, op_ret: 0
[2019-05-28 07:30:00.624930] I [MSGID: 106490] [glusterd-handler.c:2670:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: b090abe4-3993-4421-9ceb-05d83971d93c
[2019-05-28 07:30:00.649959] I [MSGID: 106493] [glusterd-handler.c:3948:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.46.248 (0), ret: 0, op_ret: 0
[2019-05-28 07:30:00.741715] I [MSGID: 106490] [glusterd-handler.c:2670:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: bf50c5f1-dd75-4d8a-b42b-a1014f8873a5
[2019-05-28 07:30:00.775125] I [MSGID: 106493] [glusterd-handler.c:3948:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.47.9 (0), ret: 0, op_ret: 0
[2019-05-28 07:30:00.875102] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: b1ba1a47-86ab-42f4-bbdc-24793c6eaf31
[2019-05-28 07:30:00.875170] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-05-28 07:30:01.117623] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: b1ba1a47-86ab-42f4-bbdc-24793c6eaf31
[2019-05-28 07:30:01.118081] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: b9c26960-cd32-495f-a7a2-f1aba4a4ae3a
[2019-05-28 07:30:01.118292] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-05-28 07:30:01.359766] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: c621234d-e846-4f82-9f05-062b01b5095e
[2019-05-28 07:30:01.359877] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-05-28 07:30:01.501224] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: b9c26960-cd32-495f-a7a2-f1aba4a4ae3a
[2019-05-28 07:30:01.501344] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: b090abe4-3993-4421-9ceb-05d83971d93c
[2019-05-28 07:30:01.501388] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-05-28 07:30:01.601546] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: c621234d-e846-4f82-9f05-062b01b5095e
[2019-05-28 07:30:01.601651] I [MSGID: 106492] [glusterd-handler.c:2855:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: bf50c5f1-dd75-4d8a-b42b-a1014f8873a5
[2019-05-28 07:30:01.601690] I [MSGID: 106502] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-05-28 07:30:01.685244] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: b090abe4-3993-4421-9ceb-05d83971d93c
[2019-05-28 07:30:01.685334] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: bf50c5f1-dd75-4d8a-b42b-a1014f8873a5

Comment 4 Sanju 2019-05-30 12:22:33 UTC

When I tried to reproduce this in Bala's setup, there were no volumes present in the cluster. After creating one volume I'm able to hit this issue.

In glusterd.log I have observed the following. If we execute the peer status command once glustershd is started, it works as expected. But If we execute the command, just after friend handshake complete it hungs. Also, one thing I didn't understand is why glustershd is started and stopped several times, once the glusterd is restarted? I see different pid's in the following log.

[2019-05-30 05:39:51.667174] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: c621234d-e846-4f82-9f05-062b01b5095e
[2019-05-30 05:39:51.872878] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service is stopped
[2019-05-30 05:39:51.900464] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600
[2019-05-30 05:39:51.900700] I [MSGID: 106567] [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting glustershd service
[2019-05-30 05:39:52.907268] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 21169
[2019-05-30 05:39:53.907629] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service is stopped
[2019-05-30 05:39:54.274356] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600
[2019-05-30 05:39:54.274626] I [MSGID: 106567] [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting glustershd service
[2019-05-30 05:39:55.281439] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 21181
[2019-05-30 05:39:56.281810] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service is stopped
[2019-05-30 05:39:56.307311] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600
[2019-05-30 05:39:56.307605] I [MSGID: 106567] [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting glustershd service
[2019-05-30 05:39:57.313569] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 21193
[2019-05-30 05:39:58.313830] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service is stopped
[2019-05-30 05:39:58.356303] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600
[2019-05-30 05:39:58.356562] I [MSGID: 106567] [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting glustershd service

Rafi, can you please take a look at this? If you are loaded with work, please provide me with some pointers to debug this issue.

Thanks,
Sanju

Comment 5 Mohammed Rafi KC 2019-05-31 07:35:45 UTC

(In reply to Sanju from comment #4)
> When I tried to reproduce this in Bala's setup, there were no volumes
> present in the cluster. After creating one volume I'm able to hit this issue.
> 
> In glusterd.log I have observed the following. If we execute the peer status
> command once glustershd is started, it works as expected. But If we execute
> the command, just after friend handshake complete it hungs. Also, one thing
> I didn't understand is why glustershd is started and stopped several times,
> once the glusterd is restarted? I see different pid's in the following log.
> 
> [2019-05-30 05:39:51.667174] I [MSGID: 106493]
> [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received
> ACC from uuid: c621234d-e846-4f82-9f05-062b01b5095e
> [2019-05-30 05:39:51.872878] I [MSGID: 106568]
> [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service
> is stopped
> [2019-05-30 05:39:51.900464] I [rpc-clnt.c:1014:rpc_clnt_connection_init]
> 0-glustershd: setting frame-timeout to 600
> [2019-05-30 05:39:51.900700] I [MSGID: 106567]
> [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting
> glustershd service
> [2019-05-30 05:39:52.907268] I [MSGID: 106568]
> [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping
> glustershd daemon running in pid: 21169
> [2019-05-30 05:39:53.907629] I [MSGID: 106568]
> [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service
> is stopped
> [2019-05-30 05:39:54.274356] I [rpc-clnt.c:1014:rpc_clnt_connection_init]
> 0-glustershd: setting frame-timeout to 600
> [2019-05-30 05:39:54.274626] I [MSGID: 106567]
> [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting
> glustershd service
> [2019-05-30 05:39:55.281439] I [MSGID: 106568]
> [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping
> glustershd daemon running in pid: 21181
> [2019-05-30 05:39:56.281810] I [MSGID: 106568]
> [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service
> is stopped
> [2019-05-30 05:39:56.307311] I [rpc-clnt.c:1014:rpc_clnt_connection_init]
> 0-glustershd: setting frame-timeout to 600
> [2019-05-30 05:39:56.307605] I [MSGID: 106567]
> [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting
> glustershd service
> [2019-05-30 05:39:57.313569] I [MSGID: 106568]
> [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping
> glustershd daemon running in pid: 21193
> [2019-05-30 05:39:58.313830] I [MSGID: 106568]
> [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service
> is stopped
> [2019-05-30 05:39:58.356303] I [rpc-clnt.c:1014:rpc_clnt_connection_init]
> 0-glustershd: setting frame-timeout to 600
> [2019-05-30 05:39:58.356562] I [MSGID: 106567]
> [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting
> glustershd service
> 
> Rafi, can you please take a look at this? If you are loaded with work,
> please provide me with some pointers to debug this issue.
> 
> Thanks,
> Sanju

This is an expected behavior where we call svc manager multiple times during glusterd restart, especially when you have more than 1 node in the trusted storage pool. This is not just for shd, but for all daemons. Particular to shd, I have sent a patch to optimise the restart scenario https://review.gluster.org/#/c/glusterfs/+/22729/.

But that won't help with other daemons. Nevertheless, Do you have any reason why you suspect calling multiple managers will lead to a deadlock?