Description of problem: gluster v status is showing wrong status of glustershd process. Execute command: gluter v status <volname> output shows: Status of volume: ccs Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick sn-0.local:/mnt/bricks/ccs/brick 49152 0 Y 18501 Brick sn-1.local:/mnt/bricks/ccs/brick 49152 0 Y 1362 Self-heal Daemon on localhost N/A N/A N N/A Self-heal Daemon on sn-2.local N/A N/A Y 22531 Self-heal Daemon on sn-1.local N/A N/A Y 27465 Task Status of Volume ccs However,The glustershd process is alive: systemctl status glusterd.service --line=0 | cat ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/glusterd.service.d └─distro.conf Active: active (running) since Sat 2018-05-12 00:08:57 EEST; 24min ago Process: 9926 ExecStart=/usr/sbin/glusterd --volfile=/opt/nokia/libexec/StorageUtils/etc/glusterd/glusterd.vol -p /run/glusterd.pid (code=exited, status=0/SUCCESS) Process: 9912 ExecStartPre=/opt/nokia/libexec/StorageUtils/script/glusterd_precheck.sh (code=exited, status=0/SUCCESS) Main PID: 9927 (glusterd) CGroup: /system.slice/glusterd.service ├─ 1513 /usr/sbin/glusterfsd -s sn-0.local --volfile-id log.sn-0.local.mnt-bricks-log-brick -p /var/run/gluster/vols/log/sn-0.local-mnt-bricks-log-brick.pid -S /var/run/gluster/0af76d9541921cccad34e5fd3041a44d.socket --brick-name /mnt/bricks/log/brick -l /var/log/glusterfs/bricks/mnt-bricks-log-brick.log --xlator-option *-posix.glusterd-uuid=715168aa-c232-4517-a7eb-58af28e13d33 --brick-port 49154 --xlator-option log-server.listen-port=49154 --xlator-option transport.socket.bind-address=sn-0.local ├─ 1522 /usr/sbin/glusterfsd -s sn-0.local --volfile-id mstate.sn-0.local.mnt-bricks-mstate-brick -p /var/run/gluster/vols/mstate/sn-0.local-mnt-bricks-mstate-brick.pid -S /var/run/gluster/79d00394b1085adcf7dc1fd4c57f71bc.socket --brick-name /mnt/bricks/mstate/brick -l /var/log/glusterfs/bricks/mnt-bricks-mstate-brick.log --xlator-option *-posix.glusterd-uuid=715168aa-c232-4517-a7eb-58af28e13d33 --brick-port 49155 --xlator-option mstate-server.listen-port=49155 --xlator-option transport.socket.bind-address=sn-0.local ├─ 9927 /usr/sbin/glusterd --volfile=/opt/nokia/libexec/StorageUtils/etc/glusterd/glusterd.vol -p /run/glusterd.pid ├─18501 /usr/sbin/glusterfsd -s sn-0.local --volfile-id ccs.sn-0.local.mnt-bricks-ccs-brick -p /var/run/gluster/vols/ccs/sn-0.local-mnt-bricks-ccs-brick.pid -S /var/run/gluster/67bb7eb0e198bb60470be2d6acfb1806.socket --brick-name /mnt/bricks/ccs/brick -l /var/log/glusterfs/bricks/mnt-bricks-ccs-brick.log --xlator-option *-posix.glusterd-uuid=715168aa-c232-4517-a7eb-58af28e13d33 --brick-port 49152 --xlator-option ccs-server.listen-port=49152 --xlator-option transport.socket.bind-address=sn-0.local ├─18513 /usr/sbin/glusterfsd -s sn-0.local --volfile-id export.sn-0.local.mnt-bricks-export-brick -p /var/run/gluster/vols/export/sn-0.local-mnt-bricks-export-brick.pid -S /var/run/gluster/44f098f14fe45b74b61c5a1d98993d15.socket --brick-name /mnt/bricks/export/brick -l /var/log/glusterfs/bricks/mnt-bricks-export-brick.log --xlator-option *-posix.glusterd-uuid=715168aa-c232-4517-a7eb-58af28e13d33 --brick-port 49153 --xlator-option export-server.listen-port=49153 --xlator-option transport.socket.bind-address=sn-0.local ├─18532 /usr/sbin/glusterfsd -s sn-0.local --volfile-id services.sn-0.local.mnt-bricks-services-brick -p /var/run/gluster/vols/services/sn-0.local-mnt-bricks-services-brick.pid -S /var/run/gluster/fc16d9fe6fd0695555a5592343da7ba3.socket --brick-name /mnt/bricks/services/brick -l /var/log/glusterfs/bricks/mnt-bricks-services-brick.log --xlator-option *-posix.glusterd-uuid=715168aa-c232-4517-a7eb-58af28e13d33 --brick-port 49156 --xlator-option services-server.listen-port=49156 --xlator-option transport.socket.bind-address=sn-0.local └─18542 /usr/sbin/glusterfs -s sn-0.local --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/a1336ae50b7469874a2eec8f3a257d5e.socket --xlator-option *replicate*.node-uuid=715168aa-c232-4517-a7eb-58af28e13d33 ///////////////////////////////////// when I check the glusterd log, I find that the disconnect event is handled later than the glusterd start glustershd process. Why the disconnect event is handled so late is probably because of “gluterd big lock” problem. [2018-05-11 21:15:19.383039] I [MSGID: 106492] [glusterd-handler.c:2718:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: a5c8dd1d-1988-42f6-bc96-fe2d61bffead [2018-05-11 21:15:19.383101] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2018-05-11 21:15:19.392601] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: a5c8dd1d-1988-42f6-bc96-fe2d61bffead, host: sn-2.local, port: 0 [2018-05-11 21:15:19.400627] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 1a08fa77-4b2b-4053-b9b3-32c5898d625e, host: sn-1.local, port: 0 [2018-05-11 21:15:19.409643] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped [2018-05-11 21:15:19.409681] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: nfs service is stopped [2018-05-11 21:15:19.409699] I [MSGID: 106600] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed [2018-05-11 21:15:19.514445] I [MSGID: 106568] [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 10032 [2018-05-11 21:15:20.514585] I [MSGID: 106568] [glusterd-proc-mgmt.c:125:glusterd_proc_stop] 0-management: glustershd is stopped! [2018-05-11 21:15:20.514637] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: glustershd service is stopped [2018-05-11 21:15:20.514700] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped [2018-05-11 21:15:20.514722] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: nfs service is stopped [2018-05-11 21:15:20.514737] I [MSGID: 106600] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed [2018-05-11 21:15:20.620276] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: glustershd already stopped [2018-05-11 21:15:20.620318] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: glustershd service is stopped [2018-05-11 21:15:20.620389] C [MSGID: 106003] [glusterd-server-quorum.c:352:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume ccs. Starting local bricks. [2018-05-11 21:15:20.621012] I [glusterd-utils.c:5933:glusterd_brick_start] 0-management: starting a fresh brick process for brick /mnt/bricks/ccs/brick [2018-05-11 21:15:20.621085] I [MSGID: 106144] [glusterd-pmap.c:396:pmap_registry_remove] 0-pmap: removing brick /mnt/bricks/ccs/brick on port 49152 [2018-05-11 21:15:20.725294] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2018-05-11 21:15:20.802903] C [MSGID: 106003] [glusterd-server-quorum.c:352:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume export. Starting local bricks. [2018-05-11 21:15:20.802984] I [glusterd-utils.c:5933:glusterd_brick_start] 0-management: starting a fresh brick process for brick /mnt/bricks/export/brick [2018-05-11 21:15:20.803036] I [MSGID: 106144] [glusterd-pmap.c:396:pmap_registry_remove] 0-pmap: removing brick /mnt/bricks/export/brick on port 49153 [2018-05-11 21:15:20.905595] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2018-05-11 21:15:20.953613] C [MSGID: 106003] [glusterd-server-quorum.c:352:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume services. Starting local bricks. [2018-05-11 21:15:20.954150] I [glusterd-utils.c:5933:glusterd_brick_start] 0-management: starting a fresh brick process for brick /mnt/bricks/services/brick [2018-05-11 21:15:21.061193] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2018-05-11 21:15:21.130280] I [MSGID: 106492] [glusterd-handler.c:2718:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 1a08fa77-4b2b-4053-b9b3-32c5898d625e [2018-05-11 21:15:21.146497] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2018-05-11 21:15:21.146585] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: a5c8dd1d-1988-42f6-bc96-fe2d61bffead [2018-05-11 21:15:21.146747] I [MSGID: 106567] [glusterd-svc-mgmt.c:198:glusterd_svc_start] 0-management: Starting glustershd service [2018-05-11 21:15:22.149771] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: quotad already stopped [2018-05-11 21:15:22.149849] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: quotad service is stopped [2018-05-11 21:15:22.150290] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped [2018-05-11 21:15:22.150355] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: bitd service is stopped [2018-05-11 21:15:22.150743] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped [2018-05-11 21:15:22.150768] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: scrub service is stopped [2018-05-11 21:15:22.341074] I [MSGID: 106567] [glusterd-svc-mgmt.c:162:glusterd_svc_start] 0-management: found svc glustershd is already running [2018-05-11 21:15:22.342115] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: quotad already stopped [2018-05-11 21:15:22.342161] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: quotad service is stopped [2018-05-11 21:15:22.342565] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped [2018-05-11 21:15:22.342590] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: bitd service is stopped [2018-05-11 21:15:22.342992] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped [2018-05-11 21:15:22.343015] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: scrub service is stopped [2018-05-11 21:15:22.519608] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 1a08fa77-4b2b-4053-b9b3-32c5898d625e [2018-05-11 21:15:22.522863] I [MSGID: 106006] [glusterd-svc-mgmt.c:337:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd. //glusterd has started glustershd process, then the notify is handled. this cause the glusterd does not have the correct status of glustershd process status! I doubt why the disconnect event is handled so late is caused by the glusterd big lock(glusterd_conn_common_notify), because the handler function requires the glusterd big lock. Version-Release number of selected component (if applicable): N/A How reproducible: Always
REVIEW: https://review.gluster.org/20131 (glusterd: gluster v status is showing wrong status for glustershd) posted (#1) for review on master by Sanju Rakonde
COMMIT: https://review.gluster.org/20131 committed in master by "Atin Mukherjee" <amukherj> with a commit message- glusterd: gluster v status is showing wrong status for glustershd When we restart the bricks, connect and disconnect events happen for glustershd. glusterd use two threads to handle disconnect and connects events from glustershd. When we restart the bricks we'll get both disconnect and connect events. So both the threads will compete for the big lock. We want disconnect event to finish before connect event. But If connect thread gets the big lock first, it sets svc->online to true, and then disconnect thread will et svc->online to false. So, glustershd will be disconnected from glusterd and wrong status is shown. After killing shd, glusterd sleeps for 1 second. To avoid the problem, If glusterd releses the lock before sleep and acquires it after sleep, disconnect thread will get a chance to handle the glusterd_svc_common_rpc_notify before other thread completes connect event. Change-Id: Ie82e823fdfc936feb7c0ae10599297b050ee9986 fixes: bz#1585391 Signed-off-by: Sanju Rakonde <srakonde>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report. glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html [2] https://www.gluster.org/pipermail/gluster-users/