1585391 – glusteshd wrong status caused by gluterd big lock

Bug 1585391 - glusteshd wrong status caused by gluterd big lock

Summary: glusteshd wrong status caused by gluterd big lock

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Sanju
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1582443
TreeView+	depends on / blocked

Reported:	2018-06-02 10:35 UTC by Sanju
Modified:	2020-01-09 17:43 UTC (History)
CC List:	1 user (show)
Fixed In Version:	glusterfs-5.0
Clone Of:
Environment:
Last Closed:	2018-10-23 15:10:44 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Sanju 2018-06-02 10:35:48 UTC

Description of problem:
gluster v status is showing wrong status of glustershd process.

Execute command:

gluter v status <volname>

output shows:

Status of volume: ccs

Gluster process                             TCP Port  RDMA Port  Online  Pid

------------------------------------------------------------------------------

Brick sn-0.local:/mnt/bricks/ccs/brick      49152     0          Y       18501

Brick sn-1.local:/mnt/bricks/ccs/brick      49152     0          Y       1362

Self-heal Daemon on localhost               N/A       N/A        N       N/A 

Self-heal Daemon on sn-2.local              N/A       N/A        Y       22531

Self-heal Daemon on sn-1.local              N/A       N/A        Y       27465

 

Task Status of Volume ccs

However,The glustershd process is alive:

systemctl status glusterd.service --line=0 | cat

● glusterd.service - GlusterFS, a clustered file-system server

   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; disabled; vendor preset: disabled)

  Drop-In: /etc/systemd/system/glusterd.service.d

           └─distro.conf

   Active: active (running) since Sat 2018-05-12 00:08:57 EEST; 24min ago

  Process: 9926 ExecStart=/usr/sbin/glusterd --volfile=/opt/nokia/libexec/StorageUtils/etc/glusterd/glusterd.vol -p /run/glusterd.pid (code=exited, status=0/SUCCESS)

  Process: 9912 ExecStartPre=/opt/nokia/libexec/StorageUtils/script/glusterd_precheck.sh (code=exited, status=0/SUCCESS)

Main PID: 9927 (glusterd)

   CGroup: /system.slice/glusterd.service

           ├─ 1513 /usr/sbin/glusterfsd -s sn-0.local --volfile-id log.sn-0.local.mnt-bricks-log-brick -p /var/run/gluster/vols/log/sn-0.local-mnt-bricks-log-brick.pid -S /var/run/gluster/0af76d9541921cccad34e5fd3041a44d.socket --brick-name /mnt/bricks/log/brick -l /var/log/glusterfs/bricks/mnt-bricks-log-brick.log --xlator-option *-posix.glusterd-uuid=715168aa-c232-4517-a7eb-58af28e13d33 --brick-port 49154 --xlator-option log-server.listen-port=49154 --xlator-option transport.socket.bind-address=sn-0.local

           ├─ 1522 /usr/sbin/glusterfsd -s sn-0.local --volfile-id mstate.sn-0.local.mnt-bricks-mstate-brick -p /var/run/gluster/vols/mstate/sn-0.local-mnt-bricks-mstate-brick.pid -S /var/run/gluster/79d00394b1085adcf7dc1fd4c57f71bc.socket --brick-name /mnt/bricks/mstate/brick -l /var/log/glusterfs/bricks/mnt-bricks-mstate-brick.log --xlator-option *-posix.glusterd-uuid=715168aa-c232-4517-a7eb-58af28e13d33 --brick-port 49155 --xlator-option mstate-server.listen-port=49155 --xlator-option transport.socket.bind-address=sn-0.local

           ├─ 9927 /usr/sbin/glusterd --volfile=/opt/nokia/libexec/StorageUtils/etc/glusterd/glusterd.vol -p /run/glusterd.pid

           ├─18501 /usr/sbin/glusterfsd -s sn-0.local --volfile-id ccs.sn-0.local.mnt-bricks-ccs-brick -p /var/run/gluster/vols/ccs/sn-0.local-mnt-bricks-ccs-brick.pid -S /var/run/gluster/67bb7eb0e198bb60470be2d6acfb1806.socket --brick-name /mnt/bricks/ccs/brick -l /var/log/glusterfs/bricks/mnt-bricks-ccs-brick.log --xlator-option *-posix.glusterd-uuid=715168aa-c232-4517-a7eb-58af28e13d33 --brick-port 49152 --xlator-option ccs-server.listen-port=49152 --xlator-option transport.socket.bind-address=sn-0.local

           ├─18513 /usr/sbin/glusterfsd -s sn-0.local --volfile-id export.sn-0.local.mnt-bricks-export-brick -p /var/run/gluster/vols/export/sn-0.local-mnt-bricks-export-brick.pid -S /var/run/gluster/44f098f14fe45b74b61c5a1d98993d15.socket --brick-name /mnt/bricks/export/brick -l /var/log/glusterfs/bricks/mnt-bricks-export-brick.log --xlator-option *-posix.glusterd-uuid=715168aa-c232-4517-a7eb-58af28e13d33 --brick-port 49153 --xlator-option export-server.listen-port=49153 --xlator-option transport.socket.bind-address=sn-0.local

           ├─18532 /usr/sbin/glusterfsd -s sn-0.local --volfile-id services.sn-0.local.mnt-bricks-services-brick -p /var/run/gluster/vols/services/sn-0.local-mnt-bricks-services-brick.pid -S /var/run/gluster/fc16d9fe6fd0695555a5592343da7ba3.socket --brick-name /mnt/bricks/services/brick -l /var/log/glusterfs/bricks/mnt-bricks-services-brick.log --xlator-option *-posix.glusterd-uuid=715168aa-c232-4517-a7eb-58af28e13d33 --brick-port 49156 --xlator-option services-server.listen-port=49156 --xlator-option transport.socket.bind-address=sn-0.local

           └─18542 /usr/sbin/glusterfs -s sn-0.local --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/a1336ae50b7469874a2eec8f3a257d5e.socket --xlator-option *replicate*.node-uuid=715168aa-c232-4517-a7eb-58af28e13d33

 

/////////////////////////////////////

when I check the glusterd log, I find that the disconnect event is handled later than the glusterd start glustershd process. Why the disconnect event is handled so late is probably because of “gluterd big lock” problem.

 

[2018-05-11 21:15:19.383039] I [MSGID: 106492] [glusterd-handler.c:2718:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: a5c8dd1d-1988-42f6-bc96-fe2d61bffead

[2018-05-11 21:15:19.383101] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend

[2018-05-11 21:15:19.392601] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: a5c8dd1d-1988-42f6-bc96-fe2d61bffead, host: sn-2.local, port: 0

[2018-05-11 21:15:19.400627] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 1a08fa77-4b2b-4053-b9b3-32c5898d625e, host: sn-1.local, port: 0

[2018-05-11 21:15:19.409643] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped

[2018-05-11 21:15:19.409681] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: nfs service is stopped

[2018-05-11 21:15:19.409699] I [MSGID: 106600] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed

[2018-05-11 21:15:19.514445] I [MSGID: 106568] [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 10032

[2018-05-11 21:15:20.514585] I [MSGID: 106568] [glusterd-proc-mgmt.c:125:glusterd_proc_stop] 0-management: glustershd is stopped!

[2018-05-11 21:15:20.514637] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: glustershd service is stopped

[2018-05-11 21:15:20.514700] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped

[2018-05-11 21:15:20.514722] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: nfs service is stopped

[2018-05-11 21:15:20.514737] I [MSGID: 106600] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed

[2018-05-11 21:15:20.620276] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: glustershd already stopped

[2018-05-11 21:15:20.620318] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: glustershd service is stopped

[2018-05-11 21:15:20.620389] C [MSGID: 106003] [glusterd-server-quorum.c:352:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume ccs. Starting local bricks.

[2018-05-11 21:15:20.621012] I [glusterd-utils.c:5933:glusterd_brick_start] 0-management: starting a fresh brick process for brick /mnt/bricks/ccs/brick

[2018-05-11 21:15:20.621085] I [MSGID: 106144] [glusterd-pmap.c:396:pmap_registry_remove] 0-pmap: removing brick /mnt/bricks/ccs/brick on port 49152

[2018-05-11 21:15:20.725294] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600

[2018-05-11 21:15:20.802903] C [MSGID: 106003] [glusterd-server-quorum.c:352:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume export. Starting local bricks.

[2018-05-11 21:15:20.802984] I [glusterd-utils.c:5933:glusterd_brick_start] 0-management: starting a fresh brick process for brick /mnt/bricks/export/brick

[2018-05-11 21:15:20.803036] I [MSGID: 106144] [glusterd-pmap.c:396:pmap_registry_remove] 0-pmap: removing brick /mnt/bricks/export/brick on port 49153

[2018-05-11 21:15:20.905595] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600

[2018-05-11 21:15:20.953613] C [MSGID: 106003] [glusterd-server-quorum.c:352:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume services. Starting local bricks.

[2018-05-11 21:15:20.954150] I [glusterd-utils.c:5933:glusterd_brick_start] 0-management: starting a fresh brick process for brick /mnt/bricks/services/brick

[2018-05-11 21:15:21.061193] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600

[2018-05-11 21:15:21.130280] I [MSGID: 106492] [glusterd-handler.c:2718:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 1a08fa77-4b2b-4053-b9b3-32c5898d625e

[2018-05-11 21:15:21.146497] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend

[2018-05-11 21:15:21.146585] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: a5c8dd1d-1988-42f6-bc96-fe2d61bffead

[2018-05-11 21:15:21.146747] I [MSGID: 106567] [glusterd-svc-mgmt.c:198:glusterd_svc_start] 0-management: Starting glustershd service

[2018-05-11 21:15:22.149771] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: quotad already stopped

[2018-05-11 21:15:22.149849] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: quotad service is stopped

[2018-05-11 21:15:22.150290] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped

[2018-05-11 21:15:22.150355] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: bitd service is stopped

[2018-05-11 21:15:22.150743] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped

[2018-05-11 21:15:22.150768] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: scrub service is stopped

[2018-05-11 21:15:22.341074] I [MSGID: 106567] [glusterd-svc-mgmt.c:162:glusterd_svc_start] 0-management: found svc glustershd is already running

[2018-05-11 21:15:22.342115] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: quotad already stopped

[2018-05-11 21:15:22.342161] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: quotad service is stopped

[2018-05-11 21:15:22.342565] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped

[2018-05-11 21:15:22.342590] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: bitd service is stopped

[2018-05-11 21:15:22.342992] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped

[2018-05-11 21:15:22.343015] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: scrub service is stopped

[2018-05-11 21:15:22.519608] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 1a08fa77-4b2b-4053-b9b3-32c5898d625e

[2018-05-11 21:15:22.522863] I [MSGID: 106006] [glusterd-svc-mgmt.c:337:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd.
 

//glusterd has started glustershd process, then the notify is handled. this cause the glusterd does not have the correct status of glustershd process status! I doubt why the disconnect event is handled so late is caused by the glusterd big lock(glusterd_conn_common_notify), because the handler function requires the glusterd big lock.

Version-Release number of selected component (if applicable):
N/A

How reproducible:
Always

Comment 1 Worker Ant 2018-06-02 13:38:25 UTC

REVIEW: https://review.gluster.org/20131 (glusterd: gluster v status is showing wrong status for glustershd) posted (#1) for review on master by Sanju Rakonde

Comment 2 Worker Ant 2018-06-06 12:20:01 UTC

COMMIT: https://review.gluster.org/20131 committed in master by "Atin Mukherjee" <amukherj> with a commit message- glusterd: gluster v status is showing wrong status for glustershd

When we restart the bricks, connect and disconnect events happen
for glustershd. glusterd use two threads to handle disconnect and
connects events from glustershd. When we restart the bricks we'll
get both disconnect and connect events. So both the threads will
compete for the big lock.

We want disconnect event to finish before connect event. But If
connect thread gets the big lock first, it sets svc->online to
true, and then disconnect thread will et svc->online to false.
So, glustershd will be disconnected from glusterd and wrong status
is shown.

After killing shd, glusterd sleeps for 1 second. To avoid the problem,
If glusterd releses the lock before sleep and acquires it after sleep,
disconnect thread will get a chance to handle the
glusterd_svc_common_rpc_notify before other thread completes connect
event.

Change-Id: Ie82e823fdfc936feb7c0ae10599297b050ee9986
fixes: bz#1585391
Signed-off-by: Sanju Rakonde <srakonde>

Comment 3 Shyamsundar 2018-10-23 15:10:44 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.

glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.