Bug 1511318 - output of command "glsuter volume status <volume>" show sth wrong with shd status
Summary: output of command "glsuter volume status <volume>" show sth wrong with shd st...
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 3.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-09 07:05 UTC by zhou lin
Modified: 2018-06-20 18:27 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-20 18:27:16 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description zhou lin 2017-11-09 07:05:29 UTC
Description of problem:
very often , after restart sn node, the command output of "gluster v status log" will show sth wrong with self-heal Daemon on sn nodes.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.restart sn1 for some times
2.on sn1 execute "gluster v status log"
3.the "self-heal Daemon on localhost/other sn nodes" will show "online" <N>

Actual results:
on sn-1 node:
[root@sn-1:/root]
# gluster v status log
Status of volume: log
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick sn-0.local:/mnt/bricks/log/brick      49154     0          Y       1259 
Brick sn-1.local:/mnt/bricks/log/brick      49153     0          Y       1521 
Self-heal Daemon on localhost               N/A       N/A        N       N/A  
Self-heal Daemon on sn-0.local              N/A       N/A        Y       28844
Self-heal Daemon on sn-2.local              N/A       N/A        Y       20706
 
Task Status of Volume log
------------------------------------------------------------------------------
There are no active volume tasks
 
but actually process shd is working normally on sn-1,

If check events.log, it is easy to find that the event has some wrong order, the “disconnect event” is received latter than the “connect event”.
[root@sn-1:/root]
# cat /var/log/glusterfs/events.log | grep glustershd
[2017-11-01 10:54:29,525] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509504869}
[2017-11-01 10:54:41,574] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509504881}
[2017-11-01 11:00:49,247] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509505249}
[2017-11-01 11:06:17,278] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509505577}
[2017-11-01 11:07:44,057] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509505664}
[2017-11-01 11:08:51,518] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509505731}
[2017-11-01 11:08:51,527] INFO [utils - 54:log_event] - {'event': 'SVC_DISCONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509505731}
[2017-11-01 11:08:55,476] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509505735}
[2017-11-01 11:09:52,482] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509505792}
[2017-11-01 11:13:14,514] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509505994}
[2017-11-01 13:27:40,517] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509514060}
[2017-11-01 13:27:40,517] INFO [utils - 54:log_event] - {'event': 'SVC_DISCONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509514060}


Expected results:


Additional info:

Comment 1 zhou lin 2017-11-09 07:14:54 UTC
after my investigation, i found that the when sn node execute restart, the glusterd_restart_bricks will invode glusterd_svcs_manager,
the shd will be firstly stopped then started,
normally the stop will cause rpc disconnect event and start will cause rpc connect event, however, the hanlder for events (glusterd_conn_common_notify) involved the glusterd big lock, it will block the disconnect event, and only when shd get started, it will release the big lock for a while and very often, that release cause the "connect event" notify function get a change to execute before the "disconect event" notify function.

there is no good way to handle this issue,
one way is to release the glusterd "big lock" for a short time, before glusterd_shdsvc_manager trigger shd start and after glusterd_shdsvc_manager trigger shd stop. so the "disconnect event" notify function  got a chance to run.

Comment 2 zhou lin 2018-05-25 08:39:47 UTC
i find that while glusterd kill glustershd in glusterd_proc_stop, it sleep 1s with the hold of big lock, this make the competition of big-lock worse, because the disconnect event may lose in competition with the connect event.one solution to this is that the glusterd could unlock before sleep and lock after sleep so the disconnect event have enough time to be handled. before the arrival of the connect event.

Comment 3 zhou lin 2018-05-25 08:39:57 UTC
i find that while glusterd kill glustershd in glusterd_proc_stop, it sleep 1s with the hold of big lock, this make the competition of big-lock worse, because the disconnect event may lose in competition with the connect event.one solution to this is that the glusterd could unlock before sleep and lock after sleep so the disconnect event have enough time to be handled. before the arrival of the connect event.

Comment 4 Shyamsundar 2018-06-20 18:27:16 UTC
This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained.

As a result this bug is being closed.

If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.

Comment 5 Shyamsundar 2018-06-20 18:27:48 UTC
This bug reported is against a version of Gluster that is no longer maintained
(or has been EOL'd). See https://www.gluster.org/release-schedule/ for the
versions currently maintained.

As a result this bug is being closed.

If the bug persists on a maintained version of gluster or against the mainline
gluster repository, request that it be reopened and the Version field be marked
appropriately.


Note You need to log in before you can comment on or make changes to this bug.