Description of problem: very often , after restart sn node, the command output of "gluster v status log" will show sth wrong with self-heal Daemon on sn nodes. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1.restart sn1 for some times 2.on sn1 execute "gluster v status log" 3.the "self-heal Daemon on localhost/other sn nodes" will show "online" <N> Actual results: on sn-1 node: [root@sn-1:/root] # gluster v status log Status of volume: log Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick sn-0.local:/mnt/bricks/log/brick 49154 0 Y 1259 Brick sn-1.local:/mnt/bricks/log/brick 49153 0 Y 1521 Self-heal Daemon on localhost N/A N/A N N/A Self-heal Daemon on sn-0.local N/A N/A Y 28844 Self-heal Daemon on sn-2.local N/A N/A Y 20706 Task Status of Volume log ------------------------------------------------------------------------------ There are no active volume tasks but actually process shd is working normally on sn-1, If check events.log, it is easy to find that the event has some wrong order, the “disconnect event” is received latter than the “connect event”. [root@sn-1:/root] # cat /var/log/glusterfs/events.log | grep glustershd [2017-11-01 10:54:29,525] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509504869} [2017-11-01 10:54:41,574] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509504881} [2017-11-01 11:00:49,247] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509505249} [2017-11-01 11:06:17,278] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509505577} [2017-11-01 11:07:44,057] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509505664} [2017-11-01 11:08:51,518] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509505731} [2017-11-01 11:08:51,527] INFO [utils - 54:log_event] - {'event': 'SVC_DISCONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509505731} [2017-11-01 11:08:55,476] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509505735} [2017-11-01 11:09:52,482] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509505792} [2017-11-01 11:13:14,514] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509505994} [2017-11-01 13:27:40,517] INFO [utils - 54:log_event] - {'event': 'SVC_CONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509514060} [2017-11-01 13:27:40,517] INFO [utils - 54:log_event] - {'event': 'SVC_DISCONNECTED', 'message': {'svc_name': 'glustershd'}, 'nodeid': '17c2b6f5-55c5-4952-9cbb-561d87a35c4f', 'ts': 1509514060} Expected results: Additional info:
after my investigation, i found that the when sn node execute restart, the glusterd_restart_bricks will invode glusterd_svcs_manager, the shd will be firstly stopped then started, normally the stop will cause rpc disconnect event and start will cause rpc connect event, however, the hanlder for events (glusterd_conn_common_notify) involved the glusterd big lock, it will block the disconnect event, and only when shd get started, it will release the big lock for a while and very often, that release cause the "connect event" notify function get a change to execute before the "disconect event" notify function. there is no good way to handle this issue, one way is to release the glusterd "big lock" for a short time, before glusterd_shdsvc_manager trigger shd start and after glusterd_shdsvc_manager trigger shd stop. so the "disconnect event" notify function got a chance to run.
i find that while glusterd kill glustershd in glusterd_proc_stop, it sleep 1s with the hold of big lock, this make the competition of big-lock worse, because the disconnect event may lose in competition with the connect event.one solution to this is that the glusterd could unlock before sleep and lock after sleep so the disconnect event have enough time to be handled. before the arrival of the connect event.
This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained. As a result this bug is being closed. If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.