Created attachment 1384641 [details] glusterd, glusterfsd,glustershd log Description of problem: sometimes after reboot one sn nodes The output of command “gluster v heal mstate info” shows [root@testsn-1:/var/log/glusterfs/bricks] # gluster v heal mstate info Brick testsn-0.local:/mnt/bricks/mstate/brick /testas-0/var/lib/ntp/drift /testas-2/var/lib/ntp/drift /.install-done /testas-0/var/lib/ntp /testmn-1/var/lib/ntp /testas-2/var/lib/ntp /testmn-0/var/lib/ntp /testmn-1/var/lib/ntp/drift /testas-1/var/lib/ntp /testas-1/var/lib/ntp/drift /testmn-0/var/lib/ntp/drift Status: Connected Number of entries: 11 Brick testsn-1.local:/mnt/bricks/mstate/brick Status: Transport endpoint is not connected Number of entries: - glustershd can not connect to local brick process! when i check the glustershd process i find it always fail when trying to connect to glusterfsd process with port 49155. [2018-01-18 10:42:29.891811] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-mstate-client-1: changing port to 49155 (from 0) [2018-01-18 10:42:29.892120] E [socket.c:2369:socket_connect_finish] 0-mstate-client-1: connection to 192.168.1.3:49155 failed (Connection refused); disconnecting however, from local mstate glusterfsd process, it is listenning on port 49153! Version-Release number of selected component (if applicable): glusterfs3.12.3 How reproducible: reboot sn node Steps to Reproduce: 1.reboot sn node 2. 3. Actual results: glustershd can not connected to one local glusterfsd brick process this can be seen from the following netstat command output; [root@testsn-1:/var/log/glusterfs/bricks] # ps -ef | grep glustershd root 1295 1 0 Jan18 ? 00:00:18 /usr/sbin/glusterfs -s testsn-1.local --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/178dba826edae38df4ba67f25beeb1e6.socket --xlator-option *replicate*.node-uuid=9ccea6b1-4d81-4020-a4ba-ee6821268ba8 root 19900 27911 0 04:10 pts/1 00:00:00 grep glustershd [root@testsn-1:/var/log/glusterfs/bricks] # netstat -p | grep 1295 tcp 0 0 testsn-1.local:49098 testsn-0.local:49154 ESTABLISHED 1295/glusterfs tcp 0 0 testsn-1.local:49099 testsn-0.local:49152 ESTABLISHED 1295/glusterfs tcp 0 0 testsn-1.local:49140 testsn-1.local:24007 ESTABLISHED 1295/glusterfs tcp 0 0 testsn-1.local:49097 testsn-0.local:49153 ESTABLISHED 1295/glusterfs tcp 0 0 testsn-1.local:49096 testsn-0.local:49155 ESTABLISHED 1295/glusterfs tcp 0 0 testsn-1.local:49120 testsn-1.local:49156 ESTABLISHED 1295/glusterfs tcp 0 0 testsn-1.local:49121 testsn-1.local:49152 ESTABLISHED 1295/glusterfs tcp 0 0 testsn-1.local:49126 testsn-1.local:49154 ESTABLISHED 1295/glusterfs unix 3 [ ] STREAM CONNECTED 36264 1295/glusterfs /var/run/gluster/178dba826edae38df4ba67f25beeb1e6.socket unix 2 [ ] DGRAM 36258 1295/glusterfs Expected results: glustershd should be able to connected to local brick process Additional info: # gluster v status mstate Status of volume: mstate Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick testsn-0.local:/mnt/bricks/mstate/bri ck 49154 0 Y 1113 Brick testsn-1.local:/mnt/bricks/mstate/bri ck 49155 0 Y 1117 Self-heal Daemon on localhost N/A N/A Y 1295 Self-heal Daemon on testsn-2.local N/A N/A Y 1813 Self-heal Daemon on testsn-0.local N/A N/A Y 1135 Task Status of Volume mstate ------------------------------------------------------------------------------ There are no active volume tasks It is quite strange that the mstate brick process listen port is showed as 49155 in “gluster v heal status mstate” but showed 49153 in ps command! [root@testsn-1:/var/log/glusterfs/bricks] # ps -ef | grep -i glusterfsd | grep mstate root 1117 1 0 Jan18 ? 00:00:05 /usr/sbin/glusterfsd -s testsn-1.local --volfile-id mstate.testsn-1.local.mnt-bricks-mstate-brick -p /var/run/gluster/vols/mstate/testsn-1.local-mnt-bricks-mstate-brick.pid -S /var/run/gluster/b520b934b415e6a68776cc4852901a77.socket --brick-name /mnt/bricks/mstate/brick -l /var/log/glusterfs/bricks/mnt-bricks-mstate-brick.log --xlator-option *-posix.glusterd-uuid=9ccea6b1-4d81-4020-a4ba-ee6821268ba8 --brick-port 49153 --xlator-option mstate-server.listen-port=49153 --xlator-option transport.socket.bind-address=testsn-1.local
REVIEW: https://review.gluster.org/19263 (glusterd: process pmap sign in only when port is marked as free) posted (#3) for review on master by Atin Mukherjee
REVISION POSTED: https://review.gluster.org/19263 (glusterd: process pmap sign in only when port is marked as free) posted (#4) for review on master by Atin Mukherjee
REVIEW: https://review.gluster.org/19323 (glusterd: process pmap sign in only when port is marked as free) posted (#1) for review on release-3.12 by Atin Mukherjee
COMMIT: https://review.gluster.org/19323 committed in release-3.12 by "jiffin tony Thottan" <jthottan> with a commit message- glusterd: process pmap sign in only when port is marked as free Because of some crazy race in volume start code path because of friend handshaking with volumes with quorum enabled we might end up into a situation where glusterd would start a brick and get a disconnect and then immediately try to start the same brick instance based on another friend update request. And then if for the very first brick even if the process doesn't come up at the end sign in event gets sent and we end up having two duplicate portmap entries for the same brick. Since in brick start we mark the previous port as free, its better to consider a sign in request as no op if the corresponding port type is marked as free. >mainline patch : https://review.gluster.org/#/c/19263/ Change-Id: I995c348c7b6988956d24b06bf3f09ab64280fc32 BUG: 1537346 Signed-off-by: Atin Mukherjee <amukherj> (cherry picked from commit 9d708a3739c8201d23f996c413d6b08f8b13dd90)
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.6, please open a new bug report. glusterfs-3.12.6 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/gluster-users/2018-February/033552.html [2] https://www.gluster.org/pipermail/gluster-users/