Bug 1537346 - glustershd/glusterd is not using right port when connecting to glusterfsd process
Summary: glustershd/glusterd is not using right port when connecting to glusterfsd pro...
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 3.12
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
Assignee: Atin Mukherjee
QA Contact:
Depends On: 1537362 1543711
TreeView+ depends on / blocked
Reported: 2018-01-23 01:29 UTC by zhou lin
Modified: 2019-02-28 14:27 UTC (History)
4 users (show)

Fixed In Version: glusterfs-3.12.6
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1537362 (view as bug list)
Last Closed: 2018-03-05 07:14:08 UTC
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:

Attachments (Terms of Use)
glusterd, glusterfsd,glustershd log (3.21 MB, application/zip)
2018-01-23 01:29 UTC, zhou lin
no flags Details

Description zhou lin 2018-01-23 01:29:52 UTC
Created attachment 1384641 [details]
glusterd, glusterfsd,glustershd log

Description of problem:
sometimes after reboot one sn nodes 
The output of command “gluster v heal mstate info” shows
# gluster v heal mstate info
Brick testsn-0.local:/mnt/bricks/mstate/brick
Status: Connected
Number of entries: 11

Brick testsn-1.local:/mnt/bricks/mstate/brick
Status: Transport endpoint is not connected
Number of entries: -

glustershd can not connect to local brick process! when i check the glustershd process i  find it always fail when trying to connect to glusterfsd process with port 49155.
[2018-01-18 10:42:29.891811] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-mstate-client-1: changing port to 49155 (from 0)
[2018-01-18 10:42:29.892120] E [socket.c:2369:socket_connect_finish] 0-mstate-client-1: connection to failed (Connection refused); disconnecting 

however, from local mstate glusterfsd process, it is listenning on port 49153!

Version-Release number of selected component (if applicable):

How reproducible:

reboot sn node
Steps to Reproduce:
1.reboot sn node

Actual results:
glustershd can not connected to one local glusterfsd brick process
this can be seen from the following netstat command output;
# ps -ef | grep glustershd
root      1295     1  0 Jan18 ?        00:00:18 /usr/sbin/glusterfs -s testsn-1.local --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/178dba826edae38df4ba67f25beeb1e6.socket --xlator-option *replicate*.node-uuid=9ccea6b1-4d81-4020-a4ba-ee6821268ba8
root     19900 27911  0 04:10 pts/1    00:00:00 grep glustershd
# netstat -p | grep 1295
tcp        0      0 testsn-1.local:49098    testsn-0.local:49154    ESTABLISHED 1295/glusterfs
tcp        0      0 testsn-1.local:49099    testsn-0.local:49152    ESTABLISHED 1295/glusterfs
tcp        0      0 testsn-1.local:49140    testsn-1.local:24007    ESTABLISHED 1295/glusterfs
tcp        0      0 testsn-1.local:49097    testsn-0.local:49153    ESTABLISHED 1295/glusterfs
tcp        0      0 testsn-1.local:49096    testsn-0.local:49155    ESTABLISHED 1295/glusterfs
tcp        0      0 testsn-1.local:49120    testsn-1.local:49156    ESTABLISHED 1295/glusterfs
tcp        0      0 testsn-1.local:49121    testsn-1.local:49152    ESTABLISHED 1295/glusterfs
tcp        0      0 testsn-1.local:49126    testsn-1.local:49154    ESTABLISHED 1295/glusterfs
unix  3      [ ]         STREAM     CONNECTED      36264 1295/glusterfs      /var/run/gluster/178dba826edae38df4ba67f25beeb1e6.socket
unix  2      [ ]         DGRAM                     36258 1295/glusterfs      

Expected results:
glustershd should be able to connected to local brick process

Additional info:
# gluster v status mstate
Status of volume: mstate
Gluster process                             TCP Port  RDMA Port  Online  Pid
Brick testsn-0.local:/mnt/bricks/mstate/bri
ck                                          49154     0          Y       1113 
Brick testsn-1.local:/mnt/bricks/mstate/bri
ck                                          49155     0          Y       1117 
Self-heal Daemon on localhost               N/A       N/A        Y       1295 
Self-heal Daemon on testsn-2.local          N/A       N/A        Y       1813 
Self-heal Daemon on testsn-0.local          N/A       N/A        Y       1135 	
Task Status of Volume mstate
There are no active volume tasks
It is quite strange that the mstate brick process listen port is showed as 49155 in “gluster v heal status mstate” but showed 49153 in ps command!
# ps -ef | grep -i glusterfsd | grep mstate
root      1117     1  0 Jan18 ?        00:00:05 /usr/sbin/glusterfsd -s testsn-1.local --volfile-id mstate.testsn-1.local.mnt-bricks-mstate-brick -p /var/run/gluster/vols/mstate/testsn-1.local-mnt-bricks-mstate-brick.pid -S /var/run/gluster/b520b934b415e6a68776cc4852901a77.socket --brick-name /mnt/bricks/mstate/brick -l /var/log/glusterfs/bricks/mnt-bricks-mstate-brick.log --xlator-option *-posix.glusterd-uuid=9ccea6b1-4d81-4020-a4ba-ee6821268ba8 --brick-port 49153 --xlator-option mstate-server.listen-port=49153 --xlator-option transport.socket.bind-address=testsn-1.local

Comment 1 Worker Ant 2018-01-23 02:29:06 UTC
REVIEW: https://review.gluster.org/19263 (glusterd: process pmap sign in only when port is marked as free) posted (#3) for review on master by Atin Mukherjee

Comment 2 Worker Ant 2018-01-23 02:45:20 UTC
REVISION POSTED: https://review.gluster.org/19263 (glusterd: process pmap sign in only when port is marked as free) posted (#4) for review on master by Atin Mukherjee

Comment 3 Worker Ant 2018-01-25 08:03:13 UTC
REVIEW: https://review.gluster.org/19323 (glusterd: process pmap sign in only when port is marked as free) posted (#1) for review on release-3.12 by Atin Mukherjee

Comment 4 Worker Ant 2018-02-02 06:49:01 UTC
COMMIT: https://review.gluster.org/19323 committed in release-3.12 by "jiffin tony Thottan" <jthottan@redhat.com> with a commit message- glusterd: process pmap sign in only when port is marked as free

Because of some crazy race in volume start code path because of friend
handshaking with volumes with quorum enabled we might end up into a situation
where glusterd would start a brick and get a disconnect and then immediately try
to start the same brick instance based on another friend update request. And
then if for the very first brick even if the process doesn't come up at the end
sign in event gets sent and we end up having two duplicate portmap entries for
the same brick. Since in brick start we mark the previous port as free, its
better to consider a sign in request as no op if the corresponding port type is
marked as free.

>mainline patch : https://review.gluster.org/#/c/19263/

Change-Id: I995c348c7b6988956d24b06bf3f09ab64280fc32
BUG: 1537346
Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
(cherry picked from commit 9d708a3739c8201d23f996c413d6b08f8b13dd90)

Comment 5 Jiffin 2018-03-05 07:14:08 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.6, please open a new bug report.

glusterfs-3.12.6 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2018-February/033552.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.