Bug 1476828 - selfheal deamon getting connection refused, due to bricks listening on different ports
selfheal deamon getting connection refused, due to bricks listening on differ...
Status: CLOSED WONTFIX
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterd (Show other bugs)
3.3
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Atin Mukherjee
Bala Konda Reddy M
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-31 10:42 EDT by nchilaka
Modified: 2017-08-02 00:43 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-02 00:43:06 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description nchilaka 2017-07-31 10:42:03 EDT
Description of problem:
======================
I was testing BZ#1315781 - AFR returns the node uuid of the same node for every file in the replica 
for a arbiter volume
I ended up in a situation where all bricks are up, but heal info and shd sees 2 bricks still down

See steps for more info

Version-Release number of selected component (if applicable):
========
3.8.4-36


Steps to Reproduce:
1.had a 3 node setup, n1..n3
2.created an arbiter volume 1 x (2 + 1) with n3 hosting arbiter brick
3.mounted volume and created dir{1..10}/f{1..3000} with each file have word "hello" (non zero, very tiny file)
4. added new set of bricks
5. triggered rebalance
6. while rebalance was going on, brought down both the data bricks of first subvolume( wanted to validate of arbiter node, would rebalance files, from arbiter node, ie metadata zero size files or from other data nodes, as wrt the fix)
Checked rebalnce, and found that it showed n1 as complete while n2 and n3 as failed

Estimated time left for rebalance to complete :       15:49:55
volume rebalance: arbo: success
[root@dhcp35-192 glusterfs]# gluster v rebal arbo status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost              925         5.4KB          5717             0             0          in progress        0:00:39
                            10.70.35.214              918         5.4KB          5720             1             0          in progress        0:00:39
                            10.70.35.215              912         5.3KB          5706             0             0          in progress        0:00:39
Estimated time left for rebalance to complete :       15:49:43
volume rebalance: arbo: success
[root@dhcp35-192 glusterfs]# kill -9 9174
[root@dhcp35-192 glusterfs]# gluster v rebal arbo status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             1111         6.5KB          7507             3             0            completed        0:00:48
                            10.70.35.214             1104         6.5KB          6771             4             0               failed        0:00:47
                            10.70.35.215             1098         6.4KB          7679             4             0               failed        0:00:48
volume rebalance: arbo: success

7. now started volume with force, to bring back bricks
8. waited for heal to triggere and clean up if any heals are pending, but heals didn't complete as below for some time, even with manual heal command
[root@dhcp35-192 glusterfs]# gluster v status arbo
Status of volume: arbo
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.192:/rhs/brick1/arbo         49158     0          Y       12521
Brick 10.70.35.214:/rhs/brick1/arbo         49162     0          Y       29544
Brick 10.70.35.215:/rhs/brick1/arbo         49159     0          Y       12866
Brick 10.70.35.192:/rhs/brick2/arbo         49155     0          Y       12260
Brick 10.70.35.214:/rhs/brick2/arbo         49158     0          Y       29332
Brick 10.70.35.215:/rhs/brick2/arbo         49160     0          Y       13292
Self-heal Daemon on localhost               N/A       N/A        Y       12541
Self-heal Daemon on 10.70.35.215            N/A       N/A        Y       13458
Self-heal Daemon on 10.70.35.214            N/A       N/A        Y       29565
 
Task Status of Volume arbo
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 4775ca69-7c93-45ad-b2c4-4a13b1fff38e
Status               : failed              
 
[root@dhcp35-192 glusterfs]# gluster v heal arbo info|grep ntriesNumber of entries: -
Number of entries: 0
Number of entries: 4
Number of entries: -
Number of entries: 3324
Number of entries: 3324


[root@dhcp35-192 glusterfs]# gluster v heal arbo 
Launching heal operation to perform index self heal on volume arbo has been unsuccessful on bricks that are down. Please check if all brick processes are running.

[root@dhcp35-192 glusterfs]# gluster v heal arbo info|more
Brick 10.70.35.192:/rhs/brick1/arbo
Status: Transport endpoint is not connected
Number of entries: -

Brick 10.70.35.214:/rhs/brick1/arbo
Status: Connected
Number of entries: 0

Brick 10.70.35.215:/rhs/brick1/arbo
Status: Connected
Number of entries: 0

Brick 10.70.35.192:/rhs/brick2/arbo
Status: Transport endpoint is not connected
Number of entries: -

Brick 10.70.35.214:/rhs/brick2/arbo
/ 
/.trashcan 
/.trashcan/internal_op 
/dir1 
/dir1/f367 
/dir1/f256 
/dir1/f353 
/dir1/f21 
/dir1/f431 
/dir1/f141 
/dir1/f99 
/dir1/f122 
/dir1/f54 
/dir1/f204 
/dir1/f438 
/dir1/f16 
/dir1/f394 
/dir1/f165 
/dir1/f28 
/dir1/f136 
/dir1/f406 
/dir1/f40 
/dir1/f195 
/dir1/f398 
/dir1/f439 
/dir1/f151 
/dir1/f105 
/dir1/f359 
/dir1/f53 
/dir1/f179 
/dir1/f369 
/dir1/f447 
/dir1/f65 
/dir1/f348 
/dir1/f407 
/dir1/f180 
/dir1/f257 
/dir1/f139 

9. So re-triggered rebalance
[root@dhcp35-192 glusterfs]# gluster v rebal arbo start
volume rebalance: arbo: success: Rebalance on arbo has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 7e1f6c4c-f3bb-4746-89a1-bb890ebdec08

[root@dhcp35-192 glusterfs]# gluster v rebal arbo status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             0             0            completed        0:00:01
                            10.70.35.214              508         3.0KB          9857             0             0          in progress        0:00:18
                            10.70.35.215              512         3.0KB          9906             0             0          in progress        0:00:18
Estimated time left for rebalance to complete :        6:30:12

[root@dhcp35-192 glusterfs]# gluster v rebal arbo status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             0             0            completed        0:00:01
                            10.70.35.214             3922        23.0KB         29905             0             0            completed        0:02:04
                            10.70.35.215             3932        23.0KB         30005             0             0            completed        0:02:04
volume rebalance: arbo: success
[root@dhcp35-192 glusterfs]# gluster v status arbo
Status of volume: arbo
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.192:/rhs/brick1/arbo         49158     0          Y       12521
Brick 10.70.35.214:/rhs/brick1/arbo         49162     0          Y       29544
Brick 10.70.35.215:/rhs/brick1/arbo         49159     0          Y       12866
Brick 10.70.35.192:/rhs/brick2/arbo         49155     0          Y       12260
Brick 10.70.35.214:/rhs/brick2/arbo         49158     0          Y       29332
Brick 10.70.35.215:/rhs/brick2/arbo         49160     0          Y       13292
Self-heal Daemon on localhost               N/A       N/A        Y       12541
Self-heal Daemon on 10.70.35.214            N/A       N/A        Y       29565
Self-heal Daemon on 10.70.35.215            N/A       N/A        Y       13458
 
Task Status of Volume arbo
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 7e1f6c4c-f3bb-4746-89a1-bb890ebdec08
Status               : completed           
 


10. even after rebalance, still below is heal info
[root@dhcp35-192 glusterfs]# gluster v heal arbo info|more
Brick 10.70.35.192:/rhs/brick1/arbo
Status: Transport endpoint is not connected
Number of entries: -

Brick 10.70.35.214:/rhs/brick1/arbo
/ 
/dir4 
/dir5 
/dir6 
/dir7 
/dir8 
/dir9 
/dir10 
/dir2 
/dir3 
Status: Connected
Number of entries: 10

Brick 10.70.35.215:/rhs/brick1/arbo
/ 
/dir4 
/dir5 
/dir6 
/dir7 
/dir8 

[root@dhcp35-192 glusterfs]# gluster v status arbo
Status of volume: arbo
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.192:/rhs/brick1/arbo         49158     0          Y       12521
Brick 10.70.35.214:/rhs/brick1/arbo         49162     0          Y       29544
Brick 10.70.35.215:/rhs/brick1/arbo         49159     0          Y       12866
Brick 10.70.35.192:/rhs/brick2/arbo         49155     0          Y       12260
Brick 10.70.35.214:/rhs/brick2/arbo         49158     0          Y       29332
Brick 10.70.35.215:/rhs/brick2/arbo         49160     0          Y       13292
Self-heal Daemon on localhost               N/A       N/A        Y       12541
Self-heal Daemon on 10.70.35.215            N/A       N/A        Y       13458
Self-heal Daemon on 10.70.35.214            N/A       N/A        Y       29565
 
Task Status of Volume arbo
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 7e1f6c4c-f3bb-4746-89a1-bb890ebdec08
Status               : completed           


11. Heal triggering fails saying bricks are down when they are up
[root@dhcp35-192 glusterfs]# gluster v heal arbo 
Launching heal operation to perform index self heal on volume arbo has been unsuccessful on bricks that are down. Please check if all brick processes are running.


12. shd log says connection failed
2017-07-31 14:39:51.523313] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-arbo-client-3: changing port to 49157 (from 0)
[2017-07-31 14:39:51.525435] E [socket.c:2360:socket_connect_finish] 0-arbo-client-0: connection to 10.70.35.192:49160 failed (Connection refused); disconnecting socket
[2017-07-31 14:39:51.527889] E [socket.c:2360:socket_connect_finish] 0-arbo-client-3: connection to 10.70.35.192:49157 failed (Connection refused); disconnecting socket
[2017-07-31 14:39:55.528636] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-arbo-client-0: changing port to 49160 (from 0)
[2017-07-31 14:39:55.531813] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-arbo-client-3: changing port to 49157 (from 0)
[2017-07-31 14:39:55.533989] E [socket.c:2360:socket_connect_finish] 0-arbo-client-0: connection to 10.70.35.192:49160 failed (Connection refused); disconnecting socket
[2017-07-31 14:39:55.536651] E [socket.c:2360:socket_connect_finish] 0-arbo-client-3: connection to 10.70.35.192:49157 failed (Connection refused); disconnecting socket

##########

I haven't tried client side heal, because if that works, then I would lose setup for debugging
Comment 2 nchilaka 2017-07-31 10:58:24 EDT
From Pranith's initial debugging:

Even fuse client see the bricks down as below
[root@dhcp35-126 arbo]# grep connected /mnt/arbo/.meta/graphs/active/arbo-client-*/private
/mnt/arbo/.meta/graphs/active/arbo-client-0/private:connected = 0
/mnt/arbo/.meta/graphs/active/arbo-client-1/private:connected = 1
/mnt/arbo/.meta/graphs/active/arbo-client-2/private:connected = 1
/mnt/arbo/.meta/graphs/active/arbo-client-3/private:connected = 0
/mnt/arbo/.meta/graphs/active/arbo-client-4/private:connected = 1
/mnt/arbo/.meta/graphs/active/arbo-client-5/private:connected = 1
[root@dhcp35-126 arbo]#
Comment 3 Pranith Kumar K 2017-07-31 12:37:21 EDT
Raghu/Milind,
     I asked Nag to keep the setup in that state for you guys to take a look to find why the connection is not successful.

Pranith
Comment 4 Raghavendra G 2017-08-01 02:33:20 EDT
> [2017-07-31 14:39:51.525435] E [socket.c:2360:socket_connect_finish] 0-arbo-client-0: connection to 10.70.35.192:49160 failed (Connection refused); disconnecting socket
> [2017-07-31 14:39:51.527889] E [socket.c:2360:socket_connect_finish] 0-arbo-client-3: connection to 10.70.35.192:49157 failed (Connection refused); disconnecting socket
> [2017-07-31 14:39:55.528636] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-arbo-client-0: changing port to 49160 (from 0)
> [2017-07-31 14:39:55.531813] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-arbo-client-3: changing port to 49157 (from 0)
> [2017-07-31 14:39:55.533989] E [socket.c:2360:socket_connect_finish] 0-arbo-client-0: connection to 10.70.35.192:49160 failed (Connection refused); disconnecting socket
> [2017-07-31 14:39:55.536651] E [socket.c:2360:socket_connect_finish] 0-arbo-client-3: connection to 10.70.35.192:49157 failed (Connection refused); disconnecting socket

Note that the error here is "Connection refused". Looks like the brick is not listening on the port mentioned here. Can we check whether the brick is actually listening on the port shd is trying to connect (use netstat on brick or cmdline of brick process)? In the past we've seen portmapper issue where glusterd returns stale ports to clients resulting in these kind of errors.
Comment 5 nchilaka 2017-08-01 02:43:00 EDT
(In reply to Raghavendra G from comment #4)
> > [2017-07-31 14:39:51.525435] E [socket.c:2360:socket_connect_finish] 0-arbo-client-0: connection to 10.70.35.192:49160 failed (Connection refused); disconnecting socket
> > [2017-07-31 14:39:51.527889] E [socket.c:2360:socket_connect_finish] 0-arbo-client-3: connection to 10.70.35.192:49157 failed (Connection refused); disconnecting socket
> > [2017-07-31 14:39:55.528636] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-arbo-client-0: changing port to 49160 (from 0)
> > [2017-07-31 14:39:55.531813] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-arbo-client-3: changing port to 49157 (from 0)
> > [2017-07-31 14:39:55.533989] E [socket.c:2360:socket_connect_finish] 0-arbo-client-0: connection to 10.70.35.192:49160 failed (Connection refused); disconnecting socket
> > [2017-07-31 14:39:55.536651] E [socket.c:2360:socket_connect_finish] 0-arbo-client-3: connection to 10.70.35.192:49157 failed (Connection refused); disconnecting socket
> 
> Note that the error here is "Connection refused". Looks like the brick is
> not listening on the port mentioned here. Can we check whether the brick is
> actually listening on the port shd is trying to connect (use netstat on
> brick or cmdline of brick process)? In the past we've seen portmapper issue
> where glusterd returns stale ports to clients resulting in these kind of
> errors.




Yes, you are right, they are not listening on these ports
[root@dhcp35-192 ~]# ps -ef|grep glusterfsd|grep arbo
root     12260     1  0 Jul31 ?        00:00:08 /usr/sbin/glusterfsd -s 10.70.35.192 --volfile-id arbo.10.70.35.192.rhs-brick2-arbo -p /var/lib/glusterd/vols/arbo/run/10.70.35.192-rhs-brick2-arbo.pid -S /var/run/gluster/b3da6309c22e2f4797517b5244c7979f.socket --brick-name /rhs/brick2/arbo -l /var/log/glusterfs/bricks/rhs-brick2-arbo.log --xlator-option *-posix.glusterd-uuid=f641078d-dd40-4316-8136-1e6dc58210a2 --brick-port 49155 --xlator-option arbo-server.listen-port=49155
root     12521     1  0 Jul31 ?        00:00:08 /usr/sbin/glusterfsd -s 10.70.35.192 --volfile-id arbo.10.70.35.192.rhs-brick1-arbo -p /var/lib/glusterd/vols/arbo/run/10.70.35.192-rhs-brick1-arbo.pid -S /var/run/gluster/e37f5747fa0bff95d27ef5693d32a873.socket --brick-name /rhs/brick1/arbo -l /var/log/glusterfs/bricks/rhs-brick1-arbo.log --xlator-option *-posix.glusterd-uuid=f641078d-dd40-4316-8136-1e6dc58210a2 --brick-port 49158 --xlator-option arbo-server.listen-port=49158
[root@dhcp35-192 ~]# netstat -ntap|egrep "12260|12521|Addr"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:49155           0.0.0.0:*               LISTEN      12260/glusterfsd    
tcp        0      0 0.0.0.0:49158           0.0.0.0:*               LISTEN      12521/glusterfsd    
tcp        0      0 10.70.35.192:1023       10.70.35.192:24007      ESTABLISHED 12260/glusterfsd    
tcp        0      0 10.70.35.192:963        10.70.35.192:24007      ESTABLISHED 12521/glusterfsd    


Moving to glusterd component as it is a port mapper issue
Comment 7 nchilaka 2017-08-01 10:35:00 EDT
Yes, I used kill -9

Note You need to log in before you can comment on or make changes to this bug.