Description of problem: Observing continuous "disconnecting socket" error message on client glusterd logs on all the 4 clients.All bricks are up and running for the volume. On debugging the setup with Dev,seems client is trying to access the brick having stale port. Version-Release number of selected component (if applicable): glusterfs-server-3.12.2-4.el7rhgs.x86_64 How reproducible: Reporting first instance Steps to Reproduce: 1.Create 4*3 distributed-replicate volume. 2.Set some client access permission on volume via auth.allow volume set option to export the fuse subdir gluster v set Ganeshavol1 auth.allow "/dir1(10.70.46.125),/dir1(10.70.46.20),/dir3(10.70.47.33),/dir4(10.70.47.170),/(*)" 3.Mount the subdirs and volumes on 4 different clients. 4.Start running IO's like bonnie,file rename etc from clients havinf fuse sundir mount Actual results: On client side,there are continues ongoing disconnecting socket Error messages. Expected results: No disconnect socket message should be observed on client side. Additional info: Before testing Fuse subdir,Same setup was been used to test Ganesha and GNFS testing. I have recreated a new volume for Fuse subdir testing but with the same volume name and brick name. ====================================== Snippet from glusterddump glusterd.pmap[49154].brickname=/gluster/brick1/b1 glusterd.pmap_port=49155 glusterd.pmap[49155].type=4 glusterd.pmap[49155].brickname= /gluster/brick1/b1 ====================================== # gluster v status Status of volume: Ganeshavol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp47-193.lab.eng.blr.redhat.com:/gl uster/brick1/b1 49153 0 Y 7984 Brick dhcp46-116.lab.eng.blr.redhat.com:/gl uster/brick1/b1 49152 0 Y 26930 Brick dhcp46-184.lab.eng.blr.redhat.com:/gl uster/brick1/b1 49156 0 Y 6500 Brick dhcp47-2.lab.eng.blr.redhat.com:/glus ter/brick1/b1 49154 0 Y 12190 Brick dhcp47-193.lab.eng.blr.redhat.com:/gl uster/brick2/b2 49153 0 Y 7984 Brick dhcp46-116.lab.eng.blr.redhat.com:/gl uster/brick2/b2 49152 0 Y 26930 Brick dhcp46-184.lab.eng.blr.redhat.com:/gl uster/brick2/b2 49156 0 Y 6500 Brick dhcp47-2.lab.eng.blr.redhat.com:/glus ter/brick2/b2 49154 0 Y 12190 Brick dhcp47-193.lab.eng.blr.redhat.com:/gl uster/brick3/b3 49153 0 Y 7984 Brick dhcp46-116.lab.eng.blr.redhat.com:/gl uster/brick3/b3 49152 0 Y 26930 Brick dhcp46-184.lab.eng.blr.redhat.com:/gl uster/brick3/b3 49156 0 Y 6500 Brick dhcp47-2.lab.eng.blr.redhat.com:/glus ter/brick3/b3 49154 0 Y 12190 Self-heal Daemon on localhost N/A N/A Y 12241 Self-heal Daemon on dhcp47-193.lab.eng.blr. redhat.com N/A N/A Y 8034 Self-heal Daemon on dhcp46-184.lab.eng.blr. redhat.com N/A N/A Y 6552 Self-heal Daemon on dhcp46-116.lab.eng.blr. redhat.com N/A N/A Y 26978 Task Status of Volume Ganeshavol1 ------------------------------------------------------------------------------ There are no active volume tasks Attaching sosreports and statedumps in a while... Also the setup is in same state for further debugging
Client side glusterd volume logs- [2018-02-26 09:30:10.504195] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-Ganeshavol1-client-0: changing port to 49155 (from 0) [2018-02-26 09:30:10.509405] E [socket.c:2369:socket_connect_finish] 0-Ganeshavol1-client-0: connection to 10.70.47.193:49155 failed (Connection refused); disconnecting socket [2018-02-26 09:30:14.509735] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-Ganeshavol1-client-0: changing port to 49155 (from 0) [2018-02-26 09:30:14.515452] E [socket.c:2369:socket_connect_finish] 0-Ganeshavol1-client-0: connection to 10.70.47.193:49155 failed (Connection refused); disconnecting socket [2018-02-26 09:30:18.515330] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-Ganeshavol1-client-0: changing port to 49155 (from 0) [2018-02-26 09:30:18.520898] E [socket.c:2369:socket_connect_finish] 0-Ganeshavol1-client-0: connection to 10.70.47.193:49155 failed (Connection refused); disconnecting socket [2018-02-26 09:30:22.521159] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-Ganeshavol1-client-0: changing port to 49155 (from 0) [2018-02-26 09:30:22.526885] E [socket.c:2369:socket_connect_finish] 0-Ganeshavol1-client-0: connection to 10.70.47.193:49155 failed (Connection refused); disconnecting socket [2018-02-26 09:30:26.527157] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-Ganeshavol1-client-0: changing port to 49155 (from 0) [2018-02-26 09:30:26.533034] E [socket.c:2369:socket_connect_finish] 0-Ganeshavol1-client-0: connection to 10.70.47.193:49155 failed (Connection refused); disconnecting socket [2018-02-26 09:30:30.539459] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-Ganeshavol1-client-0: changing port to 49155 (from 0) [2018-02-26 09:30:30.545487] E [socket.c:2369:socket_connect_finish] 0-Ganeshavol1-client-0: connection to 10.70.47.193:49155 failed (Connection refused); disconnecting socket [2018-02-26 09:30:34.545203] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-Ganeshavol1-client-0: changing port to 49155 (from 0) [2018-02-26 09:30:34.550923] E [socket.c:2369:socket_connect_finish] 0-Ganeshavol1-client-0: connection to 10.70.47.193:49155 failed (Connection refused); disconnecting socket
Clearing the needinfo. Setup Details are provided on email and also changed the permissions of logs to be accessible
After initial analysis, I can conclude that the problem you're seeing now is happening due to stale entries in pmap_registry. From gdb, ``` (gdb) p $8.ports[49152] $13 = {type = GF_PMAP_PORT_BRICKSERVER, brickname = 0x7f10d40b1230 "/var/lib/glusterd/ss_brick /gluster/brick1/b1 /gluster/brick2/b1 /gluster/brick1/b2 /gluster/brick2/b2 /gluster/brick1/b3 /gluster/brick2/b3 /gluster/brick1/b4 /gluster/brick1/b5 /gluster/brick1/b6 /g"..., xprt = 0x7f10dc083eb0} (gdb) p $8.ports[49153] $14 = {type = GF_PMAP_PORT_BRICKSERVER, brickname = 0x7f10d40724c0 "/gluster/brick1/b1 /gluster/brick2/b1", ' ' <repeats 20 times>, "/gluster/brick2/b2 /gluster/brick1/b3 /gluster/brick2/b3", ' ' <repeats 20 times>, "/gluster/brick1/b5", ' ' <repeats 49 times>..., xprt = 0x0} ``` Note that, pmap_registry contains stale pmap_entry for port 49153 for the affected bricks. pmap_registry_search does a reverse search from the highest port to the lowest port. This causes the stale port to be returned to the clients when they do a pmap query for one of the bricks in the stale entry. Which leads to the client trying to connect to the stale port, and getting the connection refused errors. I've not yet root caused how the stale entries remained. The only path I see which could lead to a situation as above, pmap_entry with xprt = NULL, but still having brickname entries and with type=BRICKSERVER, is if glusterd_rpcsvc_notify called pmap_registry_remove, on a DISCONNECT event when the brick process stops. Normally, glusterd_rpcsvc_notify DISCONNECT event should have had a related __glusterd_brick_rpc_notify DISCONNECT event, which inturn would have also called pmap_registry_remove but with brick_disconnet=true, causing forcefull removal of pmap_entry. But in this case, that doesn't seem to have happened, and I don't know why. glusterd_rpcsvc_notify calls pmap_registry_remove with brick_disconnect=false. The leads to pmap_registry_remove setting pmap_entry.xprt to NULL, but will not reset the pmap_entry.type or cleanup pmap_entry.brickname as it still has entries, leading to the stale pmap_entry. Ensuring glusterd_rpcsvc_notify calls pmap_registry_remove with brick_disconnect=true (which it should have been doing already), should fix the above observed problem.
Sanju - Can you please check if commit e2ee5cb fixes this problem?
Have we done any SIGKILL of the brick process i.e. kill -9 <brick process> during this testing?
(In reply to Atin Mukherjee from comment #17) > Have we done any SIGKILL of the brick process i.e. kill -9 <brick process> > during this testing? No..It was freshly created volume w/o any brick kill operation.
The problem is not with the current/new volume, but the previous volume. How was it stopped?
I went through all the code path and can confirm this csn happen only when a brick is SIGKILLed. Kaushal’s analysis although is technically correct but I don’t think we can eventually land up into that case. As discussed with Manisha yesterday, we need a proper reproducer of this problem with out which nothing can be done here and hence Manisha is going to try the scenario out and update the bug.
Verified this with # rpm -qa | grep gluster glusterfs-3.12.2-14.el7rhgs.x86_64 glusterfs-ganesha-3.12.2-14.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 glusterfs-fuse-3.12.2-14.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch python2-gluster-3.12.2-14.el7rhgs.x86_64 nfs-ganesha-gluster-2.5.5-8.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-14.el7rhgs.x86_64 glusterfs-server-3.12.2-14.el7rhgs.x86_64 glusterfs-rdma-3.12.2-14.el7rhgs.x86_64 vdsm-gluster-4.19.43-2.3.el7rhgs.noarch glusterfs-cli-3.12.2-14.el7rhgs.x86_64 glusterfs-libs-3.12.2-14.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-14.el7rhgs.x86_64 glusterfs-api-3.12.2-14.el7rhgs.x86_64 glusterfs-debuginfo-3.12.2-14.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64 Tried creating/deleting volume multiple times.I am unable to hit this issue.Moving this BZ to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607