Bug 1549023
Summary: | Observing continuous "disconnecting socket" error messages on client glusterd logs | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Manisha Saini <msaini> |
Component: | glusterd | Assignee: | Atin Mukherjee <amukherj> |
Status: | CLOSED ERRATA | QA Contact: | Manisha Saini <msaini> |
Severity: | unspecified | Docs Contact: | |
Priority: | high | ||
Version: | rhgs-3.4 | CC: | amukherj, kaushal, msaini, rhinduja, rhs-bugs, srakonde, storage-qa-internal, vbellur |
Target Milestone: | --- | ||
Target Release: | RHGS 3.4.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-3.12.2-8 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-09-04 06:42:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1548829, 1549996 | ||
Bug Blocks: | 1503137 |
Description
Manisha Saini
2018-02-26 09:25:21 UTC
Client side glusterd volume logs- [2018-02-26 09:30:10.504195] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-Ganeshavol1-client-0: changing port to 49155 (from 0) [2018-02-26 09:30:10.509405] E [socket.c:2369:socket_connect_finish] 0-Ganeshavol1-client-0: connection to 10.70.47.193:49155 failed (Connection refused); disconnecting socket [2018-02-26 09:30:14.509735] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-Ganeshavol1-client-0: changing port to 49155 (from 0) [2018-02-26 09:30:14.515452] E [socket.c:2369:socket_connect_finish] 0-Ganeshavol1-client-0: connection to 10.70.47.193:49155 failed (Connection refused); disconnecting socket [2018-02-26 09:30:18.515330] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-Ganeshavol1-client-0: changing port to 49155 (from 0) [2018-02-26 09:30:18.520898] E [socket.c:2369:socket_connect_finish] 0-Ganeshavol1-client-0: connection to 10.70.47.193:49155 failed (Connection refused); disconnecting socket [2018-02-26 09:30:22.521159] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-Ganeshavol1-client-0: changing port to 49155 (from 0) [2018-02-26 09:30:22.526885] E [socket.c:2369:socket_connect_finish] 0-Ganeshavol1-client-0: connection to 10.70.47.193:49155 failed (Connection refused); disconnecting socket [2018-02-26 09:30:26.527157] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-Ganeshavol1-client-0: changing port to 49155 (from 0) [2018-02-26 09:30:26.533034] E [socket.c:2369:socket_connect_finish] 0-Ganeshavol1-client-0: connection to 10.70.47.193:49155 failed (Connection refused); disconnecting socket [2018-02-26 09:30:30.539459] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-Ganeshavol1-client-0: changing port to 49155 (from 0) [2018-02-26 09:30:30.545487] E [socket.c:2369:socket_connect_finish] 0-Ganeshavol1-client-0: connection to 10.70.47.193:49155 failed (Connection refused); disconnecting socket [2018-02-26 09:30:34.545203] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-Ganeshavol1-client-0: changing port to 49155 (from 0) [2018-02-26 09:30:34.550923] E [socket.c:2369:socket_connect_finish] 0-Ganeshavol1-client-0: connection to 10.70.47.193:49155 failed (Connection refused); disconnecting socket Clearing the needinfo. Setup Details are provided on email and also changed the permissions of logs to be accessible After initial analysis, I can conclude that the problem you're seeing now is happening due to stale entries in pmap_registry. From gdb, ``` (gdb) p $8.ports[49152] $13 = {type = GF_PMAP_PORT_BRICKSERVER, brickname = 0x7f10d40b1230 "/var/lib/glusterd/ss_brick /gluster/brick1/b1 /gluster/brick2/b1 /gluster/brick1/b2 /gluster/brick2/b2 /gluster/brick1/b3 /gluster/brick2/b3 /gluster/brick1/b4 /gluster/brick1/b5 /gluster/brick1/b6 /g"..., xprt = 0x7f10dc083eb0} (gdb) p $8.ports[49153] $14 = {type = GF_PMAP_PORT_BRICKSERVER, brickname = 0x7f10d40724c0 "/gluster/brick1/b1 /gluster/brick2/b1", ' ' <repeats 20 times>, "/gluster/brick2/b2 /gluster/brick1/b3 /gluster/brick2/b3", ' ' <repeats 20 times>, "/gluster/brick1/b5", ' ' <repeats 49 times>..., xprt = 0x0} ``` Note that, pmap_registry contains stale pmap_entry for port 49153 for the affected bricks. pmap_registry_search does a reverse search from the highest port to the lowest port. This causes the stale port to be returned to the clients when they do a pmap query for one of the bricks in the stale entry. Which leads to the client trying to connect to the stale port, and getting the connection refused errors. I've not yet root caused how the stale entries remained. The only path I see which could lead to a situation as above, pmap_entry with xprt = NULL, but still having brickname entries and with type=BRICKSERVER, is if glusterd_rpcsvc_notify called pmap_registry_remove, on a DISCONNECT event when the brick process stops. Normally, glusterd_rpcsvc_notify DISCONNECT event should have had a related __glusterd_brick_rpc_notify DISCONNECT event, which inturn would have also called pmap_registry_remove but with brick_disconnet=true, causing forcefull removal of pmap_entry. But in this case, that doesn't seem to have happened, and I don't know why. glusterd_rpcsvc_notify calls pmap_registry_remove with brick_disconnect=false. The leads to pmap_registry_remove setting pmap_entry.xprt to NULL, but will not reset the pmap_entry.type or cleanup pmap_entry.brickname as it still has entries, leading to the stale pmap_entry. Ensuring glusterd_rpcsvc_notify calls pmap_registry_remove with brick_disconnect=true (which it should have been doing already), should fix the above observed problem. Sanju - Can you please check if commit e2ee5cb fixes this problem? Have we done any SIGKILL of the brick process i.e. kill -9 <brick process> during this testing? (In reply to Atin Mukherjee from comment #17) > Have we done any SIGKILL of the brick process i.e. kill -9 <brick process> > during this testing? No..It was freshly created volume w/o any brick kill operation. The problem is not with the current/new volume, but the previous volume. How was it stopped? I went through all the code path and can confirm this csn happen only when a brick is SIGKILLed. Kaushal’s analysis although is technically correct but I don’t think we can eventually land up into that case. As discussed with Manisha yesterday, we need a proper reproducer of this problem with out which nothing can be done here and hence Manisha is going to try the scenario out and update the bug. Verified this with # rpm -qa | grep gluster glusterfs-3.12.2-14.el7rhgs.x86_64 glusterfs-ganesha-3.12.2-14.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 glusterfs-fuse-3.12.2-14.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch python2-gluster-3.12.2-14.el7rhgs.x86_64 nfs-ganesha-gluster-2.5.5-8.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-14.el7rhgs.x86_64 glusterfs-server-3.12.2-14.el7rhgs.x86_64 glusterfs-rdma-3.12.2-14.el7rhgs.x86_64 vdsm-gluster-4.19.43-2.3.el7rhgs.noarch glusterfs-cli-3.12.2-14.el7rhgs.x86_64 glusterfs-libs-3.12.2-14.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-14.el7rhgs.x86_64 glusterfs-api-3.12.2-14.el7rhgs.x86_64 glusterfs-debuginfo-3.12.2-14.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64 Tried creating/deleting volume multiple times.I am unable to hit this issue.Moving this BZ to verified state. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607 |