Description of problem: ----------------------- On a 4-node cluster, `gluster peer status' on one of the nodes shows another peer as disconnected, but the "disconnected" peer seems to be connected to other peers and has glusterd running on it. Version-Release number of selected component (if applicable): ------------------------------------------------------------- glusterfs-3.7.5-14.el7rhgs.x86_64 How reproducible: ----------------- Only observed once Steps to Reproduce: ------------------- I am not sure about exact steps to reproduce the issue, but glusterd was restarted a few times on the peer that seems to be disconnected. Actual results: --------------- One peer seems to be disconnected from only one of the nodes in the cluster, whereas actually it has glusterd running and appears to be connected to other peers. Expected results: ----------------- Expected is that the state of the peer be consistent across the cluster. Since glusterd is running on the node without any issues, it should be connected to all other nodes.
After debugging the left over setup I found that rpc_clnt_reconnect was getting triggered after every 3 seconds which means the node was trying to establish the connection with the disconnected peer. However socket_connect () was returning a failure saying that underlying transport is already connected. Basically the code expect the socket to be set to -1 but its set to 17.
We tried to debug this problem further but could not conclude on anything concrete. However it seems like (DIS)CONNECT event(s) raced and because of which notifyfn to the upper layer was never called. We'd need to add few logs into the code and see whether we can reproduce it to get to the RCA. By no means its a blocker for RHGS 3.1.2
After going through the log files once again it seems like multi threaded epoll was enabled for GlusterD in the set up. Ideally once GlusterD starts up the final graph would look like this with an INFO log: +------------------------------------------------------------------------------+ 1: volume management 2: type mgmt/glusterd 3: option rpc-auth.auth-glusterfs on 4: option rpc-auth.auth-unix on 5: option rpc-auth.auth-null on 6: option transport.socket.listen-backlog 128 7: option rpc-auth-allow-insecure on 8: option event-threads 1 9: option ping-timeout 0 10: option transport.socket.read-fail-log off 11: option transport.socket.keepalive-interval 2 12: option transport.socket.keepalive-time 10 13: option transport-type rdma 14: option working-directory /var/lib/glusterd 15: end-volume 16: +------------------------------------------------------------------------------+ [2016-02-11 13:36:36.167635] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 But here in the logs I can see that the following: Final graph: +------------------------------------------------------------------------------+ 1: volume management 2: type mgmt/glusterd 3: option rpc-auth.auth-glusterfs on 4: option rpc-auth.auth-unix on 5: option rpc-auth.auth-null on 6: option transport.socket.listen-backlog 128 7: option rpc-auth-allow-insecure on 8: option ping-timeout 0 9: option transport.socket.read-fail-log off 10: option transport.socket.keepalive-interval 2 11: option transport.socket.keepalive-time 10 12: option transport-type rdma 13: option working-directory /var/lib/glusterd 14: end-volume 15: +------------------------------------------------------------------------------+ [2016-01-19 12:38:35.428986] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2016-01-19 12:38:35.429108] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 And the above indicates the thread count was set to 2 by default if glusterd.vol file doesn't have event-threads option set. Since this is not a supported configuration we'd need to close this bug.
(In reply to Atin Mukherjee from comment #6) > Since this is not a supported configuration we'd need to close this bug. I did a upgrade test from RHGS 3.0 to RHGS 3.1.0 and here are the results 1. In RHGS 3.0, glusterd runs with single epoll thread only 2. In RHGS 3.0, no option available to configure number of epoll threads for glusterd After upgrade to RHGS 3.1.0, there are no changes to glusterd volfile. The above observation holds true in RHGS 3.1.0 too. After upgrade to RHGS 3.1.1, there is a change in glusterd volfile with 'event-threads 1' which prompts glusterd to start with only 1 epoll thread. If this option is removed, then glusterd starts with 2 epoll threads. We haven't really reached the conclusion, until we know how 'event-threads 1' option got removed from glusterd volfile ( in glusterfs-3.7.5-14.el7rhgs )
As per the discussion with Shruti it seems like glusterd.vol file didn't have event-threads option configured as in this container setup the glusterd.vol file was shared and the same was from a different build. Hence closing this bug.