Description of problem: I have a 1 x (2 + 1) = 3 volume. One host in the cluster is reporting one of the peers is down but the other two hosts don't show any problems. [root@store01 ~]# gluster volume status shardvol1 Status of volume: shardvol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick store01:/srv/gluster/shardbrick1 49153 0 Y 23039 Brick store03:/srv/gluster/shardbrick1 49153 0 Y 7120 NFS Server on localhost N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 27426 NFS Server on store03 N/A N/A N N/A Self-heal Daemon on store03 N/A N/A Y 31088 [root@store02 ~]# gluster volume status shardvol1 Status of volume: shardvol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick store01:/srv/gluster/shardbrick1 49153 0 Y 23039 Brick store02:/srv/gluster/shardbrick1 49153 0 Y 5660 Brick store03:/srv/gluster/shardbrick1 49153 0 Y 7120 NFS Server on localhost N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 31843 NFS Server on store03 N/A N/A N N/A Self-heal Daemon on store03 N/A N/A Y 31088 NFS Server on store01 N/A N/A N N/A Self-heal Daemon on store01 N/A N/A Y 27426 [root@store03 ~]# gluster volume status shardvol1 Status of volume: shardvol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick store01:/srv/gluster/shardbrick1 49153 0 Y 23039 Brick store02:/srv/gluster/shardbrick1 49153 0 Y 5660 Brick store03:/srv/gluster/shardbrick1 49153 0 Y 7120 NFS Server on localhost N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 31088 NFS Server on store01 N/A N/A N N/A Self-heal Daemon on store01 N/A N/A Y 27426 NFS Server on store02 N/A N/A N N/A Self-heal Daemon on store02 N/A N/A Y 31843 [root@store01 ~]# gluster peer status Number of Peers: 2 Hostname: store02 Uuid: 97d6abee-f2ac-47d8-bb96-738ffb99b38f State: Peer in Cluster (Disconnected) Hostname: store03 Uuid: 2cbf1e99-4fb7-410d-b1f1-e385000b20ec State: Peer in Cluster (Connected) [root@store02 ~]# gluster peer status Number of Peers: 2 Hostname: store01 Uuid: 7a931ab9-6075-4fd4-868d-9deeb91295c0 State: Peer in Cluster (Connected) Hostname: store03 Uuid: 2cbf1e99-4fb7-410d-b1f1-e385000b20ec State: Peer in Cluster (Connected) [root@store03 ~]# gluster peer status Number of Peers: 2 Hostname: store01 Uuid: 7a931ab9-6075-4fd4-868d-9deeb91295c0 State: Peer in Cluster (Connected) Hostname: store02 Uuid: 97d6abee-f2ac-47d8-bb96-738ffb99b38f State: Peer in Cluster (Connected) Version-Release number of selected component (if applicable): glusterfs 3.8.6 How reproducible: Probably not very Steps to Reproduce: 1. Create 1 x (2 + 1) = 3 volume 2. ??? Additional info: I've uploaded gluster sosreports from each of the hosts here: http://drop.ceph.com/qa/dgalloway/ As far as I can tell, store02 started showing as disconnected from store01 today. No changes (iptables or otherwise) were made to the gluster hosts. I attempted a 'service glusterd restart' on store01 and store02 without improvement.
We had seen this issue a long back but couldn't RCA it. We could debug to a state that there was a discrepancy in the connection between socket and application layer where application layer keeps on retrying for the connection and underlying socket was already connected due to which reconnect was failing on every attempt. Good news is we managed to debug and RCA this issue recently and now its been fixed in upstream mainline and a patch http://review.gluster.org/#/c/16025 has been backported in release-3.8 branch which should go in to the next 3.8.x release.
Thanks for the quick response! Am I at any risk in continuing to run my cluster in this state? In other words, are the peers really disconnected and replication is not happening or is 'peer status' output bogus?
No, its a race, restarting glusterd instance where it shows disconnected should bring everything back to normal.
(In reply to Atin Mukherjee from comment #3) > No, its a race, restarting glusterd instance where it shows disconnected > should bring everything back to normal. It doesn't though. I restarted glusterd on store01 and store02 multiple times.
(In reply to David Galloway from comment #4) > (In reply to Atin Mukherjee from comment #3) > > No, its a race, restarting glusterd instance where it shows disconnected > > should bring everything back to normal. > > It doesn't though. I restarted glusterd on store01 and store02 multiple > times. In that case, my initial suspect doesn't stand true. Is there any firewall setting causing the connection to not go through from store1 to store2?
(In reply to Atin Mukherjee from comment #5) > (In reply to David Galloway from comment #4) > > (In reply to Atin Mukherjee from comment #3) > > > No, its a race, restarting glusterd instance where it shows disconnected > > > should bring everything back to normal. > > > > It doesn't though. I restarted glusterd on store01 and store02 multiple > > times. > > In that case, my initial suspect doesn't stand true. Is there any firewall > setting causing the connection to not go through from store1 to store2? Relevant 'iptables -L' output. store01 ------- ACCEPT tcp -- 172.21.0.0/20 anywhere tcp dpt:24007 ACCEPT tcp -- 172.21.0.0/20 anywhere tcp dpt:24009 ACCEPT tcp -- 172.21.0.0/20 anywhere tcp dpts:38465:38468 ACCEPT tcp -- 172.21.0.0/20 anywhere tcp dpts:49152:49162 store02 ------- ACCEPT tcp -- 172.21.0.0/20 anywhere tcp dpt:24007 ACCEPT tcp -- 172.21.0.0/20 anywhere tcp dpt:24009 ACCEPT tcp -- 172.21.0.0/20 anywhere tcp dpts:38465:38467 ACCEPT tcp -- 172.21.0.0/20 anywhere tcp dpts:49152:49162
David, I wanted to have a look at the sosreport but unable to access it due to insufficient permissions. Mean time, can you check if glusterd has got any disconnect event by any chance? In glusterd log you should be seeing peer X is disconnected kind of log entry if that's the case. Otherwise could you restart glusterd with debug log enabled (kill glusterd and then bring it back with 'glusterd -LDEBUG') and share the glusterd logs from both the nodes?
I fixed file permissions on the sosreports. Sorry about that. We had a network glitch on 2016-12-03 so there are a few entries about peers disconnecting then as well as when I tried restarting glusterd on 2016-12-06. Let me know if you need the debug logging still after checking out the sosreport.
Do you still need debug logs or can you tell from the sosreports what happened?
I just logged in to store01 and all peers are showing as connected. Nothing has changed as far as firewalls or configs on my end. I uploaded new sosreports from store01 and store02 to http://drop.ceph.com/qa/dgalloway/ if you want to investigate.
Sorry for the delay, I'll have a look at the logs and get back.
[2016-12-06 23:40:10.394118] E [socket.c:2309:socket_connect_finish] 0-management: connection to 172.21.0.9:24007 failed (Connection timed out) [2016-12-06 23:40:10.394317] I [MSGID: 106004] [glusterd-handler.c:5219:__glusterd_peer_rpc_notify] 0-management: Peer <store02> (<97d6abee-f2ac-47d8-bb96-738ffb99b38f>), in state <Peer in Cluster>, has disconnected from glusterd. The above glusterd log snippet from store1 convinces me that there is an issue with the connectivity between store1 & store2 (172.21.0.9) ? Doesn't restarting all the glusterd instances solve this issue?
(In reply to Atin Mukherjee from comment #12) > [2016-12-06 23:40:10.394118] E [socket.c:2309:socket_connect_finish] > 0-management: connection to 172.21.0.9:24007 failed (Connection timed out) > [2016-12-06 23:40:10.394317] I [MSGID: 106004] > [glusterd-handler.c:5219:__glusterd_peer_rpc_notify] 0-management: Peer > <store02> (<97d6abee-f2ac-47d8-bb96-738ffb99b38f>), in state <Peer in > Cluster>, has disconnected from glusterd. > > The above glusterd log snippet from store1 convinces me that there is an > issue with the connectivity between store1 & store2 (172.21.0.9) ? > > Doesn't restarting all the glusterd instances solve this issue? It didn't, no. But all the peers are showing as connected now without any intervention on my part so I'm not sure this bug is valid anymore.
I am closing this BZ based on comment 13, please reopen if the issue is seen again.