Created attachment 1430394 [details] screen shot to display that the 10g nic 'em1' is down on the dell-per630-06, while gluster peer status shows that it is 'connected' Description of problem: During a network failure test on a RHHI Pod, gluster peer status is not showing the correct status. Network failure test in this context simply refers to running 'ifdown em1' with em1 being the 10g nic for these machines. Three node gluster cluster dell-per630-05 dell-per630-06 dell-per630-07 Before test is run 'gluster peer status' shows connected for all [root@dell-per630-05 ~]# gluster peer status Number of Peers: 2 Hostname: dell-per630-07 Uuid: 1e96827a-fb45-4aa1-bf04-98cc44113504 State: Peer in Cluster (Connected) Hostname: 192.168.50.22 Uuid: 44030434-1b6a-466b-b670-94a1e4b7a49e State: Peer in Cluster (Connected) ----------------------------------------------- [root@dell-per630-06 ~]# gluster peer status Number of Peers: 2 Hostname: dell-per630-07 Uuid: 1e96827a-fb45-4aa1-bf04-98cc44113504 State: Peer in Cluster (Connected) Hostname: 192.168.50.21 Uuid: a4aa2e0e-b015-4b2e-a821-5af2f515ebb8 State: Peer in Cluster (Connected) ----------------------------------------------- [root@dell-per630-07 ~]# gluster peer status Number of Peers: 2 Hostname: 192.168.50.22 Uuid: 44030434-1b6a-466b-b670-94a1e4b7a49e State: Peer in Cluster (Connected) Hostname: 192.168.50.21 Uuid: a4aa2e0e-b015-4b2e-a821-5af2f515ebb8 State: Peer in Cluster (Connected) During the test I fail the 10g network on the dell-per630-06. This causes gluster peer status to show as follows. [root@dell-per630-05 ~]# gluster peer status Number of Peers: 2 Hostname: dell-per630-07 Uuid: 1e96827a-fb45-4aa1-bf04-98cc44113504 State: Peer in Cluster (Connected) Hostname: 192.168.50.22 Uuid: 44030434-1b6a-466b-b670-94a1e4b7a49e State: Peer in Cluster (Disconnected) ----------------------------------------------- [root@dell-per630-06 ~]# gluster peer status Number of Peers: 2 Hostname: dell-per630-07 Uuid: 1e96827a-fb45-4aa1-bf04-98cc44113504 State: Peer in Cluster (Disconnected) Hostname: 192.168.50.21 Uuid: a4aa2e0e-b015-4b2e-a821-5af2f515ebb8 State: Peer in Cluster (Disconnected) ----------------------------------------------- [root@dell-per630-07 ~]# gluster peer status Number of Peers: 2 Hostname: 192.168.50.22 Uuid: 44030434-1b6a-466b-b670-94a1e4b7a49e State: Peer in Cluster (Connected) Hostname: 192.168.50.21 Uuid: a4aa2e0e-b015-4b2e-a821-5af2f515ebb8 State: Peer in Cluster (Connected) Gluster peer status on the third machine the 'dell-per630-07' never recognizes that the peer should be disconnected from the cluster. The test I have been running has a wait time of about 30 min and this command has failed to display that the peer is disconnected several times. The results have been inconsistent. I have had scenarios where this command correctly disconnects the peer almost immediately, or it has taken up to an hour or even 8 hours, and in some cases it never displayed correctly. I have included a screen shot to display that the 10g nic 'em1' is down on the dell-per630-06, and to show the gluster peer status of all three machines. The terminal in the bottom right corner is the gluster peer status command that is failing. This should show that Hostname: 192.168.50.22 is disconnected. I have supplied the contents of /var/log/glusterfs/glusterd.log here: http://pastebin.test.redhat.com/585538 Version-Release number of selected component (if applicable): glusterfs 3.8.4 rhhi 1.1 How reproducible: Inconsistent Steps to Reproduce: See above Actual results: Gluster peer status on the third machine the 'dell-per630-07' never recognizes that the peer should be disconnected from the cluster. Expected results: Gluster peer status command should recognize that the 10g nic is down for that node and display disconnected for that node. Additional info: This behaviour is the same when the 10g nic is recovered. In the scenarios that gluster peer status worked correctly when failing the nic I saw that when the nic was recovered the gluster peer status would run into the same inconsistencies as before where the nic would be back up and the gluster peer status command would display that the node was disconnected.