Bug 1574291 - Failing the 10g nic on a node that is part of a three node gluster cluster does not display disconnected in the gluster peer status command on a different node
Summary: Failing the 10g nic on a node that is part of a three node gluster cluster do...
Keywords:
Status: CLOSED DUPLICATE of bug 1408354
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterfs
Version: rhhi-1.1
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Milind Changire
QA Contact: Bala Konda Reddy M
URL:
Whiteboard:
Depends On:
Blocks: 1724792
TreeView+ depends on / blocked
 
Reported: 2018-05-03 02:12 UTC by Adam Scerra
Modified: 2019-06-28 16:04 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-15 10:48:00 UTC
Embargoed:


Attachments (Terms of Use)
screen shot to display that the 10g nic 'em1' is down on the dell-per630-06, while gluster peer status shows that it is 'connected' (224.74 KB, image/png)
2018-05-03 02:12 UTC, Adam Scerra
no flags Details

Description Adam Scerra 2018-05-03 02:12:59 UTC
Created attachment 1430394 [details]
screen shot to display that the 10g nic 'em1' is down on the dell-per630-06, while gluster peer status shows that it is 'connected'

Description of problem:
During a network failure test on a RHHI Pod, gluster peer status is not showing the correct status. Network failure test in this context simply refers to running 'ifdown em1' with em1 being the 10g nic for these machines.
 
Three node gluster cluster
dell-per630-05
dell-per630-06
dell-per630-07

Before test is run 'gluster peer status' shows connected for all

[root@dell-per630-05 ~]# gluster peer status
Number of Peers: 2

Hostname: dell-per630-07
Uuid: 1e96827a-fb45-4aa1-bf04-98cc44113504
State: Peer in Cluster (Connected)

Hostname: 192.168.50.22
Uuid: 44030434-1b6a-466b-b670-94a1e4b7a49e
State: Peer in Cluster (Connected)
-----------------------------------------------
[root@dell-per630-06 ~]# gluster peer status
Number of Peers: 2

Hostname: dell-per630-07
Uuid: 1e96827a-fb45-4aa1-bf04-98cc44113504
State: Peer in Cluster (Connected)

Hostname: 192.168.50.21
Uuid: a4aa2e0e-b015-4b2e-a821-5af2f515ebb8
State: Peer in Cluster (Connected)
-----------------------------------------------
[root@dell-per630-07 ~]# gluster peer status
Number of Peers: 2

Hostname: 192.168.50.22
Uuid: 44030434-1b6a-466b-b670-94a1e4b7a49e
State: Peer in Cluster (Connected)

Hostname: 192.168.50.21
Uuid: a4aa2e0e-b015-4b2e-a821-5af2f515ebb8
State: Peer in Cluster (Connected)


During the test I fail the 10g network on the dell-per630-06. This causes gluster peer status to show as follows.

[root@dell-per630-05 ~]# gluster peer status
Number of Peers: 2

Hostname: dell-per630-07
Uuid: 1e96827a-fb45-4aa1-bf04-98cc44113504
State: Peer in Cluster (Connected)

Hostname: 192.168.50.22
Uuid: 44030434-1b6a-466b-b670-94a1e4b7a49e
State: Peer in Cluster (Disconnected)
-----------------------------------------------
[root@dell-per630-06 ~]# gluster peer status
Number of Peers: 2

Hostname: dell-per630-07
Uuid: 1e96827a-fb45-4aa1-bf04-98cc44113504
State: Peer in Cluster (Disconnected)

Hostname: 192.168.50.21
Uuid: a4aa2e0e-b015-4b2e-a821-5af2f515ebb8
State: Peer in Cluster (Disconnected)
-----------------------------------------------
[root@dell-per630-07 ~]# gluster peer status
Number of Peers: 2

Hostname: 192.168.50.22
Uuid: 44030434-1b6a-466b-b670-94a1e4b7a49e
State: Peer in Cluster (Connected)

Hostname: 192.168.50.21
Uuid: a4aa2e0e-b015-4b2e-a821-5af2f515ebb8
State: Peer in Cluster (Connected)

Gluster peer status on the third machine the 'dell-per630-07' never recognizes that the peer should be disconnected from the cluster. The test I have been running has a wait time of about 30 min and this command has failed to display that the peer is disconnected several times. 

The results have been inconsistent. I have had scenarios where this command correctly disconnects the peer almost immediately, or it has taken up to an hour or even 8 hours, and in some cases it never displayed correctly.

I have included a screen shot to display that the 10g nic 'em1' is down on the dell-per630-06, and to show the gluster peer status of all three machines. The terminal in the bottom right corner is the gluster peer status command that is failing. This should show that Hostname: 192.168.50.22 is disconnected.

I have supplied the contents of /var/log/glusterfs/glusterd.log here:
http://pastebin.test.redhat.com/585538

Version-Release number of selected component (if applicable):
glusterfs 3.8.4
rhhi 1.1

How reproducible:
Inconsistent

Steps to Reproduce:
See above

Actual results:

Gluster peer status on the third machine the 'dell-per630-07' never recognizes that the peer should be disconnected from the cluster.

Expected results:

Gluster peer status command should recognize that the 10g nic is down for that node and display disconnected for that node.

Additional info:
This behaviour is the same when the 10g nic is recovered.

In the scenarios that gluster peer status worked correctly when failing the nic I saw that when the nic was recovered the gluster peer status would run into the same inconsistencies as before where the nic would be back up and the gluster peer status command would display that the node was disconnected.


Note You need to log in before you can comment on or make changes to this bug.