Bug 1314373

Summary:	Peer information is not propagated to all the nodes in the cluster, when the peer is probed with its second interface FQDN/IP
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	SATHEESARAN <sasundar>
Component:	glusterd	Assignee:	Kaushal <kaushal>
Status:	CLOSED ERRATA	QA Contact:	Byreddy <bsrirama>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	amukherj, asrivast, kaushal, rhinduja, rhs-bugs, storage-qa-internal, vbellur
Target Milestone:	---	Keywords:	Regression, ZStream
Target Release:	RHGS 3.1.3
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.7.9-2	Doc Type:	Bug Fix
Doc Text:	Cause: The fix for bug#1291386 introduced changes to reduce the number of updates exchanged between GlusterDs. This change inadvertently made it so that updates that needed to be sent when peer probe command was done to attach a new address to an existing peer, were not sent. Consequence: Because of this, the newly attached address was only known to the peer where the peer probe command was issued. This could cause failures of gluster volume commands using this new address. Fix: GlusterD was fixed to send updates to all other nodes when a peer probe is done to attach a new address. Result: New address is available on all nodes and commands using the new address don't fail.	Story Points:	---
Clone Of:	1314366	Environment:
Last Closed:	2016-06-23 05:10:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1314366
Bug Blocks:	1299184

Description SATHEESARAN 2016-03-03 13:18:11 UTC

+++ This bug was initially created as a clone of Bug #1314366 +++

Description of problem:
-----------------------
When there are multiple interfaces available in the gluster node and to make use both the interfaces for gluster traffic, the peer probe should be done with all the network identifiers (i.e) IP or FQDN

While doing so, the other names for the particular peer is updated.
The problem here is that the other name of the particular host is not propogated to all the nodes in the cluster, leading to error - "staging failed on the host" - on the other hosts, for any volume related operation, as that node is unaware of the new hostname or IP

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
3.7.8

How reproducible:
-----------------
Always

Steps to Reproduce:
--------------------
1. Create 3 gluster nodes with 2 network interfaces and each of them connected to different (isolated) network
2. Form a gluster cluster with 2 gluster nodes by peer probing with one set of IP ( from network1 )
3. Probe the node2 ( from node1 ) with IP ( from network2 )
4. Check peer status on both the nodes
5. From node1, peer probe node3 with IP from network1
6. From node1, peer probe node3 with IP from network2

Actual results:
---------------
Peer status on node2 doesn't get updated with other name of node3

Expected results:
-----------------
Peer information should be consistent/updated across all the nodes in the cluster

--- Additional comment from SATHEESARAN on 2016-03-03 08:15:17 EST ---

Peer status on 2 nodes
-----------------------
[root@data-node1 ~]# gluster peer status
Number of Peers: 1

Hostname: mgmt-node2.lab.eng.blr.redhat.com
Uuid: 204a51d3-3c2c-4bec-a005-4e974a49aa7e
State: Peer in Cluster (Connected)
Other names:
data-node2.lab.eng.blr.redhat.com
mgmt-node2

[root@data-node2 ~]# gluster peer status
Number of Peers: 1

Hostname: mgmt-node1.lab.eng.blr.redhat.com
Uuid: 5ba71f4c-fe2e-410d-939a-d5fc903a1ec4
State: Peer in Cluster (Connected)
Other names:
data-node1.lab.eng.blr.redhat.com

Peer status on 3 nodes after probing node3 with network1
---------------------------------------------------------
[root@data-node1 ~]# gluster peer status
Number of Peers: 2

Hostname: mgmt-node2.lab.eng.blr.redhat.com
Uuid: 204a51d3-3c2c-4bec-a005-4e974a49aa7e
State: Peer in Cluster (Connected)
Other names:
data-node2.lab.eng.blr.redhat.com
mgmt-node2

Hostname: mgmt-node3.lab.eng.blr.redhat.com
Uuid: 5b4abfd3-9397-4527-a39e-ee3bc00f5710
State: Peer in Cluster (Connected)

[root@data-node2 ~]# gluster peer status
Number of Peers: 2

Hostname: mgmt-node1.lab.eng.blr.redhat.com
Uuid: 5ba71f4c-fe2e-410d-939a-d5fc903a1ec4
State: Peer in Cluster (Connected)
Other names:
data-node1.lab.eng.blr.redhat.com

Hostname: mgmt-node3.lab.eng.blr.redhat.com
Uuid: 5b4abfd3-9397-4527-a39e-ee3bc00f5710
State: Peer in Cluster (Connected)

[root@localhost ~]# gluster peer status
Number of Peers: 2

Hostname: mgmt-node1.lab.eng.blr.redhat.com
Uuid: 5ba71f4c-fe2e-410d-939a-d5fc903a1ec4
State: Peer in Cluster (Connected)
Other names:
data-node1.lab.eng.blr.redhat.com

Hostname: mgmt-node2.lab.eng.blr.redhat.com
Uuid: 204a51d3-3c2c-4bec-a005-4e974a49aa7e
State: Peer in Cluster (Connected)
Other names:
data-node2.lab.eng.blr.redhat.com
mgmt-node2


Peer status on 3 nodes after probing node3 with network2
---------------------------------------------------------
[root@data-node1 ~]# gluster peer probe data-node3.lab.eng.blr.redhat.com
peer probe: success. Host data-node3.lab.eng.blr.redhat.com port 24007 already in peer list

[root@data-node1 ~]# gluster peer status
Number of Peers: 2

Hostname: mgmt-node2.lab.eng.blr.redhat.com
Uuid: 204a51d3-3c2c-4bec-a005-4e974a49aa7e
State: Peer in Cluster (Connected)
Other names:
data-node2.lab.eng.blr.redhat.com
mgmt-node2

Hostname: mgmt-node3.lab.eng.blr.redhat.com
Uuid: 5b4abfd3-9397-4527-a39e-ee3bc00f5710
State: Peer in Cluster (Connected)
Other names:
data-node3.lab.eng.blr.redhat.com  <--- other name updated in node1

[root@data-node2 ~]# gluster pe s
Number of Peers: 2

Hostname: mgmt-node1.lab.eng.blr.redhat.com
Uuid: 5ba71f4c-fe2e-410d-939a-d5fc903a1ec4
State: Peer in Cluster (Connected)
Other names:
data-node1.lab.eng.blr.redhat.com

Hostname: mgmt-node3.lab.eng.blr.redhat.com <---not updated with other name
Uuid: 5b4abfd3-9397-4527-a39e-ee3bc00f5710
State: Peer in Cluster (Connected)

[root@localhost ~]# gluster peer status
Number of Peers: 2

Hostname: mgmt-node1.lab.eng.blr.redhat.com
Uuid: 5ba71f4c-fe2e-410d-939a-d5fc903a1ec4
State: Peer in Cluster (Connected)
Other names:
data-node1.lab.eng.blr.redhat.com

Hostname: mgmt-node2.lab.eng.blr.redhat.com
Uuid: 204a51d3-3c2c-4bec-a005-4e974a49aa7e
State: Peer in Cluster (Connected)
Other names:
data-node2.lab.eng.blr.redhat.com
mgmt-node

[root@data-node1 ~]# gluster volume create testvol data-node3.lab.eng.blr.redhat.com:/rhs/brick1/brc1
volume create: testvol: failed: Staging failed on mgmt-node2.lab.eng.blr.redhat.com. Error: Host data-node3.lab.eng.blr.redhat.com is not in 'Peer in Cluster' state

Error messages in glusterd log in node1 - 
<snip>
[2016-03-03 18:40:38.034436] I [MSGID: 106487] [glusterd-handler.c:1411:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2016-03-03 18:45:20.723287] E [MSGID: 106452] [glusterd-utils.c:5735:glusterd_new_brick_validate] 0-management: Host data-node3.lab.eng.blr.redhat.com is not in 'Peer in Cluster' state
[2016-03-03 18:45:20.723323] E [MSGID: 106536] [glusterd-volume-ops.c:1336:glusterd_op_stage_create_volume] 0-management: Host data-node3.lab.eng.blr.redhat.com is not in 'Peer in Cluster' state
[2016-03-03 18:45:20.723338] E [MSGID: 106301] [glusterd-op-sm.c:5241:glusterd_op_ac_stage_op] 0-management: Stage failed on operation 'Volume Create', Status : -1
</snip>

Comment 2 Atin Mukherjee 2016-03-04 04:09:37 UTC

This is a regression caused by the fix for BZ 1291386

Comment 4 Kaushal 2016-03-07 06:17:07 UTC

The fix for #1291386 reduced the number of updates sent when a peer, already in the befriended state, establishes a connection with another peer.

Before this fix, the updates were sent to all other peers when this happened. The fix changed it so that the updates are only sent between the peers involved. This was done by changing the action for a ACC or LOCAL_ACC event when in BEFRIENDED state, in the state table used by the peer state machine.

This caused a regression when attempting to attach other names to a peer using peer probe. Attaching another name to a peer in befriended state gives rise to a LOCAL_ACC, which leads to the updates being exchanged only between the two involved peers. The other peers don't get updates with the newly attached name, which could lead to command failures later.

Comment 6 Atin Mukherjee 2016-03-23 10:19:48 UTC

Upstream patch http://review.gluster.org/13817 posted for review.

Comment 7 Mike McCune 2016-03-28 22:51:39 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 8 Atin Mukherjee 2016-04-04 10:45:14 UTC

Downstream patch https://code.engineering.redhat.com/gerrit/#/c/71313/ is now merged. Moving the status to Modified.

Comment 10 Byreddy 2016-04-25 10:55:13 UTC

Verified this bug using the build "glusterfs-3.7.9-2.el7rhgs"

Steps followed to verify this bug:
==================================
1. Have 3 rhgs nodes with 3.1.3 (node1, node2 and node3)
2. Probed node2 from node1
    - Using the node2 IP
    - Using the FQDN of node2
    - Using the short name.

3. Checked  peer status on node1 and node2 //it was correct,node1 peer status had short name and FQDN name of node2 under other names.

4. Again probed node3 from node1
    - Using the node3 IP
    - Using the FQDN of node3
    - Using the short name.

5. Checked  peer status on node1, node2 and node3 //it was correct
    - Node1 peer status had short name and FQDN name of node2 under other names AND had short name and FQDN name of node3 under other names.

    - Node2 peer status had short name and FQDN name of node3 under other names
    - Node3 peer status had short name and FQDN name of node2 under other names.

With above details moving this bug to verified state.

Comment 12 errata-xmlrpc 2016-06-23 05:10:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240