Bug 1390537

Summary: [GSS]One of the peer remains in "Sent and Received peer request (Connected)" state
Product: Red Hat Gluster Storage Reporter: Riyas Abdulrasak <rnalakka>
Component: glusterdAssignee: Patric Uebele <puebele>
Status: CLOSED WONTFIX QA Contact: Bala Konda Reddy M <bmekala>
Severity: medium Docs Contact:
Priority: low    
Version: rhgs-3.1CC: amukherj, bkunal, mtaru, puebele, rhs-bugs, storage-qa-internal, vbellur
Target Milestone: ---Keywords: Reopened, ZStream
Target Release: ---Flags: sankarshan: needinfo-
sankarshan: needinfo-
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-06 17:57:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1481177    

Description Riyas Abdulrasak 2016-11-01 10:31:41 UTC
Description of problem:

One of the RHGS nodes in a 14 node cluster remains as "Sent and Received peer request (Connected)" state.

Version-Release number of selected component (if applicable):

glusterfs-3.7.9-11.el6rhs.x86_64
Red Hat gluster storage 3.1.3

How reproducible:

Reproducible at customer environment. 


Steps to Reproduce:

All the nodes were in 'peer in cluster' state. There was brick replace activity for one of the volumes due to a faulty filesystem in one of the nodes. After this activity there was 'peer a peer rejected' issue in the cluster, the gluster log files were complaining about the cksum mismatch. Tried to resolve this issue , by stopped glusterd, deleted /var/lib/glusterd/vols, copied the vols from a good node and started glusterd.

After this activity all the nodes were showing peer in cluster except in one node. The problematic node(node y) was showing "Sent and Received peer request (Connected)" for node x. But node x showed peer in cluster for node y. 

Actual results:

Peer remains in the "Sent and Received peer request (Connected)"


Expected results:

Peer should go to 'peer in cluster (connected)' state. 

Additional info:

In node x for node y below is the message seen 

~~~~~~~~~~~~~
[2016-10-31 18:12:06.810129] I [MSGID: 106490] [glusterd-handler.c:2600:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 7fd8fa12-fee4-403e-a43d-43796dc624d7
The message "I [MSGID: 106488] [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req" repeated 2 times between [2016-10-31 18:11:50.377711] and [2016-10-31 18:12:55.740337]
[2016-10-31 18:14:58.387733] I [MSGID: 106488] [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
~~~~~~~~~~~~~

for the other nodes we see " __glusterd_handle_incoming_friend_req  & glusterd_xfer_friend_add_resp" messages. But we never got the "glusterd_xfer_friend_add_resp"  message for node y. 

~~~~~~~~~~
[2016-10-31 17:49:59.110303] I [MSGID: 106490] [glusterd-handler.c:2600:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 9ec82d44-031a-4f2d-a194-2c2ab3b46c7f
[2016-10-31 17:49:59.127713] I [MSGID: 106493] [glusterd-handler.c:3842:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to other.node.name (0), ret: 0
~~~~~~~~~~

We tried restarting of glusterd on both node y and node x multiple times. But that didn't help to change the state.

Comment 2 Atin Mukherjee 2016-11-07 08:46:11 UTC
Samikshan has started looking into this case now.

Comment 3 Samikshan Bairagya 2016-11-08 14:49:13 UTC
The RCA (whatever has been done so far) was done on a 3-node setup where a similar problem was encountered and is as follows:

Let the 3 nodes be n1, n2 and n3, with n2's state being shown as "connected and Accepted (Connected)" and n3's state shown as "Peer in Cluster (Connected)" from node n1. From the analysis so far, what seems to be happening is that when glusterd in node n1 is restarted, n2 and n3 gets separate events from n1. n2 gets a GD_FRIEND_EVENT_CONNECTED event from n1 while n3 gets a GD_FRIEND_EVENT_RCVD_FRIEND_REQ. Also n3 gets a probe request from n1, which n2 doesn't seem to get. We are yet to figure out why this might be happening and we will continue with the analysis to figure that out.