Bug 1390537 - [GSS]One of the peer remains in "Sent and Received peer request (Connected)" state
Summary: [GSS]One of the peer remains in "Sent and Received peer request (Connected)" ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterd
Version: rhgs-3.1
Hardware: All
OS: All
low
medium
Target Milestone: ---
: ---
Assignee: Patric Uebele
QA Contact: Bala Konda Reddy M
URL:
Whiteboard:
Depends On:
Blocks: 1481177
TreeView+ depends on / blocked
 
Reported: 2016-11-01 10:31 UTC by Riyas Abdulrasak
Modified: 2022-03-13 14:08 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-06 17:57:43 UTC
Target Upstream Version:
sankarshan: needinfo-
sankarshan: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2795241 0 None None None 2016-12-06 04:06:40 UTC

Description Riyas Abdulrasak 2016-11-01 10:31:41 UTC
Description of problem:

One of the RHGS nodes in a 14 node cluster remains as "Sent and Received peer request (Connected)" state.

Version-Release number of selected component (if applicable):

glusterfs-3.7.9-11.el6rhs.x86_64
Red Hat gluster storage 3.1.3

How reproducible:

Reproducible at customer environment. 


Steps to Reproduce:

All the nodes were in 'peer in cluster' state. There was brick replace activity for one of the volumes due to a faulty filesystem in one of the nodes. After this activity there was 'peer a peer rejected' issue in the cluster, the gluster log files were complaining about the cksum mismatch. Tried to resolve this issue , by stopped glusterd, deleted /var/lib/glusterd/vols, copied the vols from a good node and started glusterd.

After this activity all the nodes were showing peer in cluster except in one node. The problematic node(node y) was showing "Sent and Received peer request (Connected)" for node x. But node x showed peer in cluster for node y. 

Actual results:

Peer remains in the "Sent and Received peer request (Connected)"


Expected results:

Peer should go to 'peer in cluster (connected)' state. 

Additional info:

In node x for node y below is the message seen 

~~~~~~~~~~~~~
[2016-10-31 18:12:06.810129] I [MSGID: 106490] [glusterd-handler.c:2600:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 7fd8fa12-fee4-403e-a43d-43796dc624d7
The message "I [MSGID: 106488] [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req" repeated 2 times between [2016-10-31 18:11:50.377711] and [2016-10-31 18:12:55.740337]
[2016-10-31 18:14:58.387733] I [MSGID: 106488] [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
~~~~~~~~~~~~~

for the other nodes we see " __glusterd_handle_incoming_friend_req  & glusterd_xfer_friend_add_resp" messages. But we never got the "glusterd_xfer_friend_add_resp"  message for node y. 

~~~~~~~~~~
[2016-10-31 17:49:59.110303] I [MSGID: 106490] [glusterd-handler.c:2600:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 9ec82d44-031a-4f2d-a194-2c2ab3b46c7f
[2016-10-31 17:49:59.127713] I [MSGID: 106493] [glusterd-handler.c:3842:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to other.node.name (0), ret: 0
~~~~~~~~~~

We tried restarting of glusterd on both node y and node x multiple times. But that didn't help to change the state.

Comment 2 Atin Mukherjee 2016-11-07 08:46:11 UTC
Samikshan has started looking into this case now.

Comment 3 Samikshan Bairagya 2016-11-08 14:49:13 UTC
The RCA (whatever has been done so far) was done on a 3-node setup where a similar problem was encountered and is as follows:

Let the 3 nodes be n1, n2 and n3, with n2's state being shown as "connected and Accepted (Connected)" and n3's state shown as "Peer in Cluster (Connected)" from node n1. From the analysis so far, what seems to be happening is that when glusterd in node n1 is restarted, n2 and n3 gets separate events from n1. n2 gets a GD_FRIEND_EVENT_CONNECTED event from n1 while n3 gets a GD_FRIEND_EVENT_RCVD_FRIEND_REQ. Also n3 gets a probe request from n1, which n2 doesn't seem to get. We are yet to figure out why this might be happening and we will continue with the analysis to figure that out.


Note You need to log in before you can comment on or make changes to this bug.