Description of problem: Peer goes to rejected state after peer probing a new node to a RHGS 3.1.1 cluster. The cluster has 16 nodes already and multiple volumes. Hostname: node_name Uuid: e80d1a60-12ec-4fb3-a9d2-d19cf70f3dfb State: Peer Rejected (Connected) Diagnostic ---------------- The gluster logs from the node which we done the peer probe had below messages. ~~~~~~~~~~~ [2016-08-10 09:08:55.536503] I [MSGID: 106490] [glusterd-handler.c:2530:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: e80d1a60-12ec-4fb3-a9d2-d19cf70f3dfb [2016-08-10 09:09:00.267679] E [MSGID: 106012] [glusterd-utils.c:2686:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume volname differ. local cksum = 1271429249, remote cksum = 1405647489 on peer newnode [2016-08-10 09:09:00.267982] I [MSGID: 106493] [glusterd-handler.c:3771:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to newnode (0), ret: 0 [2016-08-10 09:09:01.088596] W [socket.c:923:__socket_keepalive] 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 78, Invalid argument [2016-08-10 09:09:01.088672] E [socket.c:3019:socket_connect] 0-management: Failed to set keep-alive: Invalid argument [2016-08-10 09:09:01.383764] I [MSGID: 106493] [glusterd-rpc-ops.c:478:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: e80d1a60-12ec-4fb3-a9d2-d19cf70f3dfb, host: newnode, port: 0 [2016-08-10 09:07:43.925602] I [MSGID: 106490] [glusterd-handler.c:2875:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: e80d1a60-12ec-4fb3-a9d2-d19cf70f3dfb [2016-08-10 09:07:43.925754] I [MSGID: 106493] [glusterd-handler.c:2938:__glusterd_handle_probe_query] 0-glusterd: Responded to newnode, op_ret: 0, op_errno: 0, ret: 0 ~~~~~~~~~~~ Action taken ------------------ 1) We checked the below things with the customer based on the existing KCS and the BZs we found a) The node from which peer probe was done and the new node has the same version of RHGS , that was 3.1.1 b) The new node or the old node was not upgraded recently. c) The volume in question had quota enabled and found the quota version and quota checksum differ in the two nodes. No other difference in the volfile info was noticed. d)confirmed with customer that the volume in question is not a clone from any snapshot. 2) We requested the customer to follow the actions mentioned in the below knowledge base , but it didn't help https://access.redhat.com/solutions/1354563 3) We took the remote session , Performed the below steps i) Tried the resolution in https://access.redhat.com/solutions/1354563 & https://access.redhat.com/solutions/2041033 found it is not working. ii) We performed the below steps a) Did peer detach of new node from the old node b) stopped glusterd on new node c) copied the quota.conf and quota.cksum from the old node. d) started glusterd e) did peer probe again Which also didn't help Version-Release number of selected component (if applicable): RHGS 3.1.1 How reproducible: Not always reproducible. Actual results: Peer goes to rejected state Expected results: Peer should be "in cluster" and "connected" state
So our RCA is correct then
http://review.gluster.org/15352 posted upstream for review.
upstream mainline : http://review.gluster.org/15352 upstream 3.8 : http://review.gluster.org/15791 downstream patch : https://code.engineering.redhat.com/gerrit/#/c/89554
Verified this bug using the build - glusterfs-3.8.4-5, Fix is working good. Confirmed the fix with below steps: =================================== 1. Had two nodes having 3.0.4 build. 2. Created a simple volume with quota enabled 3. Updated to 3.1.1 build and done op-version bump up 4. Probed newly installed 3.1.1 rhgs node Result of step-4: Peer status was in Rejected state and quota.conf version in updated nodes was v1.1 and in newly installed node it's v1.2. ( Reported issue is reproduced ) 5. Updated both 3.1.1 nodes to 3.2 and done op-version bump-up 6. Checked the quota.conf version, it's changed to v1.2 from v1.1 7. Probed new 3.2 node, probe is successful and peer status displayed correct result. Moving to verified state based on above result.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html