Description of problem:
Peer goes to rejected state after peer probing a new node to a RHGS 3.1.1 cluster. The cluster has 16 nodes already and multiple volumes.
State: Peer Rejected (Connected)
The gluster logs from the node which we done the peer probe had below messages.
[2016-08-10 09:08:55.536503] I [MSGID: 106490] [glusterd-handler.c:2530:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: e80d1a60-12ec-4fb3-a9d2-d19cf70f3dfb
[2016-08-10 09:09:00.267679] E [MSGID: 106012] [glusterd-utils.c:2686:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume volname differ. local cksum = 1271429249, remote cksum = 1405647489 on peer newnode
[2016-08-10 09:09:00.267982] I [MSGID: 106493] [glusterd-handler.c:3771:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to newnode (0), ret: 0
[2016-08-10 09:09:01.088596] W [socket.c:923:__socket_keepalive] 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 78, Invalid argument
[2016-08-10 09:09:01.088672] E [socket.c:3019:socket_connect] 0-management: Failed to set keep-alive: Invalid argument
[2016-08-10 09:09:01.383764] I [MSGID: 106493] [glusterd-rpc-ops.c:478:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: e80d1a60-12ec-4fb3-a9d2-d19cf70f3dfb, host: newnode, port: 0
[2016-08-10 09:07:43.925602] I [MSGID: 106490] [glusterd-handler.c:2875:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: e80d1a60-12ec-4fb3-a9d2-d19cf70f3dfb
[2016-08-10 09:07:43.925754] I [MSGID: 106493] [glusterd-handler.c:2938:__glusterd_handle_probe_query] 0-glusterd: Responded to newnode, op_ret: 0, op_errno: 0, ret: 0
1) We checked the below things with the customer based on the existing KCS and the BZs we found
a) The node from which peer probe was done and the new node has the same version of RHGS , that was 3.1.1
b) The new node or the old node was not upgraded recently.
c) The volume in question had quota enabled and found the quota version and quota checksum differ in the two nodes. No other difference in the volfile info was noticed.
d)confirmed with customer that the volume in question is not a clone from any snapshot.
2) We requested the customer to follow the actions mentioned in the below knowledge base , but it didn't help
3) We took the remote session , Performed the below steps
i) Tried the resolution in https://access.redhat.com/solutions/1354563 & https://access.redhat.com/solutions/2041033 found it is not working.
ii) We performed the below steps
a) Did peer detach of new node from the old node
b) stopped glusterd on new node
c) copied the quota.conf and quota.cksum from the old node.
d) started glusterd
e) did peer probe again
Which also didn't help
Version-Release number of selected component (if applicable):
Not always reproducible.
Peer goes to rejected state
Peer should be "in cluster" and "connected" state
So our RCA is correct then
http://review.gluster.org/15352 posted upstream for review.
upstream mainline : http://review.gluster.org/15352
upstream 3.8 : http://review.gluster.org/15791
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/89554
Verified this bug using the build - glusterfs-3.8.4-5, Fix is working good.
Confirmed the fix with below steps:
1. Had two nodes having 3.0.4 build.
2. Created a simple volume with quota enabled
3. Updated to 3.1.1 build and done op-version bump up
4. Probed newly installed 3.1.1 rhgs node
Result of step-4: Peer status was in Rejected state and quota.conf version in updated nodes was v1.1 and in newly installed node it's v1.2.
( Reported issue is reproduced )
5. Updated both 3.1.1 nodes to 3.2 and done op-version bump-up
6. Checked the quota.conf version, it's changed to v1.2 from v1.1
7. Probed new 3.2 node, probe is successful and peer status displayed correct result.
Moving to verified state based on above result.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.