Description of problem: Unable to upgrade gluster cluster to 3.12 version after 3.8 version
Version-Release number of selected component (if applicable): old one - 3.8, new one - 3.12
How reproducible: Always
Steps to Reproduce:
1. Install 3 node gluster cluster using 3.8 version. I used binaries, shipped with node-ng version 4.1.6-20170921
2. Upgrade one of those nodes to 3.12. I upgraded my node-ng to 4.2beta1 latest tested
3. Newly upgraded node will be rejected from a gluster cluster.
Actual results: Node is rejected from cluster
Expected results: Node must be accepted
Additional info: Here is the log from the new node:
[2017-11-09 05:15:08.481680] I [MSGID: 106163] [glusterd-handshake.c:1271:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30800
[2017-11-09 05:15:08.489219] I [MSGID: 106490] [glusterd-handler.c:2608:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 6b2193c1-63bd-408b-961e-51e01d
[2017-11-09 05:15:11.165116] E [MSGID: 106010] [glusterd-utils.c:2938:glusterd_compare_friend_volume] 0-management: Version of Cksums data differ. local cksum = 1799370953, remote
cksum = 3144964316 on peer 172.19.11.7
[2017-11-09 05:15:11.165328] I [MSGID: 106493] [glusterd-handler.c:3852:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 172.19.11.7 (0), ret: 0, op_ret: -1
[2017-11-09 05:15:11.175332] I [MSGID: 106493] [glusterd-rpc-ops.c:476:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 6b2193c1-63bd-408b-961e-51e01de486b7, host: 1
After upgrading have you ensured that you have bumped up the op-version? If no, please do the same and then restart glusterd service of all the nodes to see if they can get accepted in the cluster.
If the above is ensured and still the issue persists, can you please share the following file from all the nodes?
No, i didn't as didn't finished upgrade of my cluster. All nodes was still 3.8, while just a single node became 3.12
Please provide the following:
Output of cat /var/lib/glusterd/vols/remote/info from 172.19.11.7 & the node where the peer got rejected i.e. the new node from where you have attached the log.
Sorry, I can't reproduce this on a clean environment and the one where we found this bug was already rebuilt.
Basically the steps I did were:
1) set up a HyperConverged oVirt 4.1 environment with 3 NGN hosts, create some VMs and let them run for a few weeks
2) upgrade one host to 4.2 beta
Not sure how we can proceed here without a stable reproducer, but just to clarify - what needs to mismatch to get the "Version of Cksums data differ"? What file is checksummed and how is this checksum computed? Here's the code that seems to be responsible:
With out the info file from the two nodes where the mismatch happens (as mentioned in comment 3), unfortunately we won't be able to debug this further. If you happen to reproduce this again please reopen this bug.