Bug 1511903

Summary: 3.8 -> 3.12 rolling upgrade fails
Product: [Community] GlusterFS Reporter: Denis Chaplygin <dchaplyg>
Component: glusterdAssignee: bugs <bugs>
Status: CLOSED INSUFFICIENT_DATA QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.12CC: amukherj, bugs, dchaplyg, ederevea
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-14 06:03:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Denis Chaplygin 2017-11-10 11:59:54 UTC
Description of problem: Unable to upgrade gluster cluster to 3.12 version after 3.8 version


Version-Release number of selected component (if applicable): old one - 3.8, new one - 3.12


How reproducible: Always


Steps to Reproduce:
1. Install 3 node gluster cluster using 3.8 version. I used binaries, shipped with node-ng version 4.1.6-20170921
2. Upgrade one of those nodes to 3.12. I upgraded my node-ng to 4.2beta1 latest tested 
3. Newly upgraded node will be rejected from a gluster cluster. 

Actual results: Node is rejected from cluster


Expected results: Node must be accepted


Additional info: Here is the log from the new node: 

[2017-11-09 05:15:08.481680] I [MSGID: 106163] [glusterd-handshake.c:1271:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30800
[2017-11-09 05:15:08.489219] I [MSGID: 106490] [glusterd-handler.c:2608:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 6b2193c1-63bd-408b-961e-51e01d
e486b7
[2017-11-09 05:15:11.165116] E [MSGID: 106010] [glusterd-utils.c:2938:glusterd_compare_friend_volume] 0-management: Version of Cksums data differ. local cksum = 1799370953, remote
 cksum = 3144964316 on peer 172.19.11.7
[2017-11-09 05:15:11.165328] I [MSGID: 106493] [glusterd-handler.c:3852:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 172.19.11.7 (0), ret: 0, op_ret: -1
[2017-11-09 05:15:11.175332] I [MSGID: 106493] [glusterd-rpc-ops.c:476:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 6b2193c1-63bd-408b-961e-51e01de486b7, host: 1

Comment 1 Atin Mukherjee 2017-11-10 13:10:01 UTC
After upgrading have you ensured that you have bumped up the op-version? If no, please do the same and then restart glusterd service of all the nodes to see if they can get accepted in the cluster.

If the above is ensured and still the issue persists, can you please share the following file from all the nodes?

cat /var/lib/glusterd/vols/remote/info

Comment 2 Denis Chaplygin 2017-11-10 15:09:17 UTC
No, i didn't as didn't finished upgrade of my cluster. All nodes was still 3.8, while just a single node became 3.12

Comment 3 Atin Mukherjee 2017-11-13 03:55:44 UTC
Please provide the following:

Output of cat /var/lib/glusterd/vols/remote/info from 172.19.11.7 & the node where the peer got rejected i.e. the new node from where you have attached the log.

Comment 4 Evgheni Dereveanchin 2017-11-13 17:10:05 UTC
Sorry, I can't reproduce this on a clean environment and the one where we found this bug was already rebuilt.

Basically the steps I did were:
1) set up a HyperConverged oVirt 4.1 environment with 3 NGN hosts, create some VMs and let them run for a few weeks
2) upgrade one host to 4.2 beta

Not sure how we can proceed here without a stable reproducer, but just to clarify - what needs to mismatch to get the "Version of Cksums data differ"? What file is checksummed and how is this checksum computed? Here's the code that seems to be responsible:

[1] https://github.com/gluster/glusterfs/blob/master/xlators/mgmt/glusterd/src/glusterd-utils.c#L3386

Comment 5 Atin Mukherjee 2017-11-14 06:03:06 UTC
With out the info file from the two nodes where the mismatch happens (as mentioned in comment 3), unfortunately we won't be able to debug this further. If you happen to reproduce this again please reopen this bug.