1511903 – 3.8 -> 3.12 rolling upgrade fails

Bug 1511903 - 3.8 -> 3.12 rolling upgrade fails

Summary: 3.8 -> 3.12 rolling upgrade fails

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	3.12
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-11-10 11:59 UTC by Denis Chaplygin
Modified:	2017-11-14 06:03 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-11-14 06:03:06 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Denis Chaplygin 2017-11-10 11:59:54 UTC

Description of problem: Unable to upgrade gluster cluster to 3.12 version after 3.8 version


Version-Release number of selected component (if applicable): old one - 3.8, new one - 3.12


How reproducible: Always


Steps to Reproduce:
1. Install 3 node gluster cluster using 3.8 version. I used binaries, shipped with node-ng version 4.1.6-20170921
2. Upgrade one of those nodes to 3.12. I upgraded my node-ng to 4.2beta1 latest tested 
3. Newly upgraded node will be rejected from a gluster cluster. 

Actual results: Node is rejected from cluster


Expected results: Node must be accepted


Additional info: Here is the log from the new node: 

[2017-11-09 05:15:08.481680] I [MSGID: 106163] [glusterd-handshake.c:1271:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30800
[2017-11-09 05:15:08.489219] I [MSGID: 106490] [glusterd-handler.c:2608:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 6b2193c1-63bd-408b-961e-51e01d
e486b7
[2017-11-09 05:15:11.165116] E [MSGID: 106010] [glusterd-utils.c:2938:glusterd_compare_friend_volume] 0-management: Version of Cksums data differ. local cksum = 1799370953, remote
 cksum = 3144964316 on peer 172.19.11.7
[2017-11-09 05:15:11.165328] I [MSGID: 106493] [glusterd-handler.c:3852:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 172.19.11.7 (0), ret: 0, op_ret: -1
[2017-11-09 05:15:11.175332] I [MSGID: 106493] [glusterd-rpc-ops.c:476:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 6b2193c1-63bd-408b-961e-51e01de486b7, host: 1

Comment 1 Atin Mukherjee 2017-11-10 13:10:01 UTC

After upgrading have you ensured that you have bumped up the op-version? If no, please do the same and then restart glusterd service of all the nodes to see if they can get accepted in the cluster.

If the above is ensured and still the issue persists, can you please share the following file from all the nodes?

cat /var/lib/glusterd/vols/remote/info

Comment 2 Denis Chaplygin 2017-11-10 15:09:17 UTC

No, i didn't as didn't finished upgrade of my cluster. All nodes was still 3.8, while just a single node became 3.12

Comment 3 Atin Mukherjee 2017-11-13 03:55:44 UTC

Please provide the following:

Output of cat /var/lib/glusterd/vols/remote/info from 172.19.11.7 & the node where the peer got rejected i.e. the new node from where you have attached the log.

Comment 4 Evgheni Dereveanchin 2017-11-13 17:10:05 UTC

Sorry, I can't reproduce this on a clean environment and the one where we found this bug was already rebuilt.

Basically the steps I did were:
1) set up a HyperConverged oVirt 4.1 environment with 3 NGN hosts, create some VMs and let them run for a few weeks
2) upgrade one host to 4.2 beta

Not sure how we can proceed here without a stable reproducer, but just to clarify - what needs to mismatch to get the "Version of Cksums data differ"? What file is checksummed and how is this checksum computed? Here's the code that seems to be responsible:

[1] https://github.com/gluster/glusterfs/blob/master/xlators/mgmt/glusterd/src/glusterd-utils.c#L3386

Comment 5 Atin Mukherjee 2017-11-14 06:03:06 UTC

With out the info file from the two nodes where the mismatch happens (as mentioned in comment 3), unfortunately we won't be able to debug this further. If you happen to reproduce this again please reopen this bug.

Note You need to log in before you can comment on or make changes to this bug.