Bug 1685120

Summary: upgrade from 3.12, 4.1 and 5 to 6 broken
Product: [Community] GlusterFS Reporter: Sanju <srakonde>
Component: coreAssignee: Sanju <srakonde>
Status: CLOSED NEXTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: high    
Version: mainlineCC: amukherj, archon810, bugs, guillaume.pavese, hgowtham, kompastver, pasik, revirii, srakonde
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: gluster-test-day
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1684029 Environment:
Last Closed: 2019-03-11 06:13:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1684029    

Description Sanju 2019-03-04 11:42:45 UTC
+++ This bug was initially created as a clone of Bug #1684029 +++

Description of problem:
While trying to upgrade from older versions like 3.12, 4.1 and 5 to gluster 6 RC, the upgrade ends in peer rejected on one node after other.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. create a replica 3 on older versions (3, 4, or 5)
2. kill the gluster process on one node and install gluster 6
3. start glusterd

Actual results:
the new version gets peer rejected. and the brick processes or not started by glusterd.

Expected results:
peer reject should not happen. Cluster should be healthy.

Additional info:
Status shows the bricks on that particular node alone with N/A as status. Other nodes aren't visible.
Looks like a volfile mismatch. 
The new volfile has "option transport.socket.ssl-enabled off" added while the old volfile misses it.
The order of quick-read and open-behind are changed in the old and new versions.

These changes cause the volfile mismatch and mess the cluster.

--- Additional comment from Sanju on 2019-02-28 17:25:57 IST ---

The peers are running inro rejected state because there is a mismatch in the volfiles. Differences are:
1. Newer volfiles are having "option transport.socket.ssl-enabled off" where older volfiles are not having this option.
2. order of quick-read and open-behind are changed

commit 4e0fab4 introduced this issue. previously we didn't had any default value for the option transport.socket.ssl-enabled. So this option was not captured in the volfile. with the above commit, we are adding a default value. So this is getting captured in volfile.

commit 4e0fab4 has a fix for https://bugzilla.redhat.com/show_bug.cgi?id=1651059. I feel this commit has less significance, we can revert this change. If we do so, we are out of 1st problem.

not sure, why the order of quick-read and open-behind are changed.

Atin, do let me know your thoughts on proposal of reverting the commit 4e0fab4.

Thanks,
Sanju

--- Additional comment from Sanju on 2019-03-04 14:58:55 IST ---

Root cause:
Commit 5a152a changed the mechanism of computing the checksum. Because of this change, in heterogeneous cluster, glusterd in upgraded node follows new mechanism for computing the cksum and non-upgraded nodes follow old mechanism for computing the cksum. So the cksum in upgraded node doesn't match with non-upgraded nodes which results in peer rejection issue.

Thanks,
Sanju

Comment 1 Worker Ant 2019-03-04 15:01:00 UTC
REVIEW: https://review.gluster.org/22297 (core: pass buffer size for computing the cksum) posted (#1) for review on master by Sanju Rakonde

Comment 2 Hubert 2019-03-05 07:44:38 UTC
fyi: happens too when upgrading from 5.3 to 5.4

Comment 3 Artem Russakovskii 2019-03-05 18:56:54 UTC
Noticed the same when upgrading from 5.3 to 5.4, as mentioned.

I'm confused though. Is actual replication affected, because the 5.4 server and the 3x 5.3 servers still show heal info as all 4 connected, and the files seem to be replicating correctly as well.

So what's actually affected - just the status command? Is it fixable by tweaking transport.socket.ssl-enabled? Does upgrading all servers to 5.4 resolve it, or should we revert back to 5.3?

Comment 4 Artem Russakovskii 2019-03-05 19:09:34 UTC
Ended up downgrading to 5.3 just in case. Peer status and volume status are OK now.

zypper install --oldpackage glusterfs-5.3-lp150.100.1
Loading repository data...
Reading installed packages...
Resolving package dependencies...

Problem: glusterfs-5.3-lp150.100.1.x86_64 requires libgfapi0 = 5.3, but this requirement cannot be provided
  not installable providers: libgfapi0-5.3-lp150.100.1.x86_64[glusterfs]
 Solution 1: Following actions will be done:
  downgrade of libgfapi0-5.4-lp150.100.1.x86_64 to libgfapi0-5.3-lp150.100.1.x86_64
  downgrade of libgfchangelog0-5.4-lp150.100.1.x86_64 to libgfchangelog0-5.3-lp150.100.1.x86_64
  downgrade of libgfrpc0-5.4-lp150.100.1.x86_64 to libgfrpc0-5.3-lp150.100.1.x86_64
  downgrade of libgfxdr0-5.4-lp150.100.1.x86_64 to libgfxdr0-5.3-lp150.100.1.x86_64
  downgrade of libglusterfs0-5.4-lp150.100.1.x86_64 to libglusterfs0-5.3-lp150.100.1.x86_64
 Solution 2: do not install glusterfs-5.3-lp150.100.1.x86_64
 Solution 3: break glusterfs-5.3-lp150.100.1.x86_64 by ignoring some of its dependencies

Choose from above solutions by number or cancel [1/2/3/c] (c): 1
Resolving dependencies...
Resolving package dependencies...

The following 6 packages are going to be downgraded:
  glusterfs libgfapi0 libgfchangelog0 libgfrpc0 libgfxdr0 libglusterfs0

6 packages to downgrade.

Comment 5 Worker Ant 2019-03-07 05:01:44 UTC
REVIEW: https://review.gluster.org/22297 (core: make compute_cksum function op_version compatible) merged (#4) on master by Amar Tumballi

Comment 6 Artem Russakovskii 2019-03-07 06:22:52 UTC
Is the next release going to be an imminent hotfix, i.e. something like today/tomorrow, or are we talking weeks?

Comment 7 Worker Ant 2019-03-08 05:47:10 UTC
REVIEW: https://review.gluster.org/22326 (glusterd: change the op-version) posted (#1) for review on master by Sanju Rakonde

Comment 8 Worker Ant 2019-03-11 06:13:07 UTC
REVIEW: https://review.gluster.org/22326 (glusterd: change the op-version) merged (#2) on master by Atin Mukherjee