1684029 – upgrade from 3.12, 4.1 and 5 to 6 broken

Bug 1684029 - upgrade from 3.12, 4.1 and 5 to 6 broken

Summary: upgrade from 3.12, 4.1 and 5 to 6 broken

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Assignee:	Sanju
QA Contact:
Docs Contact:
URL:
Whiteboard:	gluster-test-day
Depends On:	1685120
Blocks:	glusterfs-6.0 1732875
TreeView+	depends on / blocked

Reported:	2019-02-28 09:54 UTC by hari gowtham
Modified:	2019-07-24 15:04 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-6.0
Clone Of:
Clones:	1685120 (view as bug list)
Environment:
Last Closed:	2019-03-08 14:46:11 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	22313	0	None	Abandoned	core: make compute_cksum function op_version compatible	2019-03-07 11:13:55 UTC
Gluster.org Gerrit	22319	0	None	Merged	core: make compute_cksum function op_version compatible	2019-03-08 14:46:10 UTC

Description hari gowtham 2019-02-28 09:54:59 UTC

Description of problem:
While trying to upgrade from older versions like 3.12, 4.1 and 5 to gluster 6 RC, the upgrade ends in peer rejected on one node after other.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. create a replica 3 on older versions (3, 4, or 5)
2. kill the gluster process on one node and install gluster 6
3. start glusterd

Actual results:
the new version gets peer rejected. and the brick processes or not started by glusterd.

Expected results:
peer reject should not happen. Cluster should be healthy.

Additional info:
Status shows the bricks on that particular node alone with N/A as status. Other nodes aren't visible.
Looks like a volfile mismatch. 
The new volfile has "option transport.socket.ssl-enabled off" added while the old volfile misses it.
The order of quick-read and open-behind are changed in the old and new versions.

These changes cause the volfile mismatch and mess the cluster.

Comment 1 Sanju 2019-02-28 11:55:57 UTC

The peers are running inro rejected state because there is a mismatch in the volfiles. Differences are:
1. Newer volfiles are having "option transport.socket.ssl-enabled off" where older volfiles are not having this option.
2. order of quick-read and open-behind are changed

commit 4e0fab4 introduced this issue. previously we didn't had any default value for the option transport.socket.ssl-enabled. So this option was not captured in the volfile. with the above commit, we are adding a default value. So this is getting captured in volfile.

commit 4e0fab4 has a fix for https://bugzilla.redhat.com/show_bug.cgi?id=1651059. I feel this commit has less significance, we can revert this change. If we do so, we are out of 1st problem.

not sure, why the order of quick-read and open-behind are changed.

Atin, do let me know your thoughts on proposal of reverting the commit 4e0fab4.

Thanks,
Sanju

Comment 2 Sanju 2019-03-04 09:28:55 UTC

Root cause:
Commit 5a152a changed the mechanism of computing the checksum. Because of this change, in heterogeneous cluster, glusterd in upgraded node follows new mechanism for computing the cksum and non-upgraded nodes follow old mechanism for computing the cksum. So the cksum in upgraded node doesn't match with non-upgraded nodes which results in peer rejection issue.

Thanks,
Sanju

Comment 3 Worker Ant 2019-03-07 06:26:49 UTC

REVIEW: https://review.gluster.org/22313 (core: make compute_cksum function op_version compatible) posted (#1) for review on release-6 by Sanju Rakonde

Comment 4 Worker Ant 2019-03-07 11:12:28 UTC

REVIEW: https://review.gluster.org/22319 (core: make compute_cksum function op_version compatible) posted (#1) for review on release-6 by Sanju Rakonde

Comment 5 Worker Ant 2019-03-08 14:46:11 UTC

REVIEW: https://review.gluster.org/22319 (core: make compute_cksum function op_version compatible) merged (#3) on release-6 by Shyamsundar Ranganathan

Comment 6 Shyamsundar 2019-03-25 16:33:26 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-6.0, please open a new bug report.

glusterfs-6.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2019-March/000120.html
[2] https://www.gluster.org/pipermail/gluster-users/

Comment 7 Amgad 2019-07-08 03:21:41 UTC

Upgrade from 3.12.15 to 6.3-1 failed -

1) Have a cluster of 3 nodes on 3.12.15
2) Upgraded 1st node to 6.3-1 , bricks on that volume went off-line and can't be brought online till backed out to 3.12.15.

The following are the two lines reported in the /var/log/glusterfs/glusterd.log

[2019-07-08 03:11:18.641072] E [MSGID: 101097] [xlator.c:218:xlator_volopt_dynload] 0-xlator: dlsym(xlator_api) missing: /usr/lib64/glusterfs/6.3/rpc-transport/socket.so: undefined symbol: xlator_api
The message "E [MSGID: 101097] [xlator.c:218:xlator_volopt_dynload] 0-xlator: dlsym(xlator_api) missing: /usr/lib64/glusterfs/6.3/rpc-transport/socket.so: undefined symbol: xlator_api" repeated 7 times between [2019-07-08 03:11:18.641072] and [2019-07-08 03:11:18.641729]


This is really blocking!!!!!!!!!!!!!!

Comment 8 Pasi Karkkainen 2019-07-08 14:04:30 UTC

amgad: Isn't that a different/separate issue? Which means you should open a new bugzilla entry for it..

Comment 9 Sanju 2019-07-09 05:34:49 UTC

Amgad,

I tried upgrading from 3.12.15 to 6.3 and I haven't observed any issue with bricks coming up. They are online.

You will face https://bugzilla.redhat.com/show_bug.cgi?id=1728126 while in-service upgrade. Fix for the bug is posted.

Thanks,
Sanju

Comment 10 Amgad 2019-07-11 15:05:02 UTC

Hi Sanju:

This one issue --

I opened https://bugzilla.redhat.com/show_bug.cgi?id=1727682 for all the issues, including glusterd not starting with the default port 24007.

Regards,
Amgad

Comment 11 Amgad 2019-07-11 15:11:13 UTC

What release this fix is going to and when it will be available?

How about the heal issue with the online rollback https://bugzilla.redhat.com/show_bug.cgi?id=1687051, does that fix address it?
It should be in the same area!

Comment 12 Sanju 2019-07-12 09:01:06 UTC

You may expect to have it in 6.3, please follow the bug to know, in which release the fix will be present.

I have replied to you at the bugs you mentioned above and please do get back with the information.

Thanks,
Sanju

Comment 13 Amgad 2019-07-17 14:51:40 UTC

Thanks Sanju:

> You may expect to have it in 6.3

Do you mean 6.3-2? or 6.4? 6.3-1 is already out. That's the one I experienced the issue with.

The bugfix - states "release 6"

Note You need to log in before you can comment on or make changes to this bug.