Description of problem: While trying to upgrade from older versions like 3.12, 4.1 and 5 to gluster 6 RC, the upgrade ends in peer rejected on one node after other. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. create a replica 3 on older versions (3, 4, or 5) 2. kill the gluster process on one node and install gluster 6 3. start glusterd Actual results: the new version gets peer rejected. and the brick processes or not started by glusterd. Expected results: peer reject should not happen. Cluster should be healthy. Additional info: Status shows the bricks on that particular node alone with N/A as status. Other nodes aren't visible. Looks like a volfile mismatch. The new volfile has "option transport.socket.ssl-enabled off" added while the old volfile misses it. The order of quick-read and open-behind are changed in the old and new versions. These changes cause the volfile mismatch and mess the cluster.
The peers are running inro rejected state because there is a mismatch in the volfiles. Differences are: 1. Newer volfiles are having "option transport.socket.ssl-enabled off" where older volfiles are not having this option. 2. order of quick-read and open-behind are changed commit 4e0fab4 introduced this issue. previously we didn't had any default value for the option transport.socket.ssl-enabled. So this option was not captured in the volfile. with the above commit, we are adding a default value. So this is getting captured in volfile. commit 4e0fab4 has a fix for https://bugzilla.redhat.com/show_bug.cgi?id=1651059. I feel this commit has less significance, we can revert this change. If we do so, we are out of 1st problem. not sure, why the order of quick-read and open-behind are changed. Atin, do let me know your thoughts on proposal of reverting the commit 4e0fab4. Thanks, Sanju
Root cause: Commit 5a152a changed the mechanism of computing the checksum. Because of this change, in heterogeneous cluster, glusterd in upgraded node follows new mechanism for computing the cksum and non-upgraded nodes follow old mechanism for computing the cksum. So the cksum in upgraded node doesn't match with non-upgraded nodes which results in peer rejection issue. Thanks, Sanju
REVIEW: https://review.gluster.org/22313 (core: make compute_cksum function op_version compatible) posted (#1) for review on release-6 by Sanju Rakonde
REVIEW: https://review.gluster.org/22319 (core: make compute_cksum function op_version compatible) posted (#1) for review on release-6 by Sanju Rakonde
REVIEW: https://review.gluster.org/22319 (core: make compute_cksum function op_version compatible) merged (#3) on release-6 by Shyamsundar Ranganathan
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-6.0, please open a new bug report. glusterfs-6.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2019-March/000120.html [2] https://www.gluster.org/pipermail/gluster-users/
Upgrade from 3.12.15 to 6.3-1 failed - 1) Have a cluster of 3 nodes on 3.12.15 2) Upgraded 1st node to 6.3-1 , bricks on that volume went off-line and can't be brought online till backed out to 3.12.15. The following are the two lines reported in the /var/log/glusterfs/glusterd.log [2019-07-08 03:11:18.641072] E [MSGID: 101097] [xlator.c:218:xlator_volopt_dynload] 0-xlator: dlsym(xlator_api) missing: /usr/lib64/glusterfs/6.3/rpc-transport/socket.so: undefined symbol: xlator_api The message "E [MSGID: 101097] [xlator.c:218:xlator_volopt_dynload] 0-xlator: dlsym(xlator_api) missing: /usr/lib64/glusterfs/6.3/rpc-transport/socket.so: undefined symbol: xlator_api" repeated 7 times between [2019-07-08 03:11:18.641072] and [2019-07-08 03:11:18.641729] This is really blocking!!!!!!!!!!!!!!
amgad: Isn't that a different/separate issue? Which means you should open a new bugzilla entry for it..
Amgad, I tried upgrading from 3.12.15 to 6.3 and I haven't observed any issue with bricks coming up. They are online. You will face https://bugzilla.redhat.com/show_bug.cgi?id=1728126 while in-service upgrade. Fix for the bug is posted. Thanks, Sanju
Hi Sanju: This one issue -- I opened https://bugzilla.redhat.com/show_bug.cgi?id=1727682 for all the issues, including glusterd not starting with the default port 24007. Regards, Amgad
What release this fix is going to and when it will be available? How about the heal issue with the online rollback https://bugzilla.redhat.com/show_bug.cgi?id=1687051, does that fix address it? It should be in the same area!
You may expect to have it in 6.3, please follow the bug to know, in which release the fix will be present. I have replied to you at the bugs you mentioned above and please do get back with the information. Thanks, Sanju
Thanks Sanju: > You may expect to have it in 6.3 Do you mean 6.3-2? or 6.4? 6.3-1 is already out. That's the one I experienced the issue with. The bugfix - states "release 6"