Bug 1544600 - 3.8 -> 3.10 rolling upgrade fails (same for 3.12 or 3.13) on Ubuntu 14
Summary: 3.8 -> 3.10 rolling upgrade fails (same for 3.12 or 3.13) on Ubuntu 14
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: x86_64
OS: Linux
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
Depends On: 1544461
Blocks: 1544637 1544638
TreeView+ depends on / blocked
Reported: 2018-02-13 02:38 UTC by Atin Mukherjee
Modified: 2018-09-24 22:22 UTC (History)
5 users (show)

Fixed In Version: glusterfs-v4.1.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1544461
: 1544637 1544638 (view as bug list)
Last Closed: 2018-06-20 17:59:24 UTC
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:

Attachments (Terms of Use)

Description Atin Mukherjee 2018-02-13 02:38:06 UTC
+++ This bug was initially created as a clone of Bug #1544461 +++

Description of problem: Unable to upgrade Gluster cluster to 3.10.10 version after 3.8.15 version ( same for 3.12 & 3.13 i think is related to https://bugzilla.redhat.com/show_bug.cgi?id=1511903 )

Version-Release number of selected component (if applicable): old one 3.8.15 , new one 3.10.10

How reproducible: Always (also tried with 3.12 and 3.13)

Steps to Reproduce:
1. Install 3.10.10 on Ubuntu 14 from PPA.
2. Upgrade one of those nodes latest 3.10 ( now 3.10.10)
3. Newly upgraded node will be rejected from a gluster cluster.

Actual results: Node is rejected from cluster

Expected results:  Node must be accepted

Additional info:
I have a 5 x replicated on Ubuntu 14. 
I am trying to update GlusterFS. First i was at 3.7 version from which i tried multiple scenarios and all failed while directly trying with the newer GlusterFS versions (3.10 3.12 3.13). I then noticed that 3.8 is working fine so i updated from 3.7.20 to 3.8.15 as an intermediary version. While trying to update ( i only updated 1/5 servers to 3.10.10 while the rest are at 3.8.15) to the next 3.10 LTM the node which was updated is throwing following error:

"Version of Cksums gluster_volume differ. local cksum = 3272345312, remote cksum = 469010668 on peer 1-gls-dus21-ci-efood-real-de.openstacklocal" 

Also all peers are now in "Peer Rejected (Connected)" state after update.

Volume Name: gluster_volume
Type: Replicate
Volume ID: 2e6bd6ba-37c8-4808-9156-08545cea3e3e
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 5 = 5
Transport-type: tcp
Brick1: 2-gls-dus10-ci-efood-real-de.openstack.local:/export_vdb
Brick2: 1-gls-dus10-ci-efood-real-de.openstack.local:/export_vdb
Brick3: 1-gls-dus21-ci-efood-real-de:/export_vdb
Brick4: 3-gls-dus10-ci-efood-real-de.openstack.local:/export_vdb
Brick5: 2-gls-dus21-ci-efood-real-de.openstacklocal:/export_vdb
Options Reconfigured:
features.barrier: off
performance.readdir-ahead: on
nfs.disable: on
performance.cache-size: 2GB
performance.cache-max-file-size: 1MB
cluster.self-heal-window-size: 64
performance.io-thread-count: 32

root@1-gls-dus21-ci-efood-real-de:/home/ubuntu# gluster peer status
Number of Peers: 4

Hostname: 3-gls-dus10-ci-efood-real-de.openstack.local
Uuid: 3d141235-9b93-4798-8e03-82a758216b0b
State: Peer in Cluster (Connected)

Hostname: 1-gls-dus10-ci-efood-real-de.openstack.local
Uuid: 00839049-2ade-48f8-b5f3-66db0e2b9377
State: Peer in Cluster (Connected)

Hostname: 2-gls-dus10-ci-efood-real-de.openstack.local
Uuid: 1617cd54-9b2a-439e-9aa6-30d4ecf303f8
State: Peer in Cluster (Connected)

Hostname: 2-gls-dus21-ci-efood-real-de.openstacklocal
Uuid: 0c698b11-9078-441a-9e7f-442befeef7a9
State: Peer Rejected (Connected)

Volume status from one of which was not updated:

root@1-gls-dus21-ci-efood-real-de:/home/ubuntu# gluster volume status
Status of volume: gluster_volume
Gluster process                             TCP Port  RDMA Port  Online  Pid
Brick 2-gls-dus10-ci-efood-real-de.openstac
k.local:/export_vdb                         49153     0          Y       30521
Brick 1-gls-dus10-ci-efood-real-de.openstac
k.local:/export_vdb                         49152     0          Y       23166
Brick 1-gls-dus21-ci-efood-real-de:/export_
vdb                                         49153     0          Y       2322
Brick 3-gls-dus10-ci-efood-real-de.openstac
k.local:/export_vdb                         49153     0          Y       10854
Self-heal Daemon on localhost               N/A       N/A        Y       4931
Self-heal Daemon on 3-gls-dus10-ci-efood-re
al-de.openstack.local                       N/A       N/A        Y       16591
Self-heal Daemon on 2-gls-dus10-ci-efood-re
al-de.openstack.local                       N/A       N/A        Y       4621
Self-heal Daemon on 1-gls-dus10-ci-efood-re
al-de.openstack.local                       N/A       N/A        Y       3487

Task Status of Volume gluster_volume
There are no active volume tasks

And from the updated one:

root@2-gls-dus21-ci-efood-real-de:/var/log/glusterfs# gluster volume status
Status of volume: gluster_volume
Gluster process                             TCP Port  RDMA Port  Online  Pid
Brick 2-gls-dus21-ci-efood-real-de.openstac
klocal:/export_vdb                          N/A       N/A        N       N/A
NFS Server on localhost                     N/A       N/A        N       N/A

Task Status of Volume gluster_volume
There are no active volume tasks

[2018-02-12 13:35:53.400122] E [MSGID: 106010] [glusterd-utils.c:3043:glusterd_compare_friend_volume] 0-management: Version of Cksums gluster_volume differ. local cksum = 3272345312, remote cksum = 469010668 on peer 1-gls-dus10-ci-efood-real-de.openstack.local
[2018-02-12 13:35:53.400211] I [MSGID: 106493] [glusterd-handler.c:3866:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 1-gls-dus10-ci-efood-real-de.openstack.local (0), ret: 0, op_ret: -1
[2018-02-12 13:35:53.417588] I [MSGID: 106163] [glusterd-handshake.c:1316:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30800
[2018-02-12 13:35:53.430748] I [MSGID: 106490] [glusterd-handler.c:2606:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 3d141235-9b93-4798-8e03-82a758216b0b
[2018-02-12 13:35:53.431024] E [MSGID: 106010] [glusterd-utils.c:3043:glusterd_compare_friend_volume] 0-management: Version of Cksums gluster_volume differ. local cksum = 3272345312, remote cksum = 469010668 on peer 3-gls-dus10-ci-efood-real-de.openstack.local
[2018-02-12 13:35:53.431121] I [MSGID: 106493] [glusterd-handler.c:3866:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 3-gls-dus10-ci-efood-real-de.openstack.local (0), ret: 0, op_ret: -1
[2018-02-12 13:35:53.473344] I [MSGID: 106493] [glusterd-rpc-ops.c:485:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 7488286f-6bfa-46f8-bc50-9ee815e96c66, host: 1-gls-dus21-ci-efood-real-de.openstacklocal, port: 0

I do no have this file on any of the servers: `/var/lib/glusterd/vols/remote/info` but i attached the `/var/lib/glusterd/vols/gluster_volume/info` from the upgraded one and from a server which was not upgraded.

The 3.7 version was running fine for quite some time so we can exclude network issue, selinux etc..

--- Additional comment from Marc on 2018-02-12 09:51:24 EST ---

I see that on the new node i have the new "tier-enabled=0", could it be also related to this: https://www.spinics.net/lists/gluster-users/msg33329.html.

--- Additional comment from Atin Mukherjee on 2018-02-12 10:07:17 EST ---

This is indeed a bug and we have managed to root cause it couple of days back. I am assigning it to one of my colleague Hari who is aware of this issue and the fix required. For the time being, please remove tier-enabled=0 in all the info files from the node which has been upgraded and then once all nodes are upgraded bump up the cluster.op-version.

@Hari - we need to send this fix to 3.10, 3.12 and 4.0 branch by changing the op-version check to 3.11 instead of 3.7.6.

Comment 1 Worker Ant 2018-02-13 02:39:11 UTC
REVIEW: https://review.gluster.org/19552 (glusterd: fix tier-enabled flag op-version check) posted (#1) for review on master by Atin Mukherjee

Comment 2 Worker Ant 2018-02-13 13:51:48 UTC
COMMIT: https://review.gluster.org/19552 committed in master by "Atin Mukherjee" <amukherj@redhat.com> with a commit message- glusterd: fix tier-enabled flag op-version check

tier-enabled flag in volinfo structure was introduced in 3.10, however
while writing this value to the glusterd store was done with a wrong
op-version check which results into volume checksum failure during upgrades.

Change-Id: I4330d0c4594eee19cba42e2cdf49a63f106627d4
BUG: 1544600
Signed-off-by: Atin Mukherjee <amukherj@redhat.com>

Comment 3 Shyamsundar 2018-06-20 17:59:24 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-v4.1.0, please open a new bug report.

glusterfs-v4.1.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2018-June/000102.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.