Bug 949625 - Peer rejected after upgrading
Summary: Peer rejected after upgrading
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: GlusterFS
Classification: Community
Component: unclassified
Version: 3.3.1
Hardware: x86_64
OS: Linux
medium
unspecified
Target Milestone: ---
Assignee: Kaushal
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-04-08 15:58 UTC by ivano.talamo
Modified: 2014-12-14 19:40 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-12-14 19:40:30 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description ivano.talamo 2013-04-08 15:58:59 UTC
Description of problem:

We have a setup with 2 SL5 servers (t2-test02, t2-test03) with gluster 3.2.5 that have a replicated volume. After upgrading these to 3.3.1 is impossible to add a SL6 server (t2-test04) with gluster 3.3.1. After a "peer probe t2-test04" we get it rejected:

[root@t2-test02 ~]# gluster peer status
Number of Peers: 2

Hostname: t2-test03
Uuid: 477e485d-448e-4c20-aeea-8362b826e3eb
State: Peer in Cluster (Connected)

Hostname: t2-test04
Uuid: 26bd1e66-1455-438c-83d4-b6dd13d4389c
State: Peer Rejected (Connected)



Version-Release number of selected component (if applicable):
3.3.1


Steps to Reproduce:
1. Install server A and B with Scientific Linux 5.7 and install gluster 3.2.5 on them. 
2. Peer probe between them and create a replicated volume.
3. Upgrade both to 3.3.1.
4. Install server C with Scientific Linux 6.3 and install gluster 3.3.1.
5. On server A do: "peer probe C". You will get for it "State: Peer Rejected (Connected)"

Actual results:
State: Peer Rejected (Connected)


Expected results:
State: Peer in Cluster (Connected)


Additional info:
If no volume is present the probe is successful.
The name resolution is ok.
In /var/log/glusterfs/etc-glusterfs-glusterd.vol.log of t2-test02 after the peer probe I get the following (ip address removed):

[2013-04-08 17:41:35.507138] I [glusterd-handler.c:685:glusterd_handle_cli_probe] 0-glusterd: Received CLI probe req t2-test04 24007
[2013-04-08 17:41:35.516756] I [glusterd-handler.c:428:glusterd_friend_find] 0-glusterd: Unable to find hostname: t2-test04
[2013-04-08 17:41:35.516780] I [glusterd-handler.c:2245:glusterd_probe_begin] 0-glusterd: Unable to find peerinfo for host: t2-test04 (24007)
[2013-04-08 17:41:35.517089] I [rpc-clnt.c:968:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2013-04-08 17:41:35.520220] I [glusterd-handler.c:2227:glusterd_friend_add] 0-management: connect returned 0
[2013-04-08 17:41:35.521552] I [glusterd-handshake.c:397:glusterd_set_clnt_mgmt_program] 0-: Using Program glusterd mgmt, Num (1238433), Version (2)
[2013-04-08 17:41:35.521580] I [glusterd-handshake.c:403:glusterd_set_clnt_mgmt_program] 0-: Using Program Peer mgmt, Num (1238437), Version (2)
[2013-04-08 17:41:35.531927] I [glusterd-rpc-ops.c:219:glusterd3_1_probe_cbk] 0-glusterd: Received probe resp from uuid: 26bd1e66-1455-438c-83d4-b6dd13d4389c, host: t2-test04
[2013-04-08 17:41:35.532181] I [glusterd-handler.c:416:glusterd_friend_find] 0-glusterd: Unable to find peer by uuid
[2013-04-08 17:41:35.533168] I [glusterd-rpc-ops.c:287:glusterd3_1_probe_cbk] 0-glusterd: Received resp to probe req
[2013-04-08 17:41:35.555106] I [glusterd-rpc-ops.c:329:glusterd3_1_friend_add_cbk] 0-glusterd: Received ACC from uuid: 26bd1e66-1455-438c-83d4-b6dd13d4389c, host: t2-test04, port: 0
[2013-04-08 17:41:35.555218] I [glusterd-handler.c:2423:glusterd_xfer_cli_probe_resp] 0-glusterd: Responded to CLI, ret: 0
[2013-04-08 17:41:35.557322] I [glusterd-handler.c:1758:glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: 26bd1e66-1455-438c-83d4-b6dd13d4389c
[2013-04-08 17:41:35.557502] I [glusterd-handler.c:1799:glusterd_handle_probe_query] 0-glusterd: Responded to XXX.XXX.XXX.XXX, op_ret: 0, op_errno: 0, ret: 0
[2013-04-08 17:41:35.558289] I [glusterd-handler.c:1486:glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 26bd1e66-1455-438c-83d4-b6dd13d4389c
[2013-04-08 17:41:35.558361] E [glusterd-utils.c:1926:glusterd_compare_friend_volume] 0-: Cksums of volume gluster-test differ. local cksum = 1934649064, remote cksum = -1823208808
[2013-04-08 17:41:35.558502] I [glusterd-handler.c:2395:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to XXX.XXX.XXX.XXX (0), ret: 0
[2013-04-08 17:41:45.287974] I [glusterd-handler.c:819:glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req

Comment 1 Kaushal 2013-04-09 04:43:29 UTC
Can you check if the volfiles for volume 'gluster-test', in /var/lib/glusterd/vols/ , are the same on test{02,03,04} (after the probe fails)? If there are any differences, please post them here.

Comment 2 ivano.talamo 2013-04-09 09:15:04 UTC
Hi,
the files on /var/lib/glusterd/vols/gluster-test are the same on test{02,03}.
While some of them differ on test04:
[root@t2-test03 gluster-test]# md5sum *
md5sum: bricks: Is a directory
5dfe5b652e3539efa37fc1c8801ea66b  cksum
ea87d872aa0870f358dbd8ee03bd8a5c  gluster-test-fuse.vol
086f16c89d72fc639876ef2b5d4876cd  gluster-test.t2-test02.root-b1.vol
086f16c89d72fc639876ef2b5d4876cd  gluster-test.t2-test03.root-b1.vol
27446113de33c6112d41c4306d24e776  info
f1b55d145d2987c1b23c80d3dcc689ed  node_state.info
7539d230a861bcba000f71047da6b2b4  rbstate

[root@t2-test04 gluster-test]# md5sum *
md5sum: bricks: Is a directory
4117326b54141c43d5f9f34fc15334c0  cksum
ecdc91a2c769a2867a367946ef2b8897  gluster-test-fuse.vol
086f16c89d72fc639876ef2b5d4876cd  gluster-test.t2-test02.root-b1.vol
086f16c89d72fc639876ef2b5d4876cd  gluster-test.t2-test03.root-b1.vol
795e55064584312169ebc12f70b0f234  info
f1b55d145d2987c1b23c80d3dcc689ed  node_state.info
7539d230a861bcba000f71047da6b2b4  rbstate
ecdc91a2c769a2867a367946ef2b8897  trusted-gluster-test-fuse.vol

Here are the diffs:
[root@t2-test04 ~]# diff gluster-test-t2-test03/cksum /var/lib/glusterd/vols/gluster-test/cksum
1c1
< info=600099517
---
> info=2600514254
[root@t2-test04 ~]# diff gluster-test-t2-test03/gluster-test-fuse.vol /var/lib/glusterd/vols/gluster-test/gluster-test-fuse.vol
40,41c40,41
< volume gluster-test-stat-prefetch
<     type performance/stat-prefetch
---
> volume gluster-test-md-cache
>     type performance/md-cache
49c49
<     subvolumes gluster-test-stat-prefetch
---
>     subvolumes gluster-test-md-cache
[root@t2-test04 ~]# diff gluster-test-t2-test03/info /var/lib/glusterd/vols/gluster-test/info
4a5,6
> stripe_count=1
> replica_count=1

Furthermore, only on t2-test04, there's the file trusted-gluster-test-fuse.vol.

Comment 3 Kaushal 2013-04-17 04:52:47 UTC
Thanks for the update.

Gluster v3.3 brings some changes to the volfiles and some options. So, on upgrade from 3.2 to 3.3, these files need to have been regenerated. If the upgrade had been done via rpms, the regeneration should have happened correctly. Since the regeneration hasn't happened, you either did a source install or the rpms you used were faulty (in which case please provide details on the rpms).

When you added a new peer to the cluster, it got the volume details from the original peers, but when it saved them to disk it used the newer format. This caused the checksum mismatch, which lead to it being Rejected.

To solve this problem, perform step 5 from this article http://vbellur.wordpress.com/2012/05/31/upgrading-to-glusterfs-3-3/ , on the original 2 peers. That should regenerate the volfiles and solve your problems.

Comment 4 ivano.talamo 2013-04-17 16:57:09 UTC
Hi Kaushal,
thank you! I made the procedure you suggested and yes, now the probe was successful. 
The repo file in the servers is configured to take RPMs from here:
http://download.gluster.org/pub/gluster/glusterfs/3.3/3.3.1/EPEL.repo/epel-5/x86_64/
Is it the right one?
Anyway a second problem came out. After adding to the existing distributed volume a brick from the new server, all the files disappeared from gluster. If I do an ls I can't see anything, although I can read/write the files giving the exact path and the files are into the bricks.
I also tried to restart the gluster daemon and to restart the server but nothing changed...

Thanks,
Ivano

Comment 5 Kaushal 2013-04-19 05:12:11 UTC
Hi Ivano,
It seems like there are problems with the rpms. We'll need to check the rpms to see what caused it.

Your new problem shouldn't be happening. Can you give more information (what you did after the upgrade basically) and logs of both servers and the client. I'd suggest that you file a new bug, if this turns out to be a valid issue.

Regards,
Kaushal

Comment 6 Niels de Vos 2014-11-27 14:54:18 UTC
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug.

If there has been no update before 9 December 2014, this bug will get automatocally closed.


Note You need to log in before you can comment on or make changes to this bug.