Bug 1072720 - gluster peer probe results in peer rejected state
Summary: gluster peer probe results in peer rejected state
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: All
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Ravishankar N
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-03-05 05:33 UTC by Ravishankar N
Modified: 2014-11-11 08:28 UTC (History)
6 users (show)

Fixed In Version: glusterfs-3.6.0beta1
Clone Of:
Environment:
Last Closed: 2014-11-11 08:28:25 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Ravishankar N 2014-03-05 05:33:27 UTC
Description of problem:

[root@ravi1 glusterfs]# gluster peer probe 10.70.42.252
peer probe: success. 
[root@ravi1 glusterfs]# gluster peer status
Number of Peers: 1

Hostname: 10.70.42.252
Uuid: 53da4d10-fa90-44fa-aeb2-11c306f23d8b
State: Peer Rejected (Connected)

From usr/local/var/log/glusterfs/usr-local-etc-glusterfs-glusterd.vol.log:
E [glusterd-utils.c:2372:glusterd_compare_friend_volume] 0-management: Cksums of volume testvol differ. local cksum = 2871551223, remote cksum = 329029812 on peer 10.70.42.252

How reproducible:
Always

Steps to Reproduce:
1.Create a volume on a node
2.Peer probe a second node
3.Check peer status. 

Actual results:
State: Peer Rejected (Connected)

Expected results:
State: Peer in Cluster (Connected)

Additional info:

Comment 1 Anand Avati 2014-03-05 05:34:36 UTC
REVIEW: http://review.gluster.org/7186 (glusterd: send/receive volinfo->caps during peer probe.) posted (#1) for review on master by Ravishankar N (ravishankar)

Comment 2 Anand Avati 2014-03-09 07:24:14 UTC
COMMIT: http://review.gluster.org/7186 committed in master by Vijay Bellur (vbellur) 
------
commit dec7950d4b0944697e4bb8788cc02de2ac4d8708
Author: Ravishankar N <ravishankar>
Date:   Wed Mar 5 04:46:50 2014 +0000

    glusterd: send/receive volinfo->caps during peer probe.
    
    Problem: volinfo->caps was not sent over to newly probed peers, resulting in a
    'Peer Rejected' state due to volinfo checksum mismatch.
    
    Fix: send/receive volinfo capability when peer probing.
    
    Change-Id: I2508d3fc7a6e4aeac9c22dd7fb2d3b362f4c21ff
    BUG: 1072720
    Signed-off-by: Ravishankar N <ravishankar>
    Reviewed-on: http://review.gluster.org/7186
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Kaushal M <kaushal>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 3 Richard 2014-04-03 09:04:12 UTC
Has this made it into 3.5.0beta4?

I get this on a distributed volume when adding in the 2nd node:

Hostname: 172.16.185.106
Uuid: 41068245-072b-48fe-91ce-249feaea3813
State: Probe Sent to Peer (Connected)

Comment 4 Awktane 2014-04-21 19:28:53 UTC
Waiting for this on the 3.4 chain as well. Wanted to translate the above into english for anybody trying to work around this issue. This applies to an upgrade only I believe:

New nodes add the following lines to /var/lib/glusterd/{mount}/info
op-version=2
client-op-version=2

You will notice that these lines do not appear on the old nodes. This causes a mismatch, and therefore a rejected. Take your nodes down, add the lines (or remove them), and restart.

Comment 5 Richard 2014-04-22 07:51:03 UTC
Hi,
this applies to a fresh install setting up a brand new single node volume. When adding in the 2nd node to a distributed setup the peer probe fails too.
Rich

Comment 6 Richard 2014-04-22 07:51:37 UTC
when I say "this applies", I mean the problem is still there, no that this fix in Comment 4 works for it :-(

Comment 7 Ravishankar N 2014-04-22 08:51:52 UTC
Hi Richard, comments #3 and #4 seem to be a different issue than the one that I fixed. In the fresh install setup you described, when you peer probe, what error are you getting in glusterd logs of the nodes?

Comment 8 Awktane 2014-04-22 11:40:20 UTC
Note typo in comment 4... /var/lib/glusterd/{mount}/info should be /var/lib/glusterd/vol/{mount}/info

I get the same checksum error if I don't manually edit the info file.

Comment 9 Richard 2014-04-22 12:20:14 UTC
(In reply to Ravishankar N from comment #7

I get this on the initial node:

[2014-04-22 12:17:37.821414] I [glusterd-handler.c:918:__glusterd_handle_cli_probe] 0-glusterd: Received CLI probe req 172.16.242.241 24007
[2014-04-22 12:17:37.829156] I [glusterd-handler.c:2931:glusterd_probe_begin] 0-glusterd: Unable to find peerinfo for host: 172.16.242.241 (24007)
[2014-04-22 12:17:37.835996] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2014-04-22 12:17:37.836161] I [socket.c:3561:socket_init] 0-management: SSL support is NOT enabled
[2014-04-22 12:17:37.836174] I [socket.c:3576:socket_init] 0-management: using system polling thread
[2014-04-22 12:17:37.840074] I [glusterd-handler.c:2912:glusterd_friend_add] 0-management: connect returned 0
[2014-04-22 12:17:37.869206] I [glusterd-rpc-ops.c:234:__glusterd_probe_cbk] 0-glusterd: Received probe resp from uuid: c1f2632f-ea5f-467b-a701-0ea29caa153c, host: 172.16.242.241
[2014-04-22 12:17:37.874531] I [glusterd-rpc-ops.c:306:__glusterd_probe_cbk] 0-glusterd: Received resp to probe req

and this on the new node trying to join:

[2014-04-22 12:17:37.844723] I [glusterd.c:168:glusterd_uuid_generate_save] 0-management: generated UUID: c1f2632f-ea5f-467b-a701-0ea29caa153c
[2014-04-22 12:17:37.852461] I [glusterd-handler.c:2346:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: 4a652daa-a614-4034-93af-e7e57f90add8
[2014-04-22 12:17:37.853880] I [glusterd-handler.c:2374:__glusterd_handle_probe_query] 0-glusterd: Unable to find peerinfo for host: 172.16.0.1 (24007)
[2014-04-22 12:17:37.855150] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2014-04-22 12:17:37.855246] I [socket.c:3561:socket_init] 0-management: SSL support is NOT enabled
[2014-04-22 12:17:37.855262] I [socket.c:3576:socket_init] 0-management: using system polling thread
[2014-04-22 12:17:37.868301] I [glusterd-handler.c:2912:glusterd_friend_add] 0-management: connect returned 0
[2014-04-22 12:17:37.868498] I [glusterd-handler.c:2398:__glusterd_handle_probe_query] 0-glusterd: Responded to 172.16.0.1, op_ret: 0, op_errno: 0, ret: 0
[2014-04-22 12:17:37.871329] I [glusterd-handler.c:2050:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 4a652daa-a614-4034-93af-e7e57f90add8
[2014-04-22 12:17:37.927078] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2014-04-22 12:17:37.927374] I [socket.c:3561:socket_init] 0-management: SSL support is NOT enabled
[2014-04-22 12:17:37.927385] I [socket.c:3576:socket_init] 0-management: using system polling thread

If I downgrade to 3.4.x everything works just fine.
Thanks,
Rich

Comment 10 Richard 2014-04-22 12:21:43 UTC
oh, and on the first node, my peer status is this:

[DEV root@e5234340b67d11e glusterfs]$ gluster peer status
Number of Peers: 1

Hostname: 172.16.242.241
Uuid: c1f2632f-ea5f-467b-a701-0ea29caa153c
State: Probe Sent to Peer (Connected)

On the new node, my peer status is not available.

[PXE root@dde3904ac98711e glusterfs]$ gluster peer status
peer status: failed

Comment 11 Ravishankar N 2014-04-23 05:24:16 UTC
(In reply to Richard from comment #10)
> oh, and on the first node, my peer status is this:
> 
> [DEV root@e5234340b67d11e glusterfs]$ gluster peer status
> Number of Peers: 1
> 
> Hostname: 172.16.242.241
> Uuid: c1f2632f-ea5f-467b-a701-0ea29caa153c
> State: Probe Sent to Peer (Connected)
> 
> On the new node, my peer status is not available.
> 
> [PXE root@dde3904ac98711e glusterfs]$ gluster peer status
> peer status: failed

Tested and found that the problem exists when upgrading from 3.3 to 3.4 as reported by Awktane and needs to be fixed (until that the cause/workaround is in comment #4). 

But I am not able to recreate this on a 3.4 to 3.5 upgrade. There doesn't seem to be any errors in the glusterd logs as well. If you are able to reproduce this, could you attach the log files of both nodes? It would also help if you could run glusterd in debug mode before probing. (upgrade all nodes to 3.5, `pkill glusterd` on both nodes, run `glusterd -LDEBUG` on the nodes, then do the peer probing).

Comment 12 Awktane 2014-04-23 05:27:32 UTC
(In reply to Ravishankar N from comment #11)

My scenario was from 3.3 to 3.4. Is yours for 3.5 and therefore I should request a backport?

Comment 13 Ravishankar N 2014-04-23 05:36:52 UTC
(In reply to Awktane from comment #12)
> (In reply to Ravishankar N from comment #11)
> 
> My scenario was from 3.3 to 3.4. Is yours for 3.5 and therefore I should
> request a backport?

No,because it is not a backport but a new fix. The problem I faced was due to the "caps" key not being sent to the peers causing checksum mismatch. But to fix the problem faced by you, we need to regenerate the info files with the 'op-version' and 'client-op-version' key value pairs after an upgrade is done.

Comment 14 Awktane 2014-04-23 05:38:15 UTC
(In reply to Ravishankar N from comment #13)
> (In reply to Awktane from comment #12)
> > (In reply to Ravishankar N from comment #11)
> > 
> > My scenario was from 3.3 to 3.4. Is yours for 3.5 and therefore I should
> > request a backport?
> 
> No,because it is not a backport but a new fix. The problem I faced was due
> to the "caps" key not being sent to the peers causing checksum mismatch. But
> to fix the problem faced by you, we need to regenerate the info files with
> the 'op-version' and 'client-op-version' key value pairs after an upgrade is
> done.

So shall I resubmit as new bug then?

Comment 15 Ravishankar N 2014-04-23 05:39:50 UTC
(In reply to Awktane from comment #14)
> (In reply to Ravishankar N from comment #13)
> > (In reply to Awktane from comment #12)
> > > (In reply to Ravishankar N from comment #11)
> > > 
> > > My scenario was from 3.3 to 3.4. Is yours for 3.5 and therefore I should
> > > request a backport?
> > 
> > No,because it is not a backport but a new fix. The problem I faced was due
> > to the "caps" key not being sent to the peers causing checksum mismatch. But
> > to fix the problem faced by you, we need to regenerate the info files with
> > the 'op-version' and 'client-op-version' key value pairs after an upgrade is
> > done.
> 
> So shall I resubmit as new bug then?
Sure :)

Comment 16 Awktane 2014-04-23 05:57:39 UTC
(In reply to Ravishankar N from comment #15)
> (In reply to Awktane from comment #14)
> > (In reply to Ravishankar N from comment #13)
> > > (In reply to Awktane from comment #12)
> > > > (In reply to Ravishankar N from comment #11)
> > > > 
> > > > My scenario was from 3.3 to 3.4. Is yours for 3.5 and therefore I should
> > > > request a backport?
> > > 
> > > No,because it is not a backport but a new fix. The problem I faced was due
> > > to the "caps" key not being sent to the peers causing checksum mismatch. But
> > > to fix the problem faced by you, we need to regenerate the info files with
> > > the 'op-version' and 'client-op-version' key value pairs after an upgrade is
> > > done.
> > 
> > So shall I resubmit as new bug then?
> Sure :)

Done. Note that my issue was unrelated to this one. Split to https://bugzilla.redhat.com/show_bug.cgi?id=1090298

Comment 17 Niels de Vos 2014-09-22 12:36:32 UTC
A beta release for GlusterFS 3.6.0 has been released. Please verify if the release solves this bug report for you. In case the glusterfs-3.6.0beta1 release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED.

Packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update (possibly an "updates-testing" repository) infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-September/018836.html
[2] http://supercolony.gluster.org/pipermail/gluster-users/

Comment 18 Richard 2014-09-22 18:13:08 UTC
Hi

Thank you for the update, once it appears in the QA folder here, I will be able to test:

http://download.gluster.org/pub/gluster/glusterfs/qa-releases/

Thanks,

Rich

Comment 19 Richard 2014-09-23 20:21:57 UTC
Hi, This beta release has resolved the problem I was having with peer probes... and thrown up a new one. noatime is nolonger a supported mount option.
Thanks,
Rich

Comment 20 Niels de Vos 2014-11-11 08:28:25 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.1, please reopen this bug report.

glusterfs-3.6.1 has been announced [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-November/019410.html
[2] http://supercolony.gluster.org/mailman/listinfo/gluster-users


Note You need to log in before you can comment on or make changes to this bug.