1051992 – Peer stuck on "accepted peer request"

Bug 1051992 - Peer stuck on "accepted peer request"

Summary: Peer stuck on "accepted peer request"

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kaushal
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-01-13 04:27 UTC by purpleidea
Modified:	2018-08-29 03:54 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-08-29 03:54:03 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description purpleidea 2014-01-13 04:27:22 UTC

Description of problem:

When building clusters of gluster hosts, invariably, one of the peers is reported as being in: "accepted peer request" (state == 4) instead of being in a normal accepted state.

As seen from the other host, the connection looks fine. The problem, is that gluster volume create commands fail because one peer looks like it's not ready.

Version-Release number of selected component (if applicable):

3.4+

How reproducible:

Joe told me that he had heard of this issue, but nobody had a reliable way to reproduce it. I can reproduce it near 100% I think. In fact, I had to patch Puppet-Gluster to work around it, because it was disrupting my automatic builds!

Steps to Reproduce:
1. Remove workaround patch that I added to:
https://github.com/purpleidea/puppet-gluster/blob/master/manifests/volume.pp#L217

That exec {} block prevents it. There are also 2 references to it that look like:
Exec["gluster-volume-stuck-${name}"],
Currently on lines: 205 and 213. Comment out those too.

2. Deploy puppet-gluster using vagrant:
https://ttboj.wordpress.com/2014/01/08/automatically-deploying-glusterfs-with-puppet-gluster-vagrant/

3. Volume create will eventually fail. Look at 'gluster peer status' on all nodes. One entry on one of the hosts will show a host in state == 4.

Actual results:

Cluster doesn't build because all host aren't in the right state.

Expected results:

Cluster should build smoothly :)

Additional info:

You can work around this problem easily by restarting glusterd on the host that shows the affected host. The problem is that it should state transition automatically...

Comment 2 Joonas Vilenius 2014-12-30 12:56:49 UTC

Possibly related so i'll provide my story` but if needed i'll either submit a new bug or move the comment to some more appropriate existing bug.

Since we upgraded to 3.4.5 we haven't added any peers to the existing cluster. Now as it is attempted it seems to always fail.

Some details for starters:
- Debian Wheezy (amd64)
- 12 existing servers in the cluster
- 4 new servers in another datacenter
- network expanded between the datacenters
- GlusterFS 3.4.5-1 from Debian packages

1. gluster peer status reports 11 peers, all in state Peer in Cluster (Connected)
2. host gluster13 is empty, glusterfs installed and started, minimal amount of data under /var/lib/glusterd
3. gluster peer probe gluster13 from gluster01 reports "peer probe: success"
4. gluster peer status run on gluster01 reports now for the added host:

Hostname: gluster13
Port: 24007
Uuid: 2902f0a9-73ba-48ea-a185-e2a94799ac3b
State: Peer Rejected (Connected)

5. gluster peer status run on gluster13 reports:

Hostname: 10.10.30.101
Port: 24007
Uuid: 0a684cf6-ae7c-44a5-b7a4-16173f311e45
State: Peer Rejected (Connected)

6. /var/lib/glusterd/peers/0a684cf6-ae7c-44a5-b7a4-16173f311e45 contains:

uuid=0a684cf6-ae7c-44a5-b7a4-16173f311e45
state=6
hostname1=10.10.30.101

7. on gluster13 the /var/log/glusterfs/etc-glusterfs-glusterd.vol.log contains:

[2014-12-30 12:26:19.748973] I [glusterd-rpc-ops.c:225:__glusterd_probe_cbk] 0-glusterd: Received probe resp from uuid: 0a684cf6-ae7c-44a5-b7a4-16173f311e45, host: 10.10.30.101
[2014-12-30 12:26:19.807746] I [glusterd-rpc-ops.c:295:__glusterd_probe_cbk] 0-glusterd: Received resp to probe req
[2014-12-30 12:26:19.807838] I [glusterd-rpc-ops.c:345:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 0a684cf6-ae7c-44a5-b7a4-16173f311e45, host: 10.10.30.101, port: 0

At this point i get puzzled.. I'm not doing anything differently than with the existing 12 hosts.. No firewall in between, same network, ..

I can see all the existing volumes and some details of them from the gluster13 even while i'm rejected from the trusted peers but i cannot create for example a new volume to the new hosts.

If i probe from gluster13 to gluster15 (13-16 are the new hosts) they get peered without issues!?

As said, i'm puzzled, where does the reject come from?

Comment 3 Joonas Vilenius 2014-12-30 15:01:58 UTC

Nevermind the above, had to do some (dummy) operation on each volume to get peers connect.

Comment 4 krishnan parthasarathi 2015-04-16 04:42:51 UTC

Root cause analysis
--------------------

The following sequence of events leads to the issue observed.

Let us take 4 nodes, namely A, B, C and D for forming a cluster with them.
- From A, probe B.
- After A and B are part of the cluster, say B goes offline.
- From A, probe C.
- From A, probe D.
- After C and D are part of the cluster, say B comes online.

At this point, C and D share their view of the cluster with B, as part of
glusterd's handshake algorithm. This is to ensure that the members' view of the
cluster are consistent. If this happens before A informs B of the addition of
C and D to the cluster, B would reject requests from C and D as 'illegal' (i.e,
out of cluster). This would result in C and D to see B in "Accepted Peer
Request" state, due to a bug in the internal state machine transitions that
didn't anticipate this sequence of events.

Analogy
--------

Imagine 4 like-minded people, namely A, B, C and D, who register for a
conference. Only A and B make it and become friends. A meets C and D, on a
different occasion where B isn't present, and become friends. A introduces B to
C and D. C and D being their enthusiastic selves introduce themselves to B,
where A isn't present. B didn't entertain C and D since she didn't know them.
Later, A informs B about C and D, but it was too late.

N B This analogy is an aid to explain the internal algorithm at a high-level.
Like all analogies this is bound to break soon.

Comment 6 Amar Tumballi 2018-08-29 03:54:03 UTC

Lot of time since no activity on this bug. We have either fixed it already or it is mostly not critical anymore!

Please re-open the bug if the issue is burning for you, or you want to take the bug to closure with fixes.

Note You need to log in before you can comment on or make changes to this bug.