Bug 1051992
Summary: | Peer stuck on "accepted peer request" | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | purpleidea |
Component: | glusterd | Assignee: | Kaushal <kaushal> |
Status: | CLOSED WONTFIX | QA Contact: | |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | mainline | CC: | bugs, purpleidea, rhbugzilla, sasundar, smohan |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-08-29 03:54:03 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
purpleidea
2014-01-13 04:27:22 UTC
Possibly related so i'll provide my story` but if needed i'll either submit a new bug or move the comment to some more appropriate existing bug. Since we upgraded to 3.4.5 we haven't added any peers to the existing cluster. Now as it is attempted it seems to always fail. Some details for starters: - Debian Wheezy (amd64) - 12 existing servers in the cluster - 4 new servers in another datacenter - network expanded between the datacenters - GlusterFS 3.4.5-1 from Debian packages 1. gluster peer status reports 11 peers, all in state Peer in Cluster (Connected) 2. host gluster13 is empty, glusterfs installed and started, minimal amount of data under /var/lib/glusterd 3. gluster peer probe gluster13 from gluster01 reports "peer probe: success" 4. gluster peer status run on gluster01 reports now for the added host: Hostname: gluster13 Port: 24007 Uuid: 2902f0a9-73ba-48ea-a185-e2a94799ac3b State: Peer Rejected (Connected) 5. gluster peer status run on gluster13 reports: Hostname: 10.10.30.101 Port: 24007 Uuid: 0a684cf6-ae7c-44a5-b7a4-16173f311e45 State: Peer Rejected (Connected) 6. /var/lib/glusterd/peers/0a684cf6-ae7c-44a5-b7a4-16173f311e45 contains: uuid=0a684cf6-ae7c-44a5-b7a4-16173f311e45 state=6 hostname1=10.10.30.101 7. on gluster13 the /var/log/glusterfs/etc-glusterfs-glusterd.vol.log contains: [2014-12-30 12:26:19.748973] I [glusterd-rpc-ops.c:225:__glusterd_probe_cbk] 0-glusterd: Received probe resp from uuid: 0a684cf6-ae7c-44a5-b7a4-16173f311e45, host: 10.10.30.101 [2014-12-30 12:26:19.807746] I [glusterd-rpc-ops.c:295:__glusterd_probe_cbk] 0-glusterd: Received resp to probe req [2014-12-30 12:26:19.807838] I [glusterd-rpc-ops.c:345:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 0a684cf6-ae7c-44a5-b7a4-16173f311e45, host: 10.10.30.101, port: 0 At this point i get puzzled.. I'm not doing anything differently than with the existing 12 hosts.. No firewall in between, same network, .. I can see all the existing volumes and some details of them from the gluster13 even while i'm rejected from the trusted peers but i cannot create for example a new volume to the new hosts. If i probe from gluster13 to gluster15 (13-16 are the new hosts) they get peered without issues!? As said, i'm puzzled, where does the reject come from? Nevermind the above, had to do some (dummy) operation on each volume to get peers connect. Root cause analysis -------------------- The following sequence of events leads to the issue observed. Let us take 4 nodes, namely A, B, C and D for forming a cluster with them. - From A, probe B. - After A and B are part of the cluster, say B goes offline. - From A, probe C. - From A, probe D. - After C and D are part of the cluster, say B comes online. At this point, C and D share their view of the cluster with B, as part of glusterd's handshake algorithm. This is to ensure that the members' view of the cluster are consistent. If this happens before A informs B of the addition of C and D to the cluster, B would reject requests from C and D as 'illegal' (i.e, out of cluster). This would result in C and D to see B in "Accepted Peer Request" state, due to a bug in the internal state machine transitions that didn't anticipate this sequence of events. Analogy -------- Imagine 4 like-minded people, namely A, B, C and D, who register for a conference. Only A and B make it and become friends. A meets C and D, on a different occasion where B isn't present, and become friends. A introduces B to C and D. C and D being their enthusiastic selves introduce themselves to B, where A isn't present. B didn't entertain C and D since she didn't know them. Later, A informs B about C and D, but it was too late. N B This analogy is an aid to explain the internal algorithm at a high-level. Like all analogies this is bound to break soon. Lot of time since no activity on this bug. We have either fixed it already or it is mostly not critical anymore! Please re-open the bug if the issue is burning for you, or you want to take the bug to closure with fixes. |