Bug 1030660 - [glusterd] On the newly probed nodes, "gluster peer status" outputs the state of node as 'Accepted peer request', when that node was offline(glusterd stopped) during peer probing and came online(glusterd started) thereafter
Summary: [glusterd] On the newly probed nodes, "gluster peer status" outputs the state...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterfs
Version: 2.1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Nagaprasad Sathyanarayana
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-11-14 21:24 UTC by SATHEESARAN
Modified: 2016-02-18 00:21 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-12-03 17:18:47 UTC
Embargoed:


Attachments (Terms of Use)

Description SATHEESARAN 2013-11-14 21:24:13 UTC
Description of problem:
In a 2 node cluster, when one of the node is offline (glusterd stopped) and another 2 nodes were probed. When the offline node was came up ( glusterd started ),"gluster peer status" in one of the newly probed peers ( sometimes both ), show the up node[which was online after peer probing] with state 'Accepted peer request'

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.44rhs-1

How reproducible:
5/6 times

Steps to Reproduce:
1. Create 4 VMs with RHS 2.1 installed

2. Update all RHS 2.1 nodes to RHS 2.1 U1 ( glusterfs-3.4.0.44rhs-1 )

3. From node1 peer probe node2
(i.e) gluster peer probe <node2>

4. Now stop glusterd in node2
(i.e) service glusterd stop

5. Check the peer status from node1. node2 should be rendered as "Disconnected", as glusterd was down
(i.e) gluster peer status

6. Probe for 2 more nodes (node3, node4) from node1
(i.e) gluster peer probe <node3>
      gluster peer probe <node4>

7. Check "gluster peer status" from all the nodes, except node2, as glusterd is down in this node
(i.e) gluster peer status

NOTE: all the nodes will show node2 as disconnected, as glusterd is stopped in this node

8. Bring up glusterd in node2
(i.e) service glusterd start

9.Check for peer status in newly probed nodes [node3 & node4]
(i.e) gluster peer status

Actual results:
All nodes should be in connected state

Expected results:
Newly probed nodes have the peers with status - 'Accepted peer request' in 'gluster peer status' output

Additional info:
1. RHS Nodes - 10.70.37.{180,182,131,116}
2. Initially 10.70.37.{180,182} were in cluster [Trusted storage pool]
3. glusterd in 10.70.37.182 was brought down
4. New hosts 10.70.37.{131,116} are probed from 10.70.37.180

5. Console logs from all the RHS Nodes
======================================
1. RHS NODE-1 [10.70.37.180]
=============================
[Thu Nov 14 20:09:54 UTC 2013 satheesaran@unused:~ ] # ssh root.37.180
Last login: Thu Nov 14 14:42:48 2013 from vpn-56-5.rdu2.redhat.com
[Thu Nov 14 20:14:08 UTC 2013 root.37.180:~ ] # gluster peer probe 10.70.37.182
peer probe: success. 
[Thu Nov 14 20:14:25 UTC 2013 root.37.180:~ ] # gluster peer status
Number of Peers: 1

Hostname: 10.70.37.182
Uuid: 80126e8e-ddf5-492c-8e6d-c87a053a04c3
State: Peer in Cluster (Connected)
[Thu Nov 14 20:14:52 UTC 2013 root.37.180:~ ] # gluster peer status
Number of Peers: 1

Hostname: 10.70.37.182
Uuid: 80126e8e-ddf5-492c-8e6d-c87a053a04c3
State: Peer in Cluster (Disconnected)
[Thu Nov 14 20:15:26 UTC 2013 root.37.180:~ ] # gluster peer probe 10.70.37.131
peer probe: success. 
[Thu Nov 14 20:15:44 UTC 2013 root.37.180:~ ] # gluster peer probe 10.70.37.116
peer probe: success. 
[Thu Nov 14 20:15:52 UTC 2013 root.37.180:~ ] # gluster peer status
Number of Peers: 3

Hostname: 10.70.37.182
Uuid: 80126e8e-ddf5-492c-8e6d-c87a053a04c3
State: Peer in Cluster (Disconnected)

Hostname: 10.70.37.131
Uuid: 1dd789ea-3552-47da-a61e-792f1b5cf2e7
State: Peer in Cluster (Connected)

Hostname: 10.70.37.116
Uuid: 16ac6c31-7bfa-47cc-ae4a-ed6647c80f58
State: Peer in Cluster (Connected)

2. RHS NODE-2 [10.70.37.182]
============================
[Thu Nov 14 20:14:07 UTC 2013 root.37.182:~ ] # service glusterd stop
Stopping glusterd:                                         [  OK  ]
[Thu Nov 14 20:15:04 UTC 2013 root.37.182:~ ] # service glusterd start
Starting glusterd:                                         [  OK  ]

3. RHS NODE-3 [10.70.37.131]
============================
[Thu Nov 14 20:14:07 UTC 2013 root.37.131:~ ] # gluster peer status
Number of Peers: 3

Hostname: 10.70.37.180
Uuid: de489372-69aa-4f61-8604-5a722d396aa8
State: Peer in Cluster (Connected)

Hostname: 10.70.37.182
Uuid: 80126e8e-ddf5-492c-8e6d-c87a053a04c3
State: Peer in Cluster (Disconnected)

Hostname: 10.70.37.116
Uuid: 16ac6c31-7bfa-47cc-ae4a-ed6647c80f58
State: Peer in Cluster (Connected)
[Thu Nov 14 20:16:04 UTC 2013 root.37.131:~ ] # gluster peer status
Number of Peers: 3

Hostname: 10.70.37.180
Uuid: de489372-69aa-4f61-8604-5a722d396aa8
State: Peer in Cluster (Connected)

Hostname: 10.70.37.182
Uuid: 80126e8e-ddf5-492c-8e6d-c87a053a04c3
State: Accepted peer request (Connected)

Hostname: 10.70.37.116
Uuid: 16ac6c31-7bfa-47cc-ae4a-ed6647c80f58
State: Peer in Cluster (Connected)

4. RHS NODE-4 [10.70.37.116]
=============================
[Thu Nov 14 20:14:08 UTC 2013 root.37.116:~ ] # gluster peer status
Number of Peers: 3

Hostname: 10.70.37.180
Uuid: de489372-69aa-4f61-8604-5a722d396aa8
State: Peer in Cluster (Connected)

Hostname: 10.70.37.182
Uuid: 80126e8e-ddf5-492c-8e6d-c87a053a04c3
State: Peer in Cluster (Disconnected)

Hostname: 10.70.37.131
Uuid: 1dd789ea-3552-47da-a61e-792f1b5cf2e7
State: Peer in Cluster (Connected)
[Thu Nov 14 20:16:12 UTC 2013 root.37.116:~ ] # gluster peer status
Number of Peers: 3

Hostname: 10.70.37.180
Uuid: de489372-69aa-4f61-8604-5a722d396aa8
State: Peer in Cluster (Connected)

Hostname: 10.70.37.182
Uuid: 80126e8e-ddf5-492c-8e6d-c87a053a04c3
State: Accepted peer request (Connected)

Hostname: 10.70.37.131
Uuid: 1dd789ea-3552-47da-a61e-792f1b5cf2e7
State: Peer in Cluster (Connected)

6. Interesting glusterd log file snippet
========================================
A. Snip from glusterd log file [/var/log/glusterfs/etc-glusterfs-glusterd.vol.log] from 10.70.37.116
---------------------------------------------------------------------
[2013-11-14 20:15:52.200772] I [glusterd-handler.c:2244:__glusterd_handle_friend_update] 0-: Received my uuid as Friend
[2013-11-14 20:16:12.361447] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2013-11-14 20:16:31.292529] I [glusterd-rpc-ops.c:357:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 80126e8e-ddf5-492c-8e6d-c87a053a04c3, host: 10.70.37.182, port: 24007

B. Snip from glusterd log file [/var/log/glusterfs/etc-glusterfs-glusterd.vol.log] from 10.70.37.131
---------------------------------------------------------------------
[2013-11-14 20:16:04.220409] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2013-11-14 20:16:30.574948] I [glusterd-rpc-ops.c:357:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 80126e8e-ddf5-492c-8e6d-c87a053a04c3, host: 10.70.37.182, port: 24007

Comment 2 Vivek Agarwal 2014-02-20 08:36:49 UTC
adding 3.0 flag and removing 2.1.z

Comment 3 James (purpleidea) 2014-05-05 21:14:09 UTC
This looks like a duplicate of:
https://bugzilla.redhat.com/show_bug.cgi?id=1051992

Note that Puppet-Gluster detects this issue, and includes a workaround for it.

Comment 5 krishnan parthasarathi 2015-04-16 04:41:16 UTC
Root cause analysis
--------------------

The following sequence of events leads to the issue observed.

Let us take 4 nodes, namely A, B, C and D for forming a cluster with them.
- From A, probe B.
- After A and B are part of the cluster, say B goes offline.
- From A, probe C.
- From A, probe D.
- After C and D are part of the cluster, say B comes online.

At this point, C and D share their view of the cluster with B, as part of
glusterd's handshake algorithm. This is to ensure that the members' view of the
cluster are consistent. If this happens before A informs B of the addition of
C and D to the cluster, B would reject requests from C and D as 'illegal' (i.e,
out of cluster). This would result in C and D to see B in "Accepted Peer
Request" state, due to a bug in the internal state machine transitions that
didn't anticipate this sequence of events.

Analogy
-------

Imagine 4 like-minded people, namely A, B, C and D, who register for a
conference. Only A and B make it and become friends. A meets C and D, on a
different occasion where B isn't present, and become friends. A introduces B to
C and D. C and D being their enthusiastic selves introduce themselves to B,
where A isn't present. B didn't entertain C and D since she didn't know them.
Later, A informs B about C and D, but it was too late.

N B This analogy is an aid to explain the internal algorithm at a high-level.
Like all analogies this is bound to break soon.

Comment 8 Vivek Agarwal 2015-12-03 17:18:47 UTC
Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/

If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.


Note You need to log in before you can comment on or make changes to this bug.