1300241 – `gluster peer status' on one node shows one peer as disconnected but it appears to be connected to other peers

Bug 1300241 - `gluster peer status' on one node shows one peer as disconnected but it appears to be connected to other peers

Summary: `gluster peer status' on one node shows one peer as disconnected but it appea...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Atin Mukherjee
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-01-20 10:38 UTC by Shruti Sampat
Modified:	2016-09-17 16:48 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-02-19 11:37:28 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1304274	1	None	None	None	2024-09-18 00:44:28 UTC

Internal Links: 1304274

Description Shruti Sampat 2016-01-20 10:38:41 UTC

Description of problem:
-----------------------

On a 4-node cluster, `gluster peer status' on one of the nodes shows another peer as disconnected, but the "disconnected" peer seems to be connected to other peers and has glusterd running on it.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
glusterfs-3.7.5-14.el7rhgs.x86_64

How reproducible:
-----------------
Only observed once

Steps to Reproduce:
-------------------
I am not sure about exact steps to reproduce the issue, but glusterd was restarted a few times on the peer that seems to be disconnected.

Actual results:
---------------
One peer seems to be disconnected from only one of the nodes in the cluster, whereas actually it has glusterd running and appears to be connected to other peers.

Expected results:
-----------------
Expected is that the state of the peer be consistent across the cluster. Since glusterd is running on the node without any issues, it should be connected to all other nodes.

Comment 2 Atin Mukherjee 2016-01-20 10:47:00 UTC

After debugging the left over setup I found that rpc_clnt_reconnect was getting triggered after every 3 seconds which means the node was trying to establish the connection with the disconnected peer. However socket_connect () was returning a failure saying that underlying transport is already connected. Basically the code expect the socket to be set to -1 but its set to 17.

Comment 4 Atin Mukherjee 2016-01-21 03:39:09 UTC

We tried to debug this problem further but could not conclude on anything concrete. However it seems like (DIS)CONNECT event(s) raced and because of which notifyfn to the upper layer was never called. We'd need to add few logs into the code and see whether we can reproduce it to get to the RCA. By no means its a blocker for RHGS 3.1.2

Comment 6 Atin Mukherjee 2016-02-19 08:13:35 UTC

After going through the log files once again it seems like multi threaded epoll was enabled for GlusterD in the set up. Ideally once GlusterD starts up the final graph would look like this with an INFO log:

+------------------------------------------------------------------------------+
  1: volume management                                                               
  2:     type mgmt/glusterd                                                          
  3:     option rpc-auth.auth-glusterfs on                                           
  4:     option rpc-auth.auth-unix on                                                
  5:     option rpc-auth.auth-null on                                                
  6:     option transport.socket.listen-backlog 128                                  
  7:     option rpc-auth-allow-insecure on                                           
  8:     option event-threads 1                                                      
  9:     option ping-timeout 0                                                       
 10:     option transport.socket.read-fail-log off                                   
 11:     option transport.socket.keepalive-interval 2                                
 12:     option transport.socket.keepalive-time 10                                                                                    
 13:     option transport-type rdma                                                  
 14:     option working-directory /var/lib/glusterd                                  
 15: end-volume                                                                      
 16:                                                                                 
+------------------------------------------------------------------------------+
[2016-02-11 13:36:36.167635] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1

But here in the logs I can see that the following:

Final graph:                                                                       
+------------------------------------------------------------------------------+
  1: volume management                                                             
  2:     type mgmt/glusterd                                                        
  3:     option rpc-auth.auth-glusterfs on                                         
  4:     option rpc-auth.auth-unix on                                              
  5:     option rpc-auth.auth-null on                                              
  6:     option transport.socket.listen-backlog 128                                
  7:     option rpc-auth-allow-insecure on                                         
  8:     option ping-timeout 0                                                     
  9:     option transport.socket.read-fail-log off                                 
 10:     option transport.socket.keepalive-interval 2                              
 11:     option transport.socket.keepalive-time 10                                 
 12:     option transport-type rdma                                                
 13:     option working-directory /var/lib/glusterd                                
 14: end-volume                                                                    
 15:                                                                               
+------------------------------------------------------------------------------+
[2016-01-19 12:38:35.428986] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2016-01-19 12:38:35.429108] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2

And the above indicates the thread count was set to 2 by default if glusterd.vol file doesn't have event-threads option set. 

Since this is not a supported configuration we'd need to close this bug.

Comment 7 SATHEESARAN 2016-02-19 08:42:26 UTC

(In reply to Atin Mukherjee from comment #6)
> Since this is not a supported configuration we'd need to close this bug.

I did a upgrade test from RHGS 3.0 to RHGS 3.1.0 and here are the results

1. In RHGS 3.0, glusterd runs with single epoll thread only
2. In RHGS 3.0, no option available to configure number of epoll threads for glusterd

After upgrade to RHGS 3.1.0, there are no changes to glusterd volfile. The above observation holds true in RHGS 3.1.0 too.

After upgrade to RHGS 3.1.1, there is a change in glusterd volfile with 'event-threads 1' which prompts glusterd to start with only 1 epoll thread.
If this option is removed, then glusterd starts with 2 epoll threads.

We haven't really reached the conclusion, until we know how 'event-threads 1' option got removed from glusterd volfile ( in glusterfs-3.7.5-14.el7rhgs )

Comment 8 Atin Mukherjee 2016-02-19 11:37:28 UTC

As per the discussion with Shruti it seems like glusterd.vol file didn't have event-threads option configured as in this container setup the glusterd.vol file was shared and the same was from a different build. Hence closing this bug.

Note You need to log in before you can comment on or make changes to this bug.