Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1402172

Summary:	Peer unexpectedly disconnected
Product:	[Community] GlusterFS	Reporter:	David Galloway <dgallowa>
Component:	glusterd	Assignee:	Atin Mukherjee <amukherj>
Status:	CLOSED WORKSFORME	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.8	CC:	amukherj, bugs, dgallowa
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-01-23 14:08:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Galloway 2016-12-07 00:15:15 UTC

Description of problem:
I have a 1 x (2 + 1) = 3 volume.

One host in the cluster is reporting one of the peers is down but the other two hosts don't show any problems.

[root@store01 ~]# gluster volume status shardvol1
Status of volume: shardvol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick store01:/srv/gluster/shardbrick1      49153     0          Y       23039
Brick store03:/srv/gluster/shardbrick1      49153     0          Y       7120 
NFS Server on localhost                     N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       27426
NFS Server on store03                       N/A       N/A        N       N/A  
Self-heal Daemon on store03                 N/A       N/A        Y       31088

[root@store02 ~]# gluster volume status shardvol1
Status of volume: shardvol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick store01:/srv/gluster/shardbrick1      49153     0          Y       23039
Brick store02:/srv/gluster/shardbrick1      49153     0          Y       5660 
Brick store03:/srv/gluster/shardbrick1      49153     0          Y       7120 
NFS Server on localhost                     N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       31843
NFS Server on store03                       N/A       N/A        N       N/A  
Self-heal Daemon on store03                 N/A       N/A        Y       31088
NFS Server on store01                       N/A       N/A        N       N/A  
Self-heal Daemon on store01                 N/A       N/A        Y       27426

[root@store03 ~]# gluster volume status shardvol1
Status of volume: shardvol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick store01:/srv/gluster/shardbrick1      49153     0          Y       23039
Brick store02:/srv/gluster/shardbrick1      49153     0          Y       5660 
Brick store03:/srv/gluster/shardbrick1      49153     0          Y       7120 
NFS Server on localhost                     N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       31088
NFS Server on store01                       N/A       N/A        N       N/A  
Self-heal Daemon on store01                 N/A       N/A        Y       27426
NFS Server on store02                       N/A       N/A        N       N/A  
Self-heal Daemon on store02                 N/A       N/A        Y       31843

[root@store01 ~]# gluster peer status
Number of Peers: 2

Hostname: store02
Uuid: 97d6abee-f2ac-47d8-bb96-738ffb99b38f
State: Peer in Cluster (Disconnected)

Hostname: store03
Uuid: 2cbf1e99-4fb7-410d-b1f1-e385000b20ec
State: Peer in Cluster (Connected)

[root@store02 ~]# gluster peer status
Number of Peers: 2

Hostname: store01
Uuid: 7a931ab9-6075-4fd4-868d-9deeb91295c0
State: Peer in Cluster (Connected)

Hostname: store03
Uuid: 2cbf1e99-4fb7-410d-b1f1-e385000b20ec
State: Peer in Cluster (Connected)

[root@store03 ~]# gluster peer status
Number of Peers: 2

Hostname: store01
Uuid: 7a931ab9-6075-4fd4-868d-9deeb91295c0
State: Peer in Cluster (Connected)

Hostname: store02
Uuid: 97d6abee-f2ac-47d8-bb96-738ffb99b38f
State: Peer in Cluster (Connected)


Version-Release number of selected component (if applicable):
glusterfs 3.8.6

How reproducible:
Probably not very

Steps to Reproduce:
1. Create 1 x (2 + 1) = 3 volume
2. ???

Additional info:
I've uploaded gluster sosreports from each of the hosts here: http://drop.ceph.com/qa/dgalloway/

As far as I can tell, store02 started showing as disconnected from store01 today.  No changes (iptables or otherwise) were made to the gluster hosts.  I attempted a 'service glusterd restart' on store01 and store02 without improvement.

Comment 1 Atin Mukherjee 2016-12-07 04:02:02 UTC

We had seen this issue a long back but couldn't RCA it. We could debug to a state that there was a discrepancy in the connection between socket and application layer where application layer keeps on retrying for the connection and underlying socket was already connected due to which reconnect was failing on every attempt.

Good news is we managed to debug and RCA this issue recently and now its been fixed in upstream mainline and a patch http://review.gluster.org/#/c/16025 has been backported in release-3.8 branch which should go in to the next 3.8.x release.

Comment 2 David Galloway 2016-12-07 15:46:33 UTC

Thanks for the quick response!  Am I at any risk in continuing to run my cluster in this state?

In other words, are the peers really disconnected and replication is not happening or is 'peer status' output bogus?

Comment 3 Atin Mukherjee 2016-12-07 15:57:41 UTC

No, its a race, restarting glusterd instance where it shows disconnected should bring everything back to normal.

Comment 4 David Galloway 2016-12-07 16:24:54 UTC

(In reply to Atin Mukherjee from comment #3)
> No, its a race, restarting glusterd instance where it shows disconnected
> should bring everything back to normal.

It doesn't though.  I restarted glusterd on store01 and store02 multiple times.

Comment 5 Atin Mukherjee 2016-12-07 16:41:07 UTC

(In reply to David Galloway from comment #4)
> (In reply to Atin Mukherjee from comment #3)
> > No, its a race, restarting glusterd instance where it shows disconnected
> > should bring everything back to normal.
> 
> It doesn't though.  I restarted glusterd on store01 and store02 multiple
> times.

In that case, my initial suspect doesn't stand true. Is there any firewall setting causing the connection to not go through from store1 to store2?

Comment 6 David Galloway 2016-12-07 16:47:03 UTC

(In reply to Atin Mukherjee from comment #5)
> (In reply to David Galloway from comment #4)
> > (In reply to Atin Mukherjee from comment #3)
> > > No, its a race, restarting glusterd instance where it shows disconnected
> > > should bring everything back to normal.
> > 
> > It doesn't though.  I restarted glusterd on store01 and store02 multiple
> > times.
> 
> In that case, my initial suspect doesn't stand true. Is there any firewall
> setting causing the connection to not go through from store1 to store2?

Relevant 'iptables -L' output.

store01
-------
ACCEPT     tcp  --  172.21.0.0/20        anywhere             tcp dpt:24007
ACCEPT     tcp  --  172.21.0.0/20        anywhere             tcp dpt:24009
ACCEPT     tcp  --  172.21.0.0/20        anywhere             tcp dpts:38465:38468
ACCEPT     tcp  --  172.21.0.0/20        anywhere             tcp dpts:49152:49162

store02
-------
ACCEPT     tcp  --  172.21.0.0/20        anywhere             tcp dpt:24007
ACCEPT     tcp  --  172.21.0.0/20        anywhere             tcp dpt:24009
ACCEPT     tcp  --  172.21.0.0/20        anywhere             tcp dpts:38465:38467
ACCEPT     tcp  --  172.21.0.0/20        anywhere             tcp dpts:49152:49162

Comment 7 Atin Mukherjee 2016-12-08 09:16:28 UTC

David,

I wanted to have a look at the sosreport but unable to access it due to insufficient permissions. Mean time, can you check if glusterd has got any disconnect event by any chance? In glusterd log you should be seeing peer X is disconnected kind of log entry if that's the case. Otherwise could you restart glusterd with debug log enabled (kill glusterd and then bring it back with 'glusterd -LDEBUG') and share the glusterd logs from both the nodes?

Comment 8 David Galloway 2016-12-08 16:02:01 UTC

I fixed file permissions on the sosreports.  Sorry about that.

We had a network glitch on 2016-12-03 so there are a few entries about peers disconnecting then as well as when I tried restarting glusterd on 2016-12-06.

Let me know if you need the debug logging still after checking out the sosreport.

Comment 9 David Galloway 2016-12-15 17:30:26 UTC

Do you still need debug logs or can you tell from the sosreports what happened?

Comment 10 David Galloway 2016-12-20 18:47:27 UTC

I just logged in to store01 and all peers are showing as connected.  Nothing has changed as far as firewalls or configs on my end.

I uploaded new sosreports from store01 and store02 to http://drop.ceph.com/qa/dgalloway/ if you want to investigate.

Comment 11 Atin Mukherjee 2016-12-21 01:53:18 UTC

Sorry for the delay, I'll have a look at the logs and get back.

Comment 12 Atin Mukherjee 2016-12-21 06:14:14 UTC

[2016-12-06 23:40:10.394118] E [socket.c:2309:socket_connect_finish] 0-management: connection to 172.21.0.9:24007 failed (Connection timed out)
[2016-12-06 23:40:10.394317] I [MSGID: 106004] [glusterd-handler.c:5219:__glusterd_peer_rpc_notify] 0-management: Peer <store02> (<97d6abee-f2ac-47d8-bb96-738ffb99b38f>), in state <Peer in Cluster>, has disconnected from glusterd.

The above glusterd log snippet from store1 convinces me that there is an issue with the connectivity between store1 & store2 (172.21.0.9) ?

Doesn't restarting all the glusterd instances solve this issue?

Comment 13 David Galloway 2016-12-21 15:38:33 UTC

(In reply to Atin Mukherjee from comment #12)
> [2016-12-06 23:40:10.394118] E [socket.c:2309:socket_connect_finish]
> 0-management: connection to 172.21.0.9:24007 failed (Connection timed out)
> [2016-12-06 23:40:10.394317] I [MSGID: 106004]
> [glusterd-handler.c:5219:__glusterd_peer_rpc_notify] 0-management: Peer
> <store02> (<97d6abee-f2ac-47d8-bb96-738ffb99b38f>), in state <Peer in
> Cluster>, has disconnected from glusterd.
> 
> The above glusterd log snippet from store1 convinces me that there is an
> issue with the connectivity between store1 & store2 (172.21.0.9) ?
> 
> Doesn't restarting all the glusterd instances solve this issue?

It didn't, no.  But all the peers are showing as connected now without any intervention on my part so I'm not sure this bug is valid anymore.

Comment 14 Atin Mukherjee 2017-01-23 14:08:35 UTC

I am closing this BZ based on comment 13, please reopen if the issue is seen again.