1327590 – restarting a node in cluster shows peer status as Peer in Cluster (Disconnected) for ever on other nodes

Bug 1327590 - restarting a node in cluster shows peer status as Peer in Cluster (Disconnected) for ever on other nodes

Summary: restarting a node in cluster shows peer status as Peer in Cluster (Disconnect...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Atin Mukherjee
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-04-15 12:58 UTC by Nag Pavan Chilakam
Modified:	2018-01-23 10:17 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-04-17 17:23:01 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nag Pavan Chilakam 2016-04-15 12:58:04 UTC

Description of problem:
========================
In a cluster, If i restarted a one of the nodes, I find that the peer status shown in all the other nodes for the downed node as State: Peer in Cluster (Disconnected)

This remains for ever


whereas on the node which is rebooted, it shows all nodes as connected, but I cannot perform any operation as the glusterd is unable to communicate(like gluster v delete, etc)

Version-Release number of selected component (if applicable):
=============================================
[root@network glusterfs]# rpm -qa|grep gluster
glusterfs-client-xlators-3.7.9-1.el7rhgs.x86_64
glusterfs-server-3.7.9-1.el7rhgs.x86_64
python-gluster-3.7.5-19.el7rhgs.noarch
gluster-nagios-addons-0.2.5-1.el7rhgs.x86_64
vdsm-gluster-4.16.30-1.3.el7rhgs.noarch
glusterfs-3.7.9-1.el7rhgs.x86_64
glusterfs-api-3.7.9-1.el7rhgs.x86_64
glusterfs-cli-3.7.9-1.el7rhgs.x86_64
glusterfs-geo-replication-3.7.9-1.el7rhgs.x86_64
glusterfs-debuginfo-3.7.9-1.el7rhgs.x86_64
gluster-nagios-common-0.2.3-1.el7rhgs.noarch
glusterfs-libs-3.7.9-1.el7rhgs.x86_64
glusterfs-fuse-3.7.9-1.el7rhgs.x86_64
glusterfs-rdma-3.7.9-1.el7rhgs.x86_64


How reproducible:
================
always

Steps to Reproduce:
1.had a 6 node cluster
2.created a 4+2 ec volume
3.wanted to check IOs when one node is down, hence brought rebooted a node

On reboot, I see that cluster peer status from any of the nodes which were up shows as connected, but for the node which is rebooted as disconnected



[root@dhcp35-228 ~]# gluster peer status
Number of Peers: 5

Hostname: dhcp35-70.lab.eng.blr.redhat.com
Uuid: a1089307-8628-4927-8016-dbf8a5e25370
State: Peer in Cluster (Connected)
Other names:
10.70.35.70

Hostname: 10.70.35.71
Uuid: 64828171-ee80-4b2c-b0ad-4db040c77778
State: Peer in Cluster (Connected)

Hostname: 10.70.35.108
Uuid: 16c133d3-7e32-4518-84e3-59d65e3fae3b
State: Peer in Cluster (Connected)

Hostname: 10.70.35.209
Uuid: c27b1ffb-382c-4771-94a1-4d972fa4263f
State: Peer in Cluster (Disconnected) ---------------->rebooted node

Hostname: 10.70.35.140
Uuid: 5e69cedc-7070-4f6d-be92-5254fff17064
State: Peer in Cluster (Connected)

Comment 4 SATHEESARAN 2016-04-17 17:23:01 UTC

From the sosreport of the downnode ( 10.70.37.209 ), I observed that the iptables rules are enabled, and no rules to allow glusterd port (24007) was added.

While the iptables rules on upnode were flushed, instead of opening up glusterd ports.

So the RCA for this issue is - Trusted storage pool was formed by flushing iptables rules on all the nodes, and post reboot these rules again came in to effect causing the node to go disconnected from other nodes.

Its highly recommended to open up ports as glusterfs firewalld service file is now available ( from RHGS 3.1.1 ),

# firewall-cmd --zone=public --add-service=glusterfs
# firewall-cmd --zone=public --add-service=glusterfs --permanent

Note You need to log in before you can comment on or make changes to this bug.