1241336 – When one of RHGS node in the cluster, abruptly goes down, then all gluster cli commands fails

Bug 1241336 - When one of RHGS node in the cluster, abruptly goes down, then all gluster cli commands fails

Summary: When one of RHGS node in the cluster, abruptly goes down, then all gluster cl...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Satish Mohan
QA Contact:	Bala Konda Reddy M
Docs Contact:
URL:
Whiteboard:	glusterd
Depends On:	1250809
Blocks:	1216951
TreeView+	depends on / blocked

Reported:	2015-07-09 03:42 UTC by SATHEESARAN
Modified:	2019-04-27 02:16 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	When an Red Hat Gluster Storage node is shut down due to power failure or hardware failure, or when the network interface on a node goes down abruptly, subsequent Gluster commands may time out. This happens because the corresponding TCP connection remains in the ESTABLISHED state. You can confirm this by executing the following command: "ss -tap state established '( dport = :24007 )' dst <IP-addr-of-powered-off-RHGS-node>" Workaround: Restart glusterd service on all other nodes.
Clone Of:
Environment:	RHEL6
Last Closed:	2018-01-30 02:04:20 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description SATHEESARAN 2015-07-09 03:42:31 UTC

Description of problem:
-----------------------
When one of RHGS node in the cluster goes down abruptly ( due to forced shutdown, power failure, hardware failure, network disconnect ), then gluster was unable to detect that the host is down. The consequence is that all the gluster cli commands are failing with "Error: Request timed out"

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHGS 3.1 Nightly build

How reproducible:
-----------------
Consistent

Steps to Reproduce:
--------------------
1. Poweroff one of the RHGS node in the 'Trusted Storage Pool'
2. Execute 'gluster volume status'

Actual results:
---------------
All gluster cli commands fail with error "Error : Request timed out"

Expected results:
-----------------
Atleast after sometime, gluster should detect that the RHGS node is down, and should not block/fail subsequent gluster cli commands

Additional info:
----------------
In RHGS 3.0.4, this issue was not there and gluster could able to detect that when the RHGS node is down.

I tried the testcase, by blocking all network traffic from the particular RHGS node to all other node ( both incoming & outgoing ), and again I could hit this problem.

Comment 3 SATHEESARAN 2015-07-09 10:20:35 UTC

With the latest testing, I had only 2 nodes in the cluster and did the following steps:

1. Created a 2 node 'Trusted Storage Pool'
2. Created a plain distribute volume with a single brick on node1
3. Powered off node2 ( as this RHGS node was a VM, I did 'virsh destroy rhsvm' )

Result - 
All gluster cli commands started to error out.
[root@ ~]# gluster v status
Error : Request timed out

Proposing this bug as a BLOCKER based on following thoughts,

Any node in the cluster could go down abruptly ( hardware failure can't be predicted ) and that leads to all gluster cli commands failing

Comment 4 SATHEESARAN 2015-07-10 06:18:44 UTC

I have tried the same case with baremetal machines and I see the same behaviour of
 gluster cli commands hanging after one of the machines is shutdown forcefully.

Here I performed 'Power off server - Immediate', through supermicro console

Comment 6 monti lawrence 2015-07-22 21:13:12 UTC

Doc text is edited. Please sign off to be included in Known Issues.

Comment 8 Anjana Suparna Sriram 2015-07-28 03:18:37 UTC

Updated the doc text, please review and sign off.

Comment 9 krishnan parthasarathi 2015-07-28 09:21:31 UTC

Anjana, 
The updated documentation looks good to me. Thanks for editing it.

Note You need to log in before you can comment on or make changes to this bug.