Bug 1241336

Summary: When one of RHGS node in the cluster, abruptly goes down, then all gluster cli commands fails
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: SATHEESARAN <sasundar>
Component: glusterdAssignee: Satish Mohan <smohan>
Status: CLOSED WORKSFORME QA Contact: Bala Konda Reddy M <bmekala>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: amukherj, asriram, asrivast, nlevinki, sasundar, smohan, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: glusterd
Fixed In Version: Doc Type: Known Issue
Doc Text:
When an Red Hat Gluster Storage node is shut down due to power failure or hardware failure, or when the network interface on a node goes down abruptly, subsequent Gluster commands may time out. This happens because the corresponding TCP connection remains in the ESTABLISHED state. You can confirm this by executing the following command: "ss -tap state established '( dport = :24007 )' dst <IP-addr-of-powered-off-RHGS-node>" Workaround: Restart glusterd service on all other nodes.
Story Points: ---
Clone Of: Environment:
RHEL6
Last Closed: 2018-01-30 02:04:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1250809    
Bug Blocks: 1216951    

Description SATHEESARAN 2015-07-09 03:42:31 UTC
Description of problem:
-----------------------
When one of RHGS node in the cluster goes down abruptly ( due to forced shutdown, power failure, hardware failure, network disconnect ), then gluster was unable to detect that the host is down. The consequence is that all the gluster cli commands are failing with "Error: Request timed out"

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHGS 3.1 Nightly build

How reproducible:
-----------------
Consistent

Steps to Reproduce:
--------------------
1. Poweroff one of the RHGS node in the 'Trusted Storage Pool'
2. Execute 'gluster volume status'

Actual results:
---------------
All gluster cli commands fail with error "Error : Request timed out"

Expected results:
-----------------
Atleast after sometime, gluster should detect that the RHGS node is down, and should not block/fail subsequent gluster cli commands

Additional info:
----------------
In RHGS 3.0.4, this issue was not there and gluster could able to detect that when the RHGS node is down.

I tried the testcase, by blocking all network traffic from the particular RHGS node to all other node ( both incoming & outgoing ), and again I could hit this problem.

Comment 3 SATHEESARAN 2015-07-09 10:20:35 UTC
With the latest testing, I had only 2 nodes in the cluster and did the following steps:

1. Created a 2 node 'Trusted Storage Pool'
2. Created a plain distribute volume with a single brick on node1
3. Powered off node2 ( as this RHGS node was a VM, I did 'virsh destroy rhsvm' )

Result - 
All gluster cli commands started to error out.
[root@ ~]# gluster v status
Error : Request timed out

Proposing this bug as a BLOCKER based on following thoughts,

Any node in the cluster could go down abruptly ( hardware failure can't be predicted ) and that leads to all gluster cli commands failing

Comment 4 SATHEESARAN 2015-07-10 06:18:44 UTC
I have tried the same case with baremetal machines and I see the same behaviour of
 gluster cli commands hanging after one of the machines is shutdown forcefully.

Here I performed 'Power off server - Immediate', through supermicro console

Comment 6 monti lawrence 2015-07-22 21:13:12 UTC
Doc text is edited. Please sign off to be included in Known Issues.

Comment 8 Anjana Suparna Sriram 2015-07-28 03:18:37 UTC
Updated the doc text, please review and sign off.

Comment 9 krishnan parthasarathi 2015-07-28 09:21:31 UTC
Anjana, 
The updated documentation looks good to me. Thanks for editing it.