Bug 1329421

Summary: [GSS] - Failover did not occur when a brick went unresponsive.
Product: Red Hat Gluster Storage Reporter: Oonkwee Lim_ <olim>
Component: coreAssignee: Atin Mukherjee <amukherj>
Status: CLOSED WORKSFORME QA Contact: Anoop <annair>
Severity: urgent Docs Contact:
Priority: urgent    
Version: rhgs-3.1CC: mmalhotr, olim, pkarampu, rhs-bugs, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-10 05:15:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Oonkwee Lim_ 2016-04-21 22:03:10 UTC
Description of problem:
Customer stated that when one of his node went unresponsive, the failover to the good node did not occur, causing the FUSE client to hang.

This specific machine was having a high CPU load, I tried to ssh to the gluster server node 1, It was waiting up to 2 minutes to ask for the password. When I opened the VMware console, it was not showing a 'frozen' server, the server was alive, but looks like was 'too busy'. I pressed enter and the Red Hat screen was jumping the lines, but never showed the prompt.

From the client's log, we do see the unwinding of saved frames possibly due to timeouts.

Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux Server release 7.1 (Maipo)
Red Hat Gluster Storage Server 3.1 Update 1

How reproducible:
Occurred only once

Steps to Reproduce:
1. None
2.
3.

Actual results:
The FUSE client kept hanging.

Expected results:
The FUSE client should not hang but should switch over to the other node.

Additional info: