Bug 763824 (GLUSTER-2092)

Summary: Detect a non-working brick volume and remove it from service
Product: [Community] GlusterFS Reporter: Allen Lu <allen>
Component: coreAssignee: Anand Avati <aavati>
Status: CLOSED WONTFIX QA Contact:
Severity: low Docs Contact:
Priority: low    
Version: 3.1.0CC: amarts, chrisw, gluster-bugs, joe, vbellur, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-04-28 03:13:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Allen Lu 2010-11-11 19:53:55 UTC
A problem with the SSD on mcloud4 caused a directory (not the entire 
drive) to hang. Gluster hung on the directory. Since the drive was not 
reported as failed on the OS level, we never stopped trying. The result 
was a complete hang on the volume. Would like to see a timeout function 
where if Gluster detects a hanging directory, that it would shutdown 
that particular brick as long as its replicated.

Comment 1 Amar Tumballi 2012-04-28 03:13:52 UTC
This is not valid as per the design. We don't want to take that decision automatically. Admin can use 'gluster volume remove-brick' to do this intentionally if needed.

Comment 2 Joe Julian 2012-04-28 03:41:53 UTC
This bug wasn't about removing a brick, but rather about glusterfsd exiting when it's posix translator fails. I believe that this bug should be re-evaluated on that basis.

Comment 3 Joe Julian 2012-06-15 22:17:31 UTC
This interpretation of the request was flawed. Please reopen this.

A problem exists that can block the entire volume from use. Louis and I have both also had occasion where the brick's filesystem or drive has failed. glusterfsd tries to access that drive and hangs indefinately. This should be detected and the glusterfsd process should timeout and exit gracefully. 

Currently, filesystem blocks like this can lead to a zombie process that can only be restored by rebooting the server. This is not acceptable behavior. The priority and severity of this problem should be considered high.