Bug 763824 (GLUSTER-2092) - Detect a non-working brick volume and remove it from service
Summary: Detect a non-working brick volume and remove it from service
Keywords:
Status: CLOSED WONTFIX
Alias: GLUSTER-2092
Product: GlusterFS
Classification: Community
Component: core
Version: 3.1.0
Hardware: All
OS: Linux
low
low
Target Milestone: ---
Assignee: Anand Avati
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-11-11 19:53 UTC by Allen Lu
Modified: 2015-09-01 23:05 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-04-28 03:13:52 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Allen Lu 2010-11-11 19:53:55 UTC
A problem with the SSD on mcloud4 caused a directory (not the entire 
drive) to hang. Gluster hung on the directory. Since the drive was not 
reported as failed on the OS level, we never stopped trying. The result 
was a complete hang on the volume. Would like to see a timeout function 
where if Gluster detects a hanging directory, that it would shutdown 
that particular brick as long as its replicated.

Comment 1 Amar Tumballi 2012-04-28 03:13:52 UTC
This is not valid as per the design. We don't want to take that decision automatically. Admin can use 'gluster volume remove-brick' to do this intentionally if needed.

Comment 2 Joe Julian 2012-04-28 03:41:53 UTC
This bug wasn't about removing a brick, but rather about glusterfsd exiting when it's posix translator fails. I believe that this bug should be re-evaluated on that basis.

Comment 3 Joe Julian 2012-06-15 22:17:31 UTC
This interpretation of the request was flawed. Please reopen this.

A problem exists that can block the entire volume from use. Louis and I have both also had occasion where the brick's filesystem or drive has failed. glusterfsd tries to access that drive and hangs indefinately. This should be detected and the glusterfsd process should timeout and exit gracefully. 

Currently, filesystem blocks like this can lead to a zombie process that can only be restored by rebooting the server. This is not acceptable behavior. The priority and severity of this problem should be considered high.


Note You need to log in before you can comment on or make changes to this bug.