Red Hat Bugzilla – Bug 848556
glusterfsd apparently unaware of brick failure.
Last modified: 2014-09-19 10:27:09 EDT
Description of problem:
When a drive dies, the associated brick is not marked offline, leading to data being written to the root filesystem and eventually to a system crash (possibly related to 832609?)
Version-Release number of selected component (if applicable):
3.2.5, behavior was duplicated in current on IRC by user 'jdarcy'.
Steps to Reproduce:
Assuming a working gluster cluster with sdc as the "faulty" drive/brick:
1. begin writing data to the gluster mount
2. unmount or disable /dev/sdc
3. wait for the root partition to fill up and take down the entire cluster.
Catastrophic system failure.
Sane handling of hardware asset identification, addition and removal.
None at this time, the problem seems self-explanatory to me. Please reach out if you require any more information.
Now, one can't 'umount' the brick partition while the brick process is running. So, suspect this won't be an issue anymore.
amar@supernova:~/work/glusterfs$ git show 2e00396e04f261af45c33b55b9b73157a2e8fc72
Author: Kaushal M <email@example.com>
Date: Tue Sep 27 12:37:22 2011 +0530
storage/posix : prevent unmount of underlying fs
posix xlator now performs opendir () on the brick directory during init ().
This will prevent the underlying filesystem mounted to that directory from being
My case was TOTAL disk failure and glusterfsd has not been aware about missing BRICK thus client is getting I/O error and files are lost which were meant to be save onto broken brick! Resending files resulting in same I/O errors for the same files since its still trying to write to broken/missing disk.
Number of Bricks: 4 x 2 = 8
The system still had disk mounted, dmesg provides messages for disk hardware failure:
XFS (vdc1): xfs_log_force: error 5 returned.
Unless I restarted by hand glusterfsd service, gluster volume status showed all Bricks are ONLINE.
I read case 832609 and seems to be mine issue as well. I dont have Amazon instance but disk used as brick is mounted from shared storage and it became unavailable.
Version 3.4.0-2 seems to have it resolved.
(In reply to Marcin from comment #4)
> Version 3.4.0-2 seems to have it resolved.
Current versions also have a Brick failure Detection:
Note that it is recommended to configure your volumes like this:
- mountpoint for a brick = /bricks/<volume>-<brick-number>
- path used in 'gluster volume create': /bricks/<volume>-<brick-number>/data
(<volume> would be the name of your volume, <brick-number> is just a counter)