Description of problem: When a drive dies, the associated brick is not marked offline, leading to data being written to the root filesystem and eventually to a system crash (possibly related to 832609?) Version-Release number of selected component (if applicable): 3.2.5, behavior was duplicated in current on IRC by user 'jdarcy'. How reproducible: Always Steps to Reproduce: Assuming a working gluster cluster with sdc as the "faulty" drive/brick: 1. begin writing data to the gluster mount 2. unmount or disable /dev/sdc 3. wait for the root partition to fill up and take down the entire cluster. Actual results: Catastrophic system failure. Expected results: Sane handling of hardware asset identification, addition and removal. Additional info: None at this time, the problem seems self-explanatory to me. Please reach out if you require any more information.
Now, one can't 'umount' the brick partition while the brick process is running. So, suspect this won't be an issue anymore. ---- amar@supernova:~/work/glusterfs$ git show 2e00396e04f261af45c33b55b9b73157a2e8fc72 commit 2e00396e04f261af45c33b55b9b73157a2e8fc72 Author: Kaushal M <kaushal> Date: Tue Sep 27 12:37:22 2011 +0530 storage/posix : prevent unmount of underlying fs posix xlator now performs opendir () on the brick directory during init (). This will prevent the underlying filesystem mounted to that directory from being unmounted. Change-Id: I02c190ab8a91abc4ab06959b36f50e0a3fa527ae BUG: GLUSTER-3578 Reviewed-on: http://review.gluster.com/509 ----
My case was TOTAL disk failure and glusterfsd has not been aware about missing BRICK thus client is getting I/O error and files are lost which were meant to be save onto broken brick! Resending files resulting in same I/O errors for the same files since its still trying to write to broken/missing disk. Setup: Type: Distributed-Replicate Number of Bricks: 4 x 2 = 8 Version 3.3.1 The system still had disk mounted, dmesg provides messages for disk hardware failure: XFS (vdc1): xfs_log_force: error 5 returned. Unless I restarted by hand glusterfsd service, gluster volume status showed all Bricks are ONLINE.
I read case 832609 and seems to be mine issue as well. I dont have Amazon instance but disk used as brick is mounted from shared storage and it became unavailable.
Version 3.4.0-2 seems to have it resolved.
(In reply to Marcin from comment #4) > Version 3.4.0-2 seems to have it resolved. Current versions also have a Brick failure Detection: - http://www.gluster.org/community/documentation/index.php/Features/Brick_Failure_Detection Note that it is recommended to configure your volumes like this: - mountpoint for a brick = /bricks/<volume>-<brick-number> - path used in 'gluster volume create': /bricks/<volume>-<brick-number>/data (<volume> would be the name of your volume, <brick-number> is just a counter)