Bug 848556 - glusterfsd apparently unaware of brick failure.
glusterfsd apparently unaware of brick failure.
Product: GlusterFS
Classification: Community
Component: core (Show other bugs)
x86_64 Linux
medium Severity high
: ---
: ---
Assigned To: GlusterFS Bugs list
Depends On:
  Show dependency treegraph
Reported: 2012-08-15 17:08 EDT by Zach Morgan
Modified: 2014-09-19 10:27 EDT (History)
6 users (show)

See Also:
Fixed In Version: glusterfs-3.4.0-2
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2014-09-19 10:27:09 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Zach Morgan 2012-08-15 17:08:42 EDT
Description of problem:
When a drive dies, the associated brick is not marked offline, leading to data being written to the root filesystem and eventually to a system crash (possibly related to 832609?)

Version-Release number of selected component (if applicable):
3.2.5, behavior was duplicated in current on IRC by user 'jdarcy'.

How reproducible:

Steps to Reproduce:
Assuming a working gluster cluster with sdc as the "faulty" drive/brick:
1. begin writing data to the gluster mount
2. unmount or disable /dev/sdc
3. wait for the root partition to fill up and take down the entire cluster.
Actual results:
Catastrophic system failure.

Expected results:
Sane handling of hardware asset identification, addition and removal.

Additional info:
None at this time, the problem seems self-explanatory to me. Please reach out if you require any more information.
Comment 1 Amar Tumballi 2013-02-26 05:41:52 EST
Now, one can't 'umount' the brick partition while the brick process is running. So, suspect this won't be  an issue anymore.
amar@supernova:~/work/glusterfs$ git show 2e00396e04f261af45c33b55b9b73157a2e8fc72
commit 2e00396e04f261af45c33b55b9b73157a2e8fc72
Author: Kaushal M <kaushal@gluster.com>
Date:   Tue Sep 27 12:37:22 2011 +0530

    storage/posix : prevent unmount of underlying fs
    posix xlator now performs opendir () on the brick directory during init ().
    This will prevent the underlying filesystem mounted to that directory from being
    Change-Id: I02c190ab8a91abc4ab06959b36f50e0a3fa527ae
    BUG: GLUSTER-3578
    Reviewed-on: http://review.gluster.com/509
Comment 2 Marcin 2013-07-11 13:02:46 EDT
My case was TOTAL disk failure and glusterfsd has not been aware about missing BRICK thus client is getting I/O error and files are lost which were meant to be save onto broken brick! Resending files resulting in same I/O errors for the same files since its still trying to write to broken/missing disk.

Type: Distributed-Replicate
Number of Bricks: 4 x 2 = 8

Version 3.3.1

The system still had disk mounted, dmesg provides messages for disk hardware failure:

    XFS (vdc1): xfs_log_force: error 5 returned.

Unless I restarted by hand glusterfsd service, gluster volume status showed all Bricks are ONLINE.
Comment 3 Marcin 2013-07-11 13:10:02 EDT
I read case 832609 and seems to be mine issue as well. I dont have Amazon instance but disk used as brick is mounted from shared storage and it became unavailable.
Comment 4 Marcin 2013-07-18 11:45:38 EDT
Version 3.4.0-2 seems to have it resolved.
Comment 5 Niels de Vos 2014-09-19 10:27:09 EDT
(In reply to Marcin from comment #4)
> Version 3.4.0-2 seems to have it resolved.

Current versions also have a Brick failure Detection:
- http://www.gluster.org/community/documentation/index.php/Features/Brick_Failure_Detection

Note that it is recommended to configure your volumes like this:
- mountpoint for a brick = /bricks/<volume>-<brick-number>
- path used in 'gluster volume create': /bricks/<volume>-<brick-number>/data

  (<volume> would be the name of your volume, <brick-number> is just a counter)

Note You need to log in before you can comment on or make changes to this bug.