Currently there is no way to replace a brick of gluster volume in cns environment in case of low level issues like LVM or FS corruption of the bricks. We have to replace whole disk on which that brick is residing which is very tedious task and time consuming. Also, just due to one brick issue, all bricks residing on that disk will be affected, which is kind of abnormal. There should be a replace brick type utility which is present in core gluster to overcome this limitation.
So the ask here is to be able to replace a brick mainly because of two reasons: 1. LVM corrupts the LV 2. FS gets corrupted Both the cases are rare and must possibly be handled as a one off case than introducing a whole feature in heketi. If there are other reasons for the ask, let us know.
Pranith, If one of the replica bricks get corrupted due to LVM/FS issues, will formatting the brick and making it empty ensure that self heal fixes it?
(In reply to Raghavendra Talur from comment #7) > Pranith, > > If one of the replica bricks get corrupted due to LVM/FS issues, will > formatting the brick and making it empty ensure that self heal fixes it? You have to use reset-brick work flow for this. Otherwise the pending xattrs won't be set in the direction of heal. I searched for documentation about the exact steps, but couldn't find one for replicate volumes. Maybe Ravi knows. Leaving a needinfo.
reset-brick (https://review.gluster.org/#/c/glusterfs/+/12250/16//COMMIT_MSG) is documented in section 11.9.5 of the admin guide: https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.4/html-single/administration_guide/index#sect-Migrating_Volumes-Reconfigure_Brick
*** Bug 1727918 has been marked as a duplicate of this bug. ***
Bug is in POST but lacks a link to the PR?
For verification: Heketi now has a new command line subcommand 'brick evict' that can be invoked like: heketi-cli brick evict [brick_id] Example: heketi-cli brick evict f37409fe4ab83a150307a1b622b3da4f The brick id can be determined from the topology (for example). A brick only belongs to one single volume so only the brick id is needed, heketi will automatically determine what volume is affected. The behavior of the command is that the named brick will be removed from the volume (evicted) and to maintain the volume heketi will automatically replace the evicted brick with a new brick following the same brick allocation as volume creation, expansion, etc. Users do not get to directly control the brick's replacement. Users can influence the brick's replacement the same way as before by setting devices/nodes online or offline or by device and node tagging. Brick eviction is done via an operation. When the eviction and replacement are being performed a new operation can be seen via 'heketi-cli server operations [info|list]'. If the operation fails, or the server is terminated uncleanly, then during cleanup heketi will try to determine if the brick has been changed in glusterd and if so the old brick will be removed, if the brick has not been replaced in glusterd then the new brick will be cleaned up and then the user can manually try again at a later time. The 'heketi-cil server operations cleanup' can be used to trigger an early cleanup of an failed/stale brick evict operations. For verification, test that: * brick evict command line functions as described * brick evict creates an operation * the components of evicted bricks are removed from the device storage (lvs are deleted, etc) * the bricks are replaced w/in glusterd * terminating heketi during a brick evict operation leaves a failed/stale operation behind * failed/stale operation behind can be cleaned up successfully See also: https://github.com/heketi/heketi/pull/1656
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Storage 3.11.z bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5602