Bug 1630172 - [GSS][RFE] Support remove/replace/re-layout of a brick in a volume
Summary: [GSS][RFE] Support remove/replace/re-layout of a brick in a volume
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: heketi
Version: cns-3.10
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: OCS 3.11.z Batch Update 6
Assignee: John Mulligan
QA Contact: Vinayak Papnoi
Amrita
URL:
Whiteboard:
: 1727918 (view as bug list)
Depends On:
Blocks: OCS-3.11.1-devel-triage-done 1646910 1812122 1930644
TreeView+ depends on / blocked
 
Reported: 2018-09-18 06:26 UTC by Abhishek Kumar
Modified: 2024-03-25 15:08 UTC (History)
18 users (show)

Fixed In Version: heketi-9.0.0-10
Doc Type: Enhancement
Doc Text:
With this update, the Heketi administrator is now provided with a command line and API to evict a single brick from an existing volume. Heketi now supports evicting a single brick by using that brick’s Heketi ID. Upon eviction, the brick is automatically replaced by a suitable new brick or if none can be established, the operation fails.
Clone Of:
Environment:
Last Closed: 2020-12-17 04:31:42 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:5602 0 None None None 2020-12-17 04:32:12 UTC

Description Abhishek Kumar 2018-09-18 06:26:56 UTC
Currently there is no way to replace a brick of gluster volume in cns environment in case of low level issues like LVM or FS corruption of the bricks.

We have to replace whole disk on which that brick is residing which is very tedious task and time consuming. Also, just due to one brick issue, all bricks residing on that disk will be affected, which is kind of abnormal.

There should be a replace brick type utility which is present in core gluster to overcome this limitation.

Comment 6 Raghavendra Talur 2019-01-23 21:14:46 UTC
So the ask here is to be able to replace a brick mainly because of two reasons:
1. LVM corrupts the LV
2. FS gets corrupted

Both the cases are rare and must possibly be handled as a one off case than introducing a whole feature in heketi.

If there are other reasons for the ask, let us know.

Comment 7 Raghavendra Talur 2019-01-23 21:15:26 UTC
Pranith,

If one of the replica bricks get corrupted due to LVM/FS issues, will formatting the brick and making it empty ensure that self heal fixes it?

Comment 9 Pranith Kumar K 2019-01-28 09:38:01 UTC
(In reply to Raghavendra Talur from comment #7)
> Pranith,
> 
> If one of the replica bricks get corrupted due to LVM/FS issues, will
> formatting the brick and making it empty ensure that self heal fixes it?

You have to use reset-brick work flow for this. Otherwise the pending xattrs won't be set in the direction of heal. I searched for documentation about the exact steps, but couldn't find one for replicate volumes. Maybe Ravi knows. Leaving a needinfo.

Comment 16 John Mulligan 2019-07-08 18:19:36 UTC
*** Bug 1727918 has been marked as a duplicate of this bug. ***

Comment 26 Yaniv Kaul 2019-12-01 08:15:45 UTC
Bug is in POST but lacks a link to the PR?

Comment 35 John Mulligan 2020-06-02 17:52:09 UTC
For verification:


Heketi now has a new command line subcommand 'brick evict' that can be invoked like:
   heketi-cli brick evict [brick_id]
Example:
   heketi-cli brick evict f37409fe4ab83a150307a1b622b3da4f

The brick id can be determined from the topology (for example). A brick only belongs to one single volume so only the brick id is needed, heketi will automatically determine what volume is affected.

The behavior of the command is that the named brick will be removed from the volume (evicted) and to maintain the volume heketi will automatically replace the evicted brick with a new brick following the same brick allocation as volume creation, expansion, etc.
Users do not get to directly control the brick's replacement.
Users can influence the brick's replacement the same way as before by setting devices/nodes online or offline or by device and node tagging.

Brick eviction is done via an operation. When the eviction and replacement are being performed a new operation can be seen via 'heketi-cli server operations [info|list]'. If the operation fails, or the server is terminated uncleanly, then during cleanup heketi will try to determine if the brick has been changed in glusterd and if so the old brick will be removed, if the brick has not been replaced in glusterd then the new brick will be cleaned up and then the user can manually try again at a later time.
The 'heketi-cil server operations cleanup' can be used to trigger an early cleanup of an failed/stale brick evict operations.

For verification, test that:
* brick evict command line functions as described
* brick evict creates an operation
* the components of evicted bricks are removed from the device storage (lvs are deleted, etc)
* the bricks are replaced w/in glusterd
* terminating heketi during a brick evict operation leaves a failed/stale operation behind
* failed/stale operation behind can be cleaned up successfully


See also: https://github.com/heketi/heketi/pull/1656

Comment 41 errata-xmlrpc 2020-12-17 04:31:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Storage 3.11.z bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5602


Note You need to log in before you can comment on or make changes to this bug.