1576094 – Using heketi-cli command, attempt to remove non-empty device from a terminated glusterfs pod(glusterfs label removed) fails, even with spare nodes available

Bug 1576094 - Using heketi-cli command, attempt to remove non-empty device from a terminated glusterfs pod(glusterfs label removed) fails, even with spare nodes available

Summary: Using heketi-cli command, attempt to remove non-empty device from a terminate...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	heketi
Sub Component:
Version:	cns-3.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	CNS 3.10
Assignee:	John Mulligan
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1568862
TreeView+	depends on / blocked

Reported:	2018-05-08 19:47 UTC by Neha Berry
Modified:	2018-09-12 09:23 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-12 09:22:13 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	heketi heketi pull 1173	0	None	closed	apps: fix selection of node for issuing gluster replace command	2020-02-19 10:13:13 UTC
Red Hat Product Errata	RHEA-2018:2686	0	None	None	None	2018-09-12 09:23:23 UTC

Description Neha Berry 2018-05-08 19:47:25 UTC

Description of problem:
+++++++++++++++++++++++++

While a node is in disconnected peer status in a 4 gluster node CNS+OCP setup, attempt to remove devices(which have bricks) from that disconnected peer throws error.
Since 3 healthy peer nodes are still available, assumption is that the brick should move easily to the remaining node and device and node delete should be successful, aslong as n>=3

Note:

a)Storage Class used for creating volumes have "volumeoptions=user.heketi.arbiter true" set and hence a 1 x (2 + 1) = 3 PVC/volume is created.
b) None ofthe nodes or devices were explicitly tagged using settag option. All settings were dafault.

Following is the snippet of the error message seen:
+++++++++++++++++++++++++++++++++++++++++++++++++++++

root@dhcp47-178 neha]# heketi-cli device remove 92bd12308fb66ea3f1948a178def0c6d
Error: Failed to remove device, error: Unable to replace brick 10.70.46.175:/var/lib/heketi/mounts/vg_92bd12308fb66ea3f1948a178def0c6d/brick_96a3281077ea60f69ea9a5197857c3f4/brick with 10.70.47.165:/var/lib/heketi/mounts/vg_6244d2715412489f22987d91ac1526cf/brick_fe4cd563fc7b64917ec3050f58cae4f6/brick for volume ar_glusterfs_mongodb-ar2_170e0769-52d9-11e8-b8d2-005056a5aac9

Version-Release number of selected component (if applicable):
++++++++++++++++++++++++++
CNS 3.9 with arbiter support

How reproducible:
++++++++++++++++++++++++++
The issue is reproducible in current setup.

Steps to Reproduce:
++++++++++++++++++++++++++
1. From a running 4 node CNS setup, removed "glusterfs=storage-host" label for node dhcp46-175.lab.eng.blr.redhat.com. Thus the gluster pod "glusterfs-storage-7r4nz" terminated.
2. The node status changed to peer disconnected.
3. Using heketi-cli device remove command, tried removing the 2 devices of the node so as to later I would be able to delete the node completely from cluster.
4. Even though a third node was available for the deleted bricks to be re-created elsewhere, the remove commands failed with "Unable to replace brick" error message.

Full details of commands executed will be shared shortly.

Actual results:
+++++++++++++++++++
even with presence of spare node to replace brick, the device delete command failed. Thus in scenarios where the disconnected peeris never restored, removing its devices and ultimately the node itself will give issues. Also, the volumes which used the failed node, would continue to to work only with 2 bricks instead of 3.

Expected results:
++++++++++++++++++++

With presence of spare node to replace brick, the device delete command should have succeeded and moved to other node.

Additional info:
+++++++++++++++++++
This issue is not seen when the following sequence was followed for node removal( the node was in connected peer state):
first disabled->removed->deleted the devices of the node using heketi-cli,
second removed->deleted the node using heketi-cli
third edited the label of the gluster pod to remove glusterfs=storage-host and hence terminate it.

But the concern is node & device removal in case of disconnected peer.

Comment 13 Humble Chirammal 2018-05-18 08:03:47 UTC

The mentioned patch https://github.com/heketi/heketi/pull/1173 is merged and available with latest heketi build ie heketi-6.0.0-13.el7rhg. I am moving this bug to ON_QA.

Comment 18 errata-xmlrpc 2018-09-12 09:22:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2686

Note You need to log in before you can comment on or make changes to this bug.