Bug 1576094

Summary: Using heketi-cli command, attempt to remove non-empty device from a terminated glusterfs pod(glusterfs label removed) fails, even with spare nodes available
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Neha Berry <nberry>
Component: heketiAssignee: John Mulligan <jmulligan>
Status: CLOSED ERRATA QA Contact: Neha Berry <nberry>
Severity: medium Docs Contact:
Priority: unspecified    
Version: cns-3.9CC: hchiramm, jmulligan, madam, pprakash, rhs-bugs, rtalur, sankarshan, storage-qa-internal
Target Milestone: ---Keywords: Regression
Target Release: CNS 3.10   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-12 09:22:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1568862    

Description Neha Berry 2018-05-08 19:47:25 UTC
Description of problem:
+++++++++++++++++++++++++

While a node is in disconnected peer status in a 4 gluster node CNS+OCP setup, attempt to remove devices(which have bricks) from that disconnected peer throws error. 
Since 3 healthy peer nodes are still available, assumption is that the brick should move easily to the remaining node and device and node delete should be successful, aslong as n>=3

Note: 

a)Storage Class used for creating volumes have "volumeoptions=user.heketi.arbiter true" set and hence a 1 x (2 + 1) = 3 PVC/volume is created.
b) None ofthe nodes or devices were explicitly tagged using settag option. All settings were dafault.

Following is the snippet of the error message seen:
+++++++++++++++++++++++++++++++++++++++++++++++++++++

root@dhcp47-178 neha]# heketi-cli device remove 92bd12308fb66ea3f1948a178def0c6d 
Error: Failed to remove device, error: Unable to replace brick 10.70.46.175:/var/lib/heketi/mounts/vg_92bd12308fb66ea3f1948a178def0c6d/brick_96a3281077ea60f69ea9a5197857c3f4/brick with 10.70.47.165:/var/lib/heketi/mounts/vg_6244d2715412489f22987d91ac1526cf/brick_fe4cd563fc7b64917ec3050f58cae4f6/brick for volume ar_glusterfs_mongodb-ar2_170e0769-52d9-11e8-b8d2-005056a5aac9


Version-Release number of selected component (if applicable):
++++++++++++++++++++++++++
CNS 3.9 with arbiter support

How reproducible:
++++++++++++++++++++++++++
The issue is reproducible in current setup.

Steps to Reproduce:
++++++++++++++++++++++++++
1. From a running 4 node CNS setup, removed "glusterfs=storage-host" label for node dhcp46-175.lab.eng.blr.redhat.com. Thus the gluster pod "glusterfs-storage-7r4nz" terminated.
2. The node status changed to peer disconnected.
3. Using heketi-cli device remove command, tried removing the 2 devices of the node so as to later I would be able to delete the node completely from cluster.
4. Even though a third node was available for the deleted bricks to be re-created elsewhere, the remove commands failed with "Unable to replace brick" error message.

Full details of commands executed will be shared shortly.

Actual results:
+++++++++++++++++++
even with presence of spare node to replace brick, the device delete command failed. Thus in scenarios where the disconnected peeris never restored, removing its devices and ultimately the node itself will give issues. Also, the volumes which used the failed node, would continue to to work only with 2 bricks instead of 3.

Expected results:
++++++++++++++++++++

With presence of spare node to replace brick, the device delete command should have succeeded and moved to other node.

Additional info:
+++++++++++++++++++
This issue is not seen when the following sequence was followed for node removal( the node was in connected peer state):
first disabled->removed->deleted the devices of the node using heketi-cli, 
second removed->deleted the node using heketi-cli
third edited the label of the gluster pod to remove glusterfs=storage-host and hence terminate it.

But the concern is node & device removal in case of disconnected peer.

Comment 13 Humble Chirammal 2018-05-18 08:03:47 UTC
The mentioned patch https://github.com/heketi/heketi/pull/1173 is merged and available with latest heketi build ie heketi-6.0.0-13.el7rhg. I am moving this bug to ON_QA.

Comment 18 errata-xmlrpc 2018-09-12 09:22:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2686