Description of problem:
While a node is in disconnected peer status in a 4 gluster node CNS+OCP setup, attempt to remove devices(which have bricks) from that disconnected peer throws error.
Since 3 healthy peer nodes are still available, assumption is that the brick should move easily to the remaining node and device and node delete should be successful, aslong as n>=3
a)Storage Class used for creating volumes have "volumeoptions=user.heketi.arbiter true" set and hence a 1 x (2 + 1) = 3 PVC/volume is created.
b) None ofthe nodes or devices were explicitly tagged using settag option. All settings were dafault.
Following is the snippet of the error message seen:
root@dhcp47-178 neha]# heketi-cli device remove 92bd12308fb66ea3f1948a178def0c6d
Error: Failed to remove device, error: Unable to replace brick 10.70.46.175:/var/lib/heketi/mounts/vg_92bd12308fb66ea3f1948a178def0c6d/brick_96a3281077ea60f69ea9a5197857c3f4/brick with 10.70.47.165:/var/lib/heketi/mounts/vg_6244d2715412489f22987d91ac1526cf/brick_fe4cd563fc7b64917ec3050f58cae4f6/brick for volume ar_glusterfs_mongodb-ar2_170e0769-52d9-11e8-b8d2-005056a5aac9
Version-Release number of selected component (if applicable):
CNS 3.9 with arbiter support
The issue is reproducible in current setup.
Steps to Reproduce:
1. From a running 4 node CNS setup, removed "glusterfs=storage-host" label for node dhcp46-175.lab.eng.blr.redhat.com. Thus the gluster pod "glusterfs-storage-7r4nz" terminated.
2. The node status changed to peer disconnected.
3. Using heketi-cli device remove command, tried removing the 2 devices of the node so as to later I would be able to delete the node completely from cluster.
4. Even though a third node was available for the deleted bricks to be re-created elsewhere, the remove commands failed with "Unable to replace brick" error message.
Full details of commands executed will be shared shortly.
even with presence of spare node to replace brick, the device delete command failed. Thus in scenarios where the disconnected peeris never restored, removing its devices and ultimately the node itself will give issues. Also, the volumes which used the failed node, would continue to to work only with 2 bricks instead of 3.
With presence of spare node to replace brick, the device delete command should have succeeded and moved to other node.
This issue is not seen when the following sequence was followed for node removal( the node was in connected peer state):
first disabled->removed->deleted the devices of the node using heketi-cli,
second removed->deleted the node using heketi-cli
third edited the label of the gluster pod to remove glusterfs=storage-host and hence terminate it.
But the concern is node & device removal in case of disconnected peer.
The mentioned patch https://github.com/heketi/heketi/pull/1173 is merged and available with latest heketi build ie heketi-6.0.0-13.el7rhg. I am moving this bug to ON_QA.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.