Description of problem: Currently gluster allows to delete a volume even if one of the node is down, once the node come up it will sync volume info back to node and start the volume. Heketi doesn't have check if any node is not connected. It will delete the volume from heketi database though fail to cleanup bricks on the disconnected node. [kubeexec] DEBUG 2016/06/08 08:09:13 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:301: Host: glusterfs-glusterfs-2-1-dbksi Command: sudo gluster --mode=script volume stop vol_660ce53ba483a2865a2b0647123733a3 force Result: volume stop: vol_660ce53ba483a2865a2b0647123733a3: success [kubeexec] DEBUG 2016/06/08 08:09:13 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:301: Host: glusterfs-glusterfs-2-1-dbksi Command: sudo gluster --mode=script volume delete vol_660ce53ba483a2865a2b0647123733a3 Result: volume delete: vol_660ce53ba483a2865a2b0647123733a3: success [kubeexec] ERROR 2016/06/08 08:09:14 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:298: Failed to run command [sudo umount /var/lib/heketi/mounts/vg_682b3989ce3434bceef7feb1b0b2ff9b/brick_af650fa234362f7a2759f5db6e8aba3b] on glusterfs-glusterfs-1-1-k2c7y: Err[Error executing remote command: Error executing command in container: Error executing in Docker Container: 32]: Stdout []: Stderr [umount: /var/lib/heketi/mounts/vg_682b3989ce3434bceef7feb1b0b2ff9b/brick_af650fa234362f7a2759f5db6e8aba3b: target is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1)) Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Stop glusterd on any of the node 2. Try to delete a volume using Heketi Actual results: It will delete the volume from Heketi db and cleanup the bricks from the connected node Expected results: Should not delete volume from heketi db. Additional info:
Upstream BZ for GlusterD https://bugzilla.redhat.com/show_bug.cgi?id=1291262
The fix for BZ 1291262 was not planned for rhgs-3.1.3. IMHO, we should add a validation at heketi layer to prevent volume deletion operation till we get this fixed upstream and pull it in downstream in next release.
(In reply to Atin Mukherjee from comment #5) > The fix for BZ 1291262 was not planned for rhgs-3.1.3. IMHO, we should add a > validation at heketi layer to prevent volume deletion operation till we get > this fixed upstream and pull it in downstream in next release. afaict, its *not* the correct way to add a validation in heketi considering the fact that, heketi get a 'success' from 'glusterd' on volume deletion operation. Upon volume deletion request to heketi, it just call 'gluster volume delete' command as a cluster admin do when he want to delete a volume. If it was a 'failure' return, we could have implemented some checks in heketi. IMHO, these kind of logics should be in glusterd and not in caller. Luis can share his thoughts though.
This is a very interesting situation. We have a volume which was successfully deleted, but the Heketi "garbage collector" was unable to free the space. This would create a out-of-sync situation between the actual storage used and the database. But if all else fails, and now that devices have state, I think that what Humble suggested could be possible. Here is a possible solution to deal with the situation after a successful volume deletion from glusterd. 1. Do not free the space from DB. Place the volume is in a "zombie" state. This would mean that volumes would also need state. 2. Place disks used by this volume in "offline" state. 3. Somehow notify admin (probable future event based system in Heketi). To re-enable a disk, the admin would need to re-delete the zombied volume, then Heketi would retry to free the storage. Heketi would remove successful freed storage bricks until none are left. It would re-enable any disk where it has freed all of the bricks from, and update the db. I think that Heketi should also check for errors from glusterd. What do you guys think?
Now glusterD is not allowing to delete volume if peer is down [kubeexec] ERROR 2016/06/15 05:19:34 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:298: Failed to run command [sudo gluster --mode=script volume delete vol_2222514b7d40f2caa5c5a0ea9cb434e1] on glusterfs-glusterfs-3-1-0y27w: Err[Error executing remote command: Error executing command in container: Error executing in Docker Container: 1]: Stdout []: Stderr [volume delete: vol_2222514b7d40f2caa5c5a0ea9cb434e1: failed: Some of the peers are down ] [sshexec] ERROR 2016/06/15 05:19:34 /src/github.com/heketi/heketi/executors/sshexec/volume.go:158: Unable to delete volume vol_2222514b7d40f2caa5c5a0ea9cb434e1: Unable to execute command on glusterfs-glusterfs-3-1-0y27w: volume delete: vol_2222514b7d40f2caa5c5a0ea9cb434e1: failed: Some of the peers are down But still Heketi is running complete brick cleanup/volume deletion. I think still need to add validation on Heketi layer based on above "errors" returned from GlusterD. heketi-cli volume delete 2222514b7d40f2caa5c5a0ea9cb434e1 Volume 2222514b7d40f2caa5c5a0ea9cb434e1 deleted gluster v status Volume vol_2222514b7d40f2caa5c5a0ea9cb434e1 is not started gluster v start vol_2222514b7d40f2caa5c5a0ea9cb434e1 volume start: vol_2222514b7d40f2caa5c5a0ea9cb434e1: failed: Failed to find brick directory /var/lib/heketi/mounts/vg_5c1f7aab4382462ddb05d2221d2f457e/brick_07ddb6b991c87e04d4adfd861684c957/brick for volume vol_2222514b7d40f2caa5c5a0ea9cb434e1. Reason : No such file or directory
Created attachment 1168273 [details] volume deletion logs for reference attaching delete logs
Created attachment 1168276 [details] volume_deletion Attaching correct logs volume_deletion
@Neha and @Luis, this issue is fixed from Gluster side as mentioned in BZ 1344625. I am moving this bug to ON_QA for further validation of this bug.
Already tested after 3.1.3 release. Moving back as per #18
Is this a Heketi bug or a glusterd bug? Please set values accordingly.
(In reply to Luis Pabón from comment #24) > Is this a Heketi bug or a glusterd bug? Please set values accordingly. This is a Heketi Bug.
Ok, thanks Neha, I was confused. So, do we need to use Comment #10 to solve the issue?
(In reply to Luis Pabón from comment #26) > Ok, thanks Neha, I was confused. So, do we need to use Comment #10 to solve > the issue? Now this issue is fixed on GlusterD side so if any node is down, gluster will not allow to delete volume from backend. So I believe #10 is not required here. But need a validation in Heketi layer based on #18.
In Heketi, if the volume deletion fails, do not continue deleting bricks
https://github.com/heketi/heketi/issues/421
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1498.html