Bug 1344239 - Volume deletion: Validate if gluster node is down
Summary: Volume deletion: Validate if gluster node is down
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: heketi
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: RHGS Container Converged 1.0
Assignee: Luis Pabón
QA Contact: Neha
URL:
Whiteboard:
Depends On: 1344625
Blocks: 1332128
TreeView+ depends on / blocked
 
Reported: 2016-06-09 08:55 UTC by Neha
Modified: 2016-11-08 22:24 UTC (History)
9 users (show)

Fixed In Version: v2.0.4-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-04 04:51:03 UTC
Embargoed:


Attachments (Terms of Use)
volume deletion logs (17.47 KB, text/plain)
2016-06-15 09:35 UTC, Neha
no flags Details
volume_deletion (15.63 KB, text/plain)
2016-06-15 09:38 UTC, Neha
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1498 0 normal SHIPPED_LIVE heketi update for Red Hat Gluster Storage 3.1 2016-08-04 08:49:19 UTC

Description Neha 2016-06-09 08:55:16 UTC
Description of problem:

Currently gluster allows to delete a volume even if one of the node is down, once the node come up it will sync volume info back to node and start the volume.

Heketi doesn't have check if any node is not connected. It will delete the volume from heketi database though fail to cleanup bricks on the disconnected node.

[kubeexec] DEBUG 2016/06/08 08:09:13 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:301: Host: glusterfs-glusterfs-2-1-dbksi Command: sudo gluster --mode=script volume stop vol_660ce53ba483a2865a2b0647123733a3 force
Result: volume stop: vol_660ce53ba483a2865a2b0647123733a3: success
[kubeexec] DEBUG 2016/06/08 08:09:13 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:301: Host: glusterfs-glusterfs-2-1-dbksi Command: sudo gluster --mode=script volume delete vol_660ce53ba483a2865a2b0647123733a3
Result: volume delete: vol_660ce53ba483a2865a2b0647123733a3: success

[kubeexec] ERROR 2016/06/08 08:09:14 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:298: Failed to run command [sudo umount /var/lib/heketi/mounts/vg_682b3989ce3434bceef7feb1b0b2ff9b/brick_af650fa234362f7a2759f5db6e8aba3b] on glusterfs-glusterfs-1-1-k2c7y: Err[Error executing remote command: Error executing command in container: Error executing in Docker Container: 32]: Stdout []: Stderr [umount: /var/lib/heketi/mounts/vg_682b3989ce3434bceef7feb1b0b2ff9b/brick_af650fa234362f7a2759f5db6e8aba3b: target is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Stop glusterd on any of the node
2. Try to delete a volume using Heketi

Actual results:
It will delete the volume from Heketi db and cleanup the bricks from the connected node

Expected results:
Should not delete volume from heketi db.

Additional info:

Comment 4 Neha 2016-06-09 09:46:37 UTC
Upstream BZ for GlusterD https://bugzilla.redhat.com/show_bug.cgi?id=1291262

Comment 5 Atin Mukherjee 2016-06-09 10:10:35 UTC
The fix for BZ 1291262 was not planned for rhgs-3.1.3. IMHO, we should add a validation at heketi layer to prevent volume deletion operation till we get this fixed upstream and pull it in downstream in next release.

Comment 6 Humble Chirammal 2016-06-09 11:04:32 UTC
(In reply to Atin Mukherjee from comment #5)
> The fix for BZ 1291262 was not planned for rhgs-3.1.3. IMHO, we should add a
> validation at heketi layer to prevent volume deletion operation till we get
> this fixed upstream and pull it in downstream in next release.

afaict, its *not* the correct way to add a validation in heketi considering the fact that, heketi get a 'success' from 'glusterd' on volume deletion operation. Upon volume deletion request to heketi, it just call 'gluster volume delete' command as a cluster admin do when he want to delete a volume.  If it was a 'failure' return, we could have implemented some checks in heketi.  IMHO, these kind of logics should be in glusterd and not in caller. Luis can share his thoughts though.

Comment 10 Luis Pabón 2016-06-10 02:51:48 UTC
This is a very interesting situation.  We have a volume which was successfully deleted, but the Heketi "garbage collector" was unable to free the space.  This would create a out-of-sync situation between the actual storage used and the database.

But if all else fails, and now that devices have state, I think that what Humble suggested could be possible.  Here is a possible solution to deal with the situation after a successful volume deletion from glusterd.

1. Do not free the space from DB.  Place the volume is in a "zombie" state.  This would mean that volumes would also need state.
2. Place disks used by this volume in "offline" state.
3. Somehow notify admin (probable future event based system in Heketi).

To re-enable a disk, the admin would need to re-delete the zombied volume, then Heketi would retry to free the storage.  Heketi would remove successful freed storage bricks until none are left.  It would re-enable any disk where it has freed all of the bricks from, and update the db.

I think that Heketi should also check for errors from glusterd.

What do you guys think?

Comment 18 Neha 2016-06-15 09:28:29 UTC
Now glusterD is not allowing to delete volume if peer is down

[kubeexec] ERROR 2016/06/15 05:19:34 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:298: Failed to run command [sudo gluster --mode=script volume delete vol_2222514b7d40f2caa5c5a0ea9cb434e1] on glusterfs-glusterfs-3-1-0y27w: Err[Error executing remote command: Error executing command in container: Error executing in Docker Container: 1]: Stdout []: Stderr [volume delete: vol_2222514b7d40f2caa5c5a0ea9cb434e1: failed: Some of the peers are down
]
[sshexec] ERROR 2016/06/15 05:19:34 /src/github.com/heketi/heketi/executors/sshexec/volume.go:158: Unable to delete volume vol_2222514b7d40f2caa5c5a0ea9cb434e1: Unable to execute command on glusterfs-glusterfs-3-1-0y27w: volume delete: vol_2222514b7d40f2caa5c5a0ea9cb434e1: failed: Some of the peers are down

But still Heketi is running complete brick cleanup/volume deletion. 

I think still need to add validation on Heketi layer based on above "errors" returned from GlusterD.

 heketi-cli volume delete 2222514b7d40f2caa5c5a0ea9cb434e1
Volume 2222514b7d40f2caa5c5a0ea9cb434e1 deleted

gluster  v status
Volume vol_2222514b7d40f2caa5c5a0ea9cb434e1 is not started

gluster v start vol_2222514b7d40f2caa5c5a0ea9cb434e1
volume start: vol_2222514b7d40f2caa5c5a0ea9cb434e1: failed: Failed to find brick directory /var/lib/heketi/mounts/vg_5c1f7aab4382462ddb05d2221d2f457e/brick_07ddb6b991c87e04d4adfd861684c957/brick for volume vol_2222514b7d40f2caa5c5a0ea9cb434e1. Reason : No such file or directory

Comment 19 Neha 2016-06-15 09:35:15 UTC
Created attachment 1168273 [details]
volume deletion logs

for reference attaching delete logs

Comment 20 Neha 2016-06-15 09:38:11 UTC
Created attachment 1168276 [details]
volume_deletion

Attaching correct logs volume_deletion

Comment 22 Humble Chirammal 2016-07-05 10:11:09 UTC
@Neha and @Luis, this issue is fixed from Gluster side as mentioned in  BZ 1344625. I am moving this bug to ON_QA for further validation of this bug.

Comment 23 Neha 2016-07-05 13:31:17 UTC
Already tested after 3.1.3 release. Moving back as per #18

Comment 24 Luis Pabón 2016-07-06 16:32:28 UTC
Is this a Heketi bug or a glusterd bug?  Please set values accordingly.

Comment 25 Neha 2016-07-07 03:41:39 UTC
(In reply to Luis Pabón from comment #24)
> Is this a Heketi bug or a glusterd bug?  Please set values accordingly.

This is a Heketi Bug.

Comment 26 Luis Pabón 2016-07-07 03:50:12 UTC
Ok, thanks Neha, I was confused.  So, do we need to use Comment #10 to solve the issue?

Comment 27 Neha 2016-07-07 04:13:13 UTC
(In reply to Luis Pabón from comment #26)
> Ok, thanks Neha, I was confused.  So, do we need to use Comment #10 to solve
> the issue?

Now this issue is fixed on GlusterD side so if any node is down, gluster will not allow to delete volume from backend.

So I believe #10 is not required here. But need a validation in Heketi layer based on #18.

Comment 28 Luis Pabón 2016-07-07 11:48:15 UTC
In Heketi, if the volume deletion fails, do not continue deleting bricks

Comment 29 Luis Pabón 2016-07-07 19:55:31 UTC
https://github.com/heketi/heketi/issues/421

Comment 33 errata-xmlrpc 2016-08-04 04:51:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1498.html


Note You need to log in before you can comment on or make changes to this bug.