Description of problem: ======================= In a case where the snap delete is issued before the snap is marked to be deleted if the node goes down and when the node comes back than the snaps are propagated on other nodes and glusterd hungs. Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.6.0.12-1.el6rhs.x86_64 Steps to Reproduce: =================== 1. Setup 4 node cluster 2. Create a volume 3. Create 256 snapshots of a volume 4. Start deleting snapshots of volume in a loop (--mode=script) 5. While snap deletion is in-progress, stop and start glusterd service on node multiple times. Actual results: =============== 1. Snapshot commit failed on the node which went down. 2. Once the node is brought back the snap is present on all the systems and no entry in the missed_entry_list. 3. gluster hungs on the machines which were up Expected results: ================= 1. Snapshot should fail with proper message. 2. Once the node is brought back the snap should be deleted from all the nodes. 3. gluster should not hung. Since, it just not hamper the missed snap functionality but also the whole cluster becomes unresponsive, raising the bug with urgent severity
Fix at https://code.engineering.redhat.com/gerrit/26884
Verified this with build: glusterfs-3.6.0.17-1.el6rhs.x86_64 Initially had 180 snaps, started deletion in loop. While deletion was inprogress brought down glusterd and brought it back multiple times on one server. Few of the snaps delete failed with message: "snapshot delete: failed: snap snap70 might not be in an usable state. Snapshot command failed" Once all the snaps are deleted and the glusterd was brought online all the snaps except to the one that might be in unusable state were deleted. Respective entries were marked as 2:2 in missed_entry_list as [root@rhs-arch-srv2 ~]# cat /var/lib/glusterd/snaps/missed_snaps_list | wc 171 171 30609 [root@rhs-arch-srv2 ~]# [root@rhs-arch-srv2 ~]# [root@rhs-arch-srv2 ~]# service glusterd status glusterd (pid 19503) is running... [root@rhs-arch-srv2 ~]# cat /var/lib/glusterd/snaps/missed_snaps_list | grep ":2:2" | wc 171 171 30609 [root@rhs-arch-srv2 ~]# The above confirms that the snaps were marked for deletion and is successfully deleted after handshake. glusterd was not hung and was able to delete the snaps where snapshot delete failed. [root@inception ~]# ls /var/lib/glusterd/snaps/ missed_snaps_list snap143 snap166 snap50 snap70 snap91 [root@inception ~]# [root@inception ~]# gluster snapshot list snap50 snap70 snap91 snap143 snap166 [root@inception ~]# gluster snapshot delete snap143 Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y snapshot delete: snap143: snap removed successfully [root@inception ~]# gluster snapshot list snap50 snap70 snap91 snap166 [root@inception ~]# gluster snapshot delete snap50 Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y snapshot delete: snap50: snap removed successfully [root@inception ~]# [root@inception ~]# [root@inception ~]# gluster snapshot list No snapshots present [root@inception ~]# [root@rhs-arch-srv2 ~]# gluster snapshot list snap50 snap70 snap91 snap143 snap166 [root@rhs-arch-srv2 ~]# gluster snapshot list snap50 snap70 snap91 snap166 [root@rhs-arch-srv2 ~]# gluster snapshot delete snap166 Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y snapshot delete: snap166: snap removed successfully [root@rhs-arch-srv2 ~]# gluster snapshot delete snap91 Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y snapshot delete: snap91: snap removed successfully [root@rhs-arch-srv2 ~]# gluster snapshot delete snap70 Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y snapshot delete: snap70: snap removed successfully [root@rhs-arch-srv2 ~]# Moving the bug to verified state.
Version : glusterfs 3.6.0.28 ======== While deleting snaps in a loop, restarted glusterd on few nodes. Some snapshots were still remaining in the system because those snapshots were not marked for decommssion where glusterd went down. When glusterd comes back up on those nodes it recreated the snaps on the other nodes. So when the snap deletion is tried again it fails . However, glusterd does not hang . The below snapshots failed with 'Commit failed' error when glusterd was restarted on other nodes gluster snapshot list vol1_snap_6 vol1_snap_18 vol1_snap_19 vol1_snap_56 vol1_snap_71 vol1_snap_72 vol1_snap_73 vol1_snap_87 vol1_snap_107 vol1_snap_113 vol1_snap_138 vol1_snap_143 vol1_snap_177 vol1_snap_189 Delete snapshot : gluster snapshot delete vol1_snap_6 Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y snapshot delete: failed: Commit failed on snapshot14.lab.eng.blr.redhat.com. Please check log file for details. Commit failed on snapshot16.lab.eng.blr.redhat.com. Please check log file for details. Commit failed on snapshot15.lab.eng.blr.redhat.com. Please check log file for details. Snapshot command failed Re-opening the bug
Edited doc text. Please review and sign-off.
The doc text looks fine to me.
Fixed with https://code.engineering.redhat.com/gerrit/36489
Version : glusterfs 3.6.0.37 ======== Retried the steps as mentioned in Description and Comment 5, unable to reproduce the issue. Marking the bug as 'Verified'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0038.html