Bug 1104635

Summary: [SNAPSHOT]: before the snap is marked to be deleted if the node goes down than the snaps are propagated on other nodes and glusterd hungs
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rahul Hinduja <rhinduja>
Component: snapshotAssignee: Avra Sengupta <asengupt>
Status: CLOSED ERRATA QA Contact: Rahul Hinduja <rhinduja>
Severity: urgent Docs Contact:
Priority: urgent    
Version: rhgs-3.0CC: asengupt, nsathyan, rhs-bugs, rjoseph, senaik, sharne, ssamanta, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.0.3   
Hardware: x86_64   
OS: Linux   
Whiteboard: SNAPSHOT
Fixed In Version: glusterfs-3.6.0.33-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1104714 (view as bug list) Environment:
Last Closed: 2015-01-15 13:37:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1086145    
Bug Blocks: 1087818, 1104714, 1162694, 1175754    

Description Rahul Hinduja 2014-06-04 11:53:23 UTC
Description of problem:
=======================

In a case where the snap delete is issued before the snap is marked to be deleted if the node goes down and when the node comes back than the snaps are propagated on other nodes and glusterd hungs.



Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.6.0.12-1.el6rhs.x86_64


Steps to Reproduce:
===================
1. Setup 4 node cluster
2. Create a volume
3. Create 256 snapshots of a volume
4. Start deleting snapshots of volume in a loop (--mode=script)
5. While snap deletion is in-progress, stop and start glusterd service on node multiple times.


Actual results:
===============
1. Snapshot commit failed on the node which went down.
2. Once the node is brought back the snap is present on all the systems and no entry in the missed_entry_list.
3. gluster hungs on the machines which were up


Expected results:
=================

1. Snapshot should fail with proper message.
2. Once the node is brought back the snap should be deleted from all the nodes.
3. gluster should not hung.

Since, it just not hamper the missed snap functionality but also the whole cluster becomes unresponsive, raising the bug with urgent severity

Comment 3 Avra Sengupta 2014-06-13 11:45:42 UTC
Fix at https://code.engineering.redhat.com/gerrit/26884

Comment 4 Rahul Hinduja 2014-06-16 12:32:08 UTC
Verified this with build: glusterfs-3.6.0.17-1.el6rhs.x86_64

Initially had 180 snaps, started deletion in loop. While deletion was inprogress brought down glusterd and brought it back multiple times on one server.

Few of the snaps delete failed with message:

"snapshot delete: failed: snap snap70 might not be in an usable state.
Snapshot command failed"

Once all the snaps are deleted and the glusterd was brought online all the snaps except to the one that might be in unusable state were deleted. Respective entries were marked as 2:2 in missed_entry_list as

[root@rhs-arch-srv2 ~]# cat /var/lib/glusterd/snaps/missed_snaps_list  | wc
    171     171   30609
[root@rhs-arch-srv2 ~]# 
[root@rhs-arch-srv2 ~]# 
[root@rhs-arch-srv2 ~]# service glusterd status
glusterd (pid  19503) is running...
[root@rhs-arch-srv2 ~]# cat /var/lib/glusterd/snaps/missed_snaps_list  | grep ":2:2" | wc
    171     171   30609
[root@rhs-arch-srv2 ~]# 


The above confirms that the snaps were marked for deletion and is successfully deleted after handshake.


glusterd was not hung and was able to delete the snaps where snapshot delete failed.

[root@inception ~]# ls /var/lib/glusterd/snaps/
missed_snaps_list  snap143  snap166  snap50  snap70  snap91
[root@inception ~]# 
[root@inception ~]# gluster snapshot list
snap50
snap70
snap91
snap143
snap166
[root@inception ~]# gluster snapshot delete snap143
Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
snapshot delete: snap143: snap removed successfully
[root@inception ~]# gluster snapshot list
snap50
snap70
snap91
snap166
[root@inception ~]# gluster snapshot delete snap50
Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
snapshot delete: snap50: snap removed successfully
[root@inception ~]# 
[root@inception ~]# 
[root@inception ~]# gluster snapshot list
No snapshots present
[root@inception ~]# 




[root@rhs-arch-srv2 ~]# gluster snapshot list
snap50
snap70
snap91
snap143
snap166
[root@rhs-arch-srv2 ~]# gluster snapshot list
snap50
snap70
snap91
snap166
[root@rhs-arch-srv2 ~]# gluster snapshot delete snap166
Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
snapshot delete: snap166: snap removed successfully
[root@rhs-arch-srv2 ~]# gluster snapshot delete snap91
Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
snapshot delete: snap91: snap removed successfully
[root@rhs-arch-srv2 ~]# gluster snapshot delete snap70
Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
snapshot delete: snap70: snap removed successfully
[root@rhs-arch-srv2 ~]# 


Moving the bug to verified state.

Comment 5 senaik 2014-09-18 12:25:56 UTC
Version : glusterfs 3.6.0.28
========

While deleting snaps in a loop, restarted glusterd on few nodes. Some snapshots were still remaining in the system because those snapshots were not marked for decommssion where glusterd went down. When glusterd comes back up on those nodes it recreated the snaps on the other nodes. So when the snap deletion is tried again it fails . However, glusterd does not hang . 

The below snapshots failed with 'Commit failed' error when glusterd was restarted on other nodes 

gluster snapshot list
vol1_snap_6
vol1_snap_18
vol1_snap_19
vol1_snap_56
vol1_snap_71
vol1_snap_72
vol1_snap_73
vol1_snap_87
vol1_snap_107
vol1_snap_113
vol1_snap_138
vol1_snap_143
vol1_snap_177
vol1_snap_189

Delete snapshot :

gluster snapshot delete vol1_snap_6
Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
snapshot delete: failed: Commit failed on snapshot14.lab.eng.blr.redhat.com. Please check log file for details.
Commit failed on snapshot16.lab.eng.blr.redhat.com. Please check log file for details.
Commit failed on snapshot15.lab.eng.blr.redhat.com. Please check log file for details.
Snapshot command failed

Re-opening the bug

Comment 9 Shalaka 2014-09-25 07:27:24 UTC
Edited doc text. Please review and sign-off.

Comment 10 rjoseph 2014-09-25 08:20:09 UTC
The doc text looks fine to me.

Comment 11 Avra Sengupta 2014-11-12 11:59:53 UTC
Fixed with https://code.engineering.redhat.com/gerrit/36489

Comment 12 senaik 2014-12-16 09:32:36 UTC
Version : glusterfs 3.6.0.37
========

Retried the steps as mentioned in Description and Comment 5, unable to reproduce the issue. 

Marking the bug as 'Verified'

Comment 14 errata-xmlrpc 2015-01-15 13:37:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0038.html