1104635 – [SNAPSHOT]: before the snap is marked to be deleted if the node goes down than the snaps are propagated on other nodes and glusterd hungs

Bug 1104635 - [SNAPSHOT]: before the snap is marked to be deleted if the node goes down than the snaps are propagated on other nodes and glusterd hungs

Summary: [SNAPSHOT]: before the snap is marked to be deleted if the node goes down tha...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	snapshot
Sub Component:
Version:	rhgs-3.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.0.3
Assignee:	Avra Sengupta
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:	SNAPSHOT
Depends On:	1086145
Blocks:	1087818 1104714 1162694 1175754
TreeView+	depends on / blocked

Reported:	2014-06-04 11:53 UTC by Rahul Hinduja
Modified:	2016-09-17 12:53 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.6.0.33-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1104714 (view as bug list)
Environment:
Last Closed:	2015-01-15 13:37:37 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:0038	0	normal	SHIPPED_LIVE	Red Hat Storage 3.0 enhancement and bug fix update #3	2015-01-15 18:35:28 UTC

Description Rahul Hinduja 2014-06-04 11:53:23 UTC

Description of problem:
=======================

In a case where the snap delete is issued before the snap is marked to be deleted if the node goes down and when the node comes back than the snaps are propagated on other nodes and glusterd hungs.



Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.6.0.12-1.el6rhs.x86_64


Steps to Reproduce:
===================
1. Setup 4 node cluster
2. Create a volume
3. Create 256 snapshots of a volume
4. Start deleting snapshots of volume in a loop (--mode=script)
5. While snap deletion is in-progress, stop and start glusterd service on node multiple times.


Actual results:
===============
1. Snapshot commit failed on the node which went down.
2. Once the node is brought back the snap is present on all the systems and no entry in the missed_entry_list.
3. gluster hungs on the machines which were up


Expected results:
=================

1. Snapshot should fail with proper message.
2. Once the node is brought back the snap should be deleted from all the nodes.
3. gluster should not hung.

Since, it just not hamper the missed snap functionality but also the whole cluster becomes unresponsive, raising the bug with urgent severity

Comment 3 Avra Sengupta 2014-06-13 11:45:42 UTC

Fix at https://code.engineering.redhat.com/gerrit/26884

Comment 4 Rahul Hinduja 2014-06-16 12:32:08 UTC

Verified this with build: glusterfs-3.6.0.17-1.el6rhs.x86_64

Initially had 180 snaps, started deletion in loop. While deletion was inprogress brought down glusterd and brought it back multiple times on one server.

Few of the snaps delete failed with message:

"snapshot delete: failed: snap snap70 might not be in an usable state.
Snapshot command failed"

Once all the snaps are deleted and the glusterd was brought online all the snaps except to the one that might be in unusable state were deleted. Respective entries were marked as 2:2 in missed_entry_list as

[root@rhs-arch-srv2 ~]# cat /var/lib/glusterd/snaps/missed_snaps_list  | wc
    171     171   30609
[root@rhs-arch-srv2 ~]# 
[root@rhs-arch-srv2 ~]# 
[root@rhs-arch-srv2 ~]# service glusterd status
glusterd (pid  19503) is running...
[root@rhs-arch-srv2 ~]# cat /var/lib/glusterd/snaps/missed_snaps_list  | grep ":2:2" | wc
    171     171   30609
[root@rhs-arch-srv2 ~]# 


The above confirms that the snaps were marked for deletion and is successfully deleted after handshake.


glusterd was not hung and was able to delete the snaps where snapshot delete failed.

[root@inception ~]# ls /var/lib/glusterd/snaps/
missed_snaps_list  snap143  snap166  snap50  snap70  snap91
[root@inception ~]# 
[root@inception ~]# gluster snapshot list
snap50
snap70
snap91
snap143
snap166
[root@inception ~]# gluster snapshot delete snap143
Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
snapshot delete: snap143: snap removed successfully
[root@inception ~]# gluster snapshot list
snap50
snap70
snap91
snap166
[root@inception ~]# gluster snapshot delete snap50
Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
snapshot delete: snap50: snap removed successfully
[root@inception ~]# 
[root@inception ~]# 
[root@inception ~]# gluster snapshot list
No snapshots present
[root@inception ~]# 




[root@rhs-arch-srv2 ~]# gluster snapshot list
snap50
snap70
snap91
snap143
snap166
[root@rhs-arch-srv2 ~]# gluster snapshot list
snap50
snap70
snap91
snap166
[root@rhs-arch-srv2 ~]# gluster snapshot delete snap166
Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
snapshot delete: snap166: snap removed successfully
[root@rhs-arch-srv2 ~]# gluster snapshot delete snap91
Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
snapshot delete: snap91: snap removed successfully
[root@rhs-arch-srv2 ~]# gluster snapshot delete snap70
Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
snapshot delete: snap70: snap removed successfully
[root@rhs-arch-srv2 ~]# 


Moving the bug to verified state.

Comment 5 senaik 2014-09-18 12:25:56 UTC

Version : glusterfs 3.6.0.28
========

While deleting snaps in a loop, restarted glusterd on few nodes. Some snapshots were still remaining in the system because those snapshots were not marked for decommssion where glusterd went down. When glusterd comes back up on those nodes it recreated the snaps on the other nodes. So when the snap deletion is tried again it fails . However, glusterd does not hang . 

The below snapshots failed with 'Commit failed' error when glusterd was restarted on other nodes 

gluster snapshot list
vol1_snap_6
vol1_snap_18
vol1_snap_19
vol1_snap_56
vol1_snap_71
vol1_snap_72
vol1_snap_73
vol1_snap_87
vol1_snap_107
vol1_snap_113
vol1_snap_138
vol1_snap_143
vol1_snap_177
vol1_snap_189

Delete snapshot :

gluster snapshot delete vol1_snap_6
Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
snapshot delete: failed: Commit failed on snapshot14.lab.eng.blr.redhat.com. Please check log file for details.
Commit failed on snapshot16.lab.eng.blr.redhat.com. Please check log file for details.
Commit failed on snapshot15.lab.eng.blr.redhat.com. Please check log file for details.
Snapshot command failed

Re-opening the bug

Comment 9 Shalaka 2014-09-25 07:27:24 UTC

Edited doc text. Please review and sign-off.

Comment 10 rjoseph 2014-09-25 08:20:09 UTC

The doc text looks fine to me.

Comment 11 Avra Sengupta 2014-11-12 11:59:53 UTC

Fixed with https://code.engineering.redhat.com/gerrit/36489

Comment 12 senaik 2014-12-16 09:32:36 UTC

Version : glusterfs 3.6.0.37
========

Retried the steps as mentioned in Description and Comment 5, unable to reproduce the issue. 

Marking the bug as 'Verified'

Comment 14 errata-xmlrpc 2015-01-15 13:37:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0038.html

Note You need to log in before you can comment on or make changes to this bug.