1168803 – [USS]: When snapd is crashed gluster volume stop/delete operation fails making the cluster in inconsistent state

Bug 1168803 - [USS]: When snapd is crashed gluster volume stop/delete operation fails making the cluster in inconsistent state

Summary: [USS]: When snapd is crashed gluster volume stop/delete operation fails makin...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Atin Mukherjee
QA Contact:
Docs Contact:
URL:
Whiteboard:	USS
Depends On:	1168607
Blocks:	1175765
TreeView+	depends on / blocked

Reported:	2014-11-28 04:58 UTC by Atin Mukherjee
Modified:	2015-05-14 17:45 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.7.0
Clone Of:	1168607
Clones:	1175765 (view as bug list)
Environment:
Last Closed:	2015-05-14 17:28:36 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Atin Mukherjee 2014-11-28 04:58:08 UTC

+++ This bug was initially created as a clone of Bug #1168607 +++

Description of problem:
When snapd is crashed the gluster volume stop/delete operation fails which makes cluster in inconsistent state.


Version-Release number of selected component (if applicable):

[root@dhcp42-244 core]# rpm -qa | grep gluster
gluster-nagios-common-0.1.3-2.el6rhs.noarch
samba-glusterfs-3.6.509-169.1.el6rhs.x86_64
glusterfs-libs-3.6.0.34-1.el6rhs.x86_64
glusterfs-server-3.6.0.34-1.el6rhs.x86_64
glusterfs-cli-3.6.0.34-1.el6rhs.x86_64
glusterfs-3.6.0.34-1.el6rhs.x86_64
glusterfs-fuse-3.6.0.34-1.el6rhs.x86_64
glusterfs-rdma-3.6.0.34-1.el6rhs.x86_64
glusterfs-debuginfo-3.6.0.34-1.el6rhs.x86_64
vdsm-gluster-4.14.7.2-1.el6rhs.noarch
gluster-nagios-addons-0.1.10-2.el6rhs.x86_64
glusterfs-geo-replication-3.6.0.34-1.el6rhs.x86_64
glusterfs-api-3.6.0.34-1.el6rhs.x86_64
[root@dhcp42-244 core]# 


How reproducible:
Always

Steps to Reproduce:
1. Create a 2*2 dist-rep volume and start it
2. Mount the volume from client (Fuse)
3. Take 2 snapshots(snap1 and snap2) and enable USS
4. Send lookup continuously on .snaps and snapd got crashed 
   unexpectedly(Hit BZ 1168497)
5. Tried to stop the volume and it failed with error "Commit failed on localhost. Please check the log file for more details"

Please See "Additional Info" section for more details.

Actual results:
Volume state in one node is "Stopped" state whereas on other node it is in "Started" State which makes the cluster in inconsistent state.

Expected results:
Volume should be stopped/deleted successfully.


Additional info:
Node1:
=====
[root@dhcp42-244 core]# gluster peer status
Number of Peers: 3

Hostname: 10.70.43.6
Uuid: 2c0d5fe8-a014-4978-ace7-c663e4cc8d91
State: Peer in Cluster (Connected)

Hostname: 10.70.42.204
Uuid: 2a2a1b36-37e3-4336-b82a-b09dcc2f745e
State: Peer in Cluster (Connected)

Hostname: 10.70.42.10
Uuid: 77c49bfc-6cb4-44f3-be12-41447a3a452e
State: Peer in Cluster (Connected)
[root@dhcp42-244 core]#


[root@dhcp42-244 ~]# gluster volume info
 
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 60c63773-39e8-4145-9985-5bcedf59cd1b
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick1/testvol
Brick2: 10.70.43.6:/rhs/brick2/testvol
Brick3: 10.70.42.204:/rhs/brick3/testvol
Brick4: 10.70.42.10:/rhs/brick4/testvol
Options Reconfigured:
features.uss: on
features.barrier: disable
performance.readdir-ahead: on
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256
 
Volume Name: testvol1
Type: Distributed-Replicate
Volume ID: bcd90c32-e79d-4197-a5b2-b0ea1d52002d
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick2/testvol
Brick2: 10.70.43.6:/rhs/brick3/testvol
Brick3: 10.70.42.204:/rhs/brick4/testvol
Brick4: 10.70.42.10:/rhs/brick1/testvol
Options Reconfigured:
performance.readdir-ahead: on
features.uss: on
features.barrier: disable
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256
[root@dhcp42-244 ~]#


[root@dhcp42-244 core]# gluster volume stop testvol
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: testvol: failed: Commit failed on localhost. Please check the log file for more details.
[root@dhcp42-244 core]# gluster volume info
 
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 60c63773-39e8-4145-9985-5bcedf59cd1b
Status: Stopped
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick1/testvol
Brick2: 10.70.43.6:/rhs/brick2/testvol
Brick3: 10.70.42.204:/rhs/brick3/testvol
Brick4: 10.70.42.10:/rhs/brick4/testvol
Options Reconfigured:
performance.readdir-ahead: on
features.barrier: disable
features.uss: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable
 
Volume Name: testvol1
Type: Distributed-Replicate
Volume ID: bcd90c32-e79d-4197-a5b2-b0ea1d52002d
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick2/testvol
Brick2: 10.70.43.6:/rhs/brick3/testvol
Brick3: 10.70.42.204:/rhs/brick4/testvol
Brick4: 10.70.42.10:/rhs/brick1/testvol
Options Reconfigured:
features.barrier: disable
features.uss: on
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable

[root@dhcp42-244 core]# gluster volume delete testvol
Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y
volume delete: testvol: failed: Cannot delete Volume testvol ,as it has 2 snapshots. To delete the volume, first delete all the snapshots under it.

[root@dhcp42-244 core]# gluster snapshot list testvol
snap1
snap2

[root@dhcp42-244 core]# gluster snapshot delete snap1
Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
snapshot delete: snap1: snap removed successfully
[root@dhcp42-244 core]# gluster snapshot delete snap2
Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
snapshot delete: snap2: snap removed successfully

[root@dhcp42-244 core]# gluster volume delete testvol
Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y
volume delete: testvol: failed: Staging failed on 10.70.43.6. Error: Volume testvol has been started.Volume needs to be stopped before deletion.
Staging failed on 10.70.42.10. Error: Volume testvol has been started.Volume needs to be stopped before deletion.
Staging failed on 10.70.42.204. Error: Volume testvol has been started.Volume needs to be stopped before deletion.
[root@dhcp42-244 core]#

[root@dhcp42-244 core]# gluster volume info
 
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 60c63773-39e8-4145-9985-5bcedf59cd1b
Status: Stopped
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick1/testvol
Brick2: 10.70.43.6:/rhs/brick2/testvol
Brick3: 10.70.42.204:/rhs/brick3/testvol
Brick4: 10.70.42.10:/rhs/brick4/testvol
Options Reconfigured:
performance.readdir-ahead: on
features.barrier: disable
features.uss: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable
 
Volume Name: testvol1
Type: Distributed-Replicate
Volume ID: bcd90c32-e79d-4197-a5b2-b0ea1d52002d
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick2/testvol
Brick2: 10.70.43.6:/rhs/brick3/testvol
Brick3: 10.70.42.204:/rhs/brick4/testvol
Brick4: 10.70.42.10:/rhs/brick1/testvol
Options Reconfigured:
features.barrier: disable
features.uss: on
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable

[root@dhcp42-244 core]# gluster volume delete testvol
Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y
volume delete: testvol: failed: Staging failed on 10.70.43.6. Error: Volume testvol has been started.Volume needs to be stopped before deletion.
Staging failed on 10.70.42.204. Error: Volume testvol has been started.Volume needs to be stopped before deletion.
Staging failed on 10.70.42.10. Error: Volume testvol has been started.Volume needs to be stopped before deletion.
[root@dhcp42-244 core]#

Node2:
=====
[root@dhcp43-6 ~]# gluster volume info
 
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 60c63773-39e8-4145-9985-5bcedf59cd1b
Status: Started ---> "gluster volume info testvol" command at Node2
                      is in Started state whereas Node1 is 
                      in Stopped State which makes the cluster in 
                      inconsistent state"
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick1/testvol
Brick2: 10.70.43.6:/rhs/brick2/testvol
Brick3: 10.70.42.204:/rhs/brick3/testvol
Brick4: 10.70.42.10:/rhs/brick4/testvol
Options Reconfigured:
performance.readdir-ahead: off
features.barrier: disable
features.uss: off
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable
 


Client Logs:
============
[root@dhcp43-190 fusemnt]# cd .snaps
-bash: cd: .snaps: Transport endpoint is not connected
[root@dhcp43-190 fusemnt]#

Comment 1 Anand Avati 2014-11-28 05:18:02 UTC

REVIEW: http://review.gluster.org/9206 (glusterd/uss: if snapd is not running, return success from glusterd_handle_snapd_option) posted (#1) for review on master by Atin Mukherjee (amukherj)

Comment 2 Anand Avati 2014-11-28 06:16:32 UTC

REVIEW: http://review.gluster.org/9206 (glusterd/uss: if snapd is not running, return success from glusterd_handle_snapd_option) posted (#2) for review on master by Atin Mukherjee (amukherj)

Comment 3 Anand Avati 2014-12-01 07:05:30 UTC

COMMIT: http://review.gluster.org/9206 committed in master by Krishnan Parthasarathi (kparthas) 
------
commit 92242ecd1047fe23ca494555edd6033685522c82
Author: Atin Mukherjee <amukherj>
Date:   Fri Nov 28 10:46:20 2014 +0530

    glusterd/uss: if snapd is not running, return success from glusterd_handle_snapd_option
    
    glusterd_handle_snapd_option was returning failure if snapd is not running
    because of which gluster commands were failing.
    
    Change-Id: I22286f4ecf28b57dfb6fb8ceb52ca8bdc66aec5d
    BUG: 1168803
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: http://review.gluster.org/9206
    Reviewed-by: Kaushal M <kaushal>
    Reviewed-by: Avra Sengupta <asengupt>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Vijaikumar Mallikarjuna <vmallika>
    Reviewed-by: Krishnan Parthasarathi <kparthas>
    Tested-by: Krishnan Parthasarathi <kparthas>

Comment 4 Niels de Vos 2015-05-14 17:28:36 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 5 Niels de Vos 2015-05-14 17:35:44 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 6 Niels de Vos 2015-05-14 17:38:06 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 7 Niels de Vos 2015-05-14 17:45:02 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.