+++ This bug was initially created as a clone of Bug #1168607 +++ Description of problem: When snapd is crashed the gluster volume stop/delete operation fails which makes cluster in inconsistent state. Version-Release number of selected component (if applicable): [root@dhcp42-244 core]# rpm -qa | grep gluster gluster-nagios-common-0.1.3-2.el6rhs.noarch samba-glusterfs-3.6.509-169.1.el6rhs.x86_64 glusterfs-libs-3.6.0.34-1.el6rhs.x86_64 glusterfs-server-3.6.0.34-1.el6rhs.x86_64 glusterfs-cli-3.6.0.34-1.el6rhs.x86_64 glusterfs-3.6.0.34-1.el6rhs.x86_64 glusterfs-fuse-3.6.0.34-1.el6rhs.x86_64 glusterfs-rdma-3.6.0.34-1.el6rhs.x86_64 glusterfs-debuginfo-3.6.0.34-1.el6rhs.x86_64 vdsm-gluster-4.14.7.2-1.el6rhs.noarch gluster-nagios-addons-0.1.10-2.el6rhs.x86_64 glusterfs-geo-replication-3.6.0.34-1.el6rhs.x86_64 glusterfs-api-3.6.0.34-1.el6rhs.x86_64 [root@dhcp42-244 core]# How reproducible: Always Steps to Reproduce: 1. Create a 2*2 dist-rep volume and start it 2. Mount the volume from client (Fuse) 3. Take 2 snapshots(snap1 and snap2) and enable USS 4. Send lookup continuously on .snaps and snapd got crashed unexpectedly(Hit BZ 1168497) 5. Tried to stop the volume and it failed with error "Commit failed on localhost. Please check the log file for more details" Please See "Additional Info" section for more details. Actual results: Volume state in one node is "Stopped" state whereas on other node it is in "Started" State which makes the cluster in inconsistent state. Expected results: Volume should be stopped/deleted successfully. Additional info: Node1: ===== [root@dhcp42-244 core]# gluster peer status Number of Peers: 3 Hostname: 10.70.43.6 Uuid: 2c0d5fe8-a014-4978-ace7-c663e4cc8d91 State: Peer in Cluster (Connected) Hostname: 10.70.42.204 Uuid: 2a2a1b36-37e3-4336-b82a-b09dcc2f745e State: Peer in Cluster (Connected) Hostname: 10.70.42.10 Uuid: 77c49bfc-6cb4-44f3-be12-41447a3a452e State: Peer in Cluster (Connected) [root@dhcp42-244 core]# [root@dhcp42-244 ~]# gluster volume info Volume Name: testvol Type: Distributed-Replicate Volume ID: 60c63773-39e8-4145-9985-5bcedf59cd1b Status: Started Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.42.244:/rhs/brick1/testvol Brick2: 10.70.43.6:/rhs/brick2/testvol Brick3: 10.70.42.204:/rhs/brick3/testvol Brick4: 10.70.42.10:/rhs/brick4/testvol Options Reconfigured: features.uss: on features.barrier: disable performance.readdir-ahead: on auto-delete: disable snap-max-soft-limit: 90 snap-max-hard-limit: 256 Volume Name: testvol1 Type: Distributed-Replicate Volume ID: bcd90c32-e79d-4197-a5b2-b0ea1d52002d Status: Started Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.42.244:/rhs/brick2/testvol Brick2: 10.70.43.6:/rhs/brick3/testvol Brick3: 10.70.42.204:/rhs/brick4/testvol Brick4: 10.70.42.10:/rhs/brick1/testvol Options Reconfigured: performance.readdir-ahead: on features.uss: on features.barrier: disable auto-delete: disable snap-max-soft-limit: 90 snap-max-hard-limit: 256 [root@dhcp42-244 ~]# [root@dhcp42-244 core]# gluster volume stop testvol Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: testvol: failed: Commit failed on localhost. Please check the log file for more details. [root@dhcp42-244 core]# gluster volume info Volume Name: testvol Type: Distributed-Replicate Volume ID: 60c63773-39e8-4145-9985-5bcedf59cd1b Status: Stopped Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.42.244:/rhs/brick1/testvol Brick2: 10.70.43.6:/rhs/brick2/testvol Brick3: 10.70.42.204:/rhs/brick3/testvol Brick4: 10.70.42.10:/rhs/brick4/testvol Options Reconfigured: performance.readdir-ahead: on features.barrier: disable features.uss: on snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable Volume Name: testvol1 Type: Distributed-Replicate Volume ID: bcd90c32-e79d-4197-a5b2-b0ea1d52002d Status: Started Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.42.244:/rhs/brick2/testvol Brick2: 10.70.43.6:/rhs/brick3/testvol Brick3: 10.70.42.204:/rhs/brick4/testvol Brick4: 10.70.42.10:/rhs/brick1/testvol Options Reconfigured: features.barrier: disable features.uss: on performance.readdir-ahead: on snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable [root@dhcp42-244 core]# gluster volume delete testvol Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y volume delete: testvol: failed: Cannot delete Volume testvol ,as it has 2 snapshots. To delete the volume, first delete all the snapshots under it. [root@dhcp42-244 core]# gluster snapshot list testvol snap1 snap2 [root@dhcp42-244 core]# gluster snapshot delete snap1 Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y snapshot delete: snap1: snap removed successfully [root@dhcp42-244 core]# gluster snapshot delete snap2 Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y snapshot delete: snap2: snap removed successfully [root@dhcp42-244 core]# gluster volume delete testvol Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y volume delete: testvol: failed: Staging failed on 10.70.43.6. Error: Volume testvol has been started.Volume needs to be stopped before deletion. Staging failed on 10.70.42.10. Error: Volume testvol has been started.Volume needs to be stopped before deletion. Staging failed on 10.70.42.204. Error: Volume testvol has been started.Volume needs to be stopped before deletion. [root@dhcp42-244 core]# [root@dhcp42-244 core]# gluster volume info Volume Name: testvol Type: Distributed-Replicate Volume ID: 60c63773-39e8-4145-9985-5bcedf59cd1b Status: Stopped Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.42.244:/rhs/brick1/testvol Brick2: 10.70.43.6:/rhs/brick2/testvol Brick3: 10.70.42.204:/rhs/brick3/testvol Brick4: 10.70.42.10:/rhs/brick4/testvol Options Reconfigured: performance.readdir-ahead: on features.barrier: disable features.uss: on snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable Volume Name: testvol1 Type: Distributed-Replicate Volume ID: bcd90c32-e79d-4197-a5b2-b0ea1d52002d Status: Started Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.42.244:/rhs/brick2/testvol Brick2: 10.70.43.6:/rhs/brick3/testvol Brick3: 10.70.42.204:/rhs/brick4/testvol Brick4: 10.70.42.10:/rhs/brick1/testvol Options Reconfigured: features.barrier: disable features.uss: on performance.readdir-ahead: on snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable [root@dhcp42-244 core]# gluster volume delete testvol Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y volume delete: testvol: failed: Staging failed on 10.70.43.6. Error: Volume testvol has been started.Volume needs to be stopped before deletion. Staging failed on 10.70.42.204. Error: Volume testvol has been started.Volume needs to be stopped before deletion. Staging failed on 10.70.42.10. Error: Volume testvol has been started.Volume needs to be stopped before deletion. [root@dhcp42-244 core]# Node2: ===== [root@dhcp43-6 ~]# gluster volume info Volume Name: testvol Type: Distributed-Replicate Volume ID: 60c63773-39e8-4145-9985-5bcedf59cd1b Status: Started ---> "gluster volume info testvol" command at Node2 is in Started state whereas Node1 is in Stopped State which makes the cluster in inconsistent state" Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.42.244:/rhs/brick1/testvol Brick2: 10.70.43.6:/rhs/brick2/testvol Brick3: 10.70.42.204:/rhs/brick3/testvol Brick4: 10.70.42.10:/rhs/brick4/testvol Options Reconfigured: performance.readdir-ahead: off features.barrier: disable features.uss: off snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable Client Logs: ============ [root@dhcp43-190 fusemnt]# cd .snaps -bash: cd: .snaps: Transport endpoint is not connected [root@dhcp43-190 fusemnt]#
REVIEW: http://review.gluster.org/9206 (glusterd/uss: if snapd is not running, return success from glusterd_handle_snapd_option) posted (#1) for review on master by Atin Mukherjee (amukherj)
REVIEW: http://review.gluster.org/9206 (glusterd/uss: if snapd is not running, return success from glusterd_handle_snapd_option) posted (#2) for review on master by Atin Mukherjee (amukherj)
COMMIT: http://review.gluster.org/9206 committed in master by Krishnan Parthasarathi (kparthas) ------ commit 92242ecd1047fe23ca494555edd6033685522c82 Author: Atin Mukherjee <amukherj> Date: Fri Nov 28 10:46:20 2014 +0530 glusterd/uss: if snapd is not running, return success from glusterd_handle_snapd_option glusterd_handle_snapd_option was returning failure if snapd is not running because of which gluster commands were failing. Change-Id: I22286f4ecf28b57dfb6fb8ceb52ca8bdc66aec5d BUG: 1168803 Signed-off-by: Atin Mukherjee <amukherj> Reviewed-on: http://review.gluster.org/9206 Reviewed-by: Kaushal M <kaushal> Reviewed-by: Avra Sengupta <asengupt> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijaikumar Mallikarjuna <vmallika> Reviewed-by: Krishnan Parthasarathi <kparthas> Tested-by: Krishnan Parthasarathi <kparthas>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report. glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user