Description of problem: Had a 2*2 setup with 3 volumes - one of which was tiered, one of them had snapshots enabled, and the other was being used for backup testing. The volume in question 'pluto' had snapshots enabled, had undergone snap restores a couple of times and the build was updated as well to one of the newer 3.7 nightlies. The volume status shows the volume has stopped - but I have been unable to delete it, as it errors out saying the volume is still running. Version-Release number of selected component (if applicable): Glusterfs 3.7 nightly glusterfs-3.7dev-0.910.git17827de.el6.x86_64 How reproducible: 1:1 Additional info: This is what is seen in the logs: [2015-04-07 10:21:00.619911] I [run.c:190:runner_log] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f4360b3b5e0] (--> /usr/lib64/libglusterfs.so.0(runner_log+0x105)[0x7f4360b8ad05] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_hooks_run_hooks+0x5a0)[0x7f43568cd770] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(+0x562f5)[0x7f43568532f5] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_commit_perform+0x5a)[0x7f435685643a] ))))) 0-management: Ran script: /var/lib/glusterd/hooks/1/stop/pre/S29CTDB-teardown.sh --volname=pluto --last=no [2015-04-07 10:21:00.631633] E [run.c:190:runner_log] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f4360b3b5e0] (--> /usr/lib64/libglusterfs.so.0(runner_log+0x105)[0x7f4360b8ad05] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_hooks_run_hooks+0x444)[0x7f43568cd614] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(+0x562f5)[0x7f43568532f5] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_commit_perform+0x5a)[0x7f435685643a] ))))) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/stop/pre/S30samba-stop.sh --volname=pluto --last=no [2015-04-07 10:21:00.631827] I [glusterd-utils.c:1367:glusterd_service_stop] 0-management: brick already stopped [2015-04-07 10:21:00.632307] I [glusterd-utils.c:1367:glusterd_service_stop] 0-management: brick already stopped [2015-04-07 10:21:00.657141] E [glusterd-volume-ops.c:2398:glusterd_stop_volume] 0-management: Failed to notify graph change for pluto volume [2015-04-07 10:21:00.657546] E [glusterd-volume-ops.c:2433:glusterd_op_stop_volume] 0-management: Failed to stop pluto volume [2015-04-07 10:21:00.657572] E [glusterd-syncop.c:1355:gd_commit_op_phase] 0-management: Commit of operation 'Volume Stop' failed on localhost [2015-04-07 10:21:00.657819] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick2/pluto/dd on port 49169 [2015-04-07 10:21:00.660557] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick1/pluto/dd on port 49168 [root@dhcp43-48 ~]# gluster v i Volume Name: nash Type: Distributed-Replicate Volume ID: cd66179e-6fda-49cf-b40f-be930bc01f6f Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.43.48:/rhs/brick1/dd Brick2: 10.70.42.147:/rhs/brick1/dd Brick3: 10.70.43.48:/rhs/brick2/dd Brick4: 10.70.42.147:/rhs/brick2/dd Options Reconfigured: changelog.changelog: on storage.build-pgfid: on Volume Name: ozone Type: Tier Volume ID: 4611c8ba-4f32-409c-8858-81d55d2acc75 Status: Started Number of Bricks: 6 x 1 = 6 Transport-type: tcp Bricks: Brick1: 10.70.42.147:/rhs/thinbrick1/ozone/hdd Brick2: 10.70.43.48:/rhs/thinbrick1/ozone/hdd Brick3: 10.70.43.48:/rhs/thinbrick1/ozone/dd Brick4: 10.70.43.48:/rhs/thinbrick2/ozone/dd Brick5: 10.70.42.147:/rhs/thinbrick1/ozone/dd Brick6: 10.70.42.147:/rhs/thinbrick2/ozone/dd Options Reconfigured: geo-replication.indexing: on geo-replication.ignore-pid-check: on changelog.changelog: on storage.build-pgfid: on Volume Name: pluto Type: Distribute Volume ID: 5656ab65-c1da-44a8-9ff0-46d08c9a8c61 Status: Started Number of Bricks: 4 Transport-type: tcp Bricks: Brick1: 10.70.43.48:/var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick1/pluto/dd Brick2: 10.70.43.48:/var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick2/pluto/dd Brick3: 10.70.42.147:/var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick3/pluto/dd Brick4: 10.70.42.147:/var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick4/pluto/dd [root@dhcp43-48 ~]# [root@dhcp43-48 ~]# [root@dhcp43-48 ~]# cd /rhs/ brick1/ brick2/ ozone/ thinbrick1/ thinbrick2/ [root@dhcp43-48 ~]# [root@dhcp43-48 ~]# gluster snapshot list snap_5 [root@dhcp43-48 ~]# [root@dhcp43-48 ~]# gluster snapshot delete pluto Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y snapshot delete: failed: Snapshot (pluto) does not exist Snapshot command failed [root@dhcp43-48 ~]# gluster snapshot delete volume pluto Volume (pluto) contains 1 snapshot(s). Do you still want to continue and delete them? (y/n) y snapshot delete: snap_5: snap removed successfully [root@dhcp43-48 ~]# [root@dhcp43-48 ~]# gluster v stop pluto Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: pluto: failed: Volume pluto is not in the started state [root@dhcp43-48 ~]# gluster v delete pluto Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y volume delete: pluto: failed: Staging failed on 10.70.42.147. Error: Volume pluto has been started.Volume needs to be stopped before deletion. [root@dhcp43-48 ~]# [root@dhcp43-48 ~]# gluster v i Volume Name: nash Type: Distributed-Replicate Volume ID: cd66179e-6fda-49cf-b40f-be930bc01f6f Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.43.48:/rhs/brick1/dd Brick2: 10.70.42.147:/rhs/brick1/dd Brick3: 10.70.43.48:/rhs/brick2/dd Brick4: 10.70.42.147:/rhs/brick2/dd Options Reconfigured: changelog.changelog: on storage.build-pgfid: on Volume Name: ozone Type: Tier Volume ID: 4611c8ba-4f32-409c-8858-81d55d2acc75 Status: Started Number of Bricks: 6 x 1 = 6 Transport-type: tcp Bricks: Brick1: 10.70.42.147:/rhs/thinbrick1/ozone/hdd Brick2: 10.70.43.48:/rhs/thinbrick1/ozone/hdd Brick3: 10.70.43.48:/rhs/thinbrick1/ozone/dd Brick4: 10.70.43.48:/rhs/thinbrick2/ozone/dd Brick5: 10.70.42.147:/rhs/thinbrick1/ozone/dd Brick6: 10.70.42.147:/rhs/thinbrick2/ozone/dd Options Reconfigured: geo-replication.indexing: on geo-replication.ignore-pid-check: on changelog.changelog: on storage.build-pgfid: on Volume Name: pluto Type: Distribute Volume ID: 5656ab65-c1da-44a8-9ff0-46d08c9a8c61 Status: Stopped Number of Bricks: 4 Transport-type: tcp Bricks: Brick1: 10.70.43.48:/var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick1/pluto/dd Brick2: 10.70.43.48:/var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick2/pluto/dd Brick3: 10.70.42.147:/var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick3/pluto/dd Brick4: 10.70.42.147:/var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick4/pluto/dd [root@dhcp43-48 ~]# [root@dhcp43-48 ~]# [root@dhcp43-48 ~]# gluster v stop pluto Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: pluto: failed: Volume pluto is not in the started state [root@dhcp43-48 ~]# [root@dhcp43-48 ~]# gluster v start pluto volume start: pluto: failed: Staging failed on 10.70.42.147. Error: Volume pluto already started [root@dhcp43-48 ~]# [root@dhcp43-48 ~]# gluster v stop pluto Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: pluto: failed: Volume pluto is not in the started state [root@dhcp43-48 ~]# [root@dhcp43-48 ~]# gluster v status pluto Volume pluto is not started [root@dhcp43-48 ~]#
Sosreports and glusterd statedumps copied to: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1209484/
Here is what I found out from the setup: Stopping any of the volumes here fails locally because glusterd fails to re-generate nfs volfile as part of stopping/restarting the daemons as ozone's volinfo.volname is corrupted and shows as : (gdb) p voliter.volname $15 = "ozone\000cold", '\000' <repeats 245 times> Assigning it to tiering team for further investigation.
RCA: Problem 1: When generating client vol file for tiered volume, graph builder was expecting to have a value larger than 0 for cold_dist_leaf_count. But cold_dist_leaf_count was not calculating properly. Patch : http://review.gluster.org/10108 Problem 2: Snapd svc was not initializing in volinfo update path. patch :http://review.gluster.org/10304
This change should not be in "ON_QA", the patch posted for this bug is only available in the master branch and not in a release yet. Moving back to MODIFIED until there is an beta release for the next GlusterFS version.
This bug was ON_QA status, and on GlusterFS product in bugzilla, we don't have that as a valid status. We are closing it as 'CURRENT RELEASE ' to indicate the availability of the fix, please reopen if found again.