Bug 1209484

Summary:	Unable to stop/start a volume
Product:	[Community] GlusterFS	Reporter:	Sweta Anandpara <sanandpa>
Component:	tiering	Assignee:	Mohammed Rafi KC <rkavunga>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	amukherj, bugs, nchilaka, rkavunga, sankarshan, smohan
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	TIERING
Fixed In Version:	glusterfs-4.1.4	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1224111 (view as bug list)		Environment:
Last Closed:	2018-10-08 09:52:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1186580, 1224111, 1260923

Description Sweta Anandpara 2015-04-07 13:10:04 UTC

Description of problem:

Had a 2*2 setup with 3 volumes - one of which was tiered, one of them had snapshots enabled, and the other was being used for backup testing. 

The volume in question 'pluto' had snapshots enabled, had undergone snap restores a couple of times and the build was updated as well to one of the newer 3.7 nightlies. The volume status shows the volume has stopped - but I have been unable to delete it, as it errors out saying the volume is still running.

Version-Release number of selected component (if applicable):

Glusterfs 3.7 nightly glusterfs-3.7dev-0.910.git17827de.el6.x86_64

How reproducible: 1:1

Additional info:

This is what is seen in the logs: 

[2015-04-07 10:21:00.619911] I [run.c:190:runner_log] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f4360b3b5e0] (--> /usr/lib64/libglusterfs.so.0(runner_log+0x105)[0x7f4360b8ad05] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_hooks_run_hooks+0x5a0)[0x7f43568cd770] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(+0x562f5)[0x7f43568532f5] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_commit_perform+0x5a)[0x7f435685643a] ))))) 0-management: Ran script: /var/lib/glusterd/hooks/1/stop/pre/S29CTDB-teardown.sh --volname=pluto --last=no
[2015-04-07 10:21:00.631633] E [run.c:190:runner_log] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f4360b3b5e0] (--> /usr/lib64/libglusterfs.so.0(runner_log+0x105)[0x7f4360b8ad05] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_hooks_run_hooks+0x444)[0x7f43568cd614] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(+0x562f5)[0x7f43568532f5] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_commit_perform+0x5a)[0x7f435685643a] ))))) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/stop/pre/S30samba-stop.sh --volname=pluto --last=no
[2015-04-07 10:21:00.631827] I [glusterd-utils.c:1367:glusterd_service_stop] 0-management: brick already stopped
[2015-04-07 10:21:00.632307] I [glusterd-utils.c:1367:glusterd_service_stop] 0-management: brick already stopped
[2015-04-07 10:21:00.657141] E [glusterd-volume-ops.c:2398:glusterd_stop_volume] 0-management: Failed to notify graph change for pluto volume
[2015-04-07 10:21:00.657546] E [glusterd-volume-ops.c:2433:glusterd_op_stop_volume] 0-management: Failed to stop pluto volume
[2015-04-07 10:21:00.657572] E [glusterd-syncop.c:1355:gd_commit_op_phase] 0-management: Commit of operation 'Volume Stop' failed on localhost
[2015-04-07 10:21:00.657819] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick2/pluto/dd on port 49169
[2015-04-07 10:21:00.660557] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick1/pluto/dd on port 49168



[root@dhcp43-48 ~]# gluster v i
 
Volume Name: nash
Type: Distributed-Replicate
Volume ID: cd66179e-6fda-49cf-b40f-be930bc01f6f
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.43.48:/rhs/brick1/dd
Brick2: 10.70.42.147:/rhs/brick1/dd
Brick3: 10.70.43.48:/rhs/brick2/dd
Brick4: 10.70.42.147:/rhs/brick2/dd
Options Reconfigured:
changelog.changelog: on
storage.build-pgfid: on
 
Volume Name: ozone
Type: Tier
Volume ID: 4611c8ba-4f32-409c-8858-81d55d2acc75
Status: Started
Number of Bricks: 6 x 1 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.42.147:/rhs/thinbrick1/ozone/hdd
Brick2: 10.70.43.48:/rhs/thinbrick1/ozone/hdd
Brick3: 10.70.43.48:/rhs/thinbrick1/ozone/dd
Brick4: 10.70.43.48:/rhs/thinbrick2/ozone/dd
Brick5: 10.70.42.147:/rhs/thinbrick1/ozone/dd
Brick6: 10.70.42.147:/rhs/thinbrick2/ozone/dd
Options Reconfigured:
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
changelog.changelog: on
storage.build-pgfid: on
 
Volume Name: pluto
Type: Distribute
Volume ID: 5656ab65-c1da-44a8-9ff0-46d08c9a8c61
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: 10.70.43.48:/var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick1/pluto/dd
Brick2: 10.70.43.48:/var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick2/pluto/dd
Brick3: 10.70.42.147:/var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick3/pluto/dd
Brick4: 10.70.42.147:/var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick4/pluto/dd
[root@dhcp43-48 ~]# 
[root@dhcp43-48 ~]# 
[root@dhcp43-48 ~]# cd /rhs/
brick1/     brick2/     ozone/      thinbrick1/ thinbrick2/ 
[root@dhcp43-48 ~]#
[root@dhcp43-48 ~]# gluster snapshot list
snap_5
[root@dhcp43-48 ~]#
[root@dhcp43-48 ~]# gluster snapshot delete pluto
Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
snapshot delete: failed: Snapshot (pluto) does not exist
Snapshot command failed
[root@dhcp43-48 ~]# gluster snapshot delete volume  pluto
Volume (pluto) contains 1 snapshot(s).
Do you still want to continue and delete them?  (y/n) y
snapshot delete: snap_5: snap removed successfully
[root@dhcp43-48 ~]# 
[root@dhcp43-48 ~]# gluster v stop pluto
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: pluto: failed: Volume pluto is not in the started state
[root@dhcp43-48 ~]# gluster v delete pluto
Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y
volume delete: pluto: failed: Staging failed on 10.70.42.147. Error: Volume pluto has been started.Volume needs to be stopped before deletion.
[root@dhcp43-48 ~]# 
[root@dhcp43-48 ~]# gluster v i
 
Volume Name: nash
Type: Distributed-Replicate
Volume ID: cd66179e-6fda-49cf-b40f-be930bc01f6f
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.43.48:/rhs/brick1/dd
Brick2: 10.70.42.147:/rhs/brick1/dd
Brick3: 10.70.43.48:/rhs/brick2/dd
Brick4: 10.70.42.147:/rhs/brick2/dd
Options Reconfigured:
changelog.changelog: on
storage.build-pgfid: on
 
Volume Name: ozone
Type: Tier
Volume ID: 4611c8ba-4f32-409c-8858-81d55d2acc75
Status: Started
Number of Bricks: 6 x 1 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.42.147:/rhs/thinbrick1/ozone/hdd
Brick2: 10.70.43.48:/rhs/thinbrick1/ozone/hdd
Brick3: 10.70.43.48:/rhs/thinbrick1/ozone/dd
Brick4: 10.70.43.48:/rhs/thinbrick2/ozone/dd
Brick5: 10.70.42.147:/rhs/thinbrick1/ozone/dd
Brick6: 10.70.42.147:/rhs/thinbrick2/ozone/dd
Options Reconfigured:
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
changelog.changelog: on
storage.build-pgfid: on
 
Volume Name: pluto
Type: Distribute
Volume ID: 5656ab65-c1da-44a8-9ff0-46d08c9a8c61
Status: Stopped
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: 10.70.43.48:/var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick1/pluto/dd
Brick2: 10.70.43.48:/var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick2/pluto/dd
Brick3: 10.70.42.147:/var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick3/pluto/dd
Brick4: 10.70.42.147:/var/run/gluster/snaps/9a209867fb0b4b3f86f49494a6cfc191/brick4/pluto/dd
[root@dhcp43-48 ~]#
[root@dhcp43-48 ~]# 
[root@dhcp43-48 ~]# gluster v stop pluto
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: pluto: failed: Volume pluto is not in the started state
[root@dhcp43-48 ~]# 
[root@dhcp43-48 ~]# gluster v start pluto
volume start: pluto: failed: Staging failed on 10.70.42.147. Error: Volume pluto already started
[root@dhcp43-48 ~]# 
[root@dhcp43-48 ~]# gluster v stop pluto
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: pluto: failed: Volume pluto is not in the started state
[root@dhcp43-48 ~]# 
[root@dhcp43-48 ~]# gluster v status pluto
Volume pluto is not started
[root@dhcp43-48 ~]#

Comment 1 Sweta Anandpara 2015-04-07 13:39:35 UTC

Sosreports and glusterd statedumps copied to: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1209484/

Comment 3 Atin Mukherjee 2015-04-09 11:47:48 UTC

Here is what I found out from the setup:

Stopping any of the volumes here fails locally because glusterd fails to re-generate nfs volfile as part of stopping/restarting the daemons as ozone's volinfo.volname is corrupted and shows as :

(gdb) p voliter.volname
$15 = "ozone\000cold", '\000' <repeats 245 times>

Assigning it to tiering team for further investigation.

Comment 4 Mohammed Rafi KC 2015-05-07 09:03:35 UTC

RCA: 

Problem 1:

When generating client vol file for tiered volume, graph builder was expecting to  have a value larger than 0 for cold_dist_leaf_count. But cold_dist_leaf_count was not calculating properly.

Patch : http://review.gluster.org/10108

Problem 2:

Snapd svc was not initializing in volinfo update path.

patch :http://review.gluster.org/10304

Comment 5 Niels de Vos 2015-05-15 13:07:33 UTC

This change should not be in "ON_QA", the patch posted for this bug is only available in the master branch and not in a release yet. Moving back to MODIFIED until there is an beta release for the next GlusterFS version.

Comment 6 Amar Tumballi 2018-10-08 09:52:42 UTC

This bug was ON_QA status, and on GlusterFS product in bugzilla, we don't have that as a valid status. We are closing it as 'CURRENT RELEASE ' to indicate the availability of the fix, please reopen if found again.