Bug 1302968

Summary:	Tiering status and rebalance status stops getting updated
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Nag Pavan Chilakam <nchilaka>
Component:	glusterd	Assignee:	Mohammed Rafi KC <rkavunga>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Sweta Anandpara <sanandpa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	amukherj, jbyers, nbalacha, rcyriac, rhinduja, rhs-bugs, rkavunga, sankarshan, sheggodu, storage-qa-internal, vbellur
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.7.9-1	Doc Type:	Bug Fix
Doc Text:	The defrag variable is not being reinitialized during glusterd restart. This means that if glusterd goes down or needs to be restarted while the following processes are running, it does not reconnect to these processes after restarting: - rebalance - tier - remove-brick This results in these processes continuing to run without communicating with glusterd. Therefore, any operation that requires communication between these processes and glusterd fails. To work around this issue, stop or kill the rebalance, tier, or remove-brick process before restarting glusterd. This ensures that a new process is spawned when glusterd restarts.	Story Points:	---
Clone Of:
Clones:	1303028 (view as bug list)		Environment:
Last Closed:	2018-09-12 03:39:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1303125
Bug Blocks:	1268895, 1286100, 1303028, 1311041

Description Nag Pavan Chilakam 2016-01-29 07:45:40 UTC

On my 16 node setup after about a day, 3 nodes in the rebalance status shows the lapsed time reset to "ZERO" and again after about 4-5 hours, all the nodes stopped ticking except only one node continued which is continually ticking.
Hence the promote/demote and scanned files stats have stopped getting updated


[root@dhcp37-202 ~]# gluster v rebal nagvol status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                2        0Bytes         35287             0             0          in progress           29986.00
                            10.70.37.195                0        0Bytes         35281             0             0          in progress           29986.00
                            10.70.35.155                0        0Bytes         35003             0             0          in progress           29986.00
                            10.70.35.222                0        0Bytes         35002             0             0          in progress           29986.00
                            10.70.35.108                0        0Bytes             0             0             0          in progress           29985.00
                             10.70.35.44                0        0Bytes             0             0             0          in progress           29986.00
                             10.70.35.89                0        0Bytes             0             0             0          in progress          146477.00
                            10.70.35.231                0        0Bytes             0             0             0          in progress           29986.00
                            10.70.35.176                0        0Bytes         35487             0             0          in progress           29986.00
                            10.70.35.232                0        0Bytes             0             0             0          in progress               0.00
                            10.70.35.173                0        0Bytes             0             0             0          in progress               0.00
                            10.70.35.163                0        0Bytes         35314             0             0          in progress           29986.00
                            10.70.37.101                0        0Bytes             0             0             0          in progress               0.00
                             10.70.37.69                0        0Bytes         35385             0             0          in progress           29986.00
                             10.70.37.60                0        0Bytes         35255             0             0          in progress           29986.00
                            10.70.37.120                0        0Bytes         35250             0             0          in progress           29986.00
volume rebalance: nagvol: success
[root@dhcp37-202 ~]# 
[root@dhcp37-202 ~]# 
[root@dhcp37-202 ~]# gluster v rebal nagvol status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                2        0Bytes         35287             0             0          in progress           29986.00
                            10.70.37.195                0        0Bytes         35281             0             0          in progress           29986.00
                            10.70.35.155                0        0Bytes         35003             0             0          in progress           29986.00
                            10.70.35.222                0        0Bytes         35002             0             0          in progress           29986.00
                            10.70.35.108                0        0Bytes             0             0             0          in progress           29985.00
                             10.70.35.44                0        0Bytes             0             0             0          in progress           29986.00
                             10.70.35.89                0        0Bytes             0             0             0          in progress          146488.00
                            10.70.35.231                0        0Bytes             0             0             0          in progress           29986.00
                            10.70.35.176                0        0Bytes         35487             0             0          in progress           29986.00
                            10.70.35.232                0        0Bytes             0             0             0          in progress               0.00
                            10.70.35.173                0        0Bytes             0             0             0          in progress               0.00
                            10.70.35.163                0        0Bytes         35314             0             0          in progress           29986.00
                            10.70.37.101                0        0Bytes             0             0             0          in progress               0.00
                             10.70.37.69                0        0Bytes         35385             0             0          in progress           29986.00
                             10.70.37.60                0        0Bytes         35255             0             0          in progress           29986.00
                            10.70.37.120                0        0Bytes         35250             0             0          in progress           29986.00





Also, the tier status shows as belo:
[root@dhcp37-202 ~]# gluster v  tier nagvol status
Node                 Promoted files       Demoted files        Status              
---------            ---------            ---------            ---------           
localhost            0                    0                    in progress         
10.70.37.195         0                    0                    in progress         
10.70.35.155         0                    0                    in progress         
10.70.35.222         0                    0                    in progress         
10.70.35.108         0                    0                    in progress         
10.70.35.44          0                    0                    in progress         
10.70.35.89          0                    0                    in progress         
10.70.35.231         0                    0                    in progress         
10.70.35.176         0                    0                    in progress         
10.70.35.232         0                    0                    in progress         
10.70.35.173         0                    0                    in progress         
10.70.35.163         0                    0                    in progress         
10.70.37.101         0                    0                    in progress         
10.70.37.69          0                    0                    in progress         
10.70.37.60          0                    0                    in progress         
10.70.37.120         0                    0                    in progress         
Tiering Migration Functionality: nagvol: success






-> I was running some IOs but not very heavy
-> Also, there was an nfs problem reported wrt music files, stopped palying with permission denied
-> I saw files promotes happening 
-> Also, the glusterd was restarted only on one of the nodes, in the last 2 days




glusterfs-client-xlators-3.7.5-17.el7rhgs.x86_64
glusterfs-server-3.7.5-17.el7rhgs.x86_64
gluster-nagios-addons-0.2.5-1.el7rhgs.x86_64
vdsm-gluster-4.16.30-1.3.el7rhgs.noarch
glusterfs-3.7.5-17.el7rhgs.x86_64
glusterfs-api-3.7.5-17.el7rhgs.x86_64
glusterfs-cli-3.7.5-17.el7rhgs.x86_64
glusterfs-geo-replication-3.7.5-17.el7rhgs.x86_64
glusterfs-debuginfo-3.7.5-17.el7rhgs.x86_64
gluster-nagios-common-0.2.3-1.el7rhgs.noarch
python-gluster-3.7.5-16.el7rhgs.noarch
glusterfs-libs-3.7.5-17.el7rhgs.x86_64
glusterfs-fuse-3.7.5-17.el7rhgs.x86_64
glusterfs-rdma-3.7.5-17.el7rhgs.x86_64




sosreports will be attached

Comment 2 Mohammed Rafi KC 2016-01-31 14:45:37 UTC

RCA

after glusterd restart, connection between rebalance and glusterd was not re-established. It is a day 1 issue, and it is also true for rebalance/remove-brick process.

For tiering it will impact more severe, because if tier pause called after glusterd restart , glusterd won't be able to talk with rebalance process, and tier pause will mark as successful.

Comment 3 Mohammed Rafi KC 2016-01-31 14:46:05 UTC

upstream patch : http://review.gluster.org/#/c/13319/

Comment 7 Mohammed Rafi KC 2016-02-05 07:41:03 UTC

Can you please verify the Doc text .

Comment 11 Mohammed Rafi KC 2016-02-08 05:28:18 UTC

Looks good to me.

Comment 12 Mike McCune 2016-03-28 23:28:22 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 16 Atin Mukherjee 2016-08-16 15:00:07 UTC

As per comment 14, moving it to ON_QA