Red Hat Bugzilla – Bug 1302968
Tiering status and rebalance status stops getting updated
Last modified: 2018-09-11 23:39:58 EDT
On my 16 node setup after about a day, 3 nodes in the rebalance status shows the lapsed time reset to "ZERO" and again after about 4-5 hours, all the nodes stopped ticking except only one node continued which is continually ticking. Hence the promote/demote and scanned files stats have stopped getting updated [root@dhcp37-202 ~]# gluster v rebal nagvol status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 2 0Bytes 35287 0 0 in progress 29986.00 10.70.37.195 0 0Bytes 35281 0 0 in progress 29986.00 10.70.35.155 0 0Bytes 35003 0 0 in progress 29986.00 10.70.35.222 0 0Bytes 35002 0 0 in progress 29986.00 10.70.35.108 0 0Bytes 0 0 0 in progress 29985.00 10.70.35.44 0 0Bytes 0 0 0 in progress 29986.00 10.70.35.89 0 0Bytes 0 0 0 in progress 146477.00 10.70.35.231 0 0Bytes 0 0 0 in progress 29986.00 10.70.35.176 0 0Bytes 35487 0 0 in progress 29986.00 10.70.35.232 0 0Bytes 0 0 0 in progress 0.00 10.70.35.173 0 0Bytes 0 0 0 in progress 0.00 10.70.35.163 0 0Bytes 35314 0 0 in progress 29986.00 10.70.37.101 0 0Bytes 0 0 0 in progress 0.00 10.70.37.69 0 0Bytes 35385 0 0 in progress 29986.00 10.70.37.60 0 0Bytes 35255 0 0 in progress 29986.00 10.70.37.120 0 0Bytes 35250 0 0 in progress 29986.00 volume rebalance: nagvol: success [root@dhcp37-202 ~]# [root@dhcp37-202 ~]# [root@dhcp37-202 ~]# gluster v rebal nagvol status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 2 0Bytes 35287 0 0 in progress 29986.00 10.70.37.195 0 0Bytes 35281 0 0 in progress 29986.00 10.70.35.155 0 0Bytes 35003 0 0 in progress 29986.00 10.70.35.222 0 0Bytes 35002 0 0 in progress 29986.00 10.70.35.108 0 0Bytes 0 0 0 in progress 29985.00 10.70.35.44 0 0Bytes 0 0 0 in progress 29986.00 10.70.35.89 0 0Bytes 0 0 0 in progress 146488.00 10.70.35.231 0 0Bytes 0 0 0 in progress 29986.00 10.70.35.176 0 0Bytes 35487 0 0 in progress 29986.00 10.70.35.232 0 0Bytes 0 0 0 in progress 0.00 10.70.35.173 0 0Bytes 0 0 0 in progress 0.00 10.70.35.163 0 0Bytes 35314 0 0 in progress 29986.00 10.70.37.101 0 0Bytes 0 0 0 in progress 0.00 10.70.37.69 0 0Bytes 35385 0 0 in progress 29986.00 10.70.37.60 0 0Bytes 35255 0 0 in progress 29986.00 10.70.37.120 0 0Bytes 35250 0 0 in progress 29986.00 Also, the tier status shows as belo: [root@dhcp37-202 ~]# gluster v tier nagvol status Node Promoted files Demoted files Status --------- --------- --------- --------- localhost 0 0 in progress 10.70.37.195 0 0 in progress 10.70.35.155 0 0 in progress 10.70.35.222 0 0 in progress 10.70.35.108 0 0 in progress 10.70.35.44 0 0 in progress 10.70.35.89 0 0 in progress 10.70.35.231 0 0 in progress 10.70.35.176 0 0 in progress 10.70.35.232 0 0 in progress 10.70.35.173 0 0 in progress 10.70.35.163 0 0 in progress 10.70.37.101 0 0 in progress 10.70.37.69 0 0 in progress 10.70.37.60 0 0 in progress 10.70.37.120 0 0 in progress Tiering Migration Functionality: nagvol: success -> I was running some IOs but not very heavy -> Also, there was an nfs problem reported wrt music files, stopped palying with permission denied -> I saw files promotes happening -> Also, the glusterd was restarted only on one of the nodes, in the last 2 days glusterfs-client-xlators-3.7.5-17.el7rhgs.x86_64 glusterfs-server-3.7.5-17.el7rhgs.x86_64 gluster-nagios-addons-0.2.5-1.el7rhgs.x86_64 vdsm-gluster-4.16.30-1.3.el7rhgs.noarch glusterfs-3.7.5-17.el7rhgs.x86_64 glusterfs-api-3.7.5-17.el7rhgs.x86_64 glusterfs-cli-3.7.5-17.el7rhgs.x86_64 glusterfs-geo-replication-3.7.5-17.el7rhgs.x86_64 glusterfs-debuginfo-3.7.5-17.el7rhgs.x86_64 gluster-nagios-common-0.2.3-1.el7rhgs.noarch python-gluster-3.7.5-16.el7rhgs.noarch glusterfs-libs-3.7.5-17.el7rhgs.x86_64 glusterfs-fuse-3.7.5-17.el7rhgs.x86_64 glusterfs-rdma-3.7.5-17.el7rhgs.x86_64 sosreports will be attached
RCA after glusterd restart, connection between rebalance and glusterd was not re-established. It is a day 1 issue, and it is also true for rebalance/remove-brick process. For tiering it will impact more severe, because if tier pause called after glusterd restart , glusterd won't be able to talk with rebalance process, and tier pause will mark as successful.
upstream patch : http://review.gluster.org/#/c/13319/
Can you please verify the Doc text .
Looks good to me.
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions
As per comment 14, moving it to ON_QA