On my 16 node setup after about a day, 3 nodes in the rebalance status shows the lapsed time reset to "ZERO" and again after about 4-5 hours, all the nodes stopped ticking except only one node continued which is continually ticking. Hence the promote/demote and scanned files stats have stopped getting updated [root@dhcp37-202 ~]# gluster v rebal nagvol status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 2 0Bytes 35287 0 0 in progress 29986.00 10.70.37.195 0 0Bytes 35281 0 0 in progress 29986.00 10.70.35.155 0 0Bytes 35003 0 0 in progress 29986.00 10.70.35.222 0 0Bytes 35002 0 0 in progress 29986.00 10.70.35.108 0 0Bytes 0 0 0 in progress 29985.00 10.70.35.44 0 0Bytes 0 0 0 in progress 29986.00 10.70.35.89 0 0Bytes 0 0 0 in progress 146477.00 10.70.35.231 0 0Bytes 0 0 0 in progress 29986.00 10.70.35.176 0 0Bytes 35487 0 0 in progress 29986.00 10.70.35.232 0 0Bytes 0 0 0 in progress 0.00 10.70.35.173 0 0Bytes 0 0 0 in progress 0.00 10.70.35.163 0 0Bytes 35314 0 0 in progress 29986.00 10.70.37.101 0 0Bytes 0 0 0 in progress 0.00 10.70.37.69 0 0Bytes 35385 0 0 in progress 29986.00 10.70.37.60 0 0Bytes 35255 0 0 in progress 29986.00 10.70.37.120 0 0Bytes 35250 0 0 in progress 29986.00 volume rebalance: nagvol: success [root@dhcp37-202 ~]# [root@dhcp37-202 ~]# [root@dhcp37-202 ~]# gluster v rebal nagvol status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 2 0Bytes 35287 0 0 in progress 29986.00 10.70.37.195 0 0Bytes 35281 0 0 in progress 29986.00 10.70.35.155 0 0Bytes 35003 0 0 in progress 29986.00 10.70.35.222 0 0Bytes 35002 0 0 in progress 29986.00 10.70.35.108 0 0Bytes 0 0 0 in progress 29985.00 10.70.35.44 0 0Bytes 0 0 0 in progress 29986.00 10.70.35.89 0 0Bytes 0 0 0 in progress 146488.00 10.70.35.231 0 0Bytes 0 0 0 in progress 29986.00 10.70.35.176 0 0Bytes 35487 0 0 in progress 29986.00 10.70.35.232 0 0Bytes 0 0 0 in progress 0.00 10.70.35.173 0 0Bytes 0 0 0 in progress 0.00 10.70.35.163 0 0Bytes 35314 0 0 in progress 29986.00 10.70.37.101 0 0Bytes 0 0 0 in progress 0.00 10.70.37.69 0 0Bytes 35385 0 0 in progress 29986.00 10.70.37.60 0 0Bytes 35255 0 0 in progress 29986.00 10.70.37.120 0 0Bytes 35250 0 0 in progress 29986.00 Also, the tier status shows as belo: [root@dhcp37-202 ~]# gluster v tier nagvol status Node Promoted files Demoted files Status --------- --------- --------- --------- localhost 0 0 in progress 10.70.37.195 0 0 in progress 10.70.35.155 0 0 in progress 10.70.35.222 0 0 in progress 10.70.35.108 0 0 in progress 10.70.35.44 0 0 in progress 10.70.35.89 0 0 in progress 10.70.35.231 0 0 in progress 10.70.35.176 0 0 in progress 10.70.35.232 0 0 in progress 10.70.35.173 0 0 in progress 10.70.35.163 0 0 in progress 10.70.37.101 0 0 in progress 10.70.37.69 0 0 in progress 10.70.37.60 0 0 in progress 10.70.37.120 0 0 in progress Tiering Migration Functionality: nagvol: success -> I was running some IOs but not very heavy -> Also, there was an nfs problem reported wrt music files, stopped palying with permission denied -> I saw files promotes happening -> Also, the glusterd was restarted only on one of the nodes, in the last 2 days glusterfs-client-xlators-3.7.5-17.el7rhgs.x86_64 glusterfs-server-3.7.5-17.el7rhgs.x86_64 gluster-nagios-addons-0.2.5-1.el7rhgs.x86_64 vdsm-gluster-4.16.30-1.3.el7rhgs.noarch glusterfs-3.7.5-17.el7rhgs.x86_64 glusterfs-api-3.7.5-17.el7rhgs.x86_64 glusterfs-cli-3.7.5-17.el7rhgs.x86_64 glusterfs-geo-replication-3.7.5-17.el7rhgs.x86_64 glusterfs-debuginfo-3.7.5-17.el7rhgs.x86_64 gluster-nagios-common-0.2.3-1.el7rhgs.noarch python-gluster-3.7.5-16.el7rhgs.noarch glusterfs-libs-3.7.5-17.el7rhgs.x86_64 glusterfs-fuse-3.7.5-17.el7rhgs.x86_64 glusterfs-rdma-3.7.5-17.el7rhgs.x86_64 sosreports will be attached
RCA after glusterd restart, connection between rebalance and glusterd was not re-established. It is a day 1 issue, and it is also true for rebalance/remove-brick process. For tiering it will impact more severe, because if tier pause called after glusterd restart , glusterd won't be able to talk with rebalance process, and tier pause will mark as successful.
upstream patch : http://review.gluster.org/#/c/13319/
Can you please verify the Doc text .
Looks good to me.
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions
As per comment 14, moving it to ON_QA