The defrag variable is not being reinitialized during glusterd restart. This means that if glusterd goes down or needs to be restarted while the following processes are running, it does not reconnect to these processes after restarting:
- rebalance
- tier
- remove-brick
This results in these processes continuing to run without communicating with glusterd. Therefore, any operation that requires communication between these processes and glusterd fails.
To work around this issue, stop or kill the rebalance, tier, or remove-brick process before restarting glusterd. This ensures that a new process is spawned when glusterd restarts.
DescriptionNag Pavan Chilakam
2016-01-29 07:45:40 UTC
On my 16 node setup after about a day, 3 nodes in the rebalance status shows the lapsed time reset to "ZERO" and again after about 4-5 hours, all the nodes stopped ticking except only one node continued which is continually ticking.
Hence the promote/demote and scanned files stats have stopped getting updated
[root@dhcp37-202 ~]# gluster v rebal nagvol status
Node Rebalanced-files size scanned failures skipped status run time in secs
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 2 0Bytes 35287 0 0 in progress 29986.00
10.70.37.195 0 0Bytes 35281 0 0 in progress 29986.00
10.70.35.155 0 0Bytes 35003 0 0 in progress 29986.00
10.70.35.222 0 0Bytes 35002 0 0 in progress 29986.00
10.70.35.108 0 0Bytes 0 0 0 in progress 29985.00
10.70.35.44 0 0Bytes 0 0 0 in progress 29986.00
10.70.35.89 0 0Bytes 0 0 0 in progress 146477.00
10.70.35.231 0 0Bytes 0 0 0 in progress 29986.00
10.70.35.176 0 0Bytes 35487 0 0 in progress 29986.00
10.70.35.232 0 0Bytes 0 0 0 in progress 0.00
10.70.35.173 0 0Bytes 0 0 0 in progress 0.00
10.70.35.163 0 0Bytes 35314 0 0 in progress 29986.00
10.70.37.101 0 0Bytes 0 0 0 in progress 0.00
10.70.37.69 0 0Bytes 35385 0 0 in progress 29986.00
10.70.37.60 0 0Bytes 35255 0 0 in progress 29986.00
10.70.37.120 0 0Bytes 35250 0 0 in progress 29986.00
volume rebalance: nagvol: success
[root@dhcp37-202 ~]#
[root@dhcp37-202 ~]#
[root@dhcp37-202 ~]# gluster v rebal nagvol status
Node Rebalanced-files size scanned failures skipped status run time in secs
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 2 0Bytes 35287 0 0 in progress 29986.00
10.70.37.195 0 0Bytes 35281 0 0 in progress 29986.00
10.70.35.155 0 0Bytes 35003 0 0 in progress 29986.00
10.70.35.222 0 0Bytes 35002 0 0 in progress 29986.00
10.70.35.108 0 0Bytes 0 0 0 in progress 29985.00
10.70.35.44 0 0Bytes 0 0 0 in progress 29986.00
10.70.35.89 0 0Bytes 0 0 0 in progress 146488.00
10.70.35.231 0 0Bytes 0 0 0 in progress 29986.00
10.70.35.176 0 0Bytes 35487 0 0 in progress 29986.00
10.70.35.232 0 0Bytes 0 0 0 in progress 0.00
10.70.35.173 0 0Bytes 0 0 0 in progress 0.00
10.70.35.163 0 0Bytes 35314 0 0 in progress 29986.00
10.70.37.101 0 0Bytes 0 0 0 in progress 0.00
10.70.37.69 0 0Bytes 35385 0 0 in progress 29986.00
10.70.37.60 0 0Bytes 35255 0 0 in progress 29986.00
10.70.37.120 0 0Bytes 35250 0 0 in progress 29986.00
Also, the tier status shows as belo:
[root@dhcp37-202 ~]# gluster v tier nagvol status
Node Promoted files Demoted files Status
--------- --------- --------- ---------
localhost 0 0 in progress
10.70.37.195 0 0 in progress
10.70.35.155 0 0 in progress
10.70.35.222 0 0 in progress
10.70.35.108 0 0 in progress
10.70.35.44 0 0 in progress
10.70.35.89 0 0 in progress
10.70.35.231 0 0 in progress
10.70.35.176 0 0 in progress
10.70.35.232 0 0 in progress
10.70.35.173 0 0 in progress
10.70.35.163 0 0 in progress
10.70.37.101 0 0 in progress
10.70.37.69 0 0 in progress
10.70.37.60 0 0 in progress
10.70.37.120 0 0 in progress
Tiering Migration Functionality: nagvol: success
-> I was running some IOs but not very heavy
-> Also, there was an nfs problem reported wrt music files, stopped palying with permission denied
-> I saw files promotes happening
-> Also, the glusterd was restarted only on one of the nodes, in the last 2 days
glusterfs-client-xlators-3.7.5-17.el7rhgs.x86_64
glusterfs-server-3.7.5-17.el7rhgs.x86_64
gluster-nagios-addons-0.2.5-1.el7rhgs.x86_64
vdsm-gluster-4.16.30-1.3.el7rhgs.noarch
glusterfs-3.7.5-17.el7rhgs.x86_64
glusterfs-api-3.7.5-17.el7rhgs.x86_64
glusterfs-cli-3.7.5-17.el7rhgs.x86_64
glusterfs-geo-replication-3.7.5-17.el7rhgs.x86_64
glusterfs-debuginfo-3.7.5-17.el7rhgs.x86_64
gluster-nagios-common-0.2.3-1.el7rhgs.noarch
python-gluster-3.7.5-16.el7rhgs.noarch
glusterfs-libs-3.7.5-17.el7rhgs.x86_64
glusterfs-fuse-3.7.5-17.el7rhgs.x86_64
glusterfs-rdma-3.7.5-17.el7rhgs.x86_64
sosreports will be attached
RCA
after glusterd restart, connection between rebalance and glusterd was not re-established. It is a day 1 issue, and it is also true for rebalance/remove-brick process.
For tiering it will impact more severe, because if tier pause called after glusterd restart , glusterd won't be able to talk with rebalance process, and tier pause will mark as successful.
On my 16 node setup after about a day, 3 nodes in the rebalance status shows the lapsed time reset to "ZERO" and again after about 4-5 hours, all the nodes stopped ticking except only one node continued which is continually ticking. Hence the promote/demote and scanned files stats have stopped getting updated [root@dhcp37-202 ~]# gluster v rebal nagvol status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 2 0Bytes 35287 0 0 in progress 29986.00 10.70.37.195 0 0Bytes 35281 0 0 in progress 29986.00 10.70.35.155 0 0Bytes 35003 0 0 in progress 29986.00 10.70.35.222 0 0Bytes 35002 0 0 in progress 29986.00 10.70.35.108 0 0Bytes 0 0 0 in progress 29985.00 10.70.35.44 0 0Bytes 0 0 0 in progress 29986.00 10.70.35.89 0 0Bytes 0 0 0 in progress 146477.00 10.70.35.231 0 0Bytes 0 0 0 in progress 29986.00 10.70.35.176 0 0Bytes 35487 0 0 in progress 29986.00 10.70.35.232 0 0Bytes 0 0 0 in progress 0.00 10.70.35.173 0 0Bytes 0 0 0 in progress 0.00 10.70.35.163 0 0Bytes 35314 0 0 in progress 29986.00 10.70.37.101 0 0Bytes 0 0 0 in progress 0.00 10.70.37.69 0 0Bytes 35385 0 0 in progress 29986.00 10.70.37.60 0 0Bytes 35255 0 0 in progress 29986.00 10.70.37.120 0 0Bytes 35250 0 0 in progress 29986.00 volume rebalance: nagvol: success [root@dhcp37-202 ~]# [root@dhcp37-202 ~]# [root@dhcp37-202 ~]# gluster v rebal nagvol status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 2 0Bytes 35287 0 0 in progress 29986.00 10.70.37.195 0 0Bytes 35281 0 0 in progress 29986.00 10.70.35.155 0 0Bytes 35003 0 0 in progress 29986.00 10.70.35.222 0 0Bytes 35002 0 0 in progress 29986.00 10.70.35.108 0 0Bytes 0 0 0 in progress 29985.00 10.70.35.44 0 0Bytes 0 0 0 in progress 29986.00 10.70.35.89 0 0Bytes 0 0 0 in progress 146488.00 10.70.35.231 0 0Bytes 0 0 0 in progress 29986.00 10.70.35.176 0 0Bytes 35487 0 0 in progress 29986.00 10.70.35.232 0 0Bytes 0 0 0 in progress 0.00 10.70.35.173 0 0Bytes 0 0 0 in progress 0.00 10.70.35.163 0 0Bytes 35314 0 0 in progress 29986.00 10.70.37.101 0 0Bytes 0 0 0 in progress 0.00 10.70.37.69 0 0Bytes 35385 0 0 in progress 29986.00 10.70.37.60 0 0Bytes 35255 0 0 in progress 29986.00 10.70.37.120 0 0Bytes 35250 0 0 in progress 29986.00 Also, the tier status shows as belo: [root@dhcp37-202 ~]# gluster v tier nagvol status Node Promoted files Demoted files Status --------- --------- --------- --------- localhost 0 0 in progress 10.70.37.195 0 0 in progress 10.70.35.155 0 0 in progress 10.70.35.222 0 0 in progress 10.70.35.108 0 0 in progress 10.70.35.44 0 0 in progress 10.70.35.89 0 0 in progress 10.70.35.231 0 0 in progress 10.70.35.176 0 0 in progress 10.70.35.232 0 0 in progress 10.70.35.173 0 0 in progress 10.70.35.163 0 0 in progress 10.70.37.101 0 0 in progress 10.70.37.69 0 0 in progress 10.70.37.60 0 0 in progress 10.70.37.120 0 0 in progress Tiering Migration Functionality: nagvol: success -> I was running some IOs but not very heavy -> Also, there was an nfs problem reported wrt music files, stopped palying with permission denied -> I saw files promotes happening -> Also, the glusterd was restarted only on one of the nodes, in the last 2 days glusterfs-client-xlators-3.7.5-17.el7rhgs.x86_64 glusterfs-server-3.7.5-17.el7rhgs.x86_64 gluster-nagios-addons-0.2.5-1.el7rhgs.x86_64 vdsm-gluster-4.16.30-1.3.el7rhgs.noarch glusterfs-3.7.5-17.el7rhgs.x86_64 glusterfs-api-3.7.5-17.el7rhgs.x86_64 glusterfs-cli-3.7.5-17.el7rhgs.x86_64 glusterfs-geo-replication-3.7.5-17.el7rhgs.x86_64 glusterfs-debuginfo-3.7.5-17.el7rhgs.x86_64 gluster-nagios-common-0.2.3-1.el7rhgs.noarch python-gluster-3.7.5-16.el7rhgs.noarch glusterfs-libs-3.7.5-17.el7rhgs.x86_64 glusterfs-fuse-3.7.5-17.el7rhgs.x86_64 glusterfs-rdma-3.7.5-17.el7rhgs.x86_64 sosreports will be attached