Description of problem: rebalance status split-brain. [root@byg612sv160 ~]# gluster volume rebalance gv0 status volume rebalance: gv0: failed: Rebalance not started for volume gv0. [root@byg612sv160 ~]# gluster volume rebalance gv0 stop volume rebalance: gv0: failed: Rebalance not started for volume gv0. [root@byg612sv160 ~]# gluster volume rebalance gv0 start volume rebalance: gv0: failed: Rebalance on gv0 is already started [root@byg612sv160 ~]# gluster volume rebalance gv0 start force volume rebalance: gv0: failed: Rebalance on gv0 is already started [root@byg612sv160 ~]# Version-Release number of selected component (if applicable): Glusterfs3.12 How reproducible: Not sure. Our Gluster storage reach 90% full and we added two more nodes. After that we tried gluster volume rebalance start and then stops it in order to add more storage nodes. GlusterFS refuses to add more nodes and reporting that rebalance is in progress. From gluster log, it is doing fix-layout balance very slowly (probably won't finish for two weeks). Now we can not stop the rebalance process. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
(In reply to Qigang from comment #0) > Version-Release number of selected component (if applicable): > Glusterfs3.12 > could you provide the complete details of rpms used ? rpm -qa | grep gluster Also the platform used, gluster volume information.
[root@byg612sv160 ~]# rpm -qa | grep gluster glusterfs-rdma-3.12.3-1.el7.x86_64 glusterfs-client-xlators-3.12.3-1.el7.x86_64 glusterfs-3.12.3-1.el7.x86_64 glusterfs-cli-3.12.3-1.el7.x86_64 glusterfs-libs-3.12.3-1.el7.x86_64 glusterfs-fuse-3.12.3-1.el7.x86_64 glusterfs-api-3.12.3-1.el7.x86_64 glusterfs-server-3.12.3-1.el7.x86_64 [root@byg612sv160 ~]# uname -a Linux byg612sv160 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux [root@byg612sv160 ~]# [root@byg612sv160 ~]# gluster volume status Status of volume: gv0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick g173:/dfs/brick1/gv0 49152 0 Y 2683 Brick g174:/dfs/brick1/gv0 49152 0 Y 2617 Brick g61:/dfs/brick1/gv0 49152 0 Y 2367 Brick g62:/dfs/brick1/gv0 49152 0 Y 3236 Brick g121:/dfs/brick1/gv0 49152 0 Y 2064 Brick g122:/dfs/brick1/gv0 49152 0 Y 2075 Brick g201:/dfs/brick1/gv0 49152 0 Y 3034 Brick g202:/dfs/brick1/gv0 49152 0 Y 2399 Brick g203:/dfs/brick1/gv0 49152 0 Y 2892 Brick g206:/dfs/brick1/gv0 49152 0 Y 2485 Brick g150:/dfs/brick1/gv0 49152 0 Y 3276 Brick g151:/dfs/brick1/gv0 49152 0 Y 3062 Brick g152:/dfs/brick1/gv0 49152 0 Y 187895 Brick g153:/dfs/brick1/gv0 49152 0 Y 61796 Brick g154:/dfs/brick1/gv0 49152 0 Y 147263 Brick g155:/dfs/brick1/gv0 49152 0 Y 61524 Brick g156:/dfs/brick1/gv0 49152 0 Y 253395 Brick g157:/dfs/brick1/gv0 49152 0 Y 3222 Brick g160:/dfs/brick1/gv0 49152 0 Y 249217 Brick g161:/dfs/brick1/gv0 49152 0 Y 192749 Self-heal Daemon on localhost N/A N/A Y 330489 Self-heal Daemon on g206 N/A N/A Y 3652 Self-heal Daemon on g203 N/A N/A Y 67033 Self-heal Daemon on g61 N/A N/A Y 441931 Self-heal Daemon on g152 N/A N/A Y 188517 Self-heal Daemon on g151 N/A N/A Y 72832 Self-heal Daemon on g154 N/A N/A Y 423672 Self-heal Daemon on g155 N/A N/A Y 375165 Self-heal Daemon on g122 N/A N/A Y 442967 Self-heal Daemon on g150 N/A N/A Y 329818 Self-heal Daemon on g153 N/A N/A Y 27126 Self-heal Daemon on g157 N/A N/A Y 102113 Self-heal Daemon on g156 N/A N/A Y 319339 Self-heal Daemon on g202 N/A N/A Y 81427 Self-heal Daemon on g62 N/A N/A Y 108351 Self-heal Daemon on g161 N/A N/A Y 218759 Self-heal Daemon on g121 N/A N/A Y 358359 Self-heal Daemon on g201 N/A N/A Y 32230 Self-heal Daemon on g173 N/A N/A Y 44555 Self-heal Daemon on g174 N/A N/A Y 41594 Task Status of Volume gv0 ------------------------------------------------------------------------------ Task : Rebalance ID : 5e50a6d6-1e3b-4468-9b0b-9a9ec48dee3c Status : in progress [root@byg612sv160 ~]#
Is the rebalance process still running on the nodes? You can use ps ax |grep rebalance to check. Rebalance will trying to finish migrating the files in its queue before terminating. It may be the reason it did not stop.
Also, are you using the upstream release bits or the supported RHGS ?
Yes, the rebalance process is still running, and it has been making very slow progress for almost a week. It looks like it is not migrating files. It is just doing fix-layout. We have over 110TB files (and many of them are small files) in our gluster storage.
The version numbers do not match the downstream RHBZ builds. Moving this to the Community release.
(In reply to Qigang from comment #6) > Yes, the rebalance process is still running, and it has been making very > slow progress for almost a week. It looks like it is not migrating files. It > is just doing fix-layout. We have over 110TB files (and many of them are > small files) in our gluster storage. Do you have a lot of directories? If yes, then fixing the layout on those will take a lot of time but do not show up in the status. The problem with the cli commands is probably because of a mismatch in the glusterd node info files. Asking Atin to provide the steps to work around this.
If you do not have lookup-optimize enabled on the volume, you can kill the rebalance processes, then perform the steps Atin will provide to clean up the node_state.info files.
Yes, we have a lot of directories. The rebalance log file /var/log/glusterfs/gv0-rebalance.log can give scanned folder information and thus can be viewed as a status report. But it is way too slow and there isn't a progress bar. We have no idea how long it will take. ----one item in rebalance.log file---- [2019-05-13 05:09:10.236068] I [MSGID: 109081] [dht-common.c:4379:dht_setxattr] 0-gv0-dht: fixing the layout of /yangdk2_data/data/meitu/meitu_img/train/gameplaying/954742707 ----one item in rebalance.log file---- The rebalance process is only observed in the two newly added pairs. Our lookup-optimize setting is off. Thank you very much.
Closing this bug as there is no activity. Please reopen if you have any new concerns.