Description of problem: Well, in this case I was having a volume with quota enabled and limit set of root of the volume, the underneath dorectories and some I/O going on. I/O was happening over nfs mount, in two different directories. Invoked an add-brick and rebalacnce following it. Since the data inside volume was quite high, roundabout 2.4 TB, hence rebalance kept on going for long. But, rebalance resulted in with status as "stopped" on all nodes of the cluster. The cause for rebalance to stop was the bricks going down. But could not find the reason why the bricks went down. this remains a question! Version-Release number of selected component (if applicable): glusterfs-3.4.0.36rhs How reproducible: seen on one set of rhs cluster of 4 nodes. Steps to Reproduce: Rather I am providing you the scenario, that I saw this issue, 1. quota enabled volume with limit set on "/" and directoried underneath 2. already having data inside the directorires. 3. in two more directories, keep creating data. with two different script executions, but both execution happening in parallel. 4. add-brick + rebalance Actual results: [root@quota5 ~]# gluster volume rebalance dist-rep status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 149234 1.6GB 477247 5 534 stopped 62895.00 10.70.35.191 87293 977.8MB 735462 20 184122 stopped 62894.00 10.70.35.108 447 8.7MB 712639 5 235 stopped 62894.00 10.70.35.144 0 0Bytes 712494 5 0 stopped 62892.00 rhsauto004.lab.eng.blr.redhat.com 0 0Bytes 713601 5 0 stopped 62893.00 because bricks went down, [root@quota6 ~]# gluster volume status Status of volume: dist-rep Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.188:/rhs/brick1/d1r1 N/A N 9376 Brick 10.70.35.108:/rhs/brick1/d1r2 N/A N 9157 Brick 10.70.35.191:/rhs/brick1/d2r1 49152 Y 9151 Brick 10.70.35.144:/rhs/brick1/d2r2 49152 Y 9148 Brick 10.70.35.188:/rhs/brick1/d3r1 49153 Y 9387 Brick 10.70.35.108:/rhs/brick1/d3r2 49153 Y 9168 Brick 10.70.35.191:/rhs/brick1/d4r1 49153 Y 9162 Brick 10.70.35.144:/rhs/brick1/d4r2 49153 Y 9159 Brick 10.70.35.188:/rhs/brick1/d5r1 N/A N 9398 Brick 10.70.35.108:/rhs/brick1/d5r2 49154 Y 9179 Brick 10.70.35.191:/rhs/brick1/d6r1 49154 Y 9173 Brick 10.70.35.144:/rhs/brick1/d6r2 49154 Y 9170 Brick 10.70.35.188:/rhs/brick1/d1r1-add 49155 Y 11217 Brick 10.70.35.108:/rhs/brick1/d1r2-add 49155 Y 10092 NFS Server on localhost 2049 Y 10104 Self-heal Daemon on localhost N/A Y 10111 Quota Daemon on localhost N/A Y 10118 NFS Server on 10.70.35.191 2049 Y 10191 Self-heal Daemon on 10.70.35.191 N/A Y 10200 Quota Daemon on 10.70.35.191 N/A Y 10205 NFS Server on 10.70.35.144 2049 Y 10086 Self-heal Daemon on 10.70.35.144 N/A Y 10095 Quota Daemon on 10.70.35.144 N/A Y 10100 NFS Server on rhsauto004.lab.eng.blr.redhat.com 2049 Y 15138 Self-heal Daemon on rhsauto004.lab.eng.blr.redhat.com N/A Y 15145 Quota Daemon on rhsauto004.lab.eng.blr.redhat.com N/A Y 15153 NFS Server on 10.70.35.188 2049 Y 11236 Self-heal Daemon on 10.70.35.188 N/A Y 11243 Quota Daemon on 10.70.35.188 N/A Y 11250 Task ID Status ---- -- ------ Rebalance 719dd624-b733-47c9-a487-296ec18544c7 2 but from logs could not make why bricks down Expected results: bricks going down , because rebalance does not provide a healthy status, is this happening because some of directories the quota limit were already reached. Additional info: [root@quota6 ~]# gluster volume quota dist-rep list Path Hard-limit Soft-limit Used Available -------------------------------------------------------------------------------- / 2.9TB 80% 1.3TB 1.6TB /qa1 512.0GB 80% 421.5GB 90.5GB /qa2 512.0GB 80% 399.7GB 112.3GB /qa3 100.0GB 80% 83.4GB 16.6GB /qa4 100.0GB 80% 83.3GB 16.7GB /qa1/dir1 500.0GB 80% 337.8GB 162.2GB /qa2/dir1 500.0GB 80% 316.4GB 183.6GB /qa5 500.0GB 80% 361.7GB 138.3GB
Please file a new bug if this issue is still seen in 3.1.x.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days