Description of problem: ======================= On a 2 x 2 distribute-replicate volume all the bricks were 100% full. Hence, added new bricks to the volume changing the volume type to 3 x 2. Started rebalance on the volume to start the migration of files from existing bricks to the newly added bricks. Migration of files were skipped and only directories were created on newly added bricks. Storage nodes , node2 (replicate-0) and node3 (replicate-1) got shutdown. Since the Migration of files were skipped, restarted the rebalance process with "force" option from node1. Checking the rebalance process from all online nodes skips all the other online nodes other than localhost and the node1 where rebalance was started. Actual Result: ================ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Node1 : "gluster volume rebalance <volume_name> status" ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ root@ip-10-64-69-235 [Nov-28-2013- 9:11:41] >gluster v rebalance vol_rep status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 273 266.9GB 2026 0 2 completed 12841.00 volume rebalance: vol_rep: success: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Node4 : "gluster volume rebalance <volume_name> status" +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ root@ip-10-101-31-43 [Nov-28-2013- 9:17:59] >gluster v rebalance vol_rep status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 260 248.6GB 1976 0 0 completed 12804.00 10.64.69.235 273 266.9GB 2026 0 2 completed 12841.00 volume rebalance: vol_rep: success: root@ip-10-101-31-43 [Nov-28-2013- 9:18:06] > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Node5 : "gluster volume rebalance <volume_name> status" +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ root@ip-10-235-46-241 [Nov-28-2013- 9:18:36] >gluster v rebalance vol_rep status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 1753 0 0 completed 19.00 10.64.69.235 273 266.9GB 2026 0 2 completed 12841.00 volume rebalance: vol_rep: success: root@ip-10-235-46-241 [Nov-28-2013- 9:18:41] > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Node6 : "gluster volume rebalance <volume_name> status" +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ root@ip-10-29-187-33 [Nov-28-2013- 9:19:04] >gluster v rebalance vol_rep status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 1753 0 0 completed 20.00 10.64.69.235 273 266.9GB 2026 0 2 completed 12841.00 volume rebalance: vol_rep: success: root@ip-10-29-187-33 [Nov-28-2013- 9:19:09] > Version-Release number of selected component (if applicable): ============================================================= glusterfs 3.4.0.44.1u2rhs built on Nov 25 2013 08:17:39 How reproducible: ================= Steps to Reproduce: ===================== 1. Create a 2 x 2 distribute-replicate volume with 4 storage nodes and 1 brick per storage node. 2. Create fuse mount. Fill the volume by creating directories and files. 3. Once the volume is filled, add 2 new servers to the cluster 4. Add bricks from the 2 new servers to the volume. 5. Start rebalance (gluster volume rebalance <volume_name> start" 6. Once the rebalance is complete , bring down node2 and node3. (Rebalance of files were skipped. refer to bug https://bugzilla.redhat.com/show_bug.cgi?id=1035647) 7. Start rebalance with force option. 8. Check the status of rebalance with " gluster volume rebalance <volume_name> status" Expected results: ================== Expected to see rebalance status from all the online nodes. Additional info: ================= root@ip-10-64-69-235 [Nov-28-2013- 9:47:12] >gluster v info Volume Name: vol_rep Type: Distributed-Replicate Volume ID: 02b066e9-4800-43ca-9556-2b06973d9cdf Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: 10.64.69.235:/rhs/bricks/b1 Brick2: 10.202.206.127:/rhs/bricks/b1_rep1 Brick3: 10.111.67.22:/rhs/bricks/b2 Brick4: 10.101.31.43:/rhs/bricks/b2_rep1 Brick5: 10.235.46.241:/rhs/bricks/b3 Brick6: 10.29.187.33:/rhs/bricks/b3_rep1 root@ip-10-64-69-235 [Nov-28-2013- 9:47:15] > root@ip-10-64-69-235 [Nov-28-2013- 9:47:16] > root@ip-10-64-69-235 [Nov-28-2013-10:09:25] > root@ip-10-64-69-235 [Nov-28-2013-10:09:25] >gluster v status Status of volume: vol_rep Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.64.69.235:/rhs/bricks/b1 49152 Y 6466 Brick 10.101.31.43:/rhs/bricks/b2_rep1 49152 Y 6286 Brick 10.235.46.241:/rhs/bricks/b3 49152 Y 15112 Brick 10.29.187.33:/rhs/bricks/b3_rep1 49152 Y 15224 NFS Server on localhost 2049 Y 16350 Self-heal Daemon on localhost N/A Y 16357 NFS Server on 10.182.189.213 2049 Y 3414 Self-heal Daemon on 10.182.189.213 N/A Y 3418 NFS Server on 10.235.52.104 2049 Y 5960 Self-heal Daemon on 10.235.52.104 N/A Y 5964 NFS Server on 10.235.46.241 2049 Y 15124 Self-heal Daemon on 10.235.46.241 N/A Y 15131 NFS Server on 10.29.187.33 2049 Y 15236 Self-heal Daemon on 10.29.187.33 N/A Y 15243 NFS Server on 10.80.109.197 2049 Y 3460 Self-heal Daemon on 10.80.109.197 N/A Y 3464 NFS Server on 10.101.31.43 2049 Y 27670 Self-heal Daemon on 10.101.31.43 N/A Y 27677 Task Status of Volume vol_rep ------------------------------------------------------------------------------ Task : Rebalance ID : c64d47b9-197a-4805-bf7e-8f3dfe7f191a Status : completed
Once again able to recreate the issue with the following case on the build : glusterfs 3.4.0.57rhs built on Jan 13 2014 06:59:05 ============================================================== The following case is executed on AWS-RHS Instances. 1) Create 2 x 3 distribute-replicate volume (3 volumes: exporter, importer, ftp. ftp doesn't have any data. Only exporter and importer has the data) 2) Filled each brick with 320GB of data 3) Terminated an Instance (NODE2) 4) Replaced the terminated instance. 5) Started heal full. (Heal is successfully complete) 6) Bricks disks got almost full. Remaining was 20GB out of 840GB. 7) Added 3 nodes to the pool. 8) added 3 bricks to the volume. (exporter, importer) 9) start reblance on exporter , importer. 10) while rebalance is in progress, Terminate NODE5 and NODE9. root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:24:48] >gluster v rebalance importer status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 6540 35.3GB 43532 0 592 in progress 13061.00 ip-10-234-21-235.ec2.internal 0 0Bytes 157780 0 0 completed 10310.00 ip-10-2-34-53.ec2.internal 6514 35.4GB 61381 0 3092 in progress 13060.00 ip-10-114-195-155.ec2.internal 0 0Bytes 157781 0 0 completed 10310.00 ip-10-159-26-108.ec2.internal 0 0Bytes 157780 0 0 completed 10310.00 ip-10-194-111-63.ec2.internal 0 0Bytes 157781 0 0 completed 10310.00 domU-12-31-39-07-74-A5.compute-1.internal 0 0Bytes 157781 0 4506 completed 10680.00 ip-10-62-118-194.ec2.internal 0 0Bytes 157781 0 0 completed 10310.00 ip-10-182-195-170.ec2.internal 0 0Bytes 157781 0 0 completed 10309.00 volume rebalance: importer: success: root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:24:53] >gluster v rebalance exporter status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 4914 26.3GB 40222 0 295 in progress 13180.00 ip-10-234-21-235.ec2.internal 0 0Bytes 159575 0 0 completed 11259.00 ip-10-2-34-53.ec2.internal 5338 29.0GB 53529 0 2377 in progress 13180.00 ip-10-114-195-155.ec2.internal 0 0Bytes 158415 0 0 completed 11636.00 ip-10-159-26-108.ec2.internal 0 0Bytes 158406 0 0 completed 11259.00 ip-10-194-111-63.ec2.internal 0 0Bytes 158417 0 0 completed 11633.00 domU-12-31-39-07-74-A5.compute-1.internal 0 0Bytes 158421 0 1281 completed 11671.00 ip-10-62-118-194.ec2.internal 0 0Bytes 158416 0 0 completed 11427.00 ip-10-182-195-170.ec2.internal 0 0Bytes 158405 0 0 completed 11260.00 volume rebalance: exporter: success: root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:24:58] > root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:24:59] > root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:25:08] > root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:25:08] > root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:25:09] > root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:25:09] > root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:25:10] >df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 99G 1.8G 96G 2% / none 3.7G 0 3.7G 0% /dev/shm /dev/md0 840G 771G 69G 92% /rhs/bricks root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:25:11] > root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:25:12] > root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:27:52] > root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:27:52] > root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:27:52] > root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:27:53] > root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:47:00] > root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:47:00] > root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:47:03] >gluster v status Status of volume: exporter Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick domU-12-31-39-0A-99-B2.compute-1.internal:/rhs/br icks/exporter 49152 Y 19405 Brick ip-10-194-111-63.ec2.internal:/rhs/bricks/exporte r 49152 Y 6496 Brick ip-10-234-21-235.ec2.internal:/rhs/bricks/exporte r 49152 Y 20226 Brick ip-10-2-34-53.ec2.internal:/rhs/bricks/exporter 49152 Y 20910 Brick ip-10-159-26-108.ec2.internal:/rhs/bricks/exporte r 49152 Y 20196 Brick domU-12-31-39-07-74-A5.compute-1.internal:/rhs/br icks/exporter 49152 Y 6553 Brick ip-10-62-118-194.ec2.internal:/rhs/bricks/exporte r 49152 Y 6391 NFS Server on localhost 2049 Y 30972 Self-heal Daemon on localhost N/A Y 30985 NFS Server on ip-10-234-21-235.ec2.internal 2049 Y 30866 Self-heal Daemon on ip-10-234-21-235.ec2.internal N/A Y 30873 NFS Server on ip-10-2-34-53.ec2.internal 2049 Y 1260 Self-heal Daemon on ip-10-2-34-53.ec2.internal N/A Y 1267 NFS Server on ip-10-159-26-108.ec2.internal 2049 Y 3153 Self-heal Daemon on ip-10-159-26-108.ec2.internal N/A Y 3160 NFS Server on ip-10-194-111-63.ec2.internal 2049 Y 16623 Self-heal Daemon on ip-10-194-111-63.ec2.internal N/A Y 16630 NFS Server on ip-10-62-118-194.ec2.internal 2049 Y 6498 Self-heal Daemon on ip-10-62-118-194.ec2.internal N/A Y 6505 NFS Server on domU-12-31-39-07-74-A5.compute-1.internal 2049 Y 6658 Self-heal Daemon on domU-12-31-39-07-74-A5.compute-1.in ternal N/A Y 6665 Task Status of Volume exporter ------------------------------------------------------------------------------ Task : Rebalance ID : c04d1bda-2482-4930-a69c-d6e0c76a1660 Status : in progress Status of volume: ftp Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick domU-12-31-39-0A-99-B2.compute-1.internal:/rhs/br icks/ftp 49154 Y 19541 Brick ip-10-194-111-63.ec2.internal:/rhs/bricks/ftp 49154 Y 6659 Brick ip-10-234-21-235.ec2.internal:/rhs/bricks/ftp 49154 Y 20333 Brick ip-10-2-34-53.ec2.internal:/rhs/bricks/ftp 49154 Y 21023 Brick ip-10-159-26-108.ec2.internal:/rhs/bricks/ftp 49154 Y 20301 NFS Server on localhost 2049 Y 30972 Self-heal Daemon on localhost N/A Y 30985 NFS Server on ip-10-234-21-235.ec2.internal 2049 Y 30866 Self-heal Daemon on ip-10-234-21-235.ec2.internal N/A Y 30873 NFS Server on ip-10-159-26-108.ec2.internal 2049 Y 3153 Self-heal Daemon on ip-10-159-26-108.ec2.internal N/A Y 3160 NFS Server on ip-10-2-34-53.ec2.internal 2049 Y 1260 Self-heal Daemon on ip-10-2-34-53.ec2.internal N/A Y 1267 NFS Server on ip-10-194-111-63.ec2.internal 2049 Y 16623 Self-heal Daemon on ip-10-194-111-63.ec2.internal N/A Y 16630 NFS Server on domU-12-31-39-07-74-A5.compute-1.internal 2049 Y 6658 Self-heal Daemon on domU-12-31-39-07-74-A5.compute-1.in ternal N/A Y 6665 NFS Server on ip-10-62-118-194.ec2.internal 2049 Y 6498 Self-heal Daemon on ip-10-62-118-194.ec2.internal N/A Y 6505 Task Status of Volume ftp ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: importer Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick domU-12-31-39-0A-99-B2.compute-1.internal:/rhs/br icks/importer 49153 Y 19470 Brick ip-10-194-111-63.ec2.internal:/rhs/bricks/importe r 49153 Y 6615 Brick ip-10-234-21-235.ec2.internal:/rhs/bricks/importe r 49153 Y 20275 Brick ip-10-2-34-53.ec2.internal:/rhs/bricks/importer 49153 Y 20960 Brick ip-10-159-26-108.ec2.internal:/rhs/bricks/importe r 49153 Y 20245 Brick domU-12-31-39-07-74-A5.compute-1.internal:/rhs/br icks/importer 49153 Y 6646 Brick ip-10-62-118-194.ec2.internal:/rhs/bricks/importe r 49153 Y 6486 NFS Server on localhost 2049 Y 30972 Self-heal Daemon on localhost N/A Y 30985 NFS Server on ip-10-234-21-235.ec2.internal 2049 Y 30866 Self-heal Daemon on ip-10-234-21-235.ec2.internal N/A Y 30873 NFS Server on domU-12-31-39-07-74-A5.compute-1.internal 2049 Y 6658 Self-heal Daemon on domU-12-31-39-07-74-A5.compute-1.in ternal N/A Y 6665 NFS Server on ip-10-194-111-63.ec2.internal 2049 Y 16623 Self-heal Daemon on ip-10-194-111-63.ec2.internal N/A Y 16630 NFS Server on ip-10-159-26-108.ec2.internal 2049 Y 3153 Self-heal Daemon on ip-10-159-26-108.ec2.internal N/A Y 3160 NFS Server on ip-10-2-34-53.ec2.internal 2049 Y 1260 Self-heal Daemon on ip-10-2-34-53.ec2.internal N/A Y 1267 NFS Server on ip-10-62-118-194.ec2.internal 2049 Y 6498 Self-heal Daemon on ip-10-62-118-194.ec2.internal N/A Y 6505 Task Status of Volume importer ------------------------------------------------------------------------------ Task : Rebalance ID : 6b5030f9-fdc1-41de-9f05-ea9292a96955 Status : in progress root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:47:05] > root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:47:07] >gluster v rebalance exporter status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 5738 30.8GB 47148 0 295 in progress 14512.00 ip-10-234-21-235.ec2.internal 0 0Bytes 159575 0 0 completed 11259.00 ip-10-2-34-53.ec2.internal 6148 33.6GB 61869 0 2707 in progress 14512.00 volume rebalance: exporter: success: root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:47:10] > root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:47:11] > root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:47:12] >gluster v rebalance importer status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 7473 40.2GB 51338 0 754 in progress 14413.00 ip-10-234-21-235.ec2.internal 0 0Bytes 157780 0 0 completed 10310.00 ip-10-2-34-53.ec2.internal 7386 40.1GB 70773 0 3270 in progress 14412.00 volume rebalance: importer: success: root@domU-12-31-39-0A-99-B2 [Jan-21-2014- 7:47:25] >
Following are the cases to recreate the issue: Case 1:- ======= 1) Create 2 x 3 distribute-replicate volume 2) Create files/dirs from mount point. 3) Added 3 more bricks to the volume. Start rebalance. Check the rebalance status 4) Poweroff a storage node. 5) Check the rebalance status Result:- =========== Incomplete status . Doesn't show some of the online nodes rebalance status. Case 2:- =========== 1) Create 2 x 3 distribute-replicate volume 2) Create files/dirs from mount point. 3) Added 3 more bricks to the volume. Start rebalance. Check the rebalance status 4) Poweroff a storage node. 5) Check the rebalance status 6) Replace the brick on the poweredoff node with new node. (commit force) 7) Check the rebalance status Actual Result: =============== Even though the Poweredoff storage node is not part of any volume rebalance status is incomplete. It doesn't show all the online nodes rebalance status. 8) peer detach the Poweredoff node and then execute "rebalance status". Result:- ========== show all the online nodes rebalance status.
This happens due to changes done to show the rebalance status output in a consistent sequence. Earlier, the status information returned by different peers was indexed based on the order in which the peers responded. This would lead to inconsistent ordering of the rebalance status output. To get consistent ordering, a each peer is now given a fixed index based on its position in the peer list. With this change, it is now possible to have holes in the indexed status information, if a peer is down or is not reachable. But the cli output code hasn't been updated to account for these holes. The cli output code will abruptly stop whenever it hits a hole, even if there is further status information available. This wasn't a problem earlier as there wasn't a chance to get holes in the indices.
(In reply to Kaushal from comment #4) > This happens due to changes done to show the rebalance status output in a > consistent sequence. > Can you please provide the RFE bug on why this change was introduced ?
The change was introduced as a fix for bug 888390.
Test Case: ================== 1. Create a dis-rep volume 2 x 2. Start the volume ( 4 storage nodes ) 2. Create fuse mount. Create files/dirs from mount point. 3. Add bricks to the volume. 4. Start rebalance. While rebalance is in progress, reboot node1 and node3. 5. check the rebalance status from the node which is part of the cluster but not part of the volume Outpout: ============ rebalance status before node reboot: ==================================== root@mia [Jul-04-2014-11:58:08] >gluster v rebalance rep status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- 10.70.36.35 587 11.4MB 2927 1 55 in progress 91.00 rhs-client12 0 0Bytes 6104 0 0 in progress 91.00 rhs-client14 0 0Bytes 6104 0 0 in progress 91.00 rhs-client13 628 13.1MB 2744 0 0 in progress 91.00 volume rebalance: rep: success: root@mia [Jul-04-2014-11:58:42] > root@mia [Jul-04-2014-11:59:34] > rebalance status when the nodes are down: ========================================= root@mia [Jul-04-2014-12:00:31] >gluster v rebalance rep status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- volume rebalance: rep: success: root@mia [Jul-04-2014-12:00:39] >
Cloning this to 3.1. to be fixed in future release.