Description of problem: ======================= If one of the nodes in the cluster is rebooted when gluster volume remove-brick operation is running, remove brick status displays 2 entries for one host . Version-Release number of selected component (if applicable): =============================================================== 3.4.0.12rhs-1.el6rhs.x86_64 How reproducible: ================= Always Steps to Reproduce: ===================== 1.Create a distribute volume with 3 bricks gluster v create testvol 10.70.34.105:/rhs/brick1/br1 10.70.34.86:/rhs/brick1/br2 10.70.34.85:/rhs/brick1/br3 volume create: testvol: success: please start the volume to access data 2.Fill the mount point with some files 3.Perform remove brick operation gluster volume remove-brick testvol 10.70.34.85:/rhs/brick1/br3 start volume remove-brick start: success ID: 749edc8f-9da5-4d64-b844-a230cb4d820b 4. Check status gluster volume remove-brick testvol 10.70.34.85:/rhs/brick1/br3 status Node Rebalanced-files size scanned failures status run-time in secs localhost 0 0Bytes 0 0 not started 0.00 10.70.34.86 0 0Bytes 0 0 not started 0.00 10.70.34.85 17 170.0MB 180 0 in progress 4.00 5. Reboot one of the nodes [10.70.34.86] 6. Check remove brick status [local host is displayed twice] gluster volume remove-brick testvol 10.70.34.85:/rhs/brick1/br3 status Node Rebalanced-files size scanned failures status run-time in secs localhost 0 0Bytes 0 0 not started 0.00 localhost 0 0Bytes 0 0 not started 0.00 10.70.34.85 88 880.0MB 250 0 completed 16.00 Actual results: =================== Remove brick status displays local host twice Expected results: ================ Status should show one entry per host Additional info: ================= gluster peer status Number of Peers: 2 Hostname: 10.70.34.86 Uuid: e33a6ffa-969d-4b84-8e40-1274aab4be80 State: Peer in Cluster (Connected) Hostname: 10.70.34.85 Uuid: 800e7dbd-2f0d-4d43-af18-16a13142466f State: Peer in Cluster (Connected)
sosreports : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/979376/
Kaushal, can you see if its still the issue ? I don't remember seeing this in any of the recent builds. If not seen, can you move the bug to ON_QA?
Changes made for bug 1019846 fix this issue. Moving to ON_QA.
Can you please verify the doc text for technical accuracy?
The doc text looks fine.
Version : glusterfs 3.4.0.55rhs ======= Previously when one of the nodes was rebooted while remove brick status was in progress and when remove brick status was checked, it showed local host twice. Now only the node where the brick was removed is shown . In my opinion , when a remove brick operation is started , it should show the nodes FROM where the data is moving TO which node. So the SOURCE and DESTINATION nodes should be shown . Steps : ===== Created a dist volume with 3 bricks and start it Mount the file and create some files Remove brick gluster volume remove-brick dist1 10.70.37.111:/rhs/brick1/e1 start volume remove-brick start: success ID: 1e6763f0-4f68-41b1-8bda-786befc80a8a [root@boo ~]# gluster volume remove-brick dist1 10.70.37.111:/rhs/brick1/e1 status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- 10.70.37.111 16 160.0MB 17 0 0 in progress 4.00 Could you please clarify on this .
When bricks are removed, the data on it is rebalanced onto the remaining bricks, so there is no exact destination. This is unlike a replace brick where we have explicit source and destination. For both the processes, the destinations are passive whereas the source is active. The destinations do not need to do anything other than have a running brick. All the work is done by the source, so it is the one collecting the stats for the procedure. In the destinations view, they are just serving requests of another client. Since, only the source contains information specific to the process (rebalance/remove-brick/replace-brick), the status command only give information from the source.
As per comment 3 in https://bugzilla.redhat.com/show_bug.cgi?id=1030932 , layout changes for existing directories for remove brick and it migrates data from the non decommissioned bricks as well, so in this case shouldn't we be showing all the nodes present in the status ?
A rebalance process should only be concerned with migrating data from those bricks of the volume which are present on the peer on which the rebalance process is running. In case of remove-brick, the rebalance processes will be launched only on those peers which contained the bricks, so they should only be migrating data from the remove-bricks. But, if those peers also contain other bricks belonging to the volume, it appears that the rebalance processes will also rebalance the data on those bricks (this is incorrect IMO, which is what bug-1030932 implies). But even then, the processes will only be launched on the peers containing the bricks being removed, so the output of the status command is still correct.
Version : glusterfs 3.4.0.55rhs ======= Previously when one of the nodes was rebooted while remove brick status was in progress and when remove brick status was checked, it showed local host twice. Now only the node where the brick was removed is shown . Marking the bug 'Verified'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0208.html