Created attachment 807426 [details] Attaching engine log Description of problem: Status dialog hangs when glusterd goes down in one of the node while rebalance happens. Version-Release number of selected component (if applicable): rhsc-2.1.1-0.0.2.master.el6ev.noarch How reproducible: Always. Steps to Reproduce: 1. Create a distributed volume and start it. 2. Start rebalance on the volume. 3. Now go to one of the server and stop glusterd. 4. Now click on the status button. Actual results: Status dialog opens up and it says fetching data. Console hangs until it is reloaded. Expected results: status dialog should open up and it should display the output same as CLI. Additional info:
Created attachment 807427 [details] Attaching vdsm log
Created attachment 807428 [details] Attaching vdsm node2 log
Created attachment 807429 [details] Attaching vdsm node3 log
Created attachment 807430 [details] Attaching vdsm node4 log
1)Even after bringing up the glusterd on the node from where it was stopped rebalance still continues to run even if it is completed from the gluster CLI. 2) When user stops rebalance on the volume and clicks on status , the status dialog still hangs.
Thread-3227::DEBUG::2013-10-03 21:33:46,583::BindingXMLRPC::981::vds::(wrapper) return volumeRebalanceStatus with {'status': {'message': 'Done', 'code': 0}, 'hosts': [{'totalSizeMoved': 6291456000, 'status': 'COMPLETED', 'filesSkipped': 0, 'name': 'localhost', 'filesMoved': 6, 'filesFailed': 0, 'filesScanned': 69}, {'totalSizeMoved': 3145728000, 'status': 'COMPLETED', 'filesSkipped': 0, 'name': '10.70.37.80', 'filesMoved': 3, 'filesFailed': 0, 'filesScanned': 63}, {'totalSizeMoved': 7340032000, 'status': 'COMPLETED', 'filesSkipped': 0, 'name': '10.70.37.135', 'filesMoved': 7, 'filesFailed': 0, 'filesScanned': 76}, {'totalSizeMoved': 0, 'status': 'COMPLETED', 'filesSkipped': 0, 'name': '10.70.37.103', 'filesMoved': 0, 'filesFailed': 0, 'filesScanned': 60}], 'summary': {'totalSizeMoved': 16777216000, 'status': 'COMPLETED', 'filesSkipped': 0, 'filesFailed': 0, 'filesMoved': 16, 'filesScanned': 268}} the above statements suggest from the vdsm log suggest that the file size is greater than an int can support.So, its an overflow error. So,can you please check the size of the data on the bricks
The data on the mount point was around 60GB.
Resolved in CB3 build
1) status dialog still hangs when glusterd goes down in one of the node. 2) Once the glusterd is up, rebalance icon and tasks pane gets updated as completed but clicking on the status shows that "No rebalance ever happened on this volume". 3) I am able to see the follwoing error in engine logs. 2013-10-17 20:20:52,180 ERROR [org.ovirt.engine.core.bll.gluster.GetGlusterVolumeRebalanceStatusQuery] (ajp-/127.0.0.1:8702-6) Query GetGlusterVolumeRebalanceStatusQuery failed. Exception message is VdcBLLException: Command execution failed. Please find the sos reports in the below link. http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/rhsc/1015394/
The above exception : "Query GetGlusterVolumeRebalanceStatusQuery failed. Exception message is VdcBLLException: Command execution failed." indicates that for some reason the query is not going through.And hence,the current build cb6 handles such failures with a message stating "Could not fetch data". If this is not the expected behaviour,please do let me know about the expected behaviour.
Seems like this is a glusterfs bug. [root@localhost ~]# gluster volume rebalance vol_dis status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 150 0 0 completed 1.00 10.70.37.155 0 0Bytes 150 0 0 completed 1.00 10.70.37.155 0 0Bytes 150 0 0 completed 1.00 10.70.37.95 1 1000.0MB 150 0 0 completed 26.00 volume rebalance: vol_dis: success: [root@localhost ~]# gluster volume rebalance vol_dis status --xml Bcoz,as can be seen above there is no status xml returned from glusterfs [root@localhost ~]# echo $? 2
works fine with cb10 and glusterfs build (glusterfs-server-3.4.0.47.1u2rhs-1.el6rhs.x86_64) and works fine. When glusterd goes down in any of the node in cluster, status dialog does not hang and able to get the status. Will reopen the bug if it happens to see it again.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0208.html