Bug 1023921
Summary: | Rebalance status does not give the correct ouput and rebalance starts automatically when glusterd is made down and made up after a while. | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | RamaKasturi <knarra> | |
Component: | glusterfs | Assignee: | Kaushal <kaushal> | |
Status: | CLOSED ERRATA | QA Contact: | Shruti Sampat <ssampat> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 2.1 | CC: | dpati, dtsang, kaushal, mmahoney, pprakash, psriniva, sdharane, ssampat, vbellur, vraman | |
Target Milestone: | --- | Keywords: | ZStream | |
Target Release: | RHGS 2.1.2 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.4.0.49rhs | Doc Type: | Bug Fix | |
Doc Text: |
Previously, the Rebalance process would start automatically when glusterd service was restarted as a result the Rebalance status command would display an incorrect output. With this fix, the Rebalance process is started only if required and the Rebalance status command works as expected.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1036464 (view as bug list) | Environment: | ||
Last Closed: | 2014-02-25 07:56:27 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1036464 | |||
Bug Blocks: | 1015045, 1021497 |
Description
RamaKasturi
2013-10-28 10:45:41 UTC
Attaching the sos reports http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/rhsc/1023921/ Above issue is not seen in glusterfs update1. 1) following is the ouput when glusterd was made down and rebalance was run [root@localhost ~]# gluster vol rebalance vol_dis status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 1 0 failed 0.00 10.70.34.85 0 0Bytes 0 1 0 failed 0.00 10.70.34.86 0 0Bytes 0 1 0 failed 0.00 volume rebalance: vol_dis: success: 2) Following is the ouput seen when glusterd is made up and checked for the status using the command "gluster vol rebalance vol_dis status" [root@localhost ~]# gluster vol rebalance vol_dis status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 1 0 not started 0.00 10.70.37.43 0 0Bytes 0 1 0 failed 0.00 10.70.37.75 0 0Bytes 0 1 0 failed 0.00 volume rebalance: vol_dis: success: The following is also seen when doing the above steps. 1. Create a distribute volume with 2 bircks. 2. Now add a brick to the volume. 3. Stop glusterd in one of the node and start rebalance. 4. The following is the ouput seen [root@localhost ~]# gluster vol rebalance vol_dis status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 1 0 failed 0.00 10.70.37.43 0 0Bytes 0 1 0 failed 0.00 10.70.37.75 0 0Bytes 0 1 0 failed 0.00 volume rebalance: vol_dis: success: 5. Now start glusterd in the node and check for the status . The following output comes. [root@localhost ~]# gluster vol rebalance vol_dis status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 42 0 15 in progress 17.00 10.70.37.43 0 0Bytes 60 0 2 completed 0.00 10.70.37.75 0 0Bytes 60 0 0 completed 0.00 10.70.37.108 0 0Bytes 1 0 0 in progress 17.00 volume rebalance: vol_dis: success: Actual results: After doing step 5 , rebalance process starts automatically which it should not. Needed by RHSC Taking the bug under my name as I'm actively working on this right now. I should have done this earlier, but since I was the only one working on the RHSC dependencies at that time, I left it at that. My mistake. Remove-brick also starts automatically, when the following steps are performed - 1. Create a distribute volume, of 3 bricks, one on each server in the cluster and start the volume. 2. Kill glusterd on one of the nodes. 3. Start remove-brick operation on the volume. Starting remove-brick succeeds, but the remove-brick status shows that it has failed. 4. Start glusterd on the node where it was killed. Check the status of the previous remove-brick operation. It says completed. Remove-brick should not start automatically when glusterd is brought back up. Under review at https://code.engineering.redhat.com/gerrit/16799 In glusterfs build 50 When glusterd is made down and made up after a while, rebalance does not starting automatically. But , when glusterd goes down, there is an inconsistency seen in the status output in the servers. Do i need to log a separate BZ for this, or would it be fixed as part of this? Could you please confirm. Can you please verify the doc text for technical accuracy? (In reply to RamaKasturi from comment #9) > In glusterfs build 50 When glusterd is made down and made up after a while, > rebalance does not starting automatically. > > But , when glusterd goes down, there is an inconsistency seen in the status > output in the servers. > > Do i need to log a separate BZ for this, or would it be fixed as part of > this? Could you please confirm. Can you raise another bug with more details? (In reply to Pavithra from comment #10) > Can you please verify the doc text for technical accuracy? The doc text looks okay. Performed the following steps - 1. Create a distribute volume of 4 bricks, one on each server and start it, create data on the mount point. 2. Bring glusterd down on one of the nodes. 3. Start rebalance on the volume. 4. Check status, it says failed in all the other three nodes. 5. Bring back glusterd on the node where it was stopped. 6. Check rebalance status now, it was the same as that was found in step 4. On performing the following steps - 1. Create a 2x2 distributed-replicate volume, start it and create data on the mount point. 2. Bring glusterd down on two nodes that contain bricks that are part of a replica set. 3. Start rebalance on the volume. 4. Check status, it says failed on the other two nodes. 5. Bring back glusterd on the two other nodes one after the other. When glusterd is started on the nodes, and rebalance status is checked on these nodes, rebalance is seen to be 'in progress' on these nodes, which means rebalance was started on these nodes when glusterd was brought back up. Moving to ASSIGNED. Shruti, I tried this out on v3.4.0.53rhs (the latest build AFAIK). In both the cases, rebalance didn't startup again, and the status was shown as failed. What I did, 1. Create a 4 node cluster 2. Create 2x2 dist volume. 3. Start volume 4. Kill glusterd on 2 nodes forming a replica 5. Start rebalance 6. Check rebalance status, shows as failed on the two up nodes. 7. Bring up the down nodes. 8. Check rebalance status again, shows failed on the two up nodes. From a preliminary look, I also don't see anything wrong in the logs from the sosreport. I need to do more thorough investigation of the logs. It'll be more helpful if I could get access to a live system if you are able to reproduce it. Kaushal, I was not able to reproduce it. But I have the setup where I saw it earlier. Let me know if that will help. I've got the sos-reports so those should be enough. Since, this is not reproducible, I'll be moving the bug back to ON_QA. Please do the verification and move it to the appropriate state. Made minor corrections to the doc text. Verified with v3.4.0.53rhs that rebalance and remove-brick do not start automatically when glusterd is brought back up. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0208.html |