Bug 1023921

Summary: Rebalance status does not give the correct ouput and rebalance starts automatically when glusterd is made down and made up after a while.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: RamaKasturi <knarra>
Component: glusterfsAssignee: Kaushal <kaushal>
Status: CLOSED ERRATA QA Contact: Shruti Sampat <ssampat>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: dpati, dtsang, kaushal, mmahoney, pprakash, psriniva, sdharane, ssampat, vbellur, vraman
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 2.1.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0.49rhs Doc Type: Bug Fix
Doc Text:
Previously, the Rebalance process would start automatically when glusterd service was restarted as a result the Rebalance status command would display an incorrect output. With this fix, the Rebalance process is started only if required and the Rebalance status command works as expected.
Story Points: ---
Clone Of:
: 1036464 (view as bug list) Environment:
Last Closed: 2014-02-25 07:56:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1036464    
Bug Blocks: 1015045, 1021497    

Description RamaKasturi 2013-10-28 10:45:41 UTC
Description of problem:
Rebalance status does not give the correct ouput when glusterd is made down and made up after a while.

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-geo-replication-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-rdma-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-libs-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-fuse-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-server-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-api-3.4.0.35.1u2rhs-1.el6rhs.x86_64
samba-glusterfs-3.6.9-160.3.el6rhs.x86_64


How reproducible:
Always

Steps to Reproduce:
1. Create a distribute volume with 2 bricks.
2. Stop glusterd in one of the node.
3. start rebalance on the volume created.
4. Now check for the rebalance status using the command "gluster vol rebalance <vol_Name> status". The following is seen in the output.
[root@localhost ~]# gluster vol rebalance vol_dis status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             1             0               failed               0.00
                             10.70.37.43                0        0Bytes             0             1             0               failed               0.00
                             10.70.37.75                0        0Bytes             0             1             0               failed               0.00
volume rebalance: vol_dis: success: 

5. Now make the glusterd up in the node, where it was stopped.
6. Now check the rebalance status again. The following is seen in the ouput.

[root@localhost ~]# gluster vol rebalance vol_dis status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes            10             0             0            completed               0.00
                             10.70.37.43                0        0Bytes            10             0             0            completed               0.00
                             10.70.37.75                0        0Bytes            10             0             3            completed               0.00
                            10.70.37.108                0        0Bytes            10             0             2            completed               0.00
volume rebalance: vol_dis: success: 


Actual results:
The rebalance status it shows is, prior to the one when glusterd was made down.

Expected results:
It should always show the last run rebalance output.

Additional info:

Comment 2 RamaKasturi 2013-10-28 11:04:04 UTC
Attaching the sos reports

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/rhsc/1023921/

Comment 3 RamaKasturi 2013-10-29 12:40:37 UTC
Above issue is not seen in glusterfs update1.

1) following is the ouput when glusterd was made down and rebalance was run

[root@localhost ~]# gluster vol rebalance vol_dis status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             1             0               failed               0.00
                             10.70.34.85                0        0Bytes             0             1             0               failed               0.00
                             10.70.34.86                0        0Bytes             0             1             0               failed               0.00
volume rebalance: vol_dis: success: 

2) Following is the ouput seen when glusterd is made up and checked for the status using the command "gluster vol rebalance vol_dis status"

[root@localhost ~]# gluster vol rebalance vol_dis status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             1             0               not started               0.00
                             10.70.37.43                0        0Bytes             0             1             0               failed               0.00
                             10.70.37.75                0        0Bytes             0             1             0               failed               0.00

volume rebalance: vol_dis: success:

Comment 4 RamaKasturi 2013-10-29 13:32:58 UTC
The following is also seen when doing the above steps.

1. Create a distribute volume with 2 bircks.
2. Now add a brick to the volume.
3. Stop glusterd in one of the node and start rebalance.
4. The following is the ouput seen 

[root@localhost ~]# gluster vol rebalance vol_dis status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             1             0               failed               0.00
                             10.70.37.43                0        0Bytes             0             1             0               failed               0.00
                             10.70.37.75                0        0Bytes             0             1             0               failed               0.00
volume rebalance: vol_dis: success: 

5. Now start glusterd in the node and check for the status . The following output comes.

[root@localhost ~]# gluster vol rebalance vol_dis status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes            42             0            15          in progress              17.00
                             10.70.37.43                0        0Bytes            60             0             2            completed               0.00
                             10.70.37.75                0        0Bytes            60             0             0            completed               0.00
                            10.70.37.108                0        0Bytes             1             0             0          in progress              17.00
volume rebalance: vol_dis: success: 

Actual results:

After doing step 5 , rebalance process starts automatically which it should not.

Comment 5 Dusmant 2013-10-30 10:05:20 UTC
Needed by RHSC

Comment 6 Kaushal 2013-11-28 04:39:01 UTC
Taking the bug under my name as I'm actively working on this right now. I should have done this earlier, but since I was the only one working on the RHSC dependencies at that time, I left it at that. My mistake.

Comment 7 Shruti Sampat 2013-12-04 09:14:47 UTC
Remove-brick also starts automatically, when the following steps are performed - 

1. Create a distribute volume, of 3 bricks, one on each server in the cluster and start the volume.
2. Kill glusterd on one of the nodes.
3. Start remove-brick operation on the volume. Starting remove-brick succeeds, but the remove-brick status shows that it has failed.
4. Start glusterd on the node where it was killed. Check the status of the previous remove-brick operation. It says completed.

Remove-brick should not start automatically when glusterd is brought back up.

Comment 8 Kaushal 2013-12-09 04:59:32 UTC
Under review at https://code.engineering.redhat.com/gerrit/16799

Comment 9 RamaKasturi 2013-12-19 06:40:00 UTC
In glusterfs build 50 When glusterd is made down and made up after a while, rebalance does not  starting automatically.

But , when glusterd goes down, there is an inconsistency seen in the status output in the servers.

Do i need to log a separate BZ for this, or would it be fixed as part of this? Could you please confirm.

Comment 10 Pavithra 2014-01-03 06:24:24 UTC
Can you please verify the doc text for technical accuracy?

Comment 11 Kaushal 2014-01-03 07:15:34 UTC
(In reply to RamaKasturi from comment #9)
> In glusterfs build 50 When glusterd is made down and made up after a while,
> rebalance does not  starting automatically.
> 
> But , when glusterd goes down, there is an inconsistency seen in the status
> output in the servers.
> 
> Do i need to log a separate BZ for this, or would it be fixed as part of
> this? Could you please confirm.

Can you raise another bug with more details?

(In reply to Pavithra from comment #10)
> Can you please verify the doc text for technical accuracy?
The doc text looks okay.

Comment 12 Shruti Sampat 2014-01-03 17:32:38 UTC
Performed the following steps - 

1. Create a distribute volume of 4 bricks, one on each server and start it, create data on the mount point.

2. Bring glusterd down on one of the nodes.

3. Start rebalance on the volume.

4. Check status, it says failed in all the other three nodes.

5. Bring back glusterd on the node where it was stopped.

6. Check rebalance status now, it was the same as that was found in step 4.

On performing the following steps - 

1. Create a 2x2 distributed-replicate volume, start it and create data on the mount point.

2. Bring glusterd down on two nodes that contain bricks that are part of a replica set.

3. Start rebalance on the volume.

4. Check status, it says failed on the other two nodes.

5. Bring back glusterd on the two other nodes one after the other. When glusterd is started on the nodes, and rebalance status is checked on these nodes, rebalance is seen to be 'in progress' on these nodes, which means rebalance was started on these nodes when glusterd was brought back up.

Moving to ASSIGNED.

Comment 14 Kaushal 2014-01-06 05:56:16 UTC
Shruti,
I tried this out on v3.4.0.53rhs (the latest build AFAIK). In both the cases, rebalance didn't startup again, and the status was shown as failed.

What I did,
1. Create a 4 node cluster
2. Create 2x2 dist volume.
3. Start volume
4. Kill glusterd on 2 nodes forming a replica
5. Start rebalance
6. Check rebalance status, shows as failed on the two up nodes.
7. Bring up the down nodes.
8. Check rebalance status again, shows failed on the two up nodes.

From a preliminary look, I also don't see anything wrong in the logs from the sosreport. I need to do more thorough investigation of the logs. It'll be more helpful if I could get access to a live system if you are able to reproduce it.

Comment 15 Shruti Sampat 2014-01-06 09:37:04 UTC
Kaushal,

I was not able to reproduce it. But I have the setup where I saw it earlier. Let me know if that will help.

Comment 16 Kaushal 2014-01-06 10:37:59 UTC
I've got the sos-reports so those should be enough.
Since, this is not reproducible, I'll be moving the bug back to ON_QA. Please do the verification and move it to the appropriate state.

Comment 17 Pavithra 2014-01-07 11:10:23 UTC
Made minor corrections to the doc text.

Comment 18 Shruti Sampat 2014-01-07 12:51:30 UTC
Verified with v3.4.0.53rhs that rebalance and remove-brick do not start automatically when glusterd is brought back up.

Comment 20 errata-xmlrpc 2014-02-25 07:56:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html