Bug 1037505

Summary: BVT: Rebalance skipped files are counted as failures in the status
Product: Red Hat Gluster Storage Reporter: shylesh <shmohan>
Component: glusterfsAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED ERRATA QA Contact: shylesh <shmohan>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: jturner, lmohanty, vagarwal, vbellur
Target Milestone: ---Keywords: Regression, ZStream
Target Release: RHGS 2.1.2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.4.0.50rhs Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-25 08:07:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Rebalance logs none

Description shylesh 2013-12-03 09:53:52 UTC
Description of problem:
By running rebalance on a distributed-replicat volume , if any of the file migration fails due to space issue it's shown in the "failed" list rather than "skipped"

Version-Release number of selected component (if applicable):
Nightly BVT

How reproducible:
Always in BVT runs

Steps to Reproduce:
1.4x2 distributed-replicate volume on which automated sanity was running
2. it creates symlinks on the mount point
3. add a brick pair and invoke rebalance
gluster v rebalance <vol> start
Actual results:

:: [   PASS   ] :: Running 'rhts-sync-block -s rebal_run.70 rhsauto019.lab.eng.blr.redhat.com rhsauto008.lab.eng.blr.redhat.com rhsauto021.lab.eng.blr.redhat.com rhsauto022.lab.eng.blr.redhat.com' (Expected 0, got 0)
:: [ 22:41:42 ] ::  rebal_get_status - Check status of rebalance and looks for errors.
:: [ 22:41:42 ] ::  Machine in recipe is MASTERNODE rhsauto019.lab.eng.blr.redhat.com
:: [ 22:41:42 ] ::  Logging initial status of rebalance:
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost               32      717Bytes           233            33             0          in progress               1.00
       rhsauto008.lab.eng.blr.redhat.com               30      674Bytes           240            27             0          in progress               1.00
       rhsauto022.lab.eng.blr.redhat.com                0        0Bytes           325             2             0            completed               0.00
volume rebalance: rebalvol: success: 



from rebalance logs
-------------------

[2013-12-03 03:16:22.993932] I [dht-rebalance.c:672:dht_migrate_file] 0-hosdu-dht: /90: attempting to move from hosdu-replicate-0 to hosdu-replicate-2
[2013-12-03 03:16:23.008892] W [dht-rebalance.c:374:__dht_check_free_space] 0-hosdu-dht: data movement attempted from node (hosdu-replicate-0) with higher disk space to a node (hosdu-replicate-2) with lesser disk space (/90)
[2013-12-03 03:16:23.025123] I [dht-rebalance.c:672:dht_migrate_file] 0-hosdu-dht: /11: attempting to move from hosdu-replicate-1 to hosdu-replicate-3
[2013-12-03 03:16:23.038697] W [dht-rebalance.c:374:__dht_check_free_space] 0-hosdu-dht: data movement attempted from node (hosdu-replicate-1) with higher disk space to a node (hosdu-replicate-3) with lesser disk space (/11)
[2013-12-03 03:16:23.047018] I [dht-rebalance.c:672:dht_migrate_file] 0-hosdu-dht: /20: attempting to move from hosdu-replicate-1 to hosdu-replicate-3
[2013-12-03 03:16:23.060844] W [dht-rebalance.c:374:__dht_check_free_space] 0-hosdu-dht: data movement attempted from node (hosdu-replicate-1) with higher disk space to a node (hosdu-replicate-3) with lesser disk space (/20)
[2013-12-03 03:16:23.071069] I [dht-rebalance.c:672:dht_migrate_file] 0-hosdu-dht: /35: attempting to move from hosdu-replicate-1 to hosdu-replicate-3
[2013-12-03 03:16:23.086840] W [dht-rebalance.c:374:__dht_check_free_space] 0-hosdu-dht: data movement attempted from node (hosdu-replicate-1) with higher disk space to a node (hosdu-replicate-3) with lesser disk space (/35)
[2013-12-03 03:16:23.094162] I [dht-rebalance.c:672:dht_migrate_file] 0-hosdu-dht: /46: attempting to move from hosdu-replicate-1 to hosdu-replicate-3
[2013-12-03 03:16:23.113757] W [dht-rebalance.c:374:__dht_check_free_space] 0-hosdu-dht: data movement attempted from node (hosdu-replicate-1) with higher disk space to a node (hosdu-replicate-3) with lesser disk space (/46)
[2013-12-03 03:16:23.125096] I [dht-rebalance.c:672:dht_migrate_file] 0-hosdu-dht: /49: attempting to move from hosdu-replicate-1 to hosdu-replicate-3
[2013-12-03 03:16:23.141371] W [dht-rebalance.c:374:__dht_check_free_space] 0-hosdu-dht: data movement attempted from node (hosdu-replicate-1) with higher disk space to a node (hosdu-replicate-3) with lesse


Above warning messages are supposed to be considered as "skipped" instead they are considered as "failed"

Comment 1 Lalatendu Mohanty 2013-12-03 10:02:15 UTC
This issue came as git branch for BVT changed from origin/rhs-2.1-u1 to origin/rhs-2.1 in downstream code

Comment 2 Lalatendu Mohanty 2013-12-03 13:40:16 UTC
Created attachment 832061 [details]
Rebalance logs

Rebalance log during the test

Comment 6 Pranith Kumar K 2013-12-16 12:45:22 UTC
I see the following in the logs, when I re-created it.

[2013-12-15 17:37:50.710296] D [dht-rebalance.c:1290:gf_defrag_migrate_data] 0-r2-dht: migrate-data skipped for /96 due to space constraints
[2013-12-15 17:37:50.729342] D [dht-rebalance.c:1290:gf_defrag_migrate_data] 0-r2-dht: migrate-data skipped for /97 due to space constraints
[2013-12-15 17:37:50.749216] D [dht-rebalance.c:1290:gf_defrag_migrate_data] 0-r2-dht: migrate-data skipped for /98 due to space constraints
[2013-12-15 17:37:50.768598] D [dht-rebalance.c:1290:gf_defrag_migrate_data] 0-r2-dht: migrate-data skipped for /99 due to space constraints

root@pranithk-vm1 - ~ 
17:46:19 :) ⚡ grep "space constraints" /usr/local/var/log/glusterfs/r2-rebalance.log | wc -l
51

With the fix:
root@pranithk-vm1 - /mnt/r2 
17:54:04 :) ⚡ gluster volume rebalance r2 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost               36       67Bytes           236             0            51            completed               2.00
                            10.70.42.237                0        0Bytes           200             0             0            completed               2.00
                            10.70.43.148                0        0Bytes           200             0             0            completed               2.00
volume rebalance: r2: success:

Comment 7 shylesh 2013-12-24 11:45:17 UTC
Verified on 3.4.0.52rhs-1.el6rhs.x86_64. Now skipped count will be shown properly in case of failures due to space constraints.

Comment 9 errata-xmlrpc 2014-02-25 08:07:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html