Bug 1286171

Summary: Rebalance : Status lists failures on stopping rebalance while it is in progress
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Susant Kumar Palai <spalai>
Component: distributeAssignee: Barak Sason Rofman <bsasonro>
Status: CLOSED ERRATA QA Contact: Kshithij Iyer <kiyer>
Severity: low Docs Contact:
Priority: low    
Version: rhgs-3.1CC: bsasonro, kiyer, pprakash, puebele, rhs-bugs, rkothiya, saraut, senaik, sheggodu, storage-qa-internal
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: RHGS 3.5.z Batch Update 3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: dht-rebalance-usability, dht-rca-unknown
Fixed In Version: glusterfs-6.0-49 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1034173
: 1800956 (view as bug list) Environment:
Last Closed: 2020-12-17 04:50:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1034173    
Bug Blocks: 1800956    

Comment 2 Nithya Balachandran 2016-01-19 14:51:46 UTC
*** Bug 1286172 has been marked as a duplicate of this bug. ***

Comment 3 Nithya Balachandran 2017-08-11 09:15:00 UTC
Still exists in the latest code:

[2017-08-11 09:08:58.109896] I [MSGID: 109029] [dht-rebalance.c:5186:gf_defrag_stop] 0-: Received stop command on rebalance
[2017-08-11 09:08:58.110144] I [MSGID: 109028] [dht-rebalance.c:5000:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 45.00 secs
[2017-08-11 09:08:58.110193] I [MSGID: 109028] [dht-rebalance.c:5004:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 1181, failures: 0, skipped: 106
[2017-08-11 09:08:58.118572] I [dht-rebalance.c:1513:dht_migrate_file] 0-vol1-dht: /dir-1/dir-2/dir-3/dir-4/dir-5/dir-6/dir-7/dir-8/dir-9/dir-10/dir-11/dir-12/dir-13/dir-14/dir-15/dir-16/dir-17/dir-18/dir-19/dir-20/dir-21/dir-22/file-17: attempting to move from vol1-client-2 to vol1-client-0
[2017-08-11 09:08:58.120142] I [dht-rebalance.c:3123:gf_defrag_process_dir] 0-vol1-dht: migrate data called on /dir-1/dir-2/dir-3/dir-4/dir-5/dir-6/dir-7/dir-8/dir-9/dir-10/dir-11/dir-12/dir-13/dir-14/dir-15/dir-16/dir-17/dir-18/dir-19/dir-20/dir-21/dir-22/dir-23/dir-24/dir-25/dir-26/dir-27
[2017-08-11 09:08:58.128104] W [dht-rebalance.c:3297:gf_defrag_process_dir] 0-vol1-dht: Found error from gf_defrag_get_entry
[2017-08-11 09:12:44.777354] E [MSGID: 109111] [dht-rebalance.c:3600:gf_defrag_fix_layout] 0-vol1-dht: gf_defrag_process_dir failed for directory: /dir-1/dir-2/dir-3/dir-4/dir-5/dir-6/dir-7/dir-8/dir-9/dir-10/dir-11/dir-12/dir-13/dir-14/dir-15/dir-16/dir-17/dir-18/dir-19/dir-20/dir-21/dir-22/dir-23/dir-24/dir-25/dir-26/dir-27

...

[2017-08-11 09:12:44.809059] E [MSGID: 109016] [dht-rebalance.c:3811:gf_defrag_fix_layout] 0-vol1-dht: Fix layout failed for /dir-1/dir-2/dir-3/dir-4/dir-5/dir-6/dir-7/dir-8/dir-9/dir-10/dir-11/dir-12/dir-13/dir-14/dir-15/dir-16/dir-17/dir-18/dir-19/dir-20/dir-21/dir-22/dir-23/dir-24/dir-25/dir-26/dir-27
[2017-08-11 09:12:44.809158] E [MSGID: 109016] [dht-rebalance.c:3811:gf_defrag_fix_layout] 0-vol1-dht: Fix layout failed for /dir-1/dir-2/dir-3/dir-4/dir-5/dir-6/dir-7/dir-8/dir-9/dir-10/dir-11/dir-12/dir-13/dir-14/dir-15/dir-16/dir-17/dir-18/dir-19/dir-20/dir-21/dir-22/dir-23/dir-24/dir-25/dir-26

Comment 14 Barak Sason Rofman 2020-01-28 09:24:59 UTC
Tested with latest upstream code - could not reproduce bug.

Test steps:
1) Created 3x3 vol:

[root@Node1 ~]# gluster volume status
Status of volume: distrep
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick Node1:/root/bricks/11                 49152     0          Y       2088 
Brick Node1:/root/bricks/12                 49153     0          Y       2096 
Brick Node1:/root/bricks/13                 49154     0          Y       2105 
Brick Node1:/root/bricks/21                 49155     0          Y       2114 
Brick Node1:/root/bricks/22                 49156     0          Y       2134 
Brick Node1:/root/bricks/23                 49157     0          Y       2127 
Brick Node1:/root/bricks/31                 49158     0          Y       2151 
Brick Node1:/root/bricks/32                 49159     0          Y       2162 
Brick Node1:/root/bricks/33                 49160     0          Y       2169 
Self-heal Daemon on localhost               N/A       N/A        Y       2217 
 
Task Status of Volume distrep
------------------------------------------------------------------------------
There are no active volume tasks

2) Mounted the volume using FUSE
3) Using a script, created a large number of small files through the mount point.
4) Added 3 more bricks to the vol:

[root@Node1 ~]# gluster volume add-brick distrep Node1:/root/bricks/41 Node1:/root/bricks/42 Node1:/root/bricks/43 force
volume add-brick: success
[root@Node1 ~]# gluster volume status
Status of volume: distrep
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick Node1:/root/bricks/11                 49161     0          Y       2754 
Brick Node1:/root/bricks/12                 49162     0          Y       2766 
Brick Node1:/root/bricks/13                 49163     0          Y       2777 
Brick Node1:/root/bricks/21                 49164     0          Y       2786 
Brick Node1:/root/bricks/22                 49165     0          Y       2793 
Brick Node1:/root/bricks/23                 49166     0          Y       2800 
Brick Node1:/root/bricks/31                 49167     0          Y       2811 
Brick Node1:/root/bricks/32                 49168     0          Y       2818 
Brick Node1:/root/bricks/33                 49169     0          Y       2831 
Brick Node1:/root/bricks/41                 49170     0          Y       3026 
Brick Node1:/root/bricks/42                 49171     0          Y       3046 
Brick Node1:/root/bricks/43                 49172     0          Y       3066 
Self-heal Daemon on localhost               N/A       N/A        Y       2854 
 
Task Status of Volume distrep
------------------------------------------------------------------------------
There are no active volume tasks

5) Initiated rebalance:

[root@Node1 ~]# gluster volume rebalance distrep start
volume rebalance: distrep: success: Rebalance on distrep has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 4cd2946f-94c0-4c41-af80-da402c471243

6)Allowed rebalance to run for ~40 seconds:

[root@Node1 ~]# gluster volume rebalance distrep status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost              749        49.6KB          7253             0             0          in progress        0:00:35
The estimated time for rebalance to complete will be unavailable for the first 10 minutes.
volume rebalance: distrep: success

7) Stopped rebalance:

[root@Node1 ~]# gluster volume rebalance distrep stop
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost              791        50.0KB          7253             0             0            completed        0:00:42
volume rebalance: distrep: success: rebalance process may be in the middle of a file migration.
The process will be fully stopped once the migration of the file is complete.
Please check rebalance process for completion before doing any further brick related tasks on the volume.

8) Checked rebalance status:

[root@Node1 ~]# gluster volume rebalance distrep status
volume rebalance: distrep: failed: Rebalance not started for volume distrep.

Rebalance log ending:

[2020-01-28 08:52:29.943902] I [dht-rebalance.c:1596:dht_migrate_file] 0-distrep-dht: /4720: attempting to move from distrep-replicate-1 to distrep-replicate-0
[2020-01-28 08:52:29.951851] I [MSGID: 109022] [dht-rebalance.c:2231:dht_migrate_file] 0-distrep-dht: completed migration of /2354 from subvolume distrep-replicate-1 to distrep-replicate-0 
[2020-01-28 08:52:30.104331] I [MSGID: 109022] [dht-rebalance.c:2231:dht_migrate_file] 0-distrep-dht: completed migration of /1444 from subvolume distrep-replicate-1 to distrep-replicate-0 
[2020-01-28 08:52:30.187361] I [MSGID: 109022] [dht-rebalance.c:2231:dht_migrate_file] 0-distrep-dht: completed migration of /2201 from subvolume distrep-replicate-1 to distrep-replicate-0 
[2020-01-28 08:52:30.195028] I [MSGID: 109022] [dht-rebalance.c:2231:dht_migrate_file] 0-distrep-dht: completed migration of /4720 from subvolume distrep-replicate-1 to distrep-replicate-0 
[2020-01-28 08:52:30.197315] I [MSGID: 109028] [dht-rebalance.c:5062:gf_defrag_status_get] 0-distrep-dht: Rebalance is completed. Time taken is 42.00 secs 
[2020-01-28 08:52:30.197332] I [MSGID: 109028] [dht-rebalance.c:5064:gf_defrag_status_get] 0-distrep-dht: Files migrated: 791, size: 51212, lookups: 7253, failures: 0, skipped: 0 
[2020-01-28 08:52:30.197603] W [glusterfsd.c:1441:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x94e2) [0x7f486ad4b4e2] -->/usr/local/sbin/glusterfs(glusterfs_sigwaiter+0x95) [0x406b45] -->/usr/local/sbin/glusterfs(cleanup_and_exit+0x4b) [0x4069fb] ) 0-: received signum (15), shutting down 

Result - no failures appear.

Comment 16 Barak Sason Rofman 2020-01-28 14:44:25 UTC
Managed to reproduce with upstream code by creating large amount of nested directories:

[2020-01-28 14:31:42.411679] I [MSGID: 109029] [dht-rebalance.c:5241:gf_defrag_stop] 0-: Received stop command on rebalance 
[2020-01-28 14:31:42.411725] I [MSGID: 109028] [dht-rebalance.c:5062:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 176.00 secs 
[2020-01-28 14:31:42.411733] I [MSGID: 109028] [dht-rebalance.c:5064:gf_defrag_status_get] 0-glusterfs: Files migrated: 2833, size: 28330, lookups: 9218, failures: 0, skipped: 0 
[2020-01-28 14:31:42.448100] I [MSGID: 109022] [dht-rebalance.c:2231:dht_migrate_file] 0-distrep-dht: completed migration of /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22/23/24/25/26/27/28/29/30/31/32/313.txt from subvolume distrep-replicate-2 to distrep-replicate-0 
[2020-01-28 14:31:42.452070] W [dht-rebalance.c:3447:gf_defrag_process_dir] 0-distrep-dht: Found error from gf_defrag_get_entry
[2020-01-28 14:31:42.452764] E [MSGID: 109111] [dht-rebalance.c:3971:gf_defrag_fix_layout] 0-distrep-dht: gf_defrag_process_dir failed for directory: /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22/23/24/25/26/27/28/29/30/31 
[2020-01-28 14:31:42.453498] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22/23/24/25/26/27/28/29/30 
[2020-01-28 14:31:42.454547] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22/23/24/25/26/27/28/29 
[2020-01-28 14:31:42.455027] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22/23/24/25/26/27/28 
[2020-01-28 14:31:42.455449] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22/23/24/25/26/27 
[2020-01-28 14:31:42.456444] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22/23/24/25/26 
[2020-01-28 14:31:42.457232] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22/23/24/25 
[2020-01-28 14:31:42.457986] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22/23/24 
[2020-01-28 14:31:42.459146] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22/23 
[2020-01-28 14:31:42.460915] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22 
[2020-01-28 14:31:42.461968] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21 
[2020-01-28 14:31:42.463126] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20 
[2020-01-28 14:31:42.464036] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19 
[2020-01-28 14:31:42.464749] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18 
[2020-01-28 14:31:42.466331] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17 
[2020-01-28 14:31:42.467066] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16 
[2020-01-28 14:31:42.467972] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15 
[2020-01-28 14:31:42.468470] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14 
[2020-01-28 14:31:42.469363] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12/13 
[2020-01-28 14:31:42.469960] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11/12 
[2020-01-28 14:31:42.471430] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10/11 
[2020-01-28 14:31:42.472479] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9/10 
[2020-01-28 14:31:42.473932] I [MSGID: 109022] [dht-rebalance.c:2231:dht_migrate_file] 0-distrep-dht: completed migration of /0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22/23/24/25/26/27/28/29/30/31/32/240.txt from subvolume distrep-replicate-1 to distrep-replicate-3 
[2020-01-28 14:31:42.474395] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8/9 
[2020-01-28 14:31:42.475655] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7/8 
[2020-01-28 14:31:42.476661] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6/7 
[2020-01-28 14:31:42.477752] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5/6 
[2020-01-28 14:31:42.478367] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4/5 
[2020-01-28 14:31:42.479049] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3/4 
[2020-01-28 14:31:42.479645] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2/3 
[2020-01-28 14:31:42.480148] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1/2 
[2020-01-28 14:31:42.480736] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0/1 
[2020-01-28 14:31:42.481243] E [MSGID: 109016] [dht-rebalance.c:3906:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /0 
[2020-01-28 14:31:42.482449] I [MSGID: 109028] [dht-rebalance.c:5062:gf_defrag_status_get] 0-distrep-dht: Rebalance is failed. Time taken is 176.00 secs 
[2020-01-28 14:31:42.482463] I [MSGID: 109028] [dht-rebalance.c:5064:gf_defrag_status_get] 0-distrep-dht: Files migrated: 2835, size: 28350, lookups: 9218, failures: 33, skipped: 0 
[2020-01-28 14:31:42.482749] W [glusterfsd.c:1441:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x94e2) [0x7fbe90bd64e2] -->/usr/local/sbin/glusterfs(glusterfs_sigwaiter+0x95) [0x406b45] -->/usr/local/sbin/glusterfs(cleanup_and_exit+0x4b) [0x4069fb] ) 0-: received signum (15), shutting down 

Will proceed to debugging the issue.

Comment 42 errata-xmlrpc 2020-12-17 04:50:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (glusterfs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5603