Bug 996003

Summary:	DHT: Rebalance will never finishes in case of a node restart
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	shylesh <shmohan>
Component:	glusterfs	Assignee:	Susant Kumar Palai <spalai>
Status:	CLOSED DUPLICATE	QA Contact:	shylesh <shmohan>
Severity:	high	Docs Contact:
Priority:	high
Version:	2.1	CC:	nsathyan, rhs-bugs, shmohan, spalai, vbellur
Target Milestone:	---
Target Release:	RHGS 3.0.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-06-03 06:50:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	923774

Description shylesh 2013-08-12 07:39:51 UTC

Description of problem:

In a distributed-replicate volume while rebalance is in progress if one of the node is rebooted then rebalance process will run forever

Version-Release number of selected component (if applicable):
[root@gqac023 ~]# rpm -qa |grep gluster
glusterfs-fuse-3.4.0.17rhs-1.el6rhs.x86_64
glusterfs-server-3.4.0.17rhs-1.el6rhs.x86_64
gluster-swift-container-1.8.0-6.3.el6rhs.noarch
glusterfs-3.4.0.17rhs-1.el6rhs.x86_64
glusterfs-geo-replication-3.4.0.17rhs-1.el6rhs.x86_64
glusterfs-api-devel-3.4.0.17rhs-1.el6rhs.x86_64
glusterfs-rdma-3.4.0.17rhs-1.el6rhs.x86_64
glusterfs-debuginfo-3.4.0.17rhs-1.el6rhs.x86_64
gluster-swift-1.8.0-6.3.el6rhs.noarch
gluster-swift-proxy-1.8.0-6.3.el6rhs.noarch
gluster-swift-account-1.8.0-6.3.el6rhs.noarch
vdsm-gluster-4.10.2-23.0.1.el6rhs.noarch
glusterfs-libs-3.4.0.17rhs-1.el6rhs.x86_64
glusterfs-api-3.4.0.17rhs-1.el6rhs.x86_64
glusterfs-devel-3.4.0.17rhs-1.el6rhs.x86_64
gluster-swift-object-1.8.0-6.3.el6rhs.noarch
gluster-swift-plugin-1.8.0-2.el6rhs.noarch


How reproducible:


Steps to Reproduce:
1. I had a 11x2 distributed-replicate volume with 3 node cluster
2. added a pair of brick and started fix-layout
3. rebalance status was showing the "in progress" on all the nodes
4. Rebooted one of the node
5. when the node comes back rebalance will not be started automatically on that node
6. Now status on the remaining nodes shows "in progress" forever

Actual results:

Rebalance runs forever on the remaining nodes of the cluster

Expected results:


Additional info:
=================
Volume Name: di-re
Type: Distributed-Replicate
Volume ID: 93edf2ff-9b12-48eb-8714-e2a2807cca43
Status: Started
Number of Bricks: 12 x 2 = 24
Transport-type: tcp
Bricks:
Brick1: gqac022.sbu.lab.eng.bos.redhat.com:/home/dire1
Brick2: gqac023.sbu.lab.eng.bos.redhat.com:/home/dire1
Brick3: gqac022.sbu.lab.eng.bos.redhat.com:/home/dire2
Brick4: gqac023.sbu.lab.eng.bos.redhat.com:/home/dire2
Brick5: gqac022.sbu.lab.eng.bos.redhat.com:/home/dire3
Brick6: gqac023.sbu.lab.eng.bos.redhat.com:/home/dire3
Brick7: gqac022.sbu.lab.eng.bos.redhat.com:/home/dire4
Brick8: gqac023.sbu.lab.eng.bos.redhat.com:/home/dire4
Brick9: gqac022.sbu.lab.eng.bos.redhat.com:/home/dire5
Brick10: gqac023.sbu.lab.eng.bos.redhat.com:/home/dire5
Brick11: gqac022.sbu.lab.eng.bos.redhat.com:/home/dire6
Brick12: gqac023.sbu.lab.eng.bos.redhat.com:/home/dire6
Brick13: gqac022.sbu.lab.eng.bos.redhat.com:/home/dire7
Brick14: gqac023.sbu.lab.eng.bos.redhat.com:/home/dire7
Brick15: gqac022.sbu.lab.eng.bos.redhat.com:/home/dire9
Brick16: gqac023.sbu.lab.eng.bos.redhat.com:/home/dire9
Brick17: gqac022.sbu.lab.eng.bos.redhat.com:/home/dire10
Brick18: gqac023.sbu.lab.eng.bos.redhat.com:/home/dire10
Brick19: gqac022.sbu.lab.eng.bos.redhat.com:/home/dire11
Brick20: gqac023.sbu.lab.eng.bos.redhat.com:/home/dire11
Brick21: gqac022.sbu.lab.eng.bos.redhat.com:/home/dire12
Brick22: gqac023.sbu.lab.eng.bos.redhat.com:/home/dire12
Brick23: gqac024.sbu.lab.eng.bos.redhat.com:/home/dire13
Brick24: gqac023.sbu.lab.eng.bos.redhat.com:/home/dire13


RHS nodes
----------
gqac022.sbu.lab.eng.bos.redhat.com
gqac023.sbu.lab.eng.bos.redhat.com
gqac024.sbu.lab.eng.bos.redhat.com

Node which was brought down
===========================
gqac023.sbu.lab.eng.bos.redhat.com

Command ran from 
================
gqac022.sbu.lab.eng.bos.redhat.com

Mounted on 
==========
gqac024.sbu.lab.eng.bos.redhat.com

Mount point
==========
/mnt


attached the sosreports

Comment 3 Scott Haines 2013-09-27 17:08:03 UTC

Targeting for 3.0.0 (Denali) release.

Comment 5 Susant Kumar Palai 2014-03-05 03:00:09 UTC

Shylesh,
   Was not able to reproduce the issue with the above steps.

Created a 2 brick distribute volume :

[root@localhost upstream]# gluster v i
 
Volume Name: test1
Type: Distribute
Volume ID: 84a3db85-9328-4d4d-adc2-045bddb55414
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 10.70.42.211:/brick/1
Brick2: 10.70.43.49:/brick/1

Created 10000 directories on mount point
[root@localhost mnt]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
                      9.6G  6.7G  2.5G  74% /
tmpfs                1004M     0 1004M   0% /dev/shm
/dev/vda1             485M   32M  428M   7% /boot
/dev/mapper/VolGroup-lv_brick
                      9.8G   33M  9.8G   1% /brick
10.70.42.211:test1     20G   88M   20G   1% /mnt

[root@localhost mnt]# ls
[root@localhost mnt]# mkdir dir{1..10000}



From 10.70.42.211 started fix-layout rebalance
[root@localhost mnt]# gluster volume rebalance test1 fix-layout start
volume rebalance: test1: success: Starting rebalance on volume test1 has been successful.
ID: 074a2a5f-8b60-4b33-8908-569cec40215e


[root@localhost upstream]# gluster volume rebalance test1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             0             0 fix-layout in progress               2.00
                            10.70.42.211                0        0Bytes             0             0             0 fix-layout in progress               1.00


In the middle of the operation rebooted 10.70.43.49
[root@localhost upstream]# reboot
[root@localhost upstream]# 
Broadcast message from root
	(/dev/pts/0) at 7:31 ...

The system is going down for reboot NOW!
Connection to 10.70.43.49 closed by remote host.
Connection to 10.70.43.49 closed.


After the server came up checked rebalance status

[root@localhost mnt]# gluster v rebalance test1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             1             0    fix-layout failed              64.00
                             10.70.43.49                0        0Bytes             0             1             0    fix-layout failed               0.00

It 




Found the following in the log: 
[2014-03-04 07:37:28.828568] I [dht-common.c:2601:dht_setxattr] 0-test1-dht: fixing the layout of /dir2123
[2014-03-04 07:37:28.838269] I [dht-common.c:2601:dht_setxattr] 0-test1-dht: fixing the layout of /dir2124
[2014-03-04 07:37:28.852623] I [dht-common.c:2601:dht_setxattr] 0-test1-dht: fixing the layout of /dir2125
[2014-03-04 07:37:28.856090] I [dht-common.c:2601:dht_setxattr] 0-test1-dht: fixing the layout of /dir2126
[2014-03-04 07:37:28.865942] I [dht-common.c:2601:dht_setxattr] 0-test1-dht: fixing the layout of /dir2127
[2014-03-04 07:38:12.083731] I [dht-rebalance.c:1826:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 56.00 secs
[2014-03-04 07:38:12.083771] I [dht-rebalance.c:1829:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 0, failures: 0, skipped: 0
[2014-03-04 07:38:20.382415] W [socket.c:522:__socket_rwv] 0-test1-client-1: readv on 10.70.43.49:49156 failed (Connection reset by peer)
[2014-03-04 07:38:20.382698] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x147) [0x7f3d853525ae] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x139) [0x7f3d85351ba6] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0x1f) [0x7f3d853516a3]))) 0-test1-client-1: forced unwinding frame type(GlusterFS 3.3) op(OPENDIR(20)) called at 2014-03-04 07:37:28.872648 (xid=0x361a)
[2014-03-04 07:38:20.382723] W [client-rpc-fops.c:2689:client3_3_opendir_cbk] 0-test1-client-1: remote operation failed: Transport endpoint is not connected. Path: /dir2127 (b520279b-781b-4f3b-90b9-3018557b0360)
[2014-03-04 07:38:20.383338] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x147) [0x7f3d853525ae] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x139) [0x7f3d85351ba6] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0x1f) [0x7f3d853516a3]))) 0-test1-client-1: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2014-03-04 07:37:58.574734 (xid=0x361b)
[2014-03-04 07:38:20.383348] W [client-handshake.c:276:client_ping_cbk] 0-test1-client-1: timer must have expired
[2014-03-04 07:38:20.383365] I [client.c:2208:client_rpc_notify] 0-test1-client-1: disconnected from 10.70.43.49:49156. Client process will keep trying to connect to glusterd until brick's port is available
[2014-03-04 07:38:20.383376] W [dht-common.c:5172:dht_notify] 0-test1-dht: Received CHILD_DOWN. Exiting
[2014-03-04 07:38:20.383383] I [dht-rebalance.c:1849:gf_defrag_stop] 0-: Received stop command on rebalance
[2014-03-04 07:38:20.383443] E [dht-rebalance.c:1541:gf_defrag_fix_layout] 0-test1-dht: Fix layout failed for /dir2127
[2014-03-04 07:38:20.383583] I [dht-rebalance.c:1826:gf_defrag_status_get] 0-glusterfs: Rebalance is failed. Time taken is 64.00 secs
[2014-03-04 07:38:20.383595] I [dht-rebalance.c:1829:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 0, failures: 1, skipped: 0
[2014-03-04 07:38:20.383809] W [glusterfsd.c:1128:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x3ea50e890d] (-->/lib64/libpthread.so.0() [0x3ea5807851] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe4) [0x407e55]))) 0-: received signum (15), shutting down

Will confirm the behaviour with code.

Comment 6 Susant Kumar Palai 2014-03-05 03:17:23 UTC

Shylesh,
     Here is the code snippet no how a child down event is handled here.

From dht_notify{
        case GF_EVENT_CHILD_DOWN:
                subvol = data; 

                if (conf->assert_no_child_down) {
                        gf_log (this->name, GF_LOG_WARNING,
                                "Received CHILD_DOWN. Exiting");
                        if (conf->defrag) {
/* status is set to GF_DEFRAG_STATUS_FAILED in gf_defrag_stop*/
                                gf_defrag_stop (conf->defrag,
                                                GF_DEFRAG_STATUS_FAILED, NULL);
                        } else {
                                kill (getpid(), SIGTERM);
                        }       
                }       
}

From gf_defrag_status_get (gf_defrag_info_t *defrag, dict_t *dict)
{
 switch (defrag->defrag_status) {
        case GF_DEFRAG_STATUS_NOT_STARTED:
                status = "not started";
                break;
        case GF_DEFRAG_STATUS_STARTED:
                status = "in progress";
                break;
        case GF_DEFRAG_STATUS_STOPPED:
                status = "stopped";
                break;
        case GF_DEFRAG_STATUS_COMPLETE:
                status = "completed";
                break;
        case GF_DEFRAG_STATUS_FAILED:
                status = "failed"; /*Status is set to failed and logged*/
                break;
        default:
                break;
        }

        gf_log (THIS->name, GF_LOG_INFO, "Rebalance is %s. Time taken is %.2f "
                "secs", status, elapsed);
        gf_log (THIS->name, GF_LOG_INFO, "Files migrated: %"PRIu64", size: %"
                PRIu64", lookups: %"PRIu64", failures: %"PRIu64", skipped: "
                "%"PRIu64, files, size, lookup, failures, skipped);


}


Can you confirm the behaviour  with latest build.

Comment 7 shylesh 2014-03-05 04:10:18 UTC

Susant,

Description says the volume type was distributed-replicate whereas you tried with a distribute volume. Please try with distributed-replicate volume and let me know

Comment 8 Susant Kumar Palai 2014-03-05 13:26:01 UTC

Shylesh,
  Able to see one issue with reboot.

Reproducible : Always
Volume:

Volume Name: test1
Type: Distributed-Replicate
Volume ID: 1dcd8dc0-3d7e-4e7f-b561-c3fadd053bf3
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.211:/brick/1
Brick2: 10.70.43.49:/brick/1
Brick3: 10.70.42.211:/brick/2
Brick4: 10.70.43.49:/brick/2
[root@localhost mnt]# 

*Created some data
[root@localhost mnt]# ls
dir1   dir11  dir13  dir15  dir17  dir19  dir20  dir4  dir6  dir8  zinkile1   zinkile2  zinkile4  zinkile6  zinkile8
dir10  dir12  dir14  dir16  dir18  dir2   dir3   dir5  dir7  dir9  zinkile10  zinkile3  zinkile5  zinkile7  zinkile9
[root@localhost mnt]# 

*added a pair of brick and started rebalance
[root@localhost mnt]# gluster v add-brick test1 10.70.42.211:/brick/5 10.70.43.49:/brick/5
volume add-brick: success
[root@localhost mnt]# 

[root@localhost mnt]# gluster v rebalance test1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes            10             0             3            completed               0.00
                             10.70.43.49                0        0Bytes            10             0             0            completed               0.00
volume rebalance: test1: success: 


Now rebooted 10.70.43.49
[root@localhost ~]# reboot

Broadcast message from root
	(/dev/pts/0) at 18:18 ...

The system is going down for reboot NOW!
[root@localhost ~]# Connection to 10.70.43.49 closed by remote host.
Connection to 10.70.43.49 closed.


Now logged in to 10.70.43.49 and started glusterd
[susant@admin ~]$ vm3
root.43.49's password: 
Last login: Tue Mar  4 13:49:40 2014 from dhcp-0-200.blr.redhat.com
[root@localhost ~]# pgrep glusterd
[root@localhost ~]# glusterd

After glusterd started saw rebalance is started again on 10.70.43.49

[2014-03-04 18:20:08.269956] I [dht-rebalance.c:1393:gf_defrag_migrate_data] 0-test1-dht: Migration operation on dir /dir13 took 0.00 secs
[2014-03-04 18:20:08.273536] I [dht-common.c:2601:dht_setxattr] 0-test1-dht: fixing the layout of /dir14
[2014-03-04 18:20:08.276153] I [dht-rebalance.c:1170:gf_defrag_migrate_data] 0-test1-dht: migrate data called on /dir14
[2014-03-04 18:20:08.278087] I [dht-rebalance.c:1393:gf_defrag_migrate_data] 0-test1-dht: Migration operation on dir /dir14 took 0.00 secs
[2014-03-04 18:20:08.281345] I [dht-common.c:2601:dht_setxattr] 0-test1-dht: fixing the layout of /dir15
[2014-03-04 18:20:08.284134] I [dht-rebalance.c:1170:gf_defrag_migrate_data] 0-test1-dht: migrate data called on /dir15
[2014-03-04 18:20:08.286188] I [dht-rebalance.c:1393:gf_defrag_migrate_data] 0-test1-dht: Migration operation on dir /dir15 took 0.00 secs
[2014-03-04 18:20:08.289099] I [dht-common.c:2601:dht_setxattr] 0-test1-dht: fixing the layout of /dir16
[2014-03-04 18:20:08.291804] I [dht-rebalance.c:1170:gf_defrag_migrate_data] 0-test1-dht: migrate data called on /dir16
[2014-03-04 18:20:08.293951] I [dht-rebalance.c:1393:gf_defrag_migrate_data] 0-test1-dht: Migration operation on dir /dir16 took 0.00 secs
[2014-03-04 18:20:08.297058] I [dht-common.c:2601:dht_setxattr] 0-test1-dht: fixing the layout of /dir17
[2014-03-04 18:20:08.299245] I [dht-rebalance.c:1170:gf_defrag_migrate_data] 0-test1-dht: migrate data called on /dir17
[2014-03-04 18:20:08.300983] I [dht-rebalance.c:1393:gf_defrag_migrate_data] 0-test1-dht: Migration operation on dir /dir17 took 0.00 secs
[2014-03-04 18:20:08.303776] I [dht-common.c:2601:dht_setxattr] 0-test1-dht: fixing the layout of /dir18
[2014-03-04 18:20:08.306262] I [dht-rebalance.c:1170:gf_defrag_migrate_data] 0-test1-dht: migrate data called on /dir18
[2014-03-04 18:20:08.308070] I [dht-rebalance.c:1393:gf_defrag_migrate_data] 0-test1-dht: Migration operation on dir /dir18 took 0.00 secs
[2014-03-04 18:20:08.311072] I [dht-common.c:2601:dht_setxattr] 0-test1-dht: fixing the layout of /dir19
[2014-03-04 18:20:08.313643] I [dht-rebalance.c:1170:gf_defrag_migrate_data] 0-test1-dht: migrate data called on /dir19
[2014-03-04 18:20:08.315300] I [dht-rebalance.c:1393:gf_defrag_migrate_data] 0-test1-dht: Migration operation on dir /dir19 took 0.00 secs
[2014-03-04 18:20:08.318545] I [dht-common.c:2601:dht_setxattr] 0-test1-dht: fixing the layout of /dir20
[2014-03-04 18:20:08.321079] I [dht-rebalance.c:1170:gf_defrag_migrate_data] 0-test1-dht: migrate data called on /dir20
[2014-03-04 18:20:08.323018] I [dht-rebalance.c:1393:gf_defrag_migrate_data] 0-test1-dht: Migration operation on dir /dir20 took 0.00 secs
[2014-03-04 18:20:08.327121] I [dht-rebalance.c:1826:gf_defrag_status_get] 0-glusterfs: Rebalance is completed. Time taken is 0.00 secs
[2014-03-04 18:20:08.327163] I [dht-rebalance.c:1829:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 10, failures: 0, skipped: 0



It's a glusterd bug. Will check with Kaushal.

Comment 9 Nagaprasad Sathyanarayana 2014-05-06 11:43:37 UTC

Dev ack to 3.0 RHS BZs

Comment 10 Susant Kumar Palai 2014-05-12 10:38:44 UTC

Hi shylesh,

Here is my findings for this bug.

[root@localhost mnt]# gvi
 
Volume Name: test1
Type: Distribute
Volume ID: 7552e742-d2e8-4282-82b7-c210c8f5ff2c
Status: Started
Snap Volume: no
Number of Bricks: 5
Transport-type: tcp
Bricks:
Brick1: 192.168.122.11:/brick/a
Brick2: 192.168.122.100:/brick/b
Brick3: 192.168.122.100:/brick/c

Added two bricks : 

[root@localhost mnt]# gvi
 
Volume Name: test1
Type: Distribute
Volume ID: 7552e742-d2e8-4282-82b7-c210c8f5ff2c
Status: Started
Snap Volume: no
Number of Bricks: 5
Transport-type: tcp
Bricks:
Brick1: 192.168.122.11:/brick/a
Brick2: 192.168.122.100:/brick/b
Brick3: 192.168.122.100:/brick/c
Brick4: 192.168.122.11:/brick/@
Brick5: 192.168.122.100:/brick/@

Started fix-layout , rebooted 192.168.122.100

After the server came up, rebalance started for only 192.168.122.100

ON  192.168.122.100 checked status after reboot
[root@localhost ~]# gluster v rebalance test1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             0             0 fix-layout in progress              13.00
                          192.168.122.11                0        0Bytes             0             1             0    fix-layout failed               3.00
volume rebalance: test1: success: 
[root@localhost ~]# gluster v rebalance test1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             0             0 fix-layout completed              20.00
                          192.168.122.11                0        0Bytes             0             1             0    fix-layout failed               3.00
volume rebalance: test1: success: 
[root@localhost ~]# 


Hence, moving it to ON_QA as the behaviour has changed over time.

If any server goes down in the middle of the rebalance process, after reboot only that server rebalance process should start.


The issue I mentioned in comment 8 is handled in bug 
: https://bugzilla.redhat.com/show_bug.cgi?id=1046908

Comment 12 Susant Kumar Palai 2014-06-03 06:50:33 UTC


*** This bug has been marked as a duplicate of bug 1046908 ***