Bug 1303269 - After GlusterD restart, Remove-brick commit happening even though data migration not completed.
Summary: After GlusterD restart, Remove-brick commit happening even though data migra...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Atin Mukherjee
QA Contact:
URL:
Whiteboard:
Depends On: 1303028 1303125 1311041
Blocks: 1310972
TreeView+ depends on / blocked
 
Reported: 2016-01-30 03:15 UTC by Atin Mukherjee
Modified: 2016-06-16 13:56 UTC (History)
3 users (show)

Fixed In Version: glusterfs-3.8rc2
Doc Type: Bug Fix
Doc Text:
Clone Of: 1303125
: 1310972 (view as bug list)
Environment:
Last Closed: 2016-06-16 13:56:06 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Atin Mukherjee 2016-01-30 03:15:51 UTC
+++ This bug was initially created as a clone of Bug #1303125 +++

Description of problem:
=======================
Have two node cluster with Distributed-Replica volume and mounted as fuse with enough data  and started removing replica brick set which triggered rebalance, during rebalance in progress, restarted glusterd on a node from where data migration is happening, after that tried to commit the remove-brick, it's get committed even though data migration not completed.


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.7.5-17


How reproducible:
=================
Every time


Steps to Reproduce:
====================
1.Have a two node cluster with Distributed-Replica volume (2 *2 )
2.Mount the volume as Fuse and write enough data
3.Start replica brick set remove // will trigger the data migration
4.Using remove-brick status identify brick node from where data migration is happening.
5. Restart glusterd on the node identified in step-4 during rebalance  in progress
6.Try to commit the remove-brick //commit will happen with out fail.

Actual results:
===============
remove-brick commit happens even though rebalance not completed.


Expected results:
=================
remove-brick commit should not happen when rebalance is in progress.

Additional info:

--- Additional comment from Byreddy on 2016-01-29 10:55:45 EST ---

[root@dhcp42-84 ~]# gluster volume status
Status of volume: Dis-Rep
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.42.84:/bricks/brick0/smp0       49157     0          Y       18500
Brick 10.70.43.6:/bricks/brick0/smp1        49162     0          Y       19368
Brick 10.70.42.84:/bricks/brick1/smp2       49158     0          Y       18519
Brick 10.70.43.6:/bricks/brick1/smp3        49163     0          Y       19387
NFS Server on localhost                     2049      0          Y       18541
Self-heal Daemon on localhost               N/A       N/A        Y       18546
NFS Server on 10.70.43.6                    2049      0          Y       19409
Self-heal Daemon on 10.70.43.6              N/A       N/A        Y       19414
 
Task Status of Volume Dis-Rep
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp42-84 ~]# 
[root@dhcp42-84 ~]# 
[root@dhcp42-84 ~]# 
[root@dhcp42-84 ~]# gluster peer status
Number of Peers: 1

Hostname: 10.70.43.6
Uuid: 2f8a267c-7e7c-488f-98b9-f816062aae58
State: Peer in Cluster (Connected)
[root@dhcp42-84 ~]# 
[root@dhcp42-84 ~]# gluster volume remove-brick Dis-Rep  replica 2 10.70.42.84:/bricks/brick1/smp2  10.70.43.6:/bricks/brick1/smp3 start
volume remove-brick start: success
ID: fd0164f8-2cba-4b25-b881-bbeb7b323695
[root@dhcp42-84 ~]# 
[root@dhcp42-84 ~]# 
[root@dhcp42-84 ~]# gluster volume remove-brick Dis-Rep  replica 2 10.70.42.84:/bricks/brick1/smp2  10.70.43.6:/bricks/brick1/smp3 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost               59       351.4KB           417             0             0          in progress               7.00
                              10.70.43.6                0        0Bytes             0             0             0          in progress               7.00
[root@dhcp42-84 ~]# 
[root@dhcp42-84 ~]# gluster volume remove-brick Dis-Rep  replica 2 10.70.42.84:/bricks/brick1/smp2  10.70.43.6:/bricks/brick1/smp3 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost               93       511.0KB           627             0             0          in progress              11.00
                              10.70.43.6                0        0Bytes             0             0             0          in progress              11.00
[root@dhcp42-84 ~]# gluster volume remove-brick Dis-Rep  replica 2 10.70.42.84:/bricks/brick1/smp2  10.70.43.6:/bricks/brick1/smp3 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost              113       569.2KB           710             0             0          in progress              13.00
                              10.70.43.6                0        0Bytes             0             0             0            completed              12.00
[root@dhcp42-84 ~]# 
[root@dhcp42-84 ~]# 
[root@dhcp42-84 ~]# gluster volume remove-brick Dis-Rep  replica 2 10.70.42.84:/bricks/brick1/smp2  10.70.43.6:/bricks/brick1/smp3 commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit: failed: use 'force' option as migration is in progress
[root@dhcp42-84 ~]# 
[root@dhcp42-84 ~]# 
[root@dhcp42-84 ~]# systemctl restart glusterd
[root@dhcp42-84 ~]# 
[root@dhcp42-84 ~]# gluster volume remove-brick Dis-Rep  replica 2 10.70.42.84:/bricks/brick1/smp2  10.70.43.6:/bricks/brick1/smp3 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             0             0          in progress               0.00
                              10.70.43.6                0        0Bytes             0             0             0            completed              12.00
[root@dhcp42-84 ~]# 
[root@dhcp42-84 ~]# gluster volume remove-brick Dis-Rep  replica 2 10.70.42.84:/bricks/brick1/smp2  10.70.43.6:/bricks/brick1/smp3 commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit: success
Check the removed bricks to ensure all files are migrated.
If files with data are found on the brick path, copy them via a gluster mount point before re-purposing the removed brick. 
[root@dhcp42-84 ~]# 
[root@dhcp42-84 ~]# 
[root@dhcp42-84 ~]# gluster volume status
Status of volume: Dis-Rep
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.42.84:/bricks/brick0/smp0       49157     0          Y       18500
Brick 10.70.43.6:/bricks/brick0/smp1        49162     0          Y       19368
NFS Server on localhost                     2049      0          Y       19014
Self-heal Daemon on localhost               N/A       N/A        Y       19022
NFS Server on 10.70.43.6                    2049      0          Y       19582
Self-heal Daemon on 10.70.43.6              N/A       N/A        Y       19590
 
Task Status of Volume Dis-Rep
------------------------------------------------------------------------------
There are no active volume tasks

Comment 1 Vijay Bellur 2016-01-30 03:21:00 UTC
REVIEW: http://review.gluster.org/13323 (glusterd: set decommission_is_in_progress flag for inprogress remove-brick op on glusterd restart) posted (#1) for review on master by Atin Mukherjee (amukherj)

Comment 2 Vijay Bellur 2016-01-30 06:50:12 UTC
REVIEW: http://review.gluster.org/13323 (glusterd: set decommission_is_in_progress flag for inprogress remove-brick op on glusterd restart) posted (#2) for review on master by Atin Mukherjee (amukherj)

Comment 3 Vijay Bellur 2016-02-02 04:34:07 UTC
REVIEW: http://review.gluster.org/13323 (glusterd: set decommission_is_in_progress flag for inprogress remove-brick op on glusterd restart) posted (#3) for review on master by Atin Mukherjee (amukherj)

Comment 4 Vijay Bellur 2016-02-23 05:42:34 UTC
COMMIT: http://review.gluster.org/13323 committed in master by Atin Mukherjee (amukherj) 
------
commit 3ca140f011faa9d92a4b3889607fefa33ae6de76
Author: Atin Mukherjee <amukherj>
Date:   Sat Jan 30 08:47:35 2016 +0530

    glusterd: set decommission_is_in_progress flag for inprogress remove-brick op on glusterd restart
    
    While remove brick is in progress, if glusterd is restarted since decommission
    flag is not persisted in the store the same value is not retained back resulting
    in glusterd not blocking remove brick commit when rebalance is already in
    progress.
    
    Change-Id: Ibbf12f3792d65ab1293fad1e368568be141a1cd6
    BUG: 1303269
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: http://review.gluster.org/13323
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Gaurav Kumar Garg <ggarg>
    Reviewed-by: mohammed rafi  kc <rkavunga>

Comment 5 Niels de Vos 2016-06-16 13:56:06 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Note You need to log in before you can comment on or make changes to this bug.