Bug 1278390

Summary:	Data Tiering:Regression:Detach tier commit is passing when detach tier is in progress
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Vivek Agarwal <vagarwal>
Component:	tier	Assignee:	Dan Lambright <dlambrig>
Status:	CLOSED ERRATA	QA Contact:	Bhaskarakiran <byarlaga>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	rhgs-3.1	CC:	amukherj, asrivast, byarlaga, dlambrig, mzywusko, nchilaka, rhs-bugs, sankarshan, storage-qa-internal
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.1.2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.7.5-7	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	1264441	Environment:
Last Closed:	2016-03-01 05:52:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1264441, 1279637
Bug Blocks:	1260783, 1260923

Description Vivek Agarwal 2015-11-05 11:39:09 UTC

+++ This bug was initially created as a clone of Bug #1264441 +++

when detach tier is in progress, detach tier commit must not be allowed.
This was working well previously. This means it is a regression,
Could be due to fix for bz#1259694


glusterfs-server-3.7.4-0.33.git1d02d4b.el7.centos.x86_64


Steps to Reproduce:
====================
1.create a volume, start it and attac tier to vol  and create lot of data on hot tier
2.now issue a detach-tier <vname> start
3. While it is in progress,issue a detach-tier <vname> commit

--- Additional comment from nchilaka on 2015-09-18 09:21:39 EDT ---

this wasfound during automation of qe scripts

--- Additional comment from Dan Lambright on 2015-09-30 15:36:21 EDT ---

have done a RCA on this and will begin working on a fix.

--- Additional comment from Dan Lambright on 2015-10-01 11:14:45 EDT ---

submitted fix 12272

--- Additional comment from Vijay Bellur on 2015-10-28 12:44:33 EDT ---

REVIEW: http://review.gluster.org/12272 (cluster/tier: migration daemon incorrectly signals detach tier done) posted (#3) for review on release-3.7 by Dan Lambright (dlambrig)

--- Additional comment from Vijay Bellur on 2015-10-28 16:31:24 EDT ---

REVIEW: http://review.gluster.org/12272 (cluster/tier: Disallow detach commit when detach in progress) posted (#4) for review on release-3.7 by Dan Lambright (dlambrig)

Comment 3 Bhaskarakiran 2015-11-17 12:30:43 UTC

Check this on 3.7.5-6 build. If detach-tier is done from the node where the scan is complete, commit is successful though other node is in progress. If detach-tier is done from the node where scan is in progress, correct error message is shown and detach-tier fails. Moving back the bug.



[root@transformers yum.repos.d]# gluster v detach-tier vol1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes        116983             0             0            completed             600.00
                                   ninja                0        0Bytes             0             0             0          in progress               0.00
[root@transformers yum.repos.d]# gluster v detach-tier vol1 commit
Removing tier can result in data loss. Do you want to Continue? (y/n) y
volume detach-tier commit: success
Check the detached bricks to ensure all files are migrated.
If files with data are found on the brick path, copy them via a gluster mount point before re-purposing the removed brick. 
[root@transformers yum.repos.d]#

Comment 4 Atin Mukherjee 2015-11-23 16:39:18 UTC

Bhaskar,

Some back ground before I get into the real business on this:

Detach start & commit follows sync-op framework in GlusterD where in every phase GlusterD accumulates responses of all the nodes and then decide whether to proceed or fail. The validation what we are talking about here is in staging phase and even if in the local node if detach operation is complete where as in other nodes its not, as a whole the command would fail, that's the guarantee what sync-op framework brings in.

I tried to simulate this problem with setting the rebalance status to complete on originator node (where the cli command is run) and rebalance status to started in other nodes, but couldn't reproduce it as CLI throws a error message in that case. Here are the steps I did before taking the gdb control.

1. Create a dist volume (one brick only) in a 2 node cluster.
2. Performed attach-tier with one more brick
3. detach start
4. detach commit

Could you provide the steps you performed to reproduce this issue?

Comment 5 Bhaskarakiran 2015-11-24 13:58:03 UTC

I am not sure if it gets hit with one brick. The setup i configured is 8+4 ec volume and 2x2 dist-rep tier volume for it.

1. Create 8+4 ec volume and attach 2x2 dist-rep tier volume
2. Start linux untar and some parallel writes
3. Start detach-tier
4. Once the status shows as completed on any of the node, issue detach-tier commit command and it passes. 

If the commit is run on any of the node which has in-progress status, it would fail with correct message.

Comment 6 Bhaskarakiran 2015-11-24 13:58:26 UTC

4. Once the status shows as completed on any of the node, issue detach-tier commit command form that node and it passes.

Comment 7 Atin Mukherjee 2015-11-25 04:09:30 UTC

(In reply to Bhaskarakiran from comment #5)
> I am not sure if it gets hit with one brick. The setup i configured is 8+4
> ec volume and 2x2 dist-rep tier volume for it.
Well, I do not think the number of bricks matter here as the validation is in the staging and we do not really check how many number of bricks are configured for the volume at this point. However, I'll try to execute the same steps and see whether the issue persists.
> 
> 1. Create 8+4 ec volume and attach 2x2 dist-rep tier volume
> 2. Start linux untar and some parallel writes
> 3. Start detach-tier
> 4. Once the status shows as completed on any of the node, issue detach-tier
> commit command and it passes. 
Were you sure that at the time you performed a detach commit, other nodes haven't finished the ongoing scan? How did you guarantee that? It could very well be that you executed the command on other nodes to check the status and at that time it showed that detach is in progress but by the time you executed detach commit, the scan was completed?
> 
> If the commit is run on any of the node which has in-progress status, it
> would fail with correct message.

Comment 8 Bhaskarakiran 2015-11-27 09:37:10 UTC

To check this behaviour i executed status and commit commands one after another without any delay. I again tried this on 3.7.5-6 build and did see the same behaviour. 


[root@transformers ~]# gluster v status vol1
Status of volume: vol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Hot Bricks:
Brick ninja:/rhs/brick2/vol1-tier4          49158     0          Y       29569
Brick vertigo:/rhs/brick2/vol1-tier3        49158     0          Y       849  
Brick ninja:/rhs/brick1/vol1-tier2          49157     0          Y       29551
Brick vertigo:/rhs/brick1/vol1-tier1        49157     0          Y       831  
Cold Bricks:
Brick transformers:/rhs/brick1/b1           49152     0          Y       40735
Brick interstellar:/rhs/brick1/b2           49152     0          Y       30530
Brick transformers:/rhs/brick2/b3           49153     0          Y       40753
Brick interstellar:/rhs/brick2/b4           49153     0          Y       30548
Brick transformers:/rhs/brick3/b5           49154     0          Y       40771
Brick interstellar:/rhs/brick3/b6           49154     0          Y       30566
Brick transformers:/rhs/brick4/b7           49155     0          Y       40789
Brick interstellar:/rhs/brick4/b8           49155     0          Y       30584
Brick transformers:/rhs/brick5/b9           49156     0          Y       40807
Brick interstellar:/rhs/brick5/b10          49156     0          Y       30602
Brick transformers:/rhs/brick6/b11          49157     0          Y       40825
Brick interstellar:/rhs/brick6/b12          49157     0          Y       30622
Snapshot Daemon on localhost                49158     0          Y       41271
NFS Server on localhost                     2049      0          Y       50798
Self-heal Daemon on localhost               N/A       N/A        Y       50806
Quota Daemon on localhost                   N/A       N/A        Y       50814
Snapshot Daemon on vertigo                  49152     0          Y       31813
NFS Server on vertigo                       2049      0          Y       868  
Self-heal Daemon on vertigo                 N/A       N/A        Y       876  
Quota Daemon on vertigo                     N/A       N/A        Y       885  
Snapshot Daemon on ninja                    49152     0          Y       26163
NFS Server on ninja                         2049      0          Y       29588
Self-heal Daemon on ninja                   N/A       N/A        Y       29596
Quota Daemon on ninja                       N/A       N/A        Y       29604
Snapshot Daemon on interstellar.lab.eng.blr
.redhat.com                                 49158     0          Y       30712
NFS Server on interstellar.lab.eng.blr.redh
at.com                                      2049      0          Y       40062
Self-heal Daemon on interstellar.lab.eng.bl
r.redhat.com                                N/A       N/A        Y       40070
Quota Daemon on interstellar.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       40078
 
Task Status of Volume vol1
------------------------------------------------------------------------------
Task                 : Detach tier         
ID                   : 5cc013cd-8002-4716-8d52-469f85053afd
Status               : in progress         
 
[root@transformers ~]# 
[root@transformers ~]# gluster v detach-tier vol1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes           107             0             0          in progress            8785.00
                                   ninja                0        0Bytes             0             0             0          in progress               0.00
                                 vertigo                0        0Bytes             0             0             0          in progress               0.00
[root@transformers ~]# gluster v detach-tier vol1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes           110             0             0          in progress            8859.00
                                   ninja                0        0Bytes             0             0             0          in progress               0.00
                                 vertigo                0        0Bytes             0             0             0          in progress               0.00
[root@transformers ~]# gluster v detach-tier vol1 commi
Usage: volume detach-tier <VOLNAME>  <start|stop|status|commit|force>
Tier command failed
[root@transformers ~]# gluster v detach-tier vol1 commit
Removing tier can result in data loss. Do you want to Continue? (y/n) y
volume detach-tier commit: success
Check the detached bricks to ensure all files are migrated.
If files with data are found on the brick path, copy them via a gluster mount point before re-purposing the removed brick. 
[root@transformers ~]#

Comment 9 Bhaskarakiran 2015-11-27 09:37:52 UTC

volume status:

[root@transformers ~]# gluster v status vol1
Status of volume: vol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick transformers:/rhs/brick1/b1           49152     0          Y       40735
Brick interstellar:/rhs/brick1/b2           49152     0          Y       30530
Brick transformers:/rhs/brick2/b3           49153     0          Y       40753
Brick interstellar:/rhs/brick2/b4           49153     0          Y       30548
Brick transformers:/rhs/brick3/b5           49154     0          Y       40771
Brick interstellar:/rhs/brick3/b6           49154     0          Y       30566
Brick transformers:/rhs/brick4/b7           49155     0          Y       40789
Brick interstellar:/rhs/brick4/b8           49155     0          Y       30584
Brick transformers:/rhs/brick5/b9           49156     0          Y       40807
Brick interstellar:/rhs/brick5/b10          49156     0          Y       30602
Brick transformers:/rhs/brick6/b11          49157     0          Y       40825
Brick interstellar:/rhs/brick6/b12          49157     0          Y       30622
Snapshot Daemon on localhost                49158     0          Y       41271
NFS Server on localhost                     2049      0          Y       51071
Self-heal Daemon on localhost               N/A       N/A        Y       51079
Quota Daemon on localhost                   N/A       N/A        Y       51099
Snapshot Daemon on vertigo                  49152     0          Y       31813
NFS Server on vertigo                       2049      0          Y       3099 
Self-heal Daemon on vertigo                 N/A       N/A        Y       3107 
Quota Daemon on vertigo                     N/A       N/A        Y       3115 
Snapshot Daemon on ninja                    49152     0          Y       26163
NFS Server on ninja                         2049      0          Y       6418 
Self-heal Daemon on ninja                   N/A       N/A        Y       6426 
Quota Daemon on ninja                       N/A       N/A        Y       6434 
Snapshot Daemon on interstellar.lab.eng.blr
.redhat.com                                 49158     0          Y       30712
NFS Server on interstellar.lab.eng.blr.redh
at.com                                      2049      0          Y       40202
Self-heal Daemon on interstellar.lab.eng.bl
r.redhat.com                                N/A       N/A        Y       40210
Quota Daemon on interstellar.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       40220
 
Task Status of Volume vol1
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@transformers ~]#

Comment 10 Atin Mukherjee 2015-11-30 09:16:31 UTC

So here is the latest update on the bug:

QE tested this behaviour with 3.7.5-6 where the fix https://code.engineering.redhat.com/gerrit/61589 was not pulled in. I am not sure whether this was supposed to be tested in latest bits or previous bits. The behaviour is reproducible in 3.7.5.6 but not in 3.7.5.7. A fixed in version could have saved us getting into all these confusions. Hence moving it to ON_QA with the fixed in version.

Comment 11 Bhaskarakiran 2015-12-01 07:51:47 UTC

checked on 3.7.5-7 and it works correctly. Marking as verified

Comment 14 errata-xmlrpc 2016-03-01 05:52:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0193.html