Bug 1624444

Summary: Fail volume stop operation in case brick detach request fails
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Atin Mukherjee <amukherj>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED ERRATA QA Contact: Rajesh Madaka <rmadaka>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.4CC: amukherj, apaladug, bugs, nchilaka, rhs-bugs, rmadaka, sanandpa, sankarshan, sheggodu, storage-qa-internal, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.4.z Batch Update 1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.12.2-20 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1624440 Environment:
Last Closed: 2018-10-31 08:46:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1624440    
Bug Blocks:    

Description Atin Mukherjee 2018-08-31 15:24:52 UTC
+++ This bug was initially created as a clone of Bug #1624440 +++

Description of problem:

While sending a detach request for a brick in brick multiplexing mode, in any situation if the brick isn't connected, glusterd will fail to detach the brick but due to the missing error code handling, glusterd will mark the volume as stopped.

Version-Release number of selected component (if applicable):
mainline.



How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Worker Ant on 2018-08-31 11:14:28 EDT ---

REVIEW: https://review.gluster.org/21055 (glusterd: fail volume stop operation if brick detach fails) posted (#1) for review on master by Atin Mukherjee

Comment 7 Atin Mukherjee 2018-10-08 12:20:07 UTC
Steps to verify:

Create multiple 1 X 3 volumes with brick multiplexing.

1. Make sure the parent brick (the first brick instance with which the brick process is spawned up) is disconnected with glusterd.
2. keep sending volume stop request for other volumes
3. (with out the fix) volume will be stopped where as it ideally shouldn't be as there are brick instances which aren't been detached and this will lead to stale brick instances which can be proved by running unmount of the brick which will fail. With the fix, the volume stop will fail.

Comment 8 Rajesh Madaka 2018-10-09 10:51:27 UTC
I have followed above steps to verify this bug.

Below are the scenarios tried.

Scenario 1:

1. Created 5 replica(1x3) volumes
2. Detached the brick using "gf_attach -d /var/run/gluster/3f637a11d12aa168.socket /bricks/brick4/rep_test1" from first volume for first node.
3. Then that particular brick went to offline from first volume.
4. I have tried to stop the other volumes. volumes stopped successfully but with fix volume stop should fail.

Scenario 2:

1. Created 5 replica(1x3) volumes
2. Detached the brick using "gf_attach -d /var/run/gluster/3f637a11d12aa168.socket /bricks/brick4/rep_test1" from first volume for all 3 nodes.
3. Then three bricks went to offline from first volume.
4. I have tried to stop the other volumes. volumes stopped successfully but with fix volume stop should fail.

Build: glusterfs-3.12.2-21

Comment 9 Rajesh Madaka 2018-10-10 07:43:21 UTC
Can I move this bug back to assigned state, behavior is not as expected.

Comment 10 Atin Mukherjee 2018-10-10 14:36:01 UTC
The test done wasn't correct here. gf_detach will gracefully attach the brick which we don't want. What we want is base brick to get to disconnected state in glusterd along with brick status being still marked as started, so it's basically a disconnect simulation.

One of the way to verify this would be to kill the brick process and then take glusterd process into gdb, put a breakpoint in glusterd_volume_stop_glusterfs () and set brickinfo->status = GF_BRICK_STARTED and then do continue. With that you should see volume stop failing in the cli where as the same will not happen in RHGS 3.4.0.

Comment 11 Rajesh Madaka 2018-10-14 10:15:00 UTC
Followed below steps to verify this bug

1. Having 3 node(n1, n2, n3) cluster
2. Brick multiplex enabled
3. Created 5 replica (1x3) volumes
4. kill the brick using kill -9 <brick PID> from first node n1
4. Then take the glusterd process to gdb using "gdb -p <glusterd PID> from node n1
5. Put a break point on function glusterd_volume_stop_glusterfs using "b glusterd_volume_stop_glusterfs"
6. Then open one more terminal for same node n1 and then try to stop the volume
7. Then set brickinfo->status = GF_BRICK_STARTED using "p brickinfo->status = GF_BRICK_STARTED"
8.Volume stop is failing with below error.
9. volume stop: rep_test5: failed: Commit failed on localhost. Please check the log file for more details.


build Version:
glusterfs-3.12.2-22

Comment 13 errata-xmlrpc 2018-10-31 08:46:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3432