1624444 – Fail volume stop operation in case brick detach request fails

Bug 1624444 - Fail volume stop operation in case brick detach request fails

Summary: Fail volume stop operation in case brick detach request fails

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.z Batch Update 1
Assignee:	Atin Mukherjee
QA Contact:	Rajesh Madaka
Docs Contact:
URL:
Whiteboard:
Depends On:	1624440
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-31 15:24 UTC by Atin Mukherjee
Modified:	2022-07-09 10:10 UTC (History)
CC List:	11 users (show)
Fixed In Version:	glusterfs-3.12.2-20
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1624440
Environment:
Last Closed:	2018-10-31 08:46:14 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:3432	0	None	None	None	2018-10-31 08:47:58 UTC

Description Atin Mukherjee 2018-08-31 15:24:52 UTC

+++ This bug was initially created as a clone of Bug #1624440 +++

Description of problem:

While sending a detach request for a brick in brick multiplexing mode, in any situation if the brick isn't connected, glusterd will fail to detach the brick but due to the missing error code handling, glusterd will mark the volume as stopped.

Version-Release number of selected component (if applicable):
mainline.



How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Worker Ant on 2018-08-31 11:14:28 EDT ---

REVIEW: https://review.gluster.org/21055 (glusterd: fail volume stop operation if brick detach fails) posted (#1) for review on master by Atin Mukherjee

Comment 7 Atin Mukherjee 2018-10-08 12:20:07 UTC

Steps to verify:

Create multiple 1 X 3 volumes with brick multiplexing.

1. Make sure the parent brick (the first brick instance with which the brick process is spawned up) is disconnected with glusterd.
2. keep sending volume stop request for other volumes
3. (with out the fix) volume will be stopped where as it ideally shouldn't be as there are brick instances which aren't been detached and this will lead to stale brick instances which can be proved by running unmount of the brick which will fail. With the fix, the volume stop will fail.

Comment 8 Rajesh Madaka 2018-10-09 10:51:27 UTC

I have followed above steps to verify this bug.

Below are the scenarios tried.

Scenario 1:

1. Created 5 replica(1x3) volumes
2. Detached the brick using "gf_attach -d /var/run/gluster/3f637a11d12aa168.socket /bricks/brick4/rep_test1" from first volume for first node.
3. Then that particular brick went to offline from first volume.
4. I have tried to stop the other volumes. volumes stopped successfully but with fix volume stop should fail.

Scenario 2:

1. Created 5 replica(1x3) volumes
2. Detached the brick using "gf_attach -d /var/run/gluster/3f637a11d12aa168.socket /bricks/brick4/rep_test1" from first volume for all 3 nodes.
3. Then three bricks went to offline from first volume.
4. I have tried to stop the other volumes. volumes stopped successfully but with fix volume stop should fail.

Build: glusterfs-3.12.2-21

Comment 9 Rajesh Madaka 2018-10-10 07:43:21 UTC

Can I move this bug back to assigned state, behavior is not as expected.

Comment 10 Atin Mukherjee 2018-10-10 14:36:01 UTC

The test done wasn't correct here. gf_detach will gracefully attach the brick which we don't want. What we want is base brick to get to disconnected state in glusterd along with brick status being still marked as started, so it's basically a disconnect simulation.

One of the way to verify this would be to kill the brick process and then take glusterd process into gdb, put a breakpoint in glusterd_volume_stop_glusterfs () and set brickinfo->status = GF_BRICK_STARTED and then do continue. With that you should see volume stop failing in the cli where as the same will not happen in RHGS 3.4.0.

Comment 11 Rajesh Madaka 2018-10-14 10:15:00 UTC

Followed below steps to verify this bug

1. Having 3 node(n1, n2, n3) cluster
2. Brick multiplex enabled
3. Created 5 replica (1x3) volumes
4. kill the brick using kill -9 <brick PID> from first node n1
4. Then take the glusterd process to gdb using "gdb -p <glusterd PID> from node n1
5. Put a break point on function glusterd_volume_stop_glusterfs using "b glusterd_volume_stop_glusterfs"
6. Then open one more terminal for same node n1 and then try to stop the volume
7. Then set brickinfo->status = GF_BRICK_STARTED using "p brickinfo->status = GF_BRICK_STARTED"
8.Volume stop is failing with below error.
9. volume stop: rep_test5: failed: Commit failed on localhost. Please check the log file for more details.


build Version:
glusterfs-3.12.2-22

Comment 13 errata-xmlrpc 2018-10-31 08:46:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3432

Note You need to log in before you can comment on or make changes to this bug.