Bug 1641489

Summary:	[Brick-Mux] gluster vol stop fails with Error : Request timed out.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Kshithij Iyer <kiyer>
Component:	glusterd	Assignee:	Milind Changire <mchangir>
Status:	CLOSED ERRATA	QA Contact:	Bala Konda Reddy M <bmekala>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.4	CC:	kaushal, kiyer, mchangir, rcyriac, rhs-bugs, sanandpa, sankarshan, sheggodu, srakonde, storage-qa-internal, ubansal, vbellur
Target Milestone:	---	Keywords:	Regression, ZStream
Target Release:	RHGS 3.4.z Batch Update 1
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.12.2-24	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-31 08:46:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1631372

Description Kshithij Iyer 2018-10-22 06:55:20 UTC

Description of problem:
On a three node cluster(N1,N2,N3) set cluster.brick-multiplex to enable and storage.fips-mode-rchecksum to enable for three 1x3 volumes(vol1,vol2,vol3). Now start all volumes and stop all volumes one by one. The first volume(vol1) stops normally after which if you try to stop the other volumes the volume stop command fails with "Error : Request timed out". After which the cluster goes into an inconsistent state where the volumes(vol2 and vol3) appears to be stopped from a node and from the another node it appears to be running.

--------------------------------------------------------------------------------
# gluster vol set vol1 storage.fips-mode-rchecksum enable
volume set: success
# gluster vol set vol2 storage.fips-mode-rchecksum enable
volume set: success
# gluster vol set vol3 storage.fips-mode-rchecksum enable
volume set: success

# for number in `seq 1 3`; do gluster vol start vol$number; done
volume start: vol1: success
volume start: vol2: success
volume start: vol3: success

#gluster vol stop vol1
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: vol1: success
# gluster vol stop vol2
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
Error : Request timed out
# gluster vol stop vol3
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
Error : Request timed out
--------------------------------------------------------------------------------

Version-Release number of selected component (if applicable):
3.12.2.23

OS Version:
Red Hat Enterprise Linux 7.6

How reproducible:
Always(With and without fips)

Steps to Reproduce:
1.On a three node cluster set cluster.brick-multiplex to enable.
gluster vol set all cluster.brick-multiplex enable
2.Create 3 volumes of 1x3 type say vol1, vol2 and vol3.
3.Enable fips on for all the three volumes.
gluster vol set vol3 storage.fips-mode-rchecksum enable
4.Start all the three volumes.
for number in `seq 1 3`; do gluster vol start vol$number; done
5.Stop all the three volumes one by one.

Actual results:
Volume stop fails with "Error : Request timed out" and cluster goes into inconsistent state.

Expected results:
All volumes should stop successfully.

Additional info:
In the builds 3.12.2.21 and 3.12.2.22 this bug was not found. I had started and stopped 100+ volumes on build 3.12.2.22 and haven't observed this bug.

Comment 10 Sanju 2018-10-23 06:30:32 UTC

I am able to reproduce this issue in my local setup. Steps I've followed reproduce this issue is:
1. enable brick-mux
2. create 3 volumes
3. start all the 3 volumes
4. stop 1st volumes, It will be success
5. stop 2nd volume, it will time out.
 
While debugging, I've observed that glusterd is hunging in gd_lock_op_phase() during gd_synctask_barrier_wait(). In glusterd log, we are not seeing any messages regarding 2nd volume stop transaction since glusterd is hung before adding any logs.

In brick logs we see the below errors:
[2018-10-22 18:26:16.395583] I [glusterfsd-mgmt.c:367:glusterfs_handle_terminate] 0-glusterfs: detaching not-only child /bricks/brick1/test_1
[2018-10-22 18:1588:6.395931] I [server.c:480:server_call_xlator_mem_cleanup] 0-test_1-server: Create graph janitor thread for brick /bricks/brick1/test_1
[2018-10-22 18:26:16.396138] E [xlator.c:1223:glusterfs_delete_volfile_checksum] 0-test_1-server: failed to get volfile checksum for volfile id test_1.server3.bricks-brick1-test_1.
[2018-10-22 18:26:16.396190] I [index.c:2587:notify] 0-test_1-index: Notify GF_EVENT_PARENT_DOWN for brick /bricks/brick1/test_1
[2018-10-22 18:26:16.396233] I [io-threads.c:1102:notify] 0-test_1-io-threads: Notify GF_EVENT_PARENT_DOWN for brick /bricks/brick1/test_1
[2018-10-22 18:26:16.396276] I [posix.c:7131:notify] 0-test_1-posix: Sending CHILD_DOWN for brick /bricks/brick1/test_1
[2018-10-22 18:26:16.396306] I [server.c:1652:notify] 0-test_1-server: Getting CHILD_DOWN event for brick /bricks/brick1/test_1
[2018-10-22 18:26:16.396335] I [server.c:756:server_graph_janitor_threads] 0-test_1-server: Start call fini for brick /bricks/brick1/test_1 stack
[2018-10-22 18:26:16.398957] E [rpcsvc.c:1424:rpcsvc_program_unregister_portmap] 0-rpc-service: Could not unregister with portmap
[2018-10-22 18:26:16.399597] E [rpcsvc.c:1551:rpcsvc_program_unregister] 0-rpc-service: portmap unregistration of program failed
[2018-10-22 18:26:16.399631] E [rpcsvc.c:26:1rpcsvc_program_unregister] 0-rpc-service: Program unregistration failed: GlusterFS Changelog, Num: 1885957735, Ver: 1, Port: 0
[2018-10-22 18:26:16.405603] I [barrier.c:662:fini] 0-test_1-barrier: Disabling barriering and dequeuing all the queued fops
[2018-10-22 18:26:16.405889] I [io-stats.c:4100:fini] 0-test_1-io-stats: io-stats translator unloaded

These errors are observed in brick logs while stopping the 1st volume. I believe that, further volume stop transaction are affected by the 1st volume stop transaction.

I'm still working on this to figure out the RCA, I will update the BZ with my findings.

Thanks,
Sanju

Comment 11 Bala Konda Reddy M 2018-10-23 06:48:57 UTC

(In reply to Kshithij Iyer from comment #0)
> # for number in `seq 1 3`; do gluster vol start vol$number; done
> volume start: vol1: success
> volume start: vol2: success
> volume start: vol3: success
Tried with enough sleep(3 seconds) for each volume start.
> 
> #gluster vol stop vol1
> Stopping volume will make its data inaccessible. Do you want to continue?
> (y/n) y
> volume stop: vol1: success
> # gluster vol stop vol2
> Stopping volume will make its data inaccessible. Do you want to continue?
> (y/n) y
> Error : Request timed out
> # gluster vol stop vol3
> Stopping volume will make its data inaccessible. Do you want to continue?
> (y/n) y
> Error : Request timed out
> -----------------------------------------------------------------------------
> ---
After each volume stop had enough sleep for next operation.

Comment 12 Kaushal 2018-10-23 07:29:06 UTC

Milind and I were able to figure out the root cause.

TL;DR, Backporting issue caused stale lock to be held.

The brick processes were deadlocked on `ctx->volfile_lock`. Detaching a graph requires that this lock be held. xlator_mem_cleanup, which frees up memory held by xlators also requires this lock.

Commit fad234b5a (https://code.engineering.redhat.com/gerrit/#/c/152908/), a backport from upstream, required manual patching to xlator_mem_cleanup (different code location from upstream). An UNLOCK on ctx->volfile_lock was missed during the backport.

The first brick detach request would go through successfully, and would trigger the first xlator_mem_cleanup. The subsequent detach requests would timeout waiting for the ctx->volfile_lock, which wasn't released by xlator_mem_cleanup.

Comment 17 errata-xmlrpc 2018-10-31 08:46:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3432