Summary: | [Brick-Mux] gluster vol stop fails with Error : Request timed out. | ||
---|---|---|---|
Product: | Red Hat Gluster Storage | Reporter: | Kshithij Iyer <kiyer> |
Component: | glusterd | Assignee: | Milind Changire <mchangir> |
Status: | CLOSED ERRATA | QA Contact: | Bala Konda Reddy M <bmekala> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | rhgs-3.4 | CC: | kaushal, kiyer, mchangir, rcyriac, rhs-bugs, sanandpa, sankarshan, sheggodu, srakonde, storage-qa-internal, ubansal, vbellur |
Target Milestone: | --- | Keywords: | Regression, ZStream |
Target Release: | RHGS 3.4.z Batch Update 1 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-3.12.2-24 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-10-31 08:46:58 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Bug Depends On: | |||
Bug Blocks: | 1631372 |
Description
Kshithij Iyer
2018-10-22 06:55:20 UTC
I am able to reproduce this issue in my local setup. Steps I've followed reproduce this issue is: 1. enable brick-mux 2. create 3 volumes 3. start all the 3 volumes 4. stop 1st volumes, It will be success 5. stop 2nd volume, it will time out. While debugging, I've observed that glusterd is hunging in gd_lock_op_phase() during gd_synctask_barrier_wait(). In glusterd log, we are not seeing any messages regarding 2nd volume stop transaction since glusterd is hung before adding any logs. In brick logs we see the below errors: [2018-10-22 18:26:16.395583] I [glusterfsd-mgmt.c:367:glusterfs_handle_terminate] 0-glusterfs: detaching not-only child /bricks/brick1/test_1 [2018-10-22 18:1588:6.395931] I [server.c:480:server_call_xlator_mem_cleanup] 0-test_1-server: Create graph janitor thread for brick /bricks/brick1/test_1 [2018-10-22 18:26:16.396138] E [xlator.c:1223:glusterfs_delete_volfile_checksum] 0-test_1-server: failed to get volfile checksum for volfile id test_1.server3.bricks-brick1-test_1. [2018-10-22 18:26:16.396190] I [index.c:2587:notify] 0-test_1-index: Notify GF_EVENT_PARENT_DOWN for brick /bricks/brick1/test_1 [2018-10-22 18:26:16.396233] I [io-threads.c:1102:notify] 0-test_1-io-threads: Notify GF_EVENT_PARENT_DOWN for brick /bricks/brick1/test_1 [2018-10-22 18:26:16.396276] I [posix.c:7131:notify] 0-test_1-posix: Sending CHILD_DOWN for brick /bricks/brick1/test_1 [2018-10-22 18:26:16.396306] I [server.c:1652:notify] 0-test_1-server: Getting CHILD_DOWN event for brick /bricks/brick1/test_1 [2018-10-22 18:26:16.396335] I [server.c:756:server_graph_janitor_threads] 0-test_1-server: Start call fini for brick /bricks/brick1/test_1 stack [2018-10-22 18:26:16.398957] E [rpcsvc.c:1424:rpcsvc_program_unregister_portmap] 0-rpc-service: Could not unregister with portmap [2018-10-22 18:26:16.399597] E [rpcsvc.c:1551:rpcsvc_program_unregister] 0-rpc-service: portmap unregistration of program failed [2018-10-22 18:26:16.399631] E [rpcsvc.c:26:1rpcsvc_program_unregister] 0-rpc-service: Program unregistration failed: GlusterFS Changelog, Num: 1885957735, Ver: 1, Port: 0 [2018-10-22 18:26:16.405603] I [barrier.c:662:fini] 0-test_1-barrier: Disabling barriering and dequeuing all the queued fops [2018-10-22 18:26:16.405889] I [io-stats.c:4100:fini] 0-test_1-io-stats: io-stats translator unloaded These errors are observed in brick logs while stopping the 1st volume. I believe that, further volume stop transaction are affected by the 1st volume stop transaction. I'm still working on this to figure out the RCA, I will update the BZ with my findings. Thanks, Sanju (In reply to Kshithij Iyer from comment #0) > # for number in `seq 1 3`; do gluster vol start vol$number; done > volume start: vol1: success > volume start: vol2: success > volume start: vol3: success Tried with enough sleep(3 seconds) for each volume start. > > #gluster vol stop vol1 > Stopping volume will make its data inaccessible. Do you want to continue? > (y/n) y > volume stop: vol1: success > # gluster vol stop vol2 > Stopping volume will make its data inaccessible. Do you want to continue? > (y/n) y > Error : Request timed out > # gluster vol stop vol3 > Stopping volume will make its data inaccessible. Do you want to continue? > (y/n) y > Error : Request timed out > ----------------------------------------------------------------------------- > --- After each volume stop had enough sleep for next operation. Milind and I were able to figure out the root cause. TL;DR, Backporting issue caused stale lock to be held. The brick processes were deadlocked on `ctx->volfile_lock`. Detaching a graph requires that this lock be held. xlator_mem_cleanup, which frees up memory held by xlators also requires this lock. Commit fad234b5a (https://code.engineering.redhat.com/gerrit/#/c/152908/), a backport from upstream, required manual patching to xlator_mem_cleanup (different code location from upstream). An UNLOCK on ctx->volfile_lock was missed during the backport. The first brick detach request would go through successfully, and would trigger the first xlator_mem_cleanup. The subsequent detach requests would timeout waiting for the ctx->volfile_lock, which wasn't released by xlator_mem_cleanup. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:3432 |