Bug 1499509

Summary:	Brick Multiplexing: Gluster volume start force complains with command "Error : Request timed out" when there are multiple volumes
Product:	[Community] GlusterFS	Reporter:	Samikshan Bairagya <sbairagy>
Component:	glusterd	Assignee:	Sanju <srakonde>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	amukherj, bmekala, bugs, nchilaka, rhs-bugs, sbairagy, srakonde, storage-qa-internal, vbellur
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	brick-multiplexing
Fixed In Version:	glusterfs-3.13.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1459895
Clones:	1501154 (view as bug list)		Environment:
Last Closed:	2017-12-08 17:43:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1459895, 1526373

Description Samikshan Bairagya 2017-10-08 07:52:40 UTC

+++ This bug was initially created as a clone of Bug #1459895 +++

Description of problem:
=====================
On a setup with hosting many similar volumes of same type, after a brick is brought down , a vol start force of the higher ascending order, will result in the command to "Error : Request timed out"



Version-Release number of selected component (if applicable):
=======
3.8.4-27

How reproducible:
=========
always

Steps to Reproduce:
==================
1.enabled brick mux, create about 25 volumes all 1x3  say test3_1,test3_2...test3_25
2.now kill the brick pid of b1 using kill -9
3.now start a volume using force, say vol start force test3_22(it must be in the later order of vol list)
4. It can  be noticed the command throws "Error : Request timed out"

Also the glusterd log shows  cleaning of dead brick during this command of vol start force, hence could be the reason for the delay


(note: my volumes are like test3_{31..61})


[2017-06-08 13:11:28.576259] I [MSGID: 106499] [glusterd-handler.c:4369:__glusterd_handle_status_volume] 0-management: Received status volume req for volume test3_47
[2017-06-08 13:11:32.144875] I [MSGID: 106062] [glusterd-volume-ops.c:2617:glusterd_op_start_volume] 0-management: Global dict not present.
[2017-06-08 13:11:47.148857] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick32/test3_32
[2017-06-08 13:12:02.153117] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick33/test3_33
[2017-06-08 13:12:17.157265] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick34/test3_34
[2017-06-08 13:12:32.161184] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick35/test3_35
[2017-06-08 13:12:47.164727] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick36/test3_36
[2017-06-08 13:13:02.168547] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick37/test3_37
[2017-06-08 13:13:17.172140] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick38/test3_38
[2017-06-08 13:13:32.175261] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick39/test3_39
[2017-06-08 13:13:47.178395] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick40/test3_40
[2017-06-08 13:14:02.181805] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick41/test3_41
[2017-06-08 13:14:17.186312] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick42/test3_42
[2017-06-08 13:14:32.189988] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick43/test3_43
[2017-06-08 13:14:47.193370] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick44/test3_44
[2017-06-08 13:15:02.197284] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick45/test3_45
[2017-06-08 13:15:17.200764] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick46/test3_46
[2017-06-08 13:15:32.204242] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick48/test3_48
[2017-06-08 13:15:47.207613] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick49/test3_49
[2017-06-08 13:16:02.211184] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick50/test3_50
[2017-06-08 13:16:17.215043] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick51/test3_51
[2017-06-08 13:16:32.218976] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick60/test3_60
[2017-06-08 13:16:47.222271] I [glusterd-utils.c:5229:find_compat_brick_in_vol] 0-management: cleaning up dead brick 10.70.35.45:/rhs/brick61/test3_61
[2017-06-08 13:16:47.222872] I [MSGID: 106144] [glusterd-pmap.c:377:pmap_registry_remove] 0-pmap: removing brick /rhs/brick47/test3_47 on port 49152
[2017-06-08 13:16:47.228031] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - dis


Actual results:


Expected results:


Additional info:

--- Additional comment from Atin Mukherjee on 2017-06-09 02:24:22 EDT ---

This only happens when the brick process was killed with SIGKILL, not SIGTERM. My guess is given the brick process was killed with SIGKILL signal the signal handler wasn't invoked and hence the further cleanup wasn't triggered due to which we ended up with a stale socket file and this is the reason we see a constant series of stale disconnect (filed as a separate issue 1459900). Now looking at the process stack below thread 8 captures the stale disconnect where as thread 8 tries to attach to a brick which is again wrong as we should have spawned a brick given there are no bricks running. But this happened because glusterd didn't receive pmap_signout event as the brick was killed abruptly and for multiplexed bricks glusterd doesn't receive disconnect events the brickinfo status was never set to stopped.

Thread 8 (Thread 0x7f435db30700 (LWP 2934)):
#0  0x00007f4365ae1bc0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f4366cb5e13 in __synclock_lock (lock=lock@entry=0x7f4366fdd7f8) at syncop.c:927
#2  0x00007f4366cb89d6 in synclock_lock (lock=lock@entry=0x7f4366fdd7f8) at syncop.c:953
#3  0x00007f4361c4967b in glusterd_big_locked_notify (rpc=0x7f4358062f90, mydata=0x7f435800e6f0, 
    event=RPC_CLNT_DISCONNECT, data=0x0, notify_fn=0x7f4361c45f90 <__glusterd_brick_rpc_notify>)
    at glusterd-handler.c:68
#4  0x00007f4366a4cd2b in rpc_clnt_handle_disconnect (conn=0x7f4358062fc0, clnt=0x7f4358062f90) at rpc-clnt.c:876
#5  rpc_clnt_notify (trans=<optimized out>, mydata=0x7f4358062fc0, event=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-clnt.c:939
#6  0x00007f4366a49433 in rpc_transport_notify (this=this@entry=0x7f43581610a0, 
    event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7f43581610a0) at rpc-transport.c:538
#7  0x00007f435f52942f in socket_event_poll_err (idx=<optimized out>, gen=<optimized out>, this=0x7f43581610a0)
    at socket.c:1206
#8  socket_event_handler (fd=6, idx=<optimized out>, gen=<optimized out>, data=0x7f43581610a0, 
    poll_in=<optimized out>, poll_out=4, poll_err=16) at socket.c:2476
#9  0x00007f4366cd4a9d in event_dispatch_epoll_handler (event=0x7f435db2fe90, event_pool=0x13c7080)
    at event-epoll.c:572
#10 event_dispatch_epoll_worker (data=0x141f850) at event-epoll.c:648
#11 0x00007f4365adc5ba in start_thread () from /lib64/libpthread.so.0
#12 0x00007f43653b57cd in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f435e331700 (LWP 2933)):
#0  0x00007f4365ae1bc0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f4361cf0deb in hooks_worker (args=<optimized out>) at glusterd-hooks.c:529
#2  0x00007f4365adc5ba in start_thread () from /lib64/libpthread.so.0
#3  0x00007f43653b57cd in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f43627ad700 (LWP 2805)):
#0  0x00007f436537b14d in nanosleep () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---
#1  0x00007f436537b09a in sleep () from /lib64/libc.so.6
#2  0x00007f4361c689ef in find_compat_brick_in_vol (conf=conf@entry=0x7f4366fb4050, 
    srch_vol=srch_vol@entry=0x7f435801ab10, comp_vol=comp_vol@entry=0x7f4358133a30, 
    brickinfo=brickinfo@entry=0x7f4358144230) at glusterd-utils.c:5236
#3  0x00007f4361c72e33 in find_compatible_brick (other_vol_p=<synthetic pointer>, brickinfo=0x7f4358144230, 
    volinfo=0x7f4358133a30, conf=0x7f4366fb4050) at glusterd-utils.c:5302
#4  glusterd_brick_start (volinfo=volinfo@entry=0x7f4358133a30, brickinfo=brickinfo@entry=0x7f4358144230, 
    wait=wait@entry=_gf_true) at glusterd-utils.c:5486
#5  0x00007f4361cdfce1 in glusterd_start_volume (volinfo=0x7f4358133a30, flags=<optimized out>, 
    wait=wait@entry=_gf_true) at glusterd-volume-ops.c:2500
#6  0x00007f4361ce00b6 in glusterd_op_start_volume (dict=dict@entry=0x7f434c1810d0, 
    op_errstr=op_errstr@entry=0x7f4350204e08) at glusterd-volume-ops.c:2604
#7  0x00007f4361d16a7b in gd_mgmt_v3_commit_fn (op=op@entry=GD_OP_START_VOLUME, dict=dict@entry=0x7f434c1810d0, 
    op_errstr=op_errstr@entry=0x7f4350204e08, op_errno=op_errno@entry=0x7f4350204dfc, 
    rsp_dict=rsp_dict@entry=0x7f434c092960) at glusterd-mgmt.c:319
#8  0x00007f4361d19854 in glusterd_mgmt_v3_commit (op=op@entry=GD_OP_START_VOLUME, 
    op_ctx=op_ctx@entry=0x7f434c180570, req_dict=0x7f434c1810d0, op_errstr=op_errstr@entry=0x7f4350204e08, 
    op_errno=op_errno@entry=0x7f4350204dfc, txn_generation=txn_generation@entry=0) at glusterd-mgmt.c:1524
#9  0x00007f4361d1b3c0 in glusterd_mgmt_v3_initiate_all_phases (req=req@entry=0x7f43500031d0, 
    op=op@entry=GD_OP_START_VOLUME, dict=0x7f434c180570) at glusterd-mgmt.c:2141
#10 0x00007f4361cd9362 in __glusterd_handle_cli_start_volume (req=req@entry=0x7f43500031d0)
    at glusterd-volume-ops.c:551
#11 0x00007f4361c496ed in glusterd_big_locked_handler (req=0x7f43500031d0, 
    actor_fn=0x7f4361cd9100 <__glusterd_handle_cli_start_volume>) at glusterd-handler.c:81
#12 0x00007f4366cb5d20 in synctask_wrap () at syncop.c:375
#13 0x00007f43652fb2c0 in ?? () from /lib64/libc.so.6
#14 0x0000000000000000 in ?? ()


We'd need to see how we can handle the disconnect part better for multiplexed bricks so that this scenario can be handled, but certainly not in this release as per our agreement.

--- Additional comment from Atin Mukherjee on 2017-08-17 06:45:02 EDT ---

The time out is seen because of all the brickinfo's are still marked in started state and while finding a compatible brick, we're waiting for one of the brick process to come up which is never the case here as we are just starting a single volume and rest all bricks are down. This is because when a brick process goes down the brick for which the process came up for the first time gets a rpc disconnect notification and the brickinfo->status is marked as stopped, however for the other bricks which were attached to the same process continue to remain in the started state.

While processing the disconnect event, if we can identify the list of bricks which are attached to the same process (probably through glusterd_brick_proc_t?) and mark the respective bricks as stopped, this issue can be avoided.

Samikshan - can this be implemented?

--- Additional comment from Samikshan Bairagya on 2017-09-04 07:13:09 EDT ---

(In reply to Atin Mukherjee from comment #4)
> The time out is seen because of all the brickinfo's are still marked in
> started state and while finding a compatible brick, we're waiting for one of
> the brick process to come up which is never the case here as we are just
> starting a single volume and rest all bricks are down. This is because when
> a brick process goes down the brick for which the process came up for the
> first time gets a rpc disconnect notification and the brickinfo->status is
> marked as stopped, however for the other bricks which were attached to the
> same process continue to remain in the started state.
> 
> While processing the disconnect event, if we can identify the list of bricks
> which are attached to the same process (probably through
> glusterd_brick_proc_t?) and mark the respective bricks as stopped, this
> issue can be avoided.
> 
> Samikshan - can this be implemented?

This can be implemented.

Comment 1 Worker Ant 2017-10-08 13:31:17 UTC

REVIEW: https://review.gluster.org/18444 (glusterd:disconnect event in brick multiplexing) posted (#4) for review on master by Sanju Rakonde (srakonde)

Comment 2 Worker Ant 2017-10-09 07:39:24 UTC

REVIEW: https://review.gluster.org/18444 (glusterd:Marking all the brick status as stopped when a process goes down in brick multiplexing) posted (#5) for review on master by Sanju Rakonde (srakonde)

Comment 3 Worker Ant 2017-10-09 18:54:16 UTC

REVIEW: https://review.gluster.org/18444 (glusterd:Marking all the brick status as stopped when a process goes down in brick multiplexing) posted (#6) for review on master by Sanju Rakonde (srakonde)

Comment 4 Worker Ant 2017-10-11 05:12:58 UTC

REVIEW: https://review.gluster.org/18444 (glusterd:Marking all the brick status as stopped when a process goes down in brick multiplexing) posted (#7) for review on master by Sanju Rakonde (srakonde)

Comment 5 Worker Ant 2017-10-11 19:36:34 UTC

REVIEW: https://review.gluster.org/18444 (glusterd:Marking all the brick status as stopped when a process goes down in brick multiplexing) posted (#8) for review on master by Sanju Rakonde (srakonde)

Comment 6 Worker Ant 2017-10-12 04:52:05 UTC

COMMIT: https://review.gluster.org/18444 committed in master by Atin Mukherjee (amukherj) 
------
commit 9422446d72bc054962d72ace9912ecb885946d49
Author: Sanju Rakonde <srakonde>
Date:   Sat Oct 7 03:33:40 2017 +0530

    glusterd:Marking all the brick status as stopped when a process goes down in brick multiplexing
    
    In brick multiplexing environment, if a brick process goes down
    i.e., if we kill it with SIGKILL, the status of the brick for which
    the process came up for the first time is only changing to stopped.
    all other brick statuses are remain started. This is happening because
    the process was killed abruptly using SIGKILL signal and signal
    handler wasn't invoked and further cleanup wasn't triggered.
    
    When we try to start a volume using force, it shows error saying
    "Request timed out", since all the brickinfo->status are still in
    started state, we're waiting for one of the brick process to come up
    which never going to happen since the brick process was killed.
    
    To resolve this, In the disconnect event, We are checking all the
    processes that whether the brick which got disconnected belongs the
    process. Once we get the process we are calling a function named
    glusterd_mark_bricks_stopped_by_proc() and sending brick_proc_t object as
    an argument.
    
    From the glusterd_brick_proc_t we can get all the bricks attached
    to that process. but these are duplicated ones. To get the original
    brickinfo we are reading volinfo from brick. In volinfo we will have
    original brickinfo copies. We are changing brickinfo->status to
    stopped for all the bricks.
    
    Change-Id: Ifb9054b3ee081ef56b39b2903ae686984fe827e7
    BUG: 1499509
    Signed-off-by: Sanju Rakonde <srakonde>

Comment 7 Worker Ant 2017-10-12 07:02:41 UTC

REVIEW: https://review.gluster.org/18499 (glusterd:Marking all the brick status as stopped when a process goes down in brick multiplexing) posted (#2) for review on release-3.12 by Sanju Rakonde (srakonde)

Comment 8 Shyamsundar 2017-12-08 17:43:00 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.13.0, please open a new bug report.

glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html
[2] https://www.gluster.org/pipermail/gluster-users/