Description of problem: ======================= Brick process on a node didn't come up after glusterd stop/start. Note: I have two volumes configured on my setup but I have seen this issue only on one volume. Version-Release number of selected component (if applicable): 3.8.4-22.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: =================== 1) Enable brick multiplexing volume option and create two distributed-replicated volumes (lets say 10x2 and 4x3) 2) FUSE mount two volumes on multiple clients. 3) Start IO from mount point. 4) While IO is running, Add few bricks to 10x2 volume and trigger rebalance. 5) Rebalance completed successfully. 6) Reboot Node1. Wait for some time and check gluster v status --> all the brick process are up and running. 7) Now stop/start glusterd on Node1 and check gluster v status Actual results: ============== Brick process on a node1 didn't come up after glusterd stop/start Expected results: ================ After glusterd stop/start, all the bricks process for all available volumes should be up and running.
I've an update on this now. The root cause of this bug along with BZ 1443972 & BZ 1442787 will be the same. The problem what I could find here is when glusterd is restarted, for the first volume glusterd could connect to the bricks and a RPC_CLNT_CONNECT event was received for both the bricks followed by no RPC_CLNT_DISCONNECT events where as for the other volumes (where brick multiplexing attached the bricks to an already running process) brick connect was successful how a constant series of CONNECT followed by DISCONNECT events are received because of which the brick status of these bricks are toggled between STARTED and STOPPED states and hence gluster volume status shows them as offline. root@ac02862b160d:/home/rhs-glusterfs# gdb -p $(pidof glusterd) GNU gdb (GDB) Fedora 7.11.1-75.fc24 Copyright (C) 2016 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word". Attaching to process 8815 [New LWP 8816] [New LWP 8817] [New LWP 8818] [New LWP 8819] [New LWP 8820] [New LWP 9048] [New LWP 9049] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00007faeeeaf86ad in pthread_join () from /lib64/libpthread.so.0 Missing separate debuginfos, use: dnf debuginfo-install device-mapper-event-libs-1.02.122-2.fc24.x86_64 device-mapper-libs-1.02.122-2.fc24.x86_64 glibc-2.23.1-7.fc24.x86_64 keyutils-libs-1.5.9-8.fc24.x86_64 krb5-libs-1.14.1-6.fc24.x86_64 libattr-2.4.47-16.fc24.x86_64 libblkid-2.28-2.fc24.x86_64 libcap-2.24-9.fc24.x86_64 libcom_err-1.42.13-4.fc24.x86_64 libgcc-6.1.1-3.fc24.x86_64 libselinux-2.5-3.fc24.x86_64 libsepol-2.5-3.fc24.x86_64 libuuid-2.28-2.fc24.x86_64 libxml2-2.9.3-3.fc24.x86_64 lvm2-libs-2.02.150-2.fc24.x86_64 openssl-libs-1.0.2h-1.fc24.x86_64 pcre-8.38-11.fc24.x86_64 systemd-libs-229-8.fc24.x86_64 userspace-rcu-0.8.6-2.fc24.x86_64 xz-libs-5.2.2-2.fc24.x86_64 zlib-1.2.8-10.fc24.x86_64 (gdb) b __glusterd_brick_rpc_notify Breakpoint 1 at 0x7faeeac69e40: file glusterd-handler.c, line 5594. (gdb) c Continuing. [Switching to Thread 0x7faee69d3700 (LWP 9049)] Thread 8 "glusterd" hit Breakpoint 1, __glusterd_brick_rpc_notify (rpc=rpc@entry=0x7faee000d5c0, mydata=mydata@entry=0x7faee000d530, event=event@entry=RPC_CLNT_CONNECT, data=data@entry=0x0) at glusterd-handler.c:5594 5594 { (gdb) bt #0 __glusterd_brick_rpc_notify (rpc=rpc@entry=0x7faee000d5c0, mydata=mydata@entry=0x7faee000d530, event=event@entry=RPC_CLNT_CONNECT, data=data@entry=0x0) at glusterd-handler.c:5594 #1 0x00007faeeac6c889 in glusterd_big_locked_notify (rpc=0x7faee000d5c0, mydata=0x7faee000d530, event=RPC_CLNT_CONNECT, data=0x0, notify_fn=0x7faeeac69e40 <__glusterd_brick_rpc_notify>) at glusterd-handler.c:69 #2 0x00007faeefa67dcc in rpc_clnt_notify (trans=<optimized out>, mydata=0x7faee000d5f0, event=<optimized out>, data=0x7faee000d7c0) at rpc-clnt.c:1020 #3 0x00007faeefa643a3 in rpc_transport_notify (this=this@entry=0x7faee000d7c0, event=event@entry=RPC_TRANSPORT_CONNECT, data=data@entry=0x7faee000d7c0) at rpc-transport.c:538 #4 0x00007faee8548d89 in socket_connect_finish (this=this@entry=0x7faee000d7c0) at socket.c:2353 #5 0x00007faee854d0b7 in socket_event_handler (fd=<optimized out>, idx=2, data=0x7faee000d7c0, poll_in=0, poll_out=4, poll_err=16) at socket.c:2400 #6 0x00007faeefced18a in event_dispatch_epoll_handler (event=0x7faee69d2e90, event_pool=0x175ce40) at event-epoll.c:572 #7 event_dispatch_epoll_worker (data=0x17c64b0) at event-epoll.c:675 #8 0x00007faeeeaf75ba in start_thread () from /lib64/libpthread.so.0 #9 0x00007faeee3d07cd in clone () from /lib64/libc.so.6 (gdb) c Continuing. Thread 8 "glusterd" hit Breakpoint 1, __glusterd_brick_rpc_notify (rpc=rpc@entry=0x7faee000d5c0, mydata=mydata@entry=0x7faee000d530, event=event@entry=RPC_CLNT_DISCONNECT, data=data@entry=0x0) at glusterd-handler.c:5594 5594 { (gdb) bt #0 __glusterd_brick_rpc_notify (rpc=rpc@entry=0x7faee000d5c0, mydata=mydata@entry=0x7faee000d530, event=event@entry=RPC_CLNT_DISCONNECT, data=data@entry=0x0) at glusterd-handler.c:5594 #1 0x00007faeeac6c889 in glusterd_big_locked_notify (rpc=0x7faee000d5c0, mydata=0x7faee000d530, event=RPC_CLNT_DISCONNECT, data=0x0, notify_fn=0x7faeeac69e40 <__glusterd_brick_rpc_notify>) at glusterd-handler.c:69 #2 0x00007faeefa67c4b in rpc_clnt_handle_disconnect (conn=0x7faee000d5f0, clnt=0x7faee000d5c0) at rpc-clnt.c:892 #3 rpc_clnt_notify (trans=<optimized out>, mydata=0x7faee000d5f0, event=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-clnt.c:955 #4 0x00007faeefa643a3 in rpc_transport_notify (this=this@entry=0x7faee000d7c0, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7faee000d7c0) at rpc-transport.c:538 #5 0x00007faee854d177 in socket_event_poll_err (this=0x7faee000d7c0) at socket.c:1184 #6 socket_event_handler (fd=<optimized out>, idx=2, data=0x7faee000d7c0, poll_in=0, poll_out=4, poll_err=<optimized out>) at socket.c:2418 #7 0x00007faeefced18a in event_dispatch_epoll_handler (event=0x7faee69d2e90, event_pool=0x175ce40) at event-epoll.c:572 #8 event_dispatch_epoll_worker (data=0x17c64b0) at event-epoll.c:675 #9 0x00007faeeeaf75ba in start_thread () from /lib64/libpthread.so.0 #10 0x00007faeee3d07cd in clone () from /lib64/libc.so.6 Also in the glusterd log the following series of log entries are constantly seen [2017-04-21 06:32:01.529310] I [socket.c:2417:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2017-04-21 06:32:01.529779] I [MSGID: 106005] [glusterd-handler.c:5682:__glusterd_brick_rpc_notify] 0-management: Brick 172.17.0.2:/tmp/b3 has disconnected from glusterd. [2017-04-21 06:32:01.530531] I [socket.c:2417:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2017-04-21 06:32:01.530957] I [MSGID: 106005] [glusterd-handler.c:5682:__glusterd_brick_rpc_notify] 0-management: Brick 172.17.0.2:/tmp/b4 has disconnected from glusterd. We need to see why are we getting an EPOLL error here.
upstream patch : https://review.gluster.org/#/c/17101/
Upstream patches : https://review.gluster.org/#/q/topic:bug-1444596 Downstream patches: https://code.engineering.redhat.com/gerrit/#/c/105595/ https://code.engineering.redhat.com/gerrit/#/c/105596/
Verified this BZ against glusterfs version 3.8.4-27.el7rhgs.x86_64. Followed the same steps as in the description, after the fix the brick process are coming up after glusterd stop/start. Hence, moving this BZ to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774