1443991 – [Brick Multiplexing] Brick process on a node didn't come up after glusterd stop/start

Bug 1443991 - [Brick Multiplexing] Brick process on a node didn't come up after glusterd stop/start

Summary: [Brick Multiplexing] Brick process on a node didn't come up after glusterd st...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Mohit Agrawal
QA Contact:	Prasad Desala
Docs Contact:
URL:
Whiteboard:	brick-multiplexing
Depends On:
Blocks:	1417151
TreeView+	depends on / blocked

Reported:	2017-04-20 11:49 UTC by Prasad Desala
Modified:	2017-09-21 04:39 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.8.4-25
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-21 04:39:40 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1442787	unspecified	CLOSED	Brick Multiplexing: During Remove brick when glusterd of a node is stopped, the brick process gets disconnected from glu...	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1443972	unspecified	CLOSED	[Brick Multiplexing] : Bricks for multiple volumes going down after glusterd restart and not coming back up after volume...	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2017:2774	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Internal Links: 1442787 1443972

Description Prasad Desala 2017-04-20 11:49:44 UTC

Description of problem:
=======================
Brick process on a node didn't come up after glusterd stop/start. 
Note: I have two volumes configured on my setup but I have seen this issue only on one volume.   

Version-Release number of selected component (if applicable):
3.8.4-22.el7rhgs.x86_64

How reproducible:
1/1

Steps to Reproduce:
===================
1) Enable brick multiplexing volume option and create two distributed-replicated volumes (lets say 10x2 and 4x3) 
2) FUSE mount two volumes on multiple clients.
3) Start IO from mount point.
4) While IO is running, Add few bricks to 10x2 volume and trigger rebalance. 
5) Rebalance completed successfully.
6) Reboot Node1. Wait for some time and check gluster v status --> all the brick process are up and running.
7) Now stop/start glusterd on Node1 and check gluster v status 

Actual results:
==============
Brick process on a node1 didn't come up after glusterd stop/start

Expected results:
================
After glusterd stop/start, all the bricks process for all available volumes should be up and running.

Comment 6 Atin Mukherjee 2017-04-21 06:32:55 UTC

I've an update on this now. The root cause of this bug along with BZ 1443972 & BZ 1442787 will be the same.

The problem what I could find here is when glusterd is restarted, for the first volume glusterd could connect to the bricks and a RPC_CLNT_CONNECT event was received for both the bricks followed by no RPC_CLNT_DISCONNECT events where as for the other volumes (where brick multiplexing attached the bricks to an already running process) brick connect was successful how a constant series of CONNECT followed by DISCONNECT events are received because of which the brick status of these bricks are toggled between STARTED and STOPPED states and hence gluster volume status shows them as offline.

root@ac02862b160d:/home/rhs-glusterfs# gdb -p $(pidof glusterd)
GNU gdb (GDB) Fedora 7.11.1-75.fc24
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 8815
[New LWP 8816]
[New LWP 8817]
[New LWP 8818]
[New LWP 8819]
[New LWP 8820]
[New LWP 9048]
[New LWP 9049]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007faeeeaf86ad in pthread_join () from /lib64/libpthread.so.0
Missing separate debuginfos, use: dnf debuginfo-install device-mapper-event-libs-1.02.122-2.fc24.x86_64 device-mapper-libs-1.02.122-2.fc24.x86_64 glibc-2.23.1-7.fc24.x86_64 keyutils-libs-1.5.9-8.fc24.x86_64 krb5-libs-1.14.1-6.fc24.x86_64 libattr-2.4.47-16.fc24.x86_64 libblkid-2.28-2.fc24.x86_64 libcap-2.24-9.fc24.x86_64 libcom_err-1.42.13-4.fc24.x86_64 libgcc-6.1.1-3.fc24.x86_64 libselinux-2.5-3.fc24.x86_64 libsepol-2.5-3.fc24.x86_64 libuuid-2.28-2.fc24.x86_64 libxml2-2.9.3-3.fc24.x86_64 lvm2-libs-2.02.150-2.fc24.x86_64 openssl-libs-1.0.2h-1.fc24.x86_64 pcre-8.38-11.fc24.x86_64 systemd-libs-229-8.fc24.x86_64 userspace-rcu-0.8.6-2.fc24.x86_64 xz-libs-5.2.2-2.fc24.x86_64 zlib-1.2.8-10.fc24.x86_64
(gdb) b __glusterd_brick_rpc_notify
Breakpoint 1 at 0x7faeeac69e40: file glusterd-handler.c, line 5594.
(gdb) c
Continuing.
[Switching to Thread 0x7faee69d3700 (LWP 9049)]

Thread 8 "glusterd" hit Breakpoint 1, __glusterd_brick_rpc_notify (rpc=rpc@entry=0x7faee000d5c0, 
    mydata=mydata@entry=0x7faee000d530, event=event@entry=RPC_CLNT_CONNECT, data=data@entry=0x0)
    at glusterd-handler.c:5594
5594	{
(gdb) bt
#0  __glusterd_brick_rpc_notify (rpc=rpc@entry=0x7faee000d5c0, mydata=mydata@entry=0x7faee000d530, 
    event=event@entry=RPC_CLNT_CONNECT, data=data@entry=0x0) at glusterd-handler.c:5594
#1  0x00007faeeac6c889 in glusterd_big_locked_notify (rpc=0x7faee000d5c0, mydata=0x7faee000d530, 
    event=RPC_CLNT_CONNECT, data=0x0, notify_fn=0x7faeeac69e40 <__glusterd_brick_rpc_notify>)
    at glusterd-handler.c:69
#2  0x00007faeefa67dcc in rpc_clnt_notify (trans=<optimized out>, mydata=0x7faee000d5f0, event=<optimized out>, 
    data=0x7faee000d7c0) at rpc-clnt.c:1020
#3  0x00007faeefa643a3 in rpc_transport_notify (this=this@entry=0x7faee000d7c0, 
    event=event@entry=RPC_TRANSPORT_CONNECT, data=data@entry=0x7faee000d7c0) at rpc-transport.c:538
#4  0x00007faee8548d89 in socket_connect_finish (this=this@entry=0x7faee000d7c0) at socket.c:2353
#5  0x00007faee854d0b7 in socket_event_handler (fd=<optimized out>, idx=2, data=0x7faee000d7c0, poll_in=0, 
    poll_out=4, poll_err=16) at socket.c:2400
#6  0x00007faeefced18a in event_dispatch_epoll_handler (event=0x7faee69d2e90, event_pool=0x175ce40)
    at event-epoll.c:572
#7  event_dispatch_epoll_worker (data=0x17c64b0) at event-epoll.c:675
#8  0x00007faeeeaf75ba in start_thread () from /lib64/libpthread.so.0
#9  0x00007faeee3d07cd in clone () from /lib64/libc.so.6
(gdb) c
Continuing.

Thread 8 "glusterd" hit Breakpoint 1, __glusterd_brick_rpc_notify (rpc=rpc@entry=0x7faee000d5c0, 
    mydata=mydata@entry=0x7faee000d530, event=event@entry=RPC_CLNT_DISCONNECT, data=data@entry=0x0)
    at glusterd-handler.c:5594
5594	{
(gdb) bt
#0  __glusterd_brick_rpc_notify (rpc=rpc@entry=0x7faee000d5c0, mydata=mydata@entry=0x7faee000d530, 
    event=event@entry=RPC_CLNT_DISCONNECT, data=data@entry=0x0) at glusterd-handler.c:5594
#1  0x00007faeeac6c889 in glusterd_big_locked_notify (rpc=0x7faee000d5c0, mydata=0x7faee000d530, 
    event=RPC_CLNT_DISCONNECT, data=0x0, notify_fn=0x7faeeac69e40 <__glusterd_brick_rpc_notify>)
    at glusterd-handler.c:69
#2  0x00007faeefa67c4b in rpc_clnt_handle_disconnect (conn=0x7faee000d5f0, clnt=0x7faee000d5c0) at rpc-clnt.c:892
#3  rpc_clnt_notify (trans=<optimized out>, mydata=0x7faee000d5f0, event=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-clnt.c:955
#4  0x00007faeefa643a3 in rpc_transport_notify (this=this@entry=0x7faee000d7c0, 
    event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7faee000d7c0) at rpc-transport.c:538
#5  0x00007faee854d177 in socket_event_poll_err (this=0x7faee000d7c0) at socket.c:1184
#6  socket_event_handler (fd=<optimized out>, idx=2, data=0x7faee000d7c0, poll_in=0, poll_out=4, 
    poll_err=<optimized out>) at socket.c:2418
#7  0x00007faeefced18a in event_dispatch_epoll_handler (event=0x7faee69d2e90, event_pool=0x175ce40)
    at event-epoll.c:572
#8  event_dispatch_epoll_worker (data=0x17c64b0) at event-epoll.c:675
#9  0x00007faeeeaf75ba in start_thread () from /lib64/libpthread.so.0
#10 0x00007faeee3d07cd in clone () from /lib64/libc.so.6

Also in the glusterd log the following series of log entries are constantly seen

[2017-04-21 06:32:01.529310] I [socket.c:2417:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2017-04-21 06:32:01.529779] I [MSGID: 106005] [glusterd-handler.c:5682:__glusterd_brick_rpc_notify] 0-management:    Brick 172.17.0.2:/tmp/b3 has disconnected from glusterd.
[2017-04-21 06:32:01.530531] I [socket.c:2417:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2017-04-21 06:32:01.530957] I [MSGID: 106005] [glusterd-handler.c:5682:__glusterd_brick_rpc_notify] 0-management:    Brick 172.17.0.2:/tmp/b4 has disconnected from glusterd. 

We need to see why are we getting an EPOLL error here.

Comment 7 Atin Mukherjee 2017-04-24 03:53:19 UTC

upstream patch : https://review.gluster.org/#/c/17101/

Comment 10 Atin Mukherjee 2017-05-09 06:35:18 UTC

Upstream patches : https://review.gluster.org/#/q/topic:bug-1444596

Downstream patches:

https://code.engineering.redhat.com/gerrit/#/c/105595/
https://code.engineering.redhat.com/gerrit/#/c/105596/

Comment 12 Prasad Desala 2017-06-09 13:44:54 UTC

Verified this BZ against glusterfs version 3.8.4-27.el7rhgs.x86_64. Followed the same steps as in the description, after the fix the brick process are coming up after glusterd stop/start.

Hence, moving this BZ to Verified.

Comment 14 errata-xmlrpc 2017-09-21 04:39:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.