Bug 1461695

Summary: glusterd crashed and core dumped, when the network interface is down
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: SATHEESARAN <sasundar>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED ERRATA QA Contact: SATHEESARAN <sasundar>
Severity: medium Docs Contact:
Priority: high    
Version: rhgs-3.2CC: rhinduja, rhs-bugs, sheggodu, storage-qa-internal, vbellur
Target Milestone: ---   
Target Release: RHGS 3.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: rebase
Fixed In Version: glusterfs-3.12.2-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 06:32:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1503134    
Attachments:
Description Flags
glusterd coredump from the node none

Description SATHEESARAN 2017-06-15 08:31:42 UTC
Description of problem:
-----------------------
In the node that hosts the bricks for gluster volumes, when the network interface is down and glusterd is restarted, glusterd crashes and coredumps

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHGS 3.1.3 ( glusterfs-3.7.9-12.elrhgs )
RHGS 3.2.0 ( glusterfs-3.8.4-18.el7rhgs )
RHGS 3.2.0 async ( glusterfs-3.8.4-18.4.el7rhgs )

How reproducible:
-----------------
Always

Steps to Reproduce:
-------------------
1. Create a Trusted Storage Pool ( gluster cluster )
2. Create a volume of any type
3. Select the node in the cluster that hosts the 'brick'
4. Using console access of the node,bring down the network interface on that node.
5. Restart glusterd on that node

Actual results:
---------------
glusterd crashed and coredumped

Expected results:
-----------------
glusterd should not crash on restart with such occasion of network interface down

Comment 1 SATHEESARAN 2017-06-15 08:47:52 UTC
gdb backtrace
--------------

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'.
Program terminated with signal 11, Segmentation fault.
#0  x86_64_fallback_frame_state (context=0x7f91c95eee00, context=0x7f91c95eee00, fs=0x7f91c95eeef0) at ./md-unwind-support.h:58
58	  if (*(unsigned char *)(pc+0) == 0x48


gdb backtrace from all threads
------------------------------
(gdb) t a a bt

Thread 7 (Thread 0x7f91d9e91780 (LWP 14715)):
#0  0x00007f91d882143d in write () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f91d99e1475 in sys_write (fd=<optimized out>, buf=<optimized out>, count=<optimized out>) at syscall.c:270
#2  0x00007f91d9eaf539 in glusterfs_process_volfp (ctx=ctx@entry=0x7f91dac5b010, fp=fp@entry=0x7f91daca52d0) at glusterfsd.c:2299
#3  0x00007f91d9eaf69d in glusterfs_volumes_init (ctx=ctx@entry=0x7f91dac5b010) at glusterfsd.c:2336
#4  0x00007f91d9eabace in main (argc=5, argv=<optimized out>) at glusterfsd.c:2448

Thread 6 (Thread 0x7f91cec83700 (LWP 14719)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007f91d99f2538 in syncenv_task (proc=proc@entry=0x7f91daca1530) at syncop.c:603
#2  0x00007f91d99f3380 in syncenv_processor (thdata=0x7f91daca1530) at syncop.c:695
#3  0x00007f91d881adc5 in start_thread (arg=0x7f91cec83700) at pthread_create.c:308
#4  0x00007f91d815f76d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 5 (Thread 0x7f91c9df3700 (LWP 14934)):
#0  0x00007f91d8154e2d in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f91cb59dda9 in poll (__timeout=-1, __nfds=2, __fds=0x7f91c9df2e80) at /usr/include/bits/poll2.h:46
#2  socket_poller (ctx=0x7f91dad607d0) at socket.c:2500
#3  0x00007f91d881adc5 in start_thread (arg=0x7f91c9df3700) at pthread_create.c:308
#4  0x00007f91d815f76d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 4 (Thread 0x7f91d0486700 (LWP 14716)):
#0  0x00007f91d8821bdd in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f91d99c6fe6 in gf_timer_proc (data=0x7f91daca0b70) at timer.c:176
#2  0x00007f91d881adc5 in start_thread (arg=0x7f91d0486700) at pthread_create.c:308
#3  0x00007f91d815f76d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 3 (Thread 0x7f91cfc85700 (LWP 14717)):
#0  0x00007f91d8822101 in do_sigwait (sig=0x7f91cfc84e1c, set=<optimized out>) at ../sysdeps/unix/sysv/linux/sigwait.c:61
#1  __sigwait (set=set@entry=0x7f91cfc84e20, sig=sig@entry=0x7f91cfc84e1c) at ../sysdeps/unix/sysv/linux/sigwait.c:99
#2  0x00007f91d9eaebfb in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2055
#3  0x00007f91d881adc5 in start_thread (arg=0x7f91cfc85700) at pthread_create.c:308
#4  0x00007f91d815f76d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 2 (Thread 0x7f91cf484700 (LWP 14718)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007f91d99f2538 in syncenv_task (proc=proc@entry=0x7f91daca1170) at syncop.c:603
#2  0x00007f91d99f3380 in syncenv_processor (thdata=0x7f91daca1170) at syncop.c:695
#3  0x00007f91d881adc5 in start_thread (arg=0x7f91cf484700) at pthread_create.c:308
#4  0x00007f91d815f76d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 1 (Thread 0x7f91c95f2700 (LWP 15480)):
#0  x86_64_fallback_frame_state (context=0x7f91c95eee00, context=0x7f91c95eee00, fs=0x7f91c95eeef0) at ./md-unwind-support.h:58
#1  uw_frame_state_for (context=context@entry=0x7f91c95eee00, fs=fs@entry=0x7f91c95eeef0) at ../../../libgcc/unwind-dw2.c:1253
#2  0x00007f91cc50a019 in _Unwind_Backtrace (trace=0x7f91d81734f0 <backtrace_helper>, trace_argument=0x7f91c95ef0b0) at ../../../libgcc/unwind.inc:290
#3  0x00007f91d8173666 in __GI___backtrace (array=array@entry=0x7f91c95ef0f0, size=size@entry=200) at ../sysdeps/x86_64/backtrace.c:109
#4  0x00007f91d99b9ce2 in _gf_msg_backtrace_nomem (level=level@entry=GF_LOG_ALERT, stacksize=stacksize@entry=200) at logging.c:1094
#5  0x00007f91d99c3884 in gf_print_trace (signum=<optimized out>, ctx=<optimized out>) at common-utils.c:755
---Type <return> to continue, or q <return> to quit---
#6  <signal handler called>
#7  strchrnul () at ../sysdeps/x86_64/strchrnul.S:33
#8  0x00007f91d80af1c2 in __find_specmb (format=0x7f91ce232210 <Address 0x7f91ce232210 out of bounds>) at printf-parse.h:109
#9  _IO_vfprintf_internal (s=s@entry=0x7f91c95f07e0, format=format@entry=0x7f91ce232210 <Address 0x7f91ce232210 out of bounds>, ap=ap@entry=0x7f91c95f09d8) at vfprintf.c:1308
#10 0x00007f91d8176a45 in __GI___vasprintf_chk (result_ptr=result_ptr@entry=0x7f91c95f09b8, flags=flags@entry=1, 
    format=format@entry=0x7f91ce232210 <Address 0x7f91ce232210 out of bounds>, args=args@entry=0x7f91c95f09d8) at vasprintf_chk.c:66
#11 0x00007f91d99bad54 in vasprintf (__ap=0x7f91c95f09d8, __fmt=0x7f91ce232210 <Address 0x7f91ce232210 out of bounds>, __ptr=0x7f91c95f09b8) at /usr/include/bits/stdio2.h:210
#12 _gf_msg (domain=0x7f91dacaa4c0 "management", file=0x7f91ce253f3a <Address 0x7f91ce253f3a out of bounds>, function=0x7f91ce2543b0 <Address 0x7f91ce2543b0 out of bounds>, line=664, 
    level=GF_LOG_ERROR, errnum=22, trace=1, msgid=101172, fmt=0x7f91ce232210 <Address 0x7f91ce232210 out of bounds>) at logging.c:2069
#13 0x00007f91ce20f3ae in ?? ()
#14 0x00007f9100000001 in ?? ()
#15 0x0000000000018b34 in ?? ()
#16 0x00007f91ce232210 in ?? ()
#17 0x0000000000000000 in ?? ()
(gdb) [Thread debugging using libthread_db enabled]
Undefined command: "".  Try "help".
(gdb) Using host libthread_db library "/lib64/libthread_db.so.1".
Undefined command: "Using".  Try "help".
(gdb) Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'.
/root/was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'.: No such file or directory.
(gdb) Program terminated with signal 11, Segmentation fault.
Undefined command: "Program".  Try "help".
(gdb) #0  x86_64_fallback_frame_state (context=0x7f91c95eee00, context=0x7f91c95eee00, fs=0x7f91c95eeef0) at ./md-unwind-support.h:58
(gdb) 58  if (*(unsigned char *)(pc+0) == 0x48
Undefined command: "58".  Try "help".
(gdb)

Comment 2 SATHEESARAN 2017-06-15 08:48:55 UTC
snip from glusterd logs:
------------------------
[2017-06-14 08:22:10.739716] I [MSGID: 106004] [glusterd-handler.c:5808:__glusterd_peer_rpc_notify] 0-management: Peer <10.70.36.74> (<0c2f8929-3a24-4b33-95ea-9810b98f0027>), in state <
Peer in Cluster>, has disconnected from glusterd.
[2017-06-14 08:22:10.740199] C [MSGID: 106002] [glusterd-server-quorum.c:347:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume AppDisksVol. Stopping local br
icks.
[2017-06-14 08:22:10.831748] C [MSGID: 106002] [glusterd-server-quorum.c:347:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume BootDisksVol. Stopping local b
ricks.
[2017-06-14 08:22:10.831815] E [MSGID: 106187] [glusterd-store.c:4417:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore
[2017-06-14 08:22:10.831871] E [MSGID: 101019] [xlator.c:433:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2017-06-14 08:22:10.831885] E [MSGID: 101066] [graph.c:324:glusterfs_graph_init] 0-management: initializing translator failed
[2017-06-14 08:22:10.831896] E [MSGID: 101176] [graph.c:673:glusterfs_graph_activate] 0-graph: init failed
[2017-06-14 08:22:10.831919] E [glusterd-peer-utils.c:153:glusterd_hostname_to_uuid] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x554a4) [0x7f3f69c9f4a4] -->/usr/lib64/glus
terfs/3.8.4/xlator/mgmt/glusterd.so(+0x43be0) [0x7f3f69c8dbe0] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x102ad6) [0x7f3f69d4cad6] ) 0-: Assertion failed: priv
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2017-06-14 08:22:10
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.8.4

Comment 3 SATHEESARAN 2017-06-15 08:53:08 UTC
Created attachment 1287961 [details]
glusterd coredump from the node

Comment 6 Gaurav Yadav 2017-07-31 06:34:15 UTC
As Atin has already mentioned that glusterd is not able to resolve the bricks on glusterd restart when network interface is down.

We have a similar kind of bug 1472267, which has been addressed and below is the upstream patch for the same.
Upstream patch: https://review.gluster.org/#/c/17813/

Comment 9 SATHEESARAN 2018-08-28 10:26:28 UTC
Tested with RHGS 3.4.0 nightly build ( glusterfs-3.12.2-17.el7rhgs ) with the following steps:

1. Created 3 node RHGS trusted storage pool
2. Create 3 volumes of type replicate.
3. Start the volumes
4. Get the console connection to one of the RHGS server node and bring down the interface(s)
5. Restart glusterd

Observed that there are no crashes seen with glusterd.

Comment 10 errata-xmlrpc 2018-09-04 06:32:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607