1461695 – glusterd crashed and core dumped, when the network interface is down

Bug 1461695 - glusterd crashed and core dumped, when the network interface is down

Summary: glusterd crashed and core dumped, when the network interface is down

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Atin Mukherjee
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:	rebase
Depends On:
Blocks:	1503134
TreeView+	depends on / blocked

Reported:	2017-06-15 08:31 UTC by SATHEESARAN
Modified:	2018-09-04 06:33 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.12.2-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-04 06:32:21 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
glusterd coredump from the node (333.13 KB, application/octet-stream) 2017-06-15 08:53 UTC, SATHEESARAN	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:33:53 UTC

Description SATHEESARAN 2017-06-15 08:31:42 UTC

Description of problem:
-----------------------
In the node that hosts the bricks for gluster volumes, when the network interface is down and glusterd is restarted, glusterd crashes and coredumps

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHGS 3.1.3 ( glusterfs-3.7.9-12.elrhgs )
RHGS 3.2.0 ( glusterfs-3.8.4-18.el7rhgs )
RHGS 3.2.0 async ( glusterfs-3.8.4-18.4.el7rhgs )

How reproducible:
-----------------
Always

Steps to Reproduce:
-------------------
1. Create a Trusted Storage Pool ( gluster cluster )
2. Create a volume of any type
3. Select the node in the cluster that hosts the 'brick'
4. Using console access of the node,bring down the network interface on that node.
5. Restart glusterd on that node

Actual results:
---------------
glusterd crashed and coredumped

Expected results:
-----------------
glusterd should not crash on restart with such occasion of network interface down

Comment 1 SATHEESARAN 2017-06-15 08:47:52 UTC

gdb backtrace
--------------

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'.
Program terminated with signal 11, Segmentation fault.
#0  x86_64_fallback_frame_state (context=0x7f91c95eee00, context=0x7f91c95eee00, fs=0x7f91c95eeef0) at ./md-unwind-support.h:58
58	  if (*(unsigned char *)(pc+0) == 0x48


gdb backtrace from all threads
------------------------------
(gdb) t a a bt

Thread 7 (Thread 0x7f91d9e91780 (LWP 14715)):
#0  0x00007f91d882143d in write () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f91d99e1475 in sys_write (fd=<optimized out>, buf=<optimized out>, count=<optimized out>) at syscall.c:270
#2  0x00007f91d9eaf539 in glusterfs_process_volfp (ctx=ctx@entry=0x7f91dac5b010, fp=fp@entry=0x7f91daca52d0) at glusterfsd.c:2299
#3  0x00007f91d9eaf69d in glusterfs_volumes_init (ctx=ctx@entry=0x7f91dac5b010) at glusterfsd.c:2336
#4  0x00007f91d9eabace in main (argc=5, argv=<optimized out>) at glusterfsd.c:2448

Thread 6 (Thread 0x7f91cec83700 (LWP 14719)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007f91d99f2538 in syncenv_task (proc=proc@entry=0x7f91daca1530) at syncop.c:603
#2  0x00007f91d99f3380 in syncenv_processor (thdata=0x7f91daca1530) at syncop.c:695
#3  0x00007f91d881adc5 in start_thread (arg=0x7f91cec83700) at pthread_create.c:308
#4  0x00007f91d815f76d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 5 (Thread 0x7f91c9df3700 (LWP 14934)):
#0  0x00007f91d8154e2d in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f91cb59dda9 in poll (__timeout=-1, __nfds=2, __fds=0x7f91c9df2e80) at /usr/include/bits/poll2.h:46
#2  socket_poller (ctx=0x7f91dad607d0) at socket.c:2500
#3  0x00007f91d881adc5 in start_thread (arg=0x7f91c9df3700) at pthread_create.c:308
#4  0x00007f91d815f76d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 4 (Thread 0x7f91d0486700 (LWP 14716)):
#0  0x00007f91d8821bdd in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f91d99c6fe6 in gf_timer_proc (data=0x7f91daca0b70) at timer.c:176
#2  0x00007f91d881adc5 in start_thread (arg=0x7f91d0486700) at pthread_create.c:308
#3  0x00007f91d815f76d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 3 (Thread 0x7f91cfc85700 (LWP 14717)):
#0  0x00007f91d8822101 in do_sigwait (sig=0x7f91cfc84e1c, set=<optimized out>) at ../sysdeps/unix/sysv/linux/sigwait.c:61
#1  __sigwait (set=set@entry=0x7f91cfc84e20, sig=sig@entry=0x7f91cfc84e1c) at ../sysdeps/unix/sysv/linux/sigwait.c:99
#2  0x00007f91d9eaebfb in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2055
#3  0x00007f91d881adc5 in start_thread (arg=0x7f91cfc85700) at pthread_create.c:308
#4  0x00007f91d815f76d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 2 (Thread 0x7f91cf484700 (LWP 14718)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007f91d99f2538 in syncenv_task (proc=proc@entry=0x7f91daca1170) at syncop.c:603
#2  0x00007f91d99f3380 in syncenv_processor (thdata=0x7f91daca1170) at syncop.c:695
#3  0x00007f91d881adc5 in start_thread (arg=0x7f91cf484700) at pthread_create.c:308
#4  0x00007f91d815f76d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 1 (Thread 0x7f91c95f2700 (LWP 15480)):
#0  x86_64_fallback_frame_state (context=0x7f91c95eee00, context=0x7f91c95eee00, fs=0x7f91c95eeef0) at ./md-unwind-support.h:58
#1  uw_frame_state_for (context=context@entry=0x7f91c95eee00, fs=fs@entry=0x7f91c95eeef0) at ../../../libgcc/unwind-dw2.c:1253
#2  0x00007f91cc50a019 in _Unwind_Backtrace (trace=0x7f91d81734f0 <backtrace_helper>, trace_argument=0x7f91c95ef0b0) at ../../../libgcc/unwind.inc:290
#3  0x00007f91d8173666 in __GI___backtrace (array=array@entry=0x7f91c95ef0f0, size=size@entry=200) at ../sysdeps/x86_64/backtrace.c:109
#4  0x00007f91d99b9ce2 in _gf_msg_backtrace_nomem (level=level@entry=GF_LOG_ALERT, stacksize=stacksize@entry=200) at logging.c:1094
#5  0x00007f91d99c3884 in gf_print_trace (signum=<optimized out>, ctx=<optimized out>) at common-utils.c:755
---Type <return> to continue, or q <return> to quit---
#6  <signal handler called>
#7  strchrnul () at ../sysdeps/x86_64/strchrnul.S:33
#8  0x00007f91d80af1c2 in __find_specmb (format=0x7f91ce232210 <Address 0x7f91ce232210 out of bounds>) at printf-parse.h:109
#9  _IO_vfprintf_internal (s=s@entry=0x7f91c95f07e0, format=format@entry=0x7f91ce232210 <Address 0x7f91ce232210 out of bounds>, ap=ap@entry=0x7f91c95f09d8) at vfprintf.c:1308
#10 0x00007f91d8176a45 in __GI___vasprintf_chk (result_ptr=result_ptr@entry=0x7f91c95f09b8, flags=flags@entry=1, 
    format=format@entry=0x7f91ce232210 <Address 0x7f91ce232210 out of bounds>, args=args@entry=0x7f91c95f09d8) at vasprintf_chk.c:66
#11 0x00007f91d99bad54 in vasprintf (__ap=0x7f91c95f09d8, __fmt=0x7f91ce232210 <Address 0x7f91ce232210 out of bounds>, __ptr=0x7f91c95f09b8) at /usr/include/bits/stdio2.h:210
#12 _gf_msg (domain=0x7f91dacaa4c0 "management", file=0x7f91ce253f3a <Address 0x7f91ce253f3a out of bounds>, function=0x7f91ce2543b0 <Address 0x7f91ce2543b0 out of bounds>, line=664, 
    level=GF_LOG_ERROR, errnum=22, trace=1, msgid=101172, fmt=0x7f91ce232210 <Address 0x7f91ce232210 out of bounds>) at logging.c:2069
#13 0x00007f91ce20f3ae in ?? ()
#14 0x00007f9100000001 in ?? ()
#15 0x0000000000018b34 in ?? ()
#16 0x00007f91ce232210 in ?? ()
#17 0x0000000000000000 in ?? ()
(gdb) [Thread debugging using libthread_db enabled]
Undefined command: "".  Try "help".
(gdb) Using host libthread_db library "/lib64/libthread_db.so.1".
Undefined command: "Using".  Try "help".
(gdb) Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'.
/root/was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'.: No such file or directory.
(gdb) Program terminated with signal 11, Segmentation fault.
Undefined command: "Program".  Try "help".
(gdb) #0  x86_64_fallback_frame_state (context=0x7f91c95eee00, context=0x7f91c95eee00, fs=0x7f91c95eeef0) at ./md-unwind-support.h:58
(gdb) 58  if (*(unsigned char *)(pc+0) == 0x48
Undefined command: "58".  Try "help".
(gdb)

Comment 2 SATHEESARAN 2017-06-15 08:48:55 UTC

snip from glusterd logs:
------------------------
[2017-06-14 08:22:10.739716] I [MSGID: 106004] [glusterd-handler.c:5808:__glusterd_peer_rpc_notify] 0-management: Peer <10.70.36.74> (<0c2f8929-3a24-4b33-95ea-9810b98f0027>), in state <
Peer in Cluster>, has disconnected from glusterd.
[2017-06-14 08:22:10.740199] C [MSGID: 106002] [glusterd-server-quorum.c:347:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume AppDisksVol. Stopping local br
icks.
[2017-06-14 08:22:10.831748] C [MSGID: 106002] [glusterd-server-quorum.c:347:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume BootDisksVol. Stopping local b
ricks.
[2017-06-14 08:22:10.831815] E [MSGID: 106187] [glusterd-store.c:4417:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore
[2017-06-14 08:22:10.831871] E [MSGID: 101019] [xlator.c:433:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2017-06-14 08:22:10.831885] E [MSGID: 101066] [graph.c:324:glusterfs_graph_init] 0-management: initializing translator failed
[2017-06-14 08:22:10.831896] E [MSGID: 101176] [graph.c:673:glusterfs_graph_activate] 0-graph: init failed
[2017-06-14 08:22:10.831919] E [glusterd-peer-utils.c:153:glusterd_hostname_to_uuid] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x554a4) [0x7f3f69c9f4a4] -->/usr/lib64/glus
terfs/3.8.4/xlator/mgmt/glusterd.so(+0x43be0) [0x7f3f69c8dbe0] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x102ad6) [0x7f3f69d4cad6] ) 0-: Assertion failed: priv
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2017-06-14 08:22:10
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.8.4

Comment 3 SATHEESARAN 2017-06-15 08:53:08 UTC

Created attachment 1287961 [details]
glusterd coredump from the node

Comment 6 Gaurav Yadav 2017-07-31 06:34:15 UTC

As Atin has already mentioned that glusterd is not able to resolve the bricks on glusterd restart when network interface is down.

We have a similar kind of bug 1472267, which has been addressed and below is the upstream patch for the same.
Upstream patch: https://review.gluster.org/#/c/17813/

Comment 9 SATHEESARAN 2018-08-28 10:26:28 UTC

Tested with RHGS 3.4.0 nightly build ( glusterfs-3.12.2-17.el7rhgs ) with the following steps:

1. Created 3 node RHGS trusted storage pool
2. Create 3 volumes of type replicate.
3. Start the volumes
4. Get the console connection to one of the RHGS server node and bring down the interface(s)
5. Restart glusterd

Observed that there are no crashes seen with glusterd.

Comment 10 errata-xmlrpc 2018-09-04 06:32:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.