Description of problem: Had a three node cluster and network on my other two nodes went down. Restarted network on other two nodes and restarted glusterd on the first node and i see that glusterd crashes and generated a core dump. Version-Release number of selected component (if applicable): glusterfs-3.8.4-2.el7rhgs.x86_64 How reproducible: seen it once Steps to Reproduce: 1. Had three node cluster 2. Restarted glusterd on the first node so that it connects with two other nodes on which network was down. 3. Actual results: glusterd crashed and core dumped Expected results: glusterd should not crash Additional info:
bt from the core dump file: =============================== [root@rhsqa-grafton4 home]# gdb -c core.1325.1476277725.dump.1 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. [New LWP 1539] [New LWP 1328] [New LWP 1326] [New LWP 1327] [New LWP 1329] [New LWP 1325] warning: core file may not match specified executable file. Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done. done. warning: .dynamic section for "/usr/lib64/libnss_myhostname.so.2" is not at the expected address (wrong library or version mismatch?) warning: Can't read pathname for load map: Input/output error. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'. Program terminated with signal 11, Segmentation fault. #0 x86_64_fallback_frame_state (context=0x7f6649c42840, context=0x7f6649c42840, fs=0x7f6649c42930) at ./md-unwind-support.h:58 58 if (*(unsigned char *)(pc+0) == 0x48 (gdb) bt #0 x86_64_fallback_frame_state (context=0x7f6649c42840, context=0x7f6649c42840, fs=0x7f6649c42930) at ./md-unwind-support.h:58 #1 uw_frame_state_for (context=context@entry=0x7f6649c42840, fs=fs@entry=0x7f6649c42930) at ../../../libgcc/unwind-dw2.c:1253 #2 0x00007f664c2fd019 in _Unwind_Backtrace (trace=0x7f6657f18f50 <backtrace_helper>, trace_argument=0x7f6649c42af0) at ../../../libgcc/unwind.inc:290 #3 0x00007f6657f190c6 in __GI___backtrace (array=array@entry=0x7f6649c42b30, size=size@entry=200) at ../sysdeps/x86_64/backtrace.c:109 #4 0x00007f665975d8d2 in _gf_msg_backtrace_nomem (level=level@entry=GF_LOG_ALERT, stacksize=stacksize@entry=200) at logging.c:1094 #5 0x00007f6659767364 in gf_print_trace (signum=11, ctx=0x7f665a5ab010) at common-utils.c:750 #6 <signal handler called> #7 0x00007f664dfb88ad in ?? () #8 0x0000000000000000 in ?? () (gdb) quit
Backtrace seems to be from glibc stack, going forward please capture 'thread apply all bt'. I'll be taking a look at this bug on Monday.
Here is the full backtrace: Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'. Program terminated with signal 11, Segmentation fault. #0 x86_64_fallback_frame_state (context=0x7f6649c42840, context=0x7f6649c42840, fs=0x7f6649c42930) at ./md-unwind-support.h:58 58 if (*(unsigned char *)(pc+0) == 0x48 (gdb) t a a bt Thread 6 (Thread 0x7f6659c37780 (LWP 1325)): #0 0x00007f6657ef80b7 in unlink () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f6659784925 in sys_unlink (pathname=<optimized out>) at syscall.c:137 #2 0x00007f6659c54a09 in glusterfs_pidfile_cleanup (ctx=ctx@entry=0x7f665a5ab010) at glusterfsd.c:1980 #3 0x00007f6659c54b0a in cleanup_and_exit (signum=1) at glusterfsd.c:1316 #4 0x00007f6659c55542 in glusterfs_process_volfp (ctx=ctx@entry=0x7f665a5ab010, fp=fp@entry=0x7f665a5f52d0) at glusterfsd.c:2300 #5 0x00007f6659c5569d in glusterfs_volumes_init (ctx=ctx@entry=0x7f665a5ab010) at glusterfsd.c:2336 #6 0x00007f6659c51ace in main (argc=5, argv=<optimized out>) at glusterfsd.c:2448 Thread 5 (Thread 0x7f664ea2c700 (LWP 1329)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x00007f6659795b08 in syncenv_task (proc=proc@entry=0x7f665a5f1530) at syncop.c:603 #2 0x00007f6659796950 in syncenv_processor (thdata=0x7f665a5f1530) at syncop.c:695 #3 0x00007f66585bedc5 in start_thread (arg=0x7f664ea2c700) at pthread_create.c:308 #4 0x00007f6657f051cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 4 (Thread 0x7f664fa2e700 (LWP 1327)): #0 0x00007f66585c5e91 in do_sigwait (sig=0x7f664fa2de1c, set=<optimized out>) at ../sysdeps/unix/sysv/linux/sigwait.c:61 #1 __sigwait (set=set@entry=0x7f664fa2de20, sig=sig@entry=0x7f664fa2de1c) at ../sysdeps/unix/sysv/linux/sigwait.c:99 #2 0x00007f6659c54bfb in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2055 #3 0x00007f66585bedc5 in start_thread (arg=0x7f664fa2e700) at pthread_create.c:308 #4 0x00007f6657f051cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 3 (Thread 0x7f665022f700 (LWP 1326)): #0 0x00007f66585c596d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f665976a926 in gf_timer_proc (data=0x7f665a5f0b70) at timer.c:176 #2 0x00007f66585bedc5 in start_thread (arg=0x7f665022f700) at pthread_create.c:308 #3 0x00007f6657f051cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 2 (Thread 0x7f664f22d700 (LWP 1328)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x00007f6659795b08 in syncenv_task (proc=proc@entry=0x7f665a5f1170) at syncop.c:603 #2 0x00007f6659796950 in syncenv_processor (thdata=0x7f665a5f1170) at syncop.c:695 #3 0x00007f66585bedc5 in start_thread (arg=0x7f664f22d700) at pthread_create.c:308 #4 0x00007f6657f051cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 1 (Thread 0x7f6649c44700 (LWP 1539)): #0 x86_64_fallback_frame_state (context=0x7f6649c42840, context=0x7f6649c42840, fs=0x7f6649c42930) at ./md-unwind-support.h:58 #1 uw_frame_state_for (context=context@entry=0x7f6649c42840, fs=fs@entry=0x7f6649c42930) at ../../../libgcc/unwind-dw2.c:1253 #2 0x00007f664c2fd019 in _Unwind_Backtrace (trace=0x7f6657f18f50 <backtrace_helper>, trace_argument=0x7f6649c42af0) at ../../../libgcc/unwind.inc:290 #3 0x00007f6657f190c6 in __GI___backtrace (array=array@entry=0x7f6649c42b30, size=size@entry=200) at ../sysdeps/x86_64/backtrace.c:109 #4 0x00007f665975d8d2 in _gf_msg_backtrace_nomem (level=level@entry=GF_LOG_ALERT, stacksize=stacksize@entry=200) at logging.c:1094 #5 0x00007f6659767364 in gf_print_trace (signum=<optimized out>, ctx=<optimized out>) at common-utils.c:750 #6 <signal handler called> #7 0x00007f664dfb88ad in ?? () #8 0x0000000000000000 in ?? () <log snippet> [2016-10-12 13:08:45.874724] C [MSGID: 106002] [glusterd-server-quorum.c:347:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume data. Stopping local bricks. [2016-10-12 13:08:45.876004] I [MSGID: 106544] [glusterd.c:156:glusterd_uuid_init] 0-management: retrieved UUID: a6bdd555-29b2-4121-ad28-e37dfff9b898 [2016-10-12 13:08:45.876848] E [socket.c:3087:socket_connect] 0-management: connection attempt on failed, (Network is unreachable) [2016-10-12 13:08:45.876943] C [MSGID: 106002] [glusterd-server-quorum.c:347:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume engine. Stopping local bricks. [2016-10-12 13:08:45.878073] E [MSGID: 106187] [glusterd-store.c:4409:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore [2016-10-12 13:08:45.878099] E [MSGID: 101019] [xlator.c:433:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2016-10-12 13:08:45.878107] E [MSGID: 101066] [graph.c:324:glusterfs_graph_init] 0-management: initializing translator failed [2016-10-12 13:08:45.878114] E [MSGID: 101176] [graph.c:673:glusterfs_graph_activate] 0-graph: init failed pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2016-10-12 13:08:45 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.8.4 [2016-10-12 13:08:45.879137] W [glusterfsd.c:1288:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x7f6659c5569d] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x172) [0x7f6659c55542] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x7f6659c54abb] ) 0-: received signum (1), shutting down It looks like glusterd instance couldn't come up at that time as it was unable to resolve the brick addresses and this could be only possible if the network is not reachable. There is a log entry which proves the same evidence: [2016-10-12 13:08:45.876848] E [socket.c:3087:socket_connect] 0-management: connection attempt on failed, (Network is unreachable) While thread 6 indicates that cleanup_and_exit was triggered given glusterd failed to come up, glusterd crashed and that lead to a stack corruption (frame 7 & 8 of thread 1) because of the print trace also crashed probably with an invalid memory access. At the current form the bt, this is very difficult to debug. And since its on the clean up path, there is no functionality impact. If you happen to see the crash once again, please update the bug. A constant reproducer would certainly help us in narrowing down the issue. For now, I am moving this BZ to 3.2.0-beyond.
We have a similar kind of bug 1472267, which has been addressed and below is the upstream patch for the same. Upstream patch: https://review.gluster.org/#/c/17813/
Build : 3.12.2-4 Followed the steps and haven't seen any glusterd crash. Hence marking this bug as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607