Description of problem: ----------------------- In the node that hosts the bricks for gluster volumes, when the network interface is down and glusterd is restarted, glusterd crashes and coredumps Version-Release number of selected component (if applicable): -------------------------------------------------------------- RHGS 3.1.3 ( glusterfs-3.7.9-12.elrhgs ) RHGS 3.2.0 ( glusterfs-3.8.4-18.el7rhgs ) RHGS 3.2.0 async ( glusterfs-3.8.4-18.4.el7rhgs ) How reproducible: ----------------- Always Steps to Reproduce: ------------------- 1. Create a Trusted Storage Pool ( gluster cluster ) 2. Create a volume of any type 3. Select the node in the cluster that hosts the 'brick' 4. Using console access of the node,bring down the network interface on that node. 5. Restart glusterd on that node Actual results: --------------- glusterd crashed and coredumped Expected results: ----------------- glusterd should not crash on restart with such occasion of network interface down
gdb backtrace -------------- [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'. Program terminated with signal 11, Segmentation fault. #0 x86_64_fallback_frame_state (context=0x7f91c95eee00, context=0x7f91c95eee00, fs=0x7f91c95eeef0) at ./md-unwind-support.h:58 58 if (*(unsigned char *)(pc+0) == 0x48 gdb backtrace from all threads ------------------------------ (gdb) t a a bt Thread 7 (Thread 0x7f91d9e91780 (LWP 14715)): #0 0x00007f91d882143d in write () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f91d99e1475 in sys_write (fd=<optimized out>, buf=<optimized out>, count=<optimized out>) at syscall.c:270 #2 0x00007f91d9eaf539 in glusterfs_process_volfp (ctx=ctx@entry=0x7f91dac5b010, fp=fp@entry=0x7f91daca52d0) at glusterfsd.c:2299 #3 0x00007f91d9eaf69d in glusterfs_volumes_init (ctx=ctx@entry=0x7f91dac5b010) at glusterfsd.c:2336 #4 0x00007f91d9eabace in main (argc=5, argv=<optimized out>) at glusterfsd.c:2448 Thread 6 (Thread 0x7f91cec83700 (LWP 14719)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x00007f91d99f2538 in syncenv_task (proc=proc@entry=0x7f91daca1530) at syncop.c:603 #2 0x00007f91d99f3380 in syncenv_processor (thdata=0x7f91daca1530) at syncop.c:695 #3 0x00007f91d881adc5 in start_thread (arg=0x7f91cec83700) at pthread_create.c:308 #4 0x00007f91d815f76d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 5 (Thread 0x7f91c9df3700 (LWP 14934)): #0 0x00007f91d8154e2d in poll () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f91cb59dda9 in poll (__timeout=-1, __nfds=2, __fds=0x7f91c9df2e80) at /usr/include/bits/poll2.h:46 #2 socket_poller (ctx=0x7f91dad607d0) at socket.c:2500 #3 0x00007f91d881adc5 in start_thread (arg=0x7f91c9df3700) at pthread_create.c:308 #4 0x00007f91d815f76d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 4 (Thread 0x7f91d0486700 (LWP 14716)): #0 0x00007f91d8821bdd in nanosleep () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f91d99c6fe6 in gf_timer_proc (data=0x7f91daca0b70) at timer.c:176 #2 0x00007f91d881adc5 in start_thread (arg=0x7f91d0486700) at pthread_create.c:308 #3 0x00007f91d815f76d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 3 (Thread 0x7f91cfc85700 (LWP 14717)): #0 0x00007f91d8822101 in do_sigwait (sig=0x7f91cfc84e1c, set=<optimized out>) at ../sysdeps/unix/sysv/linux/sigwait.c:61 #1 __sigwait (set=set@entry=0x7f91cfc84e20, sig=sig@entry=0x7f91cfc84e1c) at ../sysdeps/unix/sysv/linux/sigwait.c:99 #2 0x00007f91d9eaebfb in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2055 #3 0x00007f91d881adc5 in start_thread (arg=0x7f91cfc85700) at pthread_create.c:308 #4 0x00007f91d815f76d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 2 (Thread 0x7f91cf484700 (LWP 14718)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x00007f91d99f2538 in syncenv_task (proc=proc@entry=0x7f91daca1170) at syncop.c:603 #2 0x00007f91d99f3380 in syncenv_processor (thdata=0x7f91daca1170) at syncop.c:695 #3 0x00007f91d881adc5 in start_thread (arg=0x7f91cf484700) at pthread_create.c:308 #4 0x00007f91d815f76d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 1 (Thread 0x7f91c95f2700 (LWP 15480)): #0 x86_64_fallback_frame_state (context=0x7f91c95eee00, context=0x7f91c95eee00, fs=0x7f91c95eeef0) at ./md-unwind-support.h:58 #1 uw_frame_state_for (context=context@entry=0x7f91c95eee00, fs=fs@entry=0x7f91c95eeef0) at ../../../libgcc/unwind-dw2.c:1253 #2 0x00007f91cc50a019 in _Unwind_Backtrace (trace=0x7f91d81734f0 <backtrace_helper>, trace_argument=0x7f91c95ef0b0) at ../../../libgcc/unwind.inc:290 #3 0x00007f91d8173666 in __GI___backtrace (array=array@entry=0x7f91c95ef0f0, size=size@entry=200) at ../sysdeps/x86_64/backtrace.c:109 #4 0x00007f91d99b9ce2 in _gf_msg_backtrace_nomem (level=level@entry=GF_LOG_ALERT, stacksize=stacksize@entry=200) at logging.c:1094 #5 0x00007f91d99c3884 in gf_print_trace (signum=<optimized out>, ctx=<optimized out>) at common-utils.c:755 ---Type <return> to continue, or q <return> to quit--- #6 <signal handler called> #7 strchrnul () at ../sysdeps/x86_64/strchrnul.S:33 #8 0x00007f91d80af1c2 in __find_specmb (format=0x7f91ce232210 <Address 0x7f91ce232210 out of bounds>) at printf-parse.h:109 #9 _IO_vfprintf_internal (s=s@entry=0x7f91c95f07e0, format=format@entry=0x7f91ce232210 <Address 0x7f91ce232210 out of bounds>, ap=ap@entry=0x7f91c95f09d8) at vfprintf.c:1308 #10 0x00007f91d8176a45 in __GI___vasprintf_chk (result_ptr=result_ptr@entry=0x7f91c95f09b8, flags=flags@entry=1, format=format@entry=0x7f91ce232210 <Address 0x7f91ce232210 out of bounds>, args=args@entry=0x7f91c95f09d8) at vasprintf_chk.c:66 #11 0x00007f91d99bad54 in vasprintf (__ap=0x7f91c95f09d8, __fmt=0x7f91ce232210 <Address 0x7f91ce232210 out of bounds>, __ptr=0x7f91c95f09b8) at /usr/include/bits/stdio2.h:210 #12 _gf_msg (domain=0x7f91dacaa4c0 "management", file=0x7f91ce253f3a <Address 0x7f91ce253f3a out of bounds>, function=0x7f91ce2543b0 <Address 0x7f91ce2543b0 out of bounds>, line=664, level=GF_LOG_ERROR, errnum=22, trace=1, msgid=101172, fmt=0x7f91ce232210 <Address 0x7f91ce232210 out of bounds>) at logging.c:2069 #13 0x00007f91ce20f3ae in ?? () #14 0x00007f9100000001 in ?? () #15 0x0000000000018b34 in ?? () #16 0x00007f91ce232210 in ?? () #17 0x0000000000000000 in ?? () (gdb) [Thread debugging using libthread_db enabled] Undefined command: "". Try "help". (gdb) Using host libthread_db library "/lib64/libthread_db.so.1". Undefined command: "Using". Try "help". (gdb) Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'. /root/was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'.: No such file or directory. (gdb) Program terminated with signal 11, Segmentation fault. Undefined command: "Program". Try "help". (gdb) #0 x86_64_fallback_frame_state (context=0x7f91c95eee00, context=0x7f91c95eee00, fs=0x7f91c95eeef0) at ./md-unwind-support.h:58 (gdb) 58 if (*(unsigned char *)(pc+0) == 0x48 Undefined command: "58". Try "help". (gdb)
snip from glusterd logs: ------------------------ [2017-06-14 08:22:10.739716] I [MSGID: 106004] [glusterd-handler.c:5808:__glusterd_peer_rpc_notify] 0-management: Peer <10.70.36.74> (<0c2f8929-3a24-4b33-95ea-9810b98f0027>), in state < Peer in Cluster>, has disconnected from glusterd. [2017-06-14 08:22:10.740199] C [MSGID: 106002] [glusterd-server-quorum.c:347:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume AppDisksVol. Stopping local br icks. [2017-06-14 08:22:10.831748] C [MSGID: 106002] [glusterd-server-quorum.c:347:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume BootDisksVol. Stopping local b ricks. [2017-06-14 08:22:10.831815] E [MSGID: 106187] [glusterd-store.c:4417:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore [2017-06-14 08:22:10.831871] E [MSGID: 101019] [xlator.c:433:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2017-06-14 08:22:10.831885] E [MSGID: 101066] [graph.c:324:glusterfs_graph_init] 0-management: initializing translator failed [2017-06-14 08:22:10.831896] E [MSGID: 101176] [graph.c:673:glusterfs_graph_activate] 0-graph: init failed [2017-06-14 08:22:10.831919] E [glusterd-peer-utils.c:153:glusterd_hostname_to_uuid] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x554a4) [0x7f3f69c9f4a4] -->/usr/lib64/glus terfs/3.8.4/xlator/mgmt/glusterd.so(+0x43be0) [0x7f3f69c8dbe0] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x102ad6) [0x7f3f69d4cad6] ) 0-: Assertion failed: priv pending frames: frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2017-06-14 08:22:10 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.8.4
Created attachment 1287961 [details] glusterd coredump from the node
As Atin has already mentioned that glusterd is not able to resolve the bricks on glusterd restart when network interface is down. We have a similar kind of bug 1472267, which has been addressed and below is the upstream patch for the same. Upstream patch: https://review.gluster.org/#/c/17813/
Tested with RHGS 3.4.0 nightly build ( glusterfs-3.12.2-17.el7rhgs ) with the following steps: 1. Created 3 node RHGS trusted storage pool 2. Create 3 volumes of type replicate. 3. Start the volumes 4. Get the console connection to one of the RHGS server node and bring down the interface(s) 5. Restart glusterd Observed that there are no crashes seen with glusterd.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607