+++ This bug was initially created as a clone of Bug #1598733 +++ Description of problem: ----------------------- Created a RHHI setup where there are 2 hypervisors ( RHVH 4.2.5-1 ) in the gluster cluster. glusterd in one of the node has crashed. While checking at the logs, I could observed that there are indications 1. Some other tried to include this node in its cluster for long time 2. There was a momentary DNS address resolution failures. ( getaddrinfo fails ) Version-Release number of selected component (if applicable): -------------------------------------------------------------- RHGS 3.3.1 ( glusterfs-3.8.4-54.13.el7rhgs ) [ not yet available public ] How reproducible: ------------------ Seen couple of times Steps to Reproduce: ------------------- 1. Create a RHHI setup with 3 nodes. 3 volumes are available 2. Leave the setup for couple of days Actual results: --------------- glusterd crashed Expected results: ----------------- glusterd should not crash Additional info:
I have noticed following hints from the logs: 1. Some unknown host ( 10.70.37.122 ) from outside the trusted storage pool tries to probe this host ( 10.70.37.200 ) <snip> [2018-06-30 22:02:58.909285] E [MSGID: 106170] [glusterd-handshake.c:1128:gd_validate_mgmt_hndsk_req] 0-management: Rejecting management handshake request from unknown peer 10.70.37.122:49141 </snip> [root@localhost ~]# grep -i rejecting /var/log/glusterfs/glusterd.log* | wc -l 148160 2. I see DNS resolution failures <snip1> [2018-07-06 07:52:23.483572] E [name.c:262:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host rhsqa-grafton4-nic2.lab.eng.blr.redhat.com [2018-07-06 07:52:23.484206] E [name.c:262:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host rhsqa-grafton6-nic2.lab.eng.blr.redhat.com </snip1> <snip2> [2018-07-06 07:44:57.602837] E [MSGID: 101075] [common-utils.c:314:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known) The message "E [MSGID: 101075] [common-utils.c:314:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known)" repeated 79 times between [2018-07-06 07:44:57.602837] and [2018-07-06 07:46:54.671084] </snip2> [root@localhost ~]# grep -i "DNS resolution failed" /var/log/glusterfs/glusterd.log* | wc -l 1834 3. Traces of crash as seen glusterd log ---------------------------------------- <snip> [2018-07-06 07:52:23.483677] I [MSGID: 106004] [glusterd-handler.c:6366:__glusterd_peer_rpc_notify] 0-management: Peer <rhsqa-grafton4-nic2.lab.eng.blr.redhat.com> (<f5aa6117-4f10-4aa8-b46a-02f1d4fa1a28>), in state <Peer in Cluster>, has disconnected from glusterd. [2018-07-06 07:52:23.483818] C [MSGID: 106002] [glusterd-server-quorum.c:349:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume data. Stopping local bricks. [2018-07-06 07:52:23.484185] E [MSGID: 101075] [common-utils.c:314:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known) [2018-07-06 07:52:23.484206] E [name.c:262:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host rhsqa-grafton6-nic2.lab.eng.blr.redhat.com [2018-07-06 07:52:23.484253] I [MSGID: 106544] [glusterd.c:157:glusterd_uuid_init] 0-management: retrieved UUID: 0e8bb8d4-dc18-4d7b-897d-931cd90e55af [2018-07-06 07:52:23.487709] C [MSGID: 106002] [glusterd-server-quorum.c:349:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume engine. Stopping local bricks. [2018-07-06 07:52:23.487904] E [MSGID: 106187] [glusterd-store.c:4451:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore [2018-07-06 07:52:23.487940] E [MSGID: 101019] [xlator.c:486:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2018-07-06 07:52:23.487953] E [MSGID: 101066] [graph.c:324:glusterfs_graph_init] 0-management: initializing translator failed [2018-07-06 07:52:23.487961] E [MSGID: 101176] [graph.c:680:glusterfs_graph_activate] 0-graph: init failed pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2018-07-06 07:52:23 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.8.4 [2018-07-06 07:52:23.488443] W [glusterfsd.c:1300:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x55f21da6ab5d] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x1b1) [0x55f21da6aa01] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x55f21da69eeb] ) 0-: received signum (1), shutting down </snip> 4. Backtrace -------------- [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'. Program terminated with signal 11, Segmentation fault. #0 x86_64_fallback_frame_state (context=0x7f69527d2740, context=0x7f69527d2740, fs=0x7f69527d2830) at ./md-unwind-support.h:58 58 if (*(unsigned char *)(pc+0) == 0x48 (gdb) 5. All threads backtrace ------------------------- (gdb) t a a bt Thread 8 (Thread 0x7f6962988780 (LWP 2402)): #0 0x00007f6960bc7084 in do_fcntl (arg=<optimized out>, cmd=6, fd=6) at ../sysdeps/unix/sysv/linux/fcntl.c:39 #1 __GI___libc_fcntl (fd=6, cmd=6) at ../sysdeps/unix/sysv/linux/fcntl.c:88 #2 0x00007f6960bc7225 in lockf (fd=<optimized out>, cmd=<optimized out>, cmd@entry=0, len=len@entry=0) at lockf.c:80 #3 0x000055f21da69e54 in glusterfs_pidfile_cleanup (ctx=ctx@entry=0x55f21f2b3370) at glusterfsd.c:2008 #4 0x000055f21da69f73 in cleanup_and_exit (signum=1) at glusterfsd.c:1341 #5 0x000055f21da6aa01 in glusterfs_process_volfp (ctx=ctx@entry=0x55f21f2b3370, fp=fp@entry=0x55f21f2f4800) at glusterfsd.c:2329 #6 0x000055f21da6ab5d in glusterfs_volumes_init (ctx=ctx@entry=0x55f21f2b3370) at glusterfsd.c:2365 #7 0x000055f21da66e8f in main (argc=5, argv=<optimized out>) at glusterfsd.c:2485 Thread 7 (Thread 0x7f69592ac700 (LWP 2404)): #0 0x00007f6961314411 in do_sigwait (sig=0x7f69592abe1c, set=<optimized out>) at ../sysdeps/unix/sysv/linux/sigwait.c:61 #1 __sigwait (set=set@entry=0x7f69592abe20, sig=sig@entry=0x7f69592abe1c) at ../sysdeps/unix/sysv/linux/sigwait.c:99 #2 0x000055f21da6a07b in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2079 #3 0x00007f696130cdd5 in start_thread (arg=0x7f69592ac700) at pthread_create.c:308 #4 0x00007f6960bd5b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 6 (Thread 0x7f6959aad700 (LWP 2403)): #0 0x00007f6961313eed in nanosleep () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f69624b9f2e in gf_timer_proc (data=0x55f21f2efe50) at timer.c:176 #2 0x00007f696130cdd5 in start_thread (arg=0x7f6959aad700) at pthread_create.c:308 #3 0x00007f6960bd5b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 5 (Thread 0x7f6958aab700 (LWP 2405)): #0 0x00007f6960b9c4fd in nanosleep () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f6960b9c394 in __sleep (seconds=0, seconds@entry=30) at ../sysdeps/unix/sysv/linux/sleep.c:137 #2 0x00007f69624d340d in pool_sweeper (arg=<optimized out>) at mem-pool.c:464 #3 0x00007f696130cdd5 in start_thread (arg=0x7f6958aab700) at pthread_create.c:308 #4 0x00007f6960bd5b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 4 (Thread 0x7f6957aa9700 (LWP 2407)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x00007f69624e59e8 in syncenv_task (proc=proc@entry=0x55f21f2f0a60) at syncop.c:603 #2 0x00007f69624e6830 in syncenv_processor (thdata=0x55f21f2f0a60) at syncop.c:695 #3 0x00007f696130cdd5 in start_thread (arg=0x7f6957aa9700) at pthread_create.c:308 #4 0x00007f6960bd5b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 3 (Thread 0x7f69582aa700 (LWP 2406)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x00007f69624e59e8 in syncenv_task (proc=proc@entry=0x55f21f2f06a0) at syncop.c:603 #2 0x00007f69624e6830 in syncenv_processor (thdata=0x55f21f2f06a0) at syncop.c:695 #3 0x00007f696130cdd5 in start_thread (arg=0x7f69582aa700) at pthread_create.c:308 #4 0x00007f6960bd5b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 2 (Thread 0x7f6951fd3700 (LWP 2999)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007f69624e35ab in __synclock_lock (lock=lock@entry=0x7f6962820818) at syncop.c:924 #2 0x00007f69624e6ae6 in synclock_lock (lock=0x7f6962820818) at syncop.c:950 #3 0x00007f6956f49bdb in ?? () #4 0x0000000000000008 in ?? () #5 0x000055f21f3a7160 in ?? () #6 0x000055f21f3a7130 in ?? () #7 0x000055f21f2f8df0 in ?? () ---Type <return> to continue, or q <return> to quit--- #8 0x00000000ffffffff in ?? () #9 0x0000000000000000 in ?? () Thread 1 (Thread 0x7f69527d4700 (LWP 2997)): #0 x86_64_fallback_frame_state (context=0x7f69527d2740, context=0x7f69527d2740, fs=0x7f69527d2830) at ./md-unwind-support.h:58 #1 uw_frame_state_for (context=context@entry=0x7f69527d2740, fs=fs@entry=0x7f69527d2830) at ../../../libgcc/unwind-dw2.c:1253 #2 0x00007f6955109fb9 in _Unwind_Backtrace (trace=0x7f6960beba30 <backtrace_helper>, trace_argument=0x7f69527d29f0) at ../../../libgcc/unwind.inc:290 #3 0x00007f6960bebba6 in __GI___backtrace (array=array@entry=0x7f69527d2a30, size=size@entry=200) at ../sysdeps/x86_64/backtrace.c:109 #4 0x00007f69624ac822 in _gf_msg_backtrace_nomem (level=level@entry=GF_LOG_ALERT, stacksize=stacksize@entry=200) at logging.c:1094 #5 0x00007f69624b6354 in gf_print_trace (signum=<optimized out>, ctx=<optimized out>) at common-utils.c:757 #6 <signal handler called> #7 0x00007f6957032358 in ?? () #8 0x000055f21f378f10 in ?? () #9 0x000055f21f385ce8 in ?? () #10 0x00007f69569b8123 in ?? () #11 0x00007f695703218d in ?? () #12 0x0000000000000000 in ?? ()
Created attachment 1457012 [details] glusterd log file
Sas, Were you able to observe this crash apart from this instance?
Sas - we'd need your help in getting this replicated, otherwise we'd not e able to take any action against this bug and we have to close this bug as not reproducible. The current core file didn't give us much clue.
We tried to reproduce this crash multiple times and couldn't. In case this is seen again, please feel free to reopen. At this point of time, we don't have enough data points to work on this bug.