Bug 1736848

Summary: Execute the "gluster peer probe invalid_hostname" thread deadlock or the glusterd process crashes
Product: [Community] GlusterFS Reporter: xlfy555
Component: glusterdAssignee: Sanju <srakonde>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: high    
Version: 6CC: amukherj, bugs, moagrawa
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-31 11:38:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description xlfy555 2019-08-02 07:56:17 UTC
Description of problem:
When glusterd starts, typing the command "gluster peer probe invalid_hostname" produces different results on different machines, with some machines glusterd crashing and producing core files, and some machines glusterd processes with many more child threads.

Version-Release number of selected component (if applicable):
release-6

How reproducible:


Steps to Reproduce:
Case 1
1.glusterd
2.gluster peer probe invalid_hostname

Case 2
1.glusterd
2.gluster peer probe invalid_hostname
3.gluster peer probe invalid_hostname
4.gluster peer probe invalid_hostname(Do it a few more times)
5.ps -aux|grep glusterd
6.gdb attach glusterd-pid
7.info thr (You'll see a lot of "__lll_lock_wait()" child threads)

Actual results:
Case 1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Core was generated by `glusterd'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fef4bd208ff in rpc_clnt_handle_disconnect (conn=0x7fef34007890, clnt=0x7fef34007860) at rpc-clnt.c:832
832	        if (!conn->rpc_clnt->disabled && (conn->reconnect == NULL)) {
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_6.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 openssl-libs-1.0.1e-60.el7.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7.x86_64 userspace-rcu-0.7.16-1.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00007fef4bd208ff in rpc_clnt_handle_disconnect (conn=0x7fef34007890, clnt=0x7fef34007860) at rpc-clnt.c:832
#1  rpc_clnt_notify (trans=0x7fef34007be0, mydata=0x7fef34007890, event=<optimized out>, data=<optimized out>) at rpc-clnt.c:878
#2  0x00007fef4bd1d4e3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:542
#3  0x00007fef3f3634d7 in socket_connect_error_cbk (opaque=0x7fef34007190) at socket.c:3239
#4  0x00007fef4adb6dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007fef4a6fb73d in clone () from /usr/lib64/libc.so.6
(gdb) p conn->rpc_clnt
$1 = (struct rpc_clnt *) 0x14860
(gdb) p conn->rpc_clnt->disabled
Cannot access memory at address 0x149a0



Case 2
(gdb) info thr
  Id   Target Id         Frame 
  16   Thread 0x7ff384f45700 (LWP 18259) "glfs_timer" 0x00007ff38c728bdd in nanosleep () from /usr/lib64/libpthread.so.0
  15   Thread 0x7ff384744700 (LWP 18260) "glfs_sigwait" 0x00007ff38c729101 in sigwait () from /usr/lib64/libpthread.so.0
  14   Thread 0x7ff383f43700 (LWP 18261) "glfs_memsweep" 0x00007ff38c02d66d in nanosleep () from /usr/lib64/libc.so.6
  13   Thread 0x7ff383742700 (LWP 18262) "glfs_sproc0" 0x00007ff38c725a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  12   Thread 0x7ff382f41700 (LWP 18263) "glfs_sproc1" 0x00007ff38c725a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  11   Thread 0x7ff382740700 (LWP 18264) "glusterd" 0x00007ff38c05dba3 in select () from /usr/lib64/libc.so.6
  10   Thread 0x7ff37f2c1700 (LWP 18290) "glfs_gdhooks" 0x00007ff38c7256d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  9    Thread 0x7ff37eac0700 (LWP 18291) "glfs_epoll000" 0x00007ff38c066d13 in epoll_wait () from /usr/lib64/libc.so.6
  8    Thread 0x7ff37d216700 (LWP 18306) "glfs_scleanup" 0x00007ff38c7281bd in __lll_lock_wait () from /usr/lib64/libpthread.so.0
  7    Thread 0x7ff37ca15700 (LWP 18307) "glfs_scleanup" 0x00007ff38c060bf9 in syscall () from /usr/lib64/libc.so.6
  6    Thread 0x7ff367fff700 (LWP 18315) "glfs_scleanup" 0x00007ff38c7281bd in __lll_lock_wait () from /usr/lib64/libpthread.so.0
  5    Thread 0x7ff3677fe700 (LWP 18323) "glfs_scleanup" 0x00007ff38c7281bd in __lll_lock_wait () from /usr/lib64/libpthread.so.0
  4    Thread 0x7ff366ffd700 (LWP 18331) "glfs_scleanup" 0x00007ff38c7281bd in __lll_lock_wait () from /usr/lib64/libpthread.so.0
  3    Thread 0x7ff3667fc700 (LWP 18339) "glfs_scleanup" 0x00007ff38c7281bd in __lll_lock_wait () from /usr/lib64/libpthread.so.0
  2    Thread 0x7ff365ffb700 (LWP 18347) "glfs_scleanup" 0x00007ff38c7281bd in __lll_lock_wait () from /usr/lib64/libpthread.so.0
* 1    Thread 0x7ff38de22480 (LWP 18258) "glusterd" 0x00007ff38c722ef7 in pthread_join () from /usr/lib64/libpthread.so.0

Expected results:


Additional info:

Comment 1 Sanju 2020-01-31 11:38:36 UTC
I don't see this happening anymore with the latest master head after a couple of tries as well. So, closing this as current release.