Description of problem: ---------------------- My setup consisted of 4 servers and 4 clients(with both - mgt and I/O encryption enabled).Each client mounts from a different server via FUSE. I tried to create 2500 files from each client.Almost 15 mins into my workload,I could see an I/O hang on 1 of my clients. A normal Ctrl+C,Ctrl+Z could not help me break out of it,so the process had to be killed from another terminal. Version-Release number of selected component (if applicable): ------------------------------------------------------------ root@gqas013 ~]# [root@gqas013 ~]# rpm -qa|grep gluster tmp-rhs-tests-beaker-rhs-gluster-qe-libs-dev-bturner-3.0-0.noarch glusterfs-api-3.7.9-5.el6rhs.x86_64 python-gluster-3.7.9-5.el6rhs.noarch samba-vfs-glusterfs-4.4.3-4.el6rhs.x86_64 glusterfs-libs-3.7.9-5.el6rhs.x86_64 gluster-nagios-common-0.2.4-1.el6rhs.noarch glusterfs-server-3.7.9-5.el6rhs.x86_64 glusterfs-fuse-3.7.9-5.el6rhs.x86_64 glusterfs-cli-3.7.9-5.el6rhs.x86_64 gluster-nagios-addons-0.2.7-1.el6rhs.x86_64 vdsm-gluster-4.16.30-1.4.el6rhs.noarch glusterfs-3.7.9-5.el6rhs.x86_64 glusterfs-rdma-3.7.9-5.el6rhs.x86_64 glusterfs-geo-replication-3.7.9-5.el6rhs.x86_64 glusterfs-client-xlators-3.7.9-5.el6rhs.x86_64 [root@gqas013 ~]# rpm -qa|grep ssl openssl-1.0.1e-48.el6.x86_64 openssl-devel-1.0.1e-48.el6.x86_64 python-backports-ssl_match_hostname-3.4.0.2-2.el6.noarch How reproducible: ---------------- Reporting the first occurrence. Steps to Reproduce: ------------------ 1. Enable mgt and I/O encryption on the volume and mount it via FUSE. 2. Run dds from multiple clients Actual results: --------------- I/O hangs after a while on 1 of the clients. Expected results: ----------------- There should be no I/O hangs. Additional info: --------------- -
Whenever a hang is seen its advised to take the backtraces of the relevant processes to give an indication and help developers to figure out what might have caused this hang as the backtrace should reveal where the process is stuck. The current set of information provided here doesn't give us any clue of the problem until we reproduce it again. Could you please re-execute this scenario and provide the details sought for?
I hit a hang of IO when I did the following: I was testing the fix of 1279628 - [GSS]-gluster v heal volname info does not work with enabled ssl/tls So did the following 1)had a 6 node cluster setup 2)had a fuse client identified 3)there were already about 3 volumes 4) Now I had to setup SSL for this cluster setup and the client Hence followed the documention "https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Network_Encryption.html" 5)Once SSL was enabled on both client and servers I created a new volume called "newssl" which was a distrep vol of 3x2 Volume Name: newssl Type: Distributed-Replicate Volume ID: 467962bd-3d4e-4b56-adb2-7b2b0b34634e Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: 10.70.35.191:/rhs/brick3/newssl Brick2: 10.70.35.27:/rhs/brick3/newssl Brick3: 10.70.35.98:/rhs/brick3/newssl Brick4: 10.70.35.64:/rhs/brick3/newssl Brick5: 10.70.35.44:/rhs/brick3/newssl Brick6: 10.70.35.114:/rhs/brick3/newssl Options Reconfigured: cluster.locking-scheme: granular client.ssl: on server.ssl: on auth.ssl-allow: 10.70.35.191,10.70.35.27,10.70.35.98,10.70.35.64,10.70.35.44,10.70.35.114,10.70.35.126 performance.readdir-ahead: on [root@dhcp35-191 ~]# gluster v status newssl Status of volume: newssl Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.191:/rhs/brick3/newssl 49157 0 Y 31539 Brick 10.70.35.27:/rhs/brick3/newssl 49156 0 Y 26589 Brick 10.70.35.98:/rhs/brick3/newssl 49157 0 Y 29210 Brick 10.70.35.64:/rhs/brick3/newssl 49158 0 Y 5333 Brick 10.70.35.44:/rhs/brick3/newssl 49156 0 Y 14494 Brick 10.70.35.114:/rhs/brick3/newssl 49157 0 Y 5145 NFS Server on localhost 2049 0 Y 32315 Self-heal Daemon on localhost N/A N/A Y 32338 NFS Server on 10.70.35.98 2049 0 Y 29860 Self-heal Daemon on 10.70.35.98 N/A N/A Y 29870 NFS Server on 10.70.35.27 2049 0 Y 27180 Self-heal Daemon on 10.70.35.27 N/A N/A Y 27192 NFS Server on 10.70.35.114 2049 0 Y 5818 Self-heal Daemon on 10.70.35.114 N/A N/A Y 5837 NFS Server on 10.70.35.64 2049 0 Y 6085 Self-heal Daemon on 10.70.35.64 N/A N/A Y 6099 NFS Server on 10.70.35.44 2049 0 Y 15120 Self-heal Daemon on 10.70.35.44 N/A N/A Y 15134 Task Status of Volume newssl ------------------------------------------------------------------------------ There are no active volume tasks 6)I then mounted the volume on the SSL enabled client(10.70.35.126) using two different mount locations as below: A) created a folder /mnt/newssl and mounted the volume using "10.70.35.191:newssl on /mnt/newssl type fuse.glusterfs" B) Created another folder /mnt/newssl-dup and mounted the vol using "10.70.35.27:newssl on /mnt/newssl-dup type fuse.glusterfs" 7)Now from both the mount points I started to dd a file for about 10GB size 1) from newssl-dup dd'ed a file called file1 in a dir dir1 2) from newssl dd'ed a file called file7 in root of vol itself 8)Now while this was happening I issued a " watch gluster v heal newssl info" on all the 6 nodes 9)I then set the vol option "cluster.locking-scheme: granular" (as part of 1311839 - False positives in heal info) I then saw that the IOs using dd was hung for ever, and the heal info not throwing me any o/p and was as well hung showed the setup to Pranith and following the gdb bt details from the client ############################################################################# 0x00007fa2f40b5ef7 in pthread_join () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.4.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 libuuid-2.23.2-26.el7.x86_64 openssl-libs-1.0.1e-42.el7_1.9.x86_64 pcre-8.32-15.el7.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64 (gdb) thread apply all bt Thread 15 (Thread 0x7fa2ebec0700 (LWP 24958)): #0 0x00007fa2f40b5ef7 in pthread_join () from /lib64/libpthread.so.0 #1 0x00007fa2e9ca81de in _socket_reap_own_threads () at socket.c:223 #2 socket_thread_reaper () at socket.c:251 #3 0x00007fa2f5274c3b in gf_timer_proc (ctx=0x7fa2f601f010) at timer.c:184 #4 0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007fa2f39fb28d in clone () from /lib64/libc.so.6 Thread 14 (Thread 0x7fa2eb6bf700 (LWP 24959)): #0 0x00007fa2f40bbe91 in sigwait () from /lib64/libpthread.so.0 #1 0x00007fa2f572e8cb in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2021 #2 0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fa2f39fb28d in clone () from /lib64/libc.so.6 Thread 13 (Thread 0x7fa2eaebe700 (LWP 24960)): #0 0x00007fa2f40b8a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fa2f529c288 in syncenv_task (proc=proc@entry=0x7fa2f605fd10) at syncop.c:607 #2 0x00007fa2f529cfc0 in syncenv_processor (thdata=0x7fa2f605fd10) at syncop.c:699 #3 0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fa2f39fb28d in clone () from /lib64/libc.so.6 Thread 12 (Thread 0x7fa2ea6bd700 (LWP 24961)): #0 0x00007fa2f40b8a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fa2f529c288 in syncenv_task (proc=proc@entry=0x7fa2f60600d0) at syncop.c:607 #2 0x00007fa2f529cfc0 in syncenv_processor (thdata=0x7fa2f60600d0) at syncop.c:699 #3 0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fa2f39fb28d in clone () from /lib64/libc.so.6 Thread 11 (Thread 0x7fa2e81de700 (LWP 24963)): #0 0x00007fa2f39fb863 in epoll_wait () from /lib64/libc.so.6 #1 0x00007fa2f52ba9b0 in event_dispatch_epoll_worker (data=0x7fa2f60a55a0) at event-epoll.c:668 #2 0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fa2f39fb28d in clone () from /lib64/libc.so.6 Thread 10 (Thread 0x7fa2e49cf700 (LWP 24964)): #0 0x00007fa2f39fb863 in epoll_wait () from /lib64/libc.so.6 #1 0x00007fa2f52ba9b0 in event_dispatch_epoll_worker (data=0x7fa2e009dcb0) at event-epoll.c:668 #2 0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fa2f39fb28d in clone () from /lib64/libc.so.6 Thread 9 (Thread 0x7fa2dd7fa700 (LWP 24970)): #0 0x00007fa2f39f0c3d in poll () from /lib64/libc.so.6 #1 0x00007fa2e9cadb65 in poll (__timeout=-1, __nfds=2, __fds=0x7fa2dd7f9e80) at /usr/include/bits/poll2.h:46 #2 socket_poller (ctx=0x7fa2e01b45e0) at socket.c:2558 #3 0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fa2f39fb28d in clone () from /lib64/libc.so.6 Thread 8 (Thread 0x7fa2c3fff700 (LWP 24974)): #0 0x00007fa2f39f0c3d in poll () from /lib64/libc.so.6 #1 0x00007fa2e9cadb65 in poll (__timeout=-1, __nfds=2, __fds=0x7fa2c3ffee80) at /usr/include/bits/poll2.h:46 #2 socket_poller (ctx=0x7fa2e01840b0) at socket.c:2558 #3 0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #4 0x00007fa2f39fb28d in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7fa2c37fe700 (LWP 24975)): #0 0x00007fa2f39f0c3d in poll () from /lib64/libc.so.6 #1 0x00007fa2e9cadb65 in poll (__timeout=-1, __nfds=2, __fds=0x7fa2c37fde80) at /usr/include/bits/poll2.h:46 #2 socket_poller (ctx=0x7fa2e0123580) at socket.c:2558 #3 0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fa2f39fb28d in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7fa2c2ffd700 (LWP 24976)): #0 0x00007fa2f39f0c3d in poll () from /lib64/libc.so.6 #1 0x00007fa2e9cadb65 in poll (__timeout=-1, __nfds=2, __fds=0x7fa2c2ffce80) at /usr/include/bits/poll2.h:46 #2 socket_poller (ctx=0x7fa2e0153b50) at socket.c:2558 #3 0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fa2f39fb28d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7fa2c27fc700 (LWP 24977)): #0 0x00007fa2f39f0c3d in poll () from /lib64/libc.so.6 #1 0x00007fa2e9cadb65 in poll (__timeout=-1, __nfds=2, __fds=0x7fa2c27fbe80) at /usr/include/bits/poll2.h:46 #2 socket_poller (ctx=0x7fa2e00f30b0) at socket.c:2558 #3 0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fa2f39fb28d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7fa2c1ffb700 (LWP 24978)): #0 0x00007fa2f39f0c3d in poll () from /lib64/libc.so.6 #1 0x00007fa2e9cadb65 in poll (__timeout=-1, __nfds=2, __fds=0x7fa2c1ffae80) at /usr/include/bits/poll2.h:46 #2 socket_poller (ctx=0x7fa2e00c22f0) at socket.c:2558 #3 0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fa2f39fb28d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7fa2c16f9700 (LWP 24979)): #0 0x00007fa2f39f23a0 in readv () from /lib64/libc.so.6 #1 0x00007fa2ebee4fec in fuse_thread_proc (data=0x7fa2f604c770) at fuse-bridge.c:4896 #2 0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fa2f39fb28d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7fa2c0ef8700 (LWP 24980)): #0 0x00007fa2f40b86d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fa2ebed1fc3 in notify_kernel_loop (data=<optimized out>) at fuse-bridge.c:3873 #2 0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fa2f39fb28d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7fa2f5714780 (LWP 24957)): #0 0x00007fa2f40b5ef7 in pthread_join () from /lib64/libpthread.so.0 #1 0x00007fa2f52baeb8 in event_dispatch_epoll (event_pool=0x7fa2f603dd30) at event-epoll.c:762 #2 0x00007fa2f572b877 in main (argc=4, argv=0x7ffd34cddc88) at glusterfsd.c:2370 ################################################################# Will also add the sosreport location info
[root@dhcp35-126 gluster]# ps aux | grep gluster root 24873 3.7 1.5 1131328 61560 ? Ssl 11:48 1:33 /usr/sbin/glusterfs --volfile-server=10.70.35.191 --volfile-id=newssl /mnt/newssl root 24957 11.5 1.6 1131328 62792 ? Ssl 11:49 4:43 /usr/sbin/glusterfs --volfile-server=10.70.35.27 --volfile-id=newssl /mnt/newssl-dup I have put all sosreports @ nchilaka@rhsqe-repo bug.1337863]$ pwd /home/repo/sosreports/nchilaka/bug.1337863 I have even put the statedumps collected by Pranith during debugging the client sosreport was not completing, hence took a tar of the /var/log and copied it to the client dir in the sosreport location
Hitting the reported failure on 3.1.2 as well.
From the backtrace it looks like pthread_join is waiting on _socket_reap_own_threads (). This function is been introduced recently in rhgs-3.1.3 with https://code.engineering.redhat.com/gerrit/#/c/73681 and I was suspecting that this patch could have regressed the functionality. However Comment 7 makes me think the other way now. It'd really helpful if we can get back trace on rhgs-3.1.2 for the same scenario. In the mean time, could you shed some light on this, Jeff?
(In reply to Atin Mukherjee from comment #8) > However Comment 7 makes me think the other way now. It'd really helpful if > we can get back trace on rhgs-3.1.2 for the same scenario. The fact that it occurs in 3.1.2 indicates pretty conclusively that it's not a problem with the reap_own_threads code. Looking at the logs, it looks like glusterd on at least one of the servers reports several thousand SSL errors ("wrong version number" in a constant stream about 1.5 seconds apart. This usually means that glusterd is set up to use SSL but clients are trying to make non-SSL connections. The "common name" fields being reported for clients also seem a bit strange. It would be useful to see what "Enable mgt and I/O encryption on the volume and mount it via FUSE" in the original report actually means in detail.
Raising the needinfo on Ambarish as we need few clarification as mentioned in Comment 10
I usually check in the gluster-mount log on the client to see if the encryptions are correctly enabled ,just to rule out config related issues. I could see that I/O and glusterd encryptions were enabled,as Jeff mentioned.
Hi, I have tried to reproduce the issue on my vm environment.I am not able to reproduce the issue after followed below steps. 1) Setup environment 4 server(2x2) and 4 client Nodes 2) Create two different mount point from two different server on one client 3) Run loop on all client simulatneously to generate the io load dd if=/dev/zero of=/mnt/file1.txt bs=5000k count=1000 dd if=/dev/zero of=/mnt1/file2.txt bs=5000k count=1000 gluster v info Volume Name: restvol Type: Distributed-Replicate Volume ID: 946aee84-d470-4270-97f3-e1a53ea53c4b Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.65.7.151:/dist1/brick0 Brick2: 10.65.6.244:/dist1/brick1 Brick3: 10.65.7.148:/dist1/brick2 Brick4: 10.65.7.154:/dist1/brick3 Options Reconfigured: server.allow-insecure: on server.ssl: on client.ssl: on performance.readdir-ahead: on Can you please share the below command output at the time of hung glusterfs process from client gdb --batch --quiet -ex "thread apply all bt full" -ex "quit" -p <glusterfs-process-id> Regards Mohit Agrawal
Hi, After executed one more command(gluster v heal info in while loop) on server as mentioned in comment 5 i am able to reproduce the issue. For specific to one mount point process(glusterfs) is hanged on client. To find the root cause of hang I have generated a traceback of glusterfs process. As per below backtrace (glusterfs process) it is showing thread 13(_socket_reap_own_threads) is trying to join tid (2791 represent to poller thread). As per socket.c code socket_reap_own_threads called by timer thread to reap threads after called pthread_join. socket_poller called function socket_thread_reaper_add to add thread id in reap queue at last after got any error in polling loop or dying the thread. It means at the time of calling socket_reap_own_threads that thread should not be running but from below traceback it confirms thread 2791(socket_poller) is still waiting in polling loop. It means for a new rpc connection thread(with same thread id 2791) is generated again but somehow socket_reap_own_threads code was not called by timer thread before generated a new thread with same tid and because of waiting in poll it is hanged. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gdb --batch --quiet -ex "thread apply all bt full" -ex "quit" -p 2777 [New LWP 2796] [New LWP 2795] [New LWP 2794] [New LWP 2793] [New LWP 2792] [New LWP 2791] [New LWP 2784] [New LWP 2783] [New LWP 2781] [New LWP 2780] [New LWP 2779] [New LWP 2778] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00007f1017023ef7 in pthread_join (threadid=139706882123520, thread_return=thread_return@entry=0x0) at pthread_join.c:92 92 lll_wait_tid (pd->tid); Thread 13 (Thread 0x7f100ee33700 (LWP 2778)): #0 0x00007f1017023ef7 in pthread_join (threadid=139706720134912, thread_return=thread_return@entry=0x0) at pthread_join.c:92 _tid = 2791 _buffer = {__routine = 0x7f1017023e30 <cleanup>, __arg = 0x7f10016d2d28, __canceltype = 407298560, __prev = 0x0} oldtype = 0 pd = 0x7f10016d2700 self = 0x7f100ee33700 result = 0 #1 0x00007f100cc1b1ce in _socket_reap_own_threads () at socket.c:223 node = 0x7f0ff4001080 tmp = 0x7f0ff8001080 i = 0 #2 socket_thread_reaper () at socket.c:251 __FUNCTION__ = "socket_thread_reaper" #3 0x00007f10181e2c4b in gf_timer_proc (ctx=0x7f1018c79010) at timer.c:184 at = <optimized out> need_cbk = 1 '\001' now = 103110752180 now_ts = {tv_sec = 103, tv_nsec = 110752180} reg = 0x7f1018cb96d0 sleepts = {tv_sec = 1, tv_nsec = 0} event = 0x7f100412f0a0 old_THIS = 0x7f101846e200 <global_xlator> __FUNCTION__ = "gf_timer_proc" #4 0x00007f1017022dc5 in start_thread (arg=0x7f100ee33700) at pthread_create.c:308 __res = <optimized out> pd = 0x7f100ee33700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706945976064, 1630556678171971743, 0, 139706945976768, 139706945976064, 139707111936016, -1693077006916988769, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> #5 0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 No locals. Thread 12 (Thread 0x7f100e632700 (LWP 2779)): #0 0x00007f1017029e91 in do_sigwait (sig=0x7f100e631e1c, set=<optimized out>) at ../sysdeps/unix/sysv/linux/sigwait.c:61 __arg4 = 8 __arg2 = 0 _a3 = 0 _a1 = 139706937581088 resultvar = <optimized out> __arg3 = 0 __arg1 = 139706937581088 _a4 = 8 _a2 = 0 ret = <optimized out> tmpset = {__val = {2822930839, 139707082203098, 4294967295, 0, 139707074193632, 139707105687952, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}} #1 __sigwait (set=set@entry=0x7f100e631e20, sig=sig@entry=0x7f100e631e1c) at ../sysdeps/unix/sysv/linux/sigwait.c:99 oldtype = 0 result = 128 #2 0x00007f101869c8cb in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2021 set = {__val = {18947, 0 <repeats 15 times>}} ret = <optimized out> sig = 0 #3 0x00007f1017022dc5 in start_thread (arg=0x7f100e632700) at pthread_create.c:308 __res = <optimized out> pd = 0x7f100e632700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706937583360, 1630556678171971743, 0, 139706937584064, 139706937583360, 140735856486139, -1693078105891745633, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> #4 0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 No locals. Thread 11 (Thread 0x7f100de31700 (LWP 2780)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 No locals. #1 0x00007f101820a2c8 in syncenv_task (proc=proc@entry=0x7f1018cb9cd0) at syncop.c:607 env = 0x7f1018cb9cd0 task = 0x0 sleep_till = {tv_sec = 1469508466, tv_nsec = 0} ret = <optimized out> #2 0x00007f101820b000 in syncenv_processor (thdata=0x7f1018cb9cd0) at syncop.c:699 env = 0x7f1018cb9cd0 proc = 0x7f1018cb9cd0 task = <optimized out> #3 0x00007f1017022dc5 in start_thread (arg=0x7f100de31700) at pthread_create.c:308 __res = <optimized out> pd = 0x7f100de31700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706929190656, 1630556678171971743, 0, 139706929191360, 139706929190656, 139707112201424, -1693074806819991393, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> #4 0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 No locals. Thread 10 (Thread 0x7f100d630700 (LWP 2781)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 No locals. #1 0x00007f101820a2c8 in syncenv_task (proc=proc@entry=0x7f1018cba090) at syncop.c:607 env = 0x7f1018cb9cd0 task = 0x0 sleep_till = {tv_sec = 1469508466, tv_nsec = 0} ret = <optimized out> #2 0x00007f101820b000 in syncenv_processor (thdata=0x7f1018cba090) at syncop.c:699 env = 0x7f1018cb9cd0 proc = 0x7f1018cba090 task = <optimized out> #3 0x00007f1017022dc5 in start_thread (arg=0x7f100d630700) at pthread_create.c:308 __res = <optimized out> pd = 0x7f100d630700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706920797952, 1630556678171971743, 0, 139706920798656, 139706920797952, 139707112202384, -1693075914384682849, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> #4 0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 No locals. Thread 9 (Thread 0x7f100b14e700 (LWP 2783)): #0 0x00007f10169682c3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81 No locals. #1 0x00007f10182289f0 in event_dispatch_epoll_worker (data=0x7f1018cfbc40) at event-epoll.c:668 event = {events = 1073741827, data = {ptr = 0x100000000, fd = 0, u32 = 0, u64 = 4294967296}} ret = <optimized out> ev_data = 0x7f1018cfbc40 event_pool = 0x7f1018c97d30 myindex = 1 timetodie = 0 __FUNCTION__ = "event_dispatch_epoll_worker" #2 0x00007f1017022dc5 in start_thread (arg=0x7f100b14e700) at pthread_create.c:308 __res = <optimized out> pd = 0x7f100b14e700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706882123520, 1630556678171971743, 0, 139706882124224, 139706882123520, 139707112471616, -1693089021551128417, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> #3 0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 No locals. Thread 8 (Thread 0x7f1003fff700 (LWP 2784)): #0 0x00007f10169682c3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81 No locals. #1 0x00007f10182289f0 in event_dispatch_epoll_worker (data=0x7f100407a440) at event-epoll.c:668 event = {events = 0, data = {ptr = 0x0, fd = 0, u32 = 0, u64 = 0}} ret = <optimized out> ev_data = 0x7f100407a440 event_pool = 0x7f1018c97d30 myindex = 2 timetodie = 0 __FUNCTION__ = "event_dispatch_epoll_worker" #2 0x00007f1017022dc5 in start_thread (arg=0x7f1003fff700) at pthread_create.c:308 __res = <optimized out> pd = 0x7f1003fff700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706763310848, 1630556678171971743, 0, 139706763311552, 139706763310848, 139707112062256, -1693105695151042401, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> #3 0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 No locals. Thread 7 (Thread 0x7f10016d2700 (LWP 2791)): #0 0x00007f101695d69d in poll () at ../sysdeps/unix/syscall-template.S:81 No locals. #1 0x00007f100cc20b55 in poll (__timeout=-1, __nfds=2, __fds=0x7f10016d1e80) at /usr/include/bits/poll2.h:46 No locals. #2 socket_poller (ctx=0x7f10041278a0) at socket.c:2558 this = 0x7f10041278a0 priv = 0x7f1004128450 pfd = {{fd = 11, events = 59, revents = 0}, {fd = 10, events = 59, revents = 0}} to_write = (unknown: 254) ret = <optimized out> gen = 14 cname = <optimized out> __FUNCTION__ = "socket_poller" #3 0x00007f1017022dc5 in start_thread (arg=0x7f10016d2700) at pthread_create.c:308 __res = <optimized out> pd = 0x7f10016d2700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706720134912, 1630556678171971743, 0, 139706720135616, 139706720134912, 139706764523680, -1693102277967687521, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> #4 0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 No locals. Thread 6 (Thread 0x7f1000ed1700 (LWP 2792)): #0 0x00007f101695d69d in poll () at ../sysdeps/unix/syscall-template.S:81 No locals. #1 0x00007f100cc20b55 in poll (__timeout=-1, __nfds=2, __fds=0x7f1000ed0e80) at /usr/include/bits/poll2.h:46 No locals. #2 socket_poller (ctx=0x7f10040fa070) at socket.c:2558 this = 0x7f10040fa070 priv = 0x7f10040fac20 pfd = {{fd = 24, events = 59, revents = 0}, {fd = 15, events = 59, revents = 0}} to_write = (unknown: 254) ret = <optimized out> gen = 14 cname = <optimized out> __FUNCTION__ = "socket_poller" #3 0x00007f1017022dc5 in start_thread (arg=0x7f1000ed1700) at pthread_create.c:308 __res = <optimized out> pd = 0x7f1000ed1700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706711742208, 1630556678171971743, 0, 139706711742912, 139706711742208, 139706764337264, -1693098978895933281, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> #4 0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 No locals. Thread 5 (Thread 0x7f0febfff700 (LWP 2793)): #0 0x00007f101695d69d in poll () at ../sysdeps/unix/syscall-template.S:81 No locals. #1 0x00007f100cc20b55 in poll (__timeout=-1, __nfds=2, __fds=0x7f0febffee80) at /usr/include/bits/poll2.h:46 No locals. #2 socket_poller (ctx=0x7f100409ea90) at socket.c:2558 this = 0x7f100409ea90 priv = 0x7f100409f740 pfd = {{fd = 26, events = 59, revents = 0}, {fd = 21, events = 59, revents = 0}} to_write = (unknown: 254) ret = <optimized out> gen = 14 cname = <optimized out> __FUNCTION__ = "socket_poller" #3 0x00007f1017022dc5 in start_thread (arg=0x7f0febfff700) at pthread_create.c:308 __res = <optimized out> pd = 0x7f0febfff700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706360657664, 1630556678171971743, 0, 139706360658368, 139706360657664, 139706763963024, -1675566285664759649, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> #4 0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 No locals. Thread 4 (Thread 0x7f0feb7fe700 (LWP 2794)): #0 0x00007f101695d69d in poll () at ../sysdeps/unix/syscall-template.S:81 No locals. #1 0x00007f100cc20b55 in poll (__timeout=-1, __nfds=2, __fds=0x7f0feb7fde80) at /usr/include/bits/poll2.h:46 No locals. #2 socket_poller (ctx=0x7f10040cc7e0) at socket.c:2558 this = 0x7f10040cc7e0 priv = 0x7f10040cd420 pfd = {{fd = 28, events = 59, revents = 0}, {fd = 18, events = 59, revents = 0}} to_write = (unknown: 254) ret = <optimized out> gen = 14 cname = <optimized out> __FUNCTION__ = "socket_poller" #3 0x00007f1017022dc5 in start_thread (arg=0x7f0feb7fe700) at pthread_create.c:308 __res = <optimized out> pd = 0x7f0feb7fe700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706352264960, 1630556678171971743, 0, 139706352265664, 139706352264960, 139706764150752, -1675567384639516513, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> #4 0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 No locals. Thread 3 (Thread 0x7f0fead5c700 (LWP 2795)): #0 0x00007f101695ee00 in __libc_readv (fd=8, vector=vector@entry=0x7f0fead5be70, count=count@entry=2) at ../sysdeps/unix/sysv/linux/readv.c:56 resultvar = 18446744073709551104 oldtype = 0 result = <optimized out> #1 0x00007f100ee5101c in fuse_thread_proc (data=0x7f1018ca6730) at fuse-bridge.c:4896 mount_point = 0x0 this = 0x7f1018ca6730 priv = 0x7f1018caa7a0 res = <optimized out> iobuf = 0x7f1018c974c0 finh = <optimized out> iov_in = {{iov_base = 0x7f0fe4049030, iov_len = 80}, {iov_base = 0x7f10163f1000, iov_len = 131072}} msg = <optimized out> fuse_ops = 0x7f100f0607a0 <fuse_std_ops> pfd = {{fd = 6, events = 25, revents = 1}, {fd = 8, events = 25, revents = 1}} mount_finished = _gf_true __FUNCTION__ = "fuse_thread_proc" #2 0x00007f1017022dc5 in start_thread (arg=0x7f0fead5c700) at pthread_create.c:308 __res = <optimized out> pd = 0x7f0fead5c700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706341115648, 1630556678171971743, 0, 139706341116352, 139706341115648, 139707112122160, -1675564445808144225, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> #3 0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 No locals. Thread 2 (Thread 0x7f0fea55b700 (LWP 2796)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 No locals. #1 0x00007f100ee3dfc3 in notify_kernel_loop (data=<optimized out>) at fuse-bridge.c:3873 len = 40 rv = <optimized out> this = <optimized out> priv = 0x7f1018caa7a0 node = <optimized out> tmp = 0x0 pfoh = <optimized out> __FUNCTION__ = "notify_kernel_loop" #2 0x00007f1017022dc5 in start_thread (arg=0x7f0fea55b700) at pthread_create.c:308 __res = <optimized out> pd = 0x7f0fea55b700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706332722944, 1630556678171971743, 0, 139706332723648, 139706332722944, 139707112122160, -1675565544782901089, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> #3 0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 No locals. Thread 1 (Thread 0x7f101867f780 (LWP 2777)): #0 0x00007f1017023ef7 in pthread_join (threadid=139706882123520, thread_return=thread_return@entry=0x0) at pthread_join.c:92 _tid = 2783 _buffer = {__routine = 0x7f1017023e30 <cleanup>, __arg = 0x7f100b14ed28, __canceltype = -1631877440, __prev = 0x0} oldtype = 0 pd = 0x7f100b14e700 self = 0x7f101867f780 result = 0 #1 0x00007f1018228ef8 in event_dispatch_epoll (event_pool=0x7f1018c97d30) at event-epoll.c:762 i = <optimized out> t_id = 139706882123520 pollercount = 1 ret = 0 ev_data = <optimized out> __FUNCTION__ = "event_dispatch_epoll" #2 0x00007f1018699877 in main (argc=4, argv=0x7fff9ebb9878) at glusterfsd.c:2367 ctx = 0x7f1018c79010 ret = 0 cmdlinestr = "/usr/sbin/glusterfs --volfile-server=10.65.7.148 --volfile-id=/restvol /mnt1", '\000' <repeats 4019 times> cmd = 0x7f1018c79010 __FUNCTION__ = "main" A debugging session is active. Inferior 1 [process 2777] will be detached. Quit anyway? (y or n) [answered Y; input not from terminal] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have tried same steps on latest patch(http://review.gluster.org/#/c/14886/5) after remove reap code,it is working fine. Regards Mohit Agrawal
Since the patch is already posted upstream, moving the state to POST.
Upstream mainline : http://review.gluster.org/14694 Upstream 3.8 : http://review.gluster.org/14891 And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.
Tested on 3.8.4-2 with 4 workload generating clients. Ran the same workload I ran while opening this BZ. Ran multi-threaded iozone(with 4 and 8 threads per client) as BenT reported in BZ#1249166 No hangs were seen,Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html