Bug 1337863 - [SSL] : I/O hangs when run from multiple clients on an SSL enabled volume
Summary: [SSL] : I/O hangs when run from multiple clients on an SSL enabled volume
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: fuse
Version: rhgs-3.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: RHGS 3.2.0
Assignee: Mohit Agrawal
QA Contact: Ambarish
URL:
Whiteboard:
Depends On:
Blocks: 1351522 1351530
TreeView+ depends on / blocked
 
Reported: 2016-05-20 09:45 UTC by Ambarish
Modified: 2017-03-23 05:32 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.8.4-1
Doc Type: Bug Fix
Doc Text:
Previously, when SSL was enabled for a Red Hat Gluster Storage volume, thread cleanup code did not always finish running before a new thread with the same ID was generated. When this happened, clients with I/O in progress experienced hung behavior. New threads are now generated after the detach flag is enabled during thread creation, so that these race conditions are avoided, and I/O no longer hangs for clients.
Clone Of:
Environment:
Last Closed: 2017-03-23 05:32:29 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 0 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Description Ambarish 2016-05-20 09:45:48 UTC
Description of problem:
----------------------

My setup consisted of 4 servers and 4 clients(with both - mgt and I/O encryption enabled).Each client mounts from a different server via FUSE.
I tried to create 2500 files from each client.Almost 15 mins into my workload,I could see an I/O hang on 1 of my clients.
A normal Ctrl+C,Ctrl+Z could not help me break out of it,so the process had to be killed from another terminal.

Version-Release number of selected component (if applicable):
------------------------------------------------------------

root@gqas013 ~]# 
[root@gqas013 ~]# rpm -qa|grep gluster
tmp-rhs-tests-beaker-rhs-gluster-qe-libs-dev-bturner-3.0-0.noarch
glusterfs-api-3.7.9-5.el6rhs.x86_64
python-gluster-3.7.9-5.el6rhs.noarch
samba-vfs-glusterfs-4.4.3-4.el6rhs.x86_64
glusterfs-libs-3.7.9-5.el6rhs.x86_64
gluster-nagios-common-0.2.4-1.el6rhs.noarch
glusterfs-server-3.7.9-5.el6rhs.x86_64
glusterfs-fuse-3.7.9-5.el6rhs.x86_64
glusterfs-cli-3.7.9-5.el6rhs.x86_64
gluster-nagios-addons-0.2.7-1.el6rhs.x86_64
vdsm-gluster-4.16.30-1.4.el6rhs.noarch
glusterfs-3.7.9-5.el6rhs.x86_64
glusterfs-rdma-3.7.9-5.el6rhs.x86_64
glusterfs-geo-replication-3.7.9-5.el6rhs.x86_64
glusterfs-client-xlators-3.7.9-5.el6rhs.x86_64

[root@gqas013 ~]# rpm -qa|grep  ssl
openssl-1.0.1e-48.el6.x86_64
openssl-devel-1.0.1e-48.el6.x86_64
python-backports-ssl_match_hostname-3.4.0.2-2.el6.noarch

How reproducible:
----------------

Reporting the first occurrence.

Steps to Reproduce:
------------------

1. Enable mgt and I/O encryption on the volume and mount it via FUSE.

2. Run dds from multiple clients


Actual results:
---------------

I/O hangs after a while on 1 of the clients.


Expected results:
-----------------

There should be no I/O hangs.

Additional info:
---------------

-

Comment 4 Atin Mukherjee 2016-05-26 13:59:18 UTC
Whenever a hang is seen its advised to take the backtraces of the relevant processes to give an indication and help developers to figure out what might have caused this hang as the backtrace should reveal where the process is stuck. The current set of information provided here doesn't give us any clue of the problem until we reproduce it again.

Could you please re-execute this scenario and provide the details sought for?

Comment 5 Nag Pavan Chilakam 2016-05-27 09:29:31 UTC
I hit  a hang of IO when I did the following:
I was testing the fix of 1279628 - [GSS]-gluster v heal volname info does not work with enabled ssl/tls 
So did the following
1)had a 6 node cluster setup
2)had a fuse client identified
3)there were already about 3 volumes
4) Now I had to setup SSL for this cluster setup and the client
Hence followed the documention "https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Network_Encryption.html"
5)Once SSL was enabled on both client and servers I created a new volume called "newssl" which was a distrep vol of 3x2
 
Volume Name: newssl
Type: Distributed-Replicate
Volume ID: 467962bd-3d4e-4b56-adb2-7b2b0b34634e
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.35.191:/rhs/brick3/newssl
Brick2: 10.70.35.27:/rhs/brick3/newssl
Brick3: 10.70.35.98:/rhs/brick3/newssl
Brick4: 10.70.35.64:/rhs/brick3/newssl
Brick5: 10.70.35.44:/rhs/brick3/newssl
Brick6: 10.70.35.114:/rhs/brick3/newssl
Options Reconfigured:
cluster.locking-scheme: granular
client.ssl: on
server.ssl: on
auth.ssl-allow: 10.70.35.191,10.70.35.27,10.70.35.98,10.70.35.64,10.70.35.44,10.70.35.114,10.70.35.126
performance.readdir-ahead: on

[root@dhcp35-191 ~]# gluster v status newssl
Status of volume: newssl
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.191:/rhs/brick3/newssl       49157     0          Y       31539
Brick 10.70.35.27:/rhs/brick3/newssl        49156     0          Y       26589
Brick 10.70.35.98:/rhs/brick3/newssl        49157     0          Y       29210
Brick 10.70.35.64:/rhs/brick3/newssl        49158     0          Y       5333 
Brick 10.70.35.44:/rhs/brick3/newssl        49156     0          Y       14494
Brick 10.70.35.114:/rhs/brick3/newssl       49157     0          Y       5145 
NFS Server on localhost                     2049      0          Y       32315
Self-heal Daemon on localhost               N/A       N/A        Y       32338
NFS Server on 10.70.35.98                   2049      0          Y       29860
Self-heal Daemon on 10.70.35.98             N/A       N/A        Y       29870
NFS Server on 10.70.35.27                   2049      0          Y       27180
Self-heal Daemon on 10.70.35.27             N/A       N/A        Y       27192
NFS Server on 10.70.35.114                  2049      0          Y       5818 
Self-heal Daemon on 10.70.35.114            N/A       N/A        Y       5837 
NFS Server on 10.70.35.64                   2049      0          Y       6085 
Self-heal Daemon on 10.70.35.64             N/A       N/A        Y       6099 
NFS Server on 10.70.35.44                   2049      0          Y       15120
Self-heal Daemon on 10.70.35.44             N/A       N/A        Y       15134
 
Task Status of Volume newssl
------------------------------------------------------------------------------
There are no active volume tasks
 
6)I then mounted the volume on the SSL enabled client(10.70.35.126) using two different mount locations as below:
 A) created a folder /mnt/newssl and mounted the volume using "10.70.35.191:newssl on /mnt/newssl type fuse.glusterfs"
 B) Created another folder /mnt/newssl-dup and mounted the vol using  "10.70.35.27:newssl on /mnt/newssl-dup type fuse.glusterfs"

7)Now from both the mount points I started to dd a file for about 10GB size
  1) from newssl-dup dd'ed a file called file1 in a dir dir1
  2) from newssl dd'ed a file called file7 in root of vol itself

8)Now while this was happening I issued a " watch gluster v heal newssl info" on all the 6 nodes
9)I then set  the vol option "cluster.locking-scheme: granular" (as part of 1311839 - False positives in heal info)

I then saw that the IOs using dd was hung for ever, and the heal info not throwing me any o/p and was as well hung


showed the setup to Pranith and following the gdb bt details from the client

#############################################################################
0x00007fa2f40b5ef7 in pthread_join () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.4.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 libuuid-2.23.2-26.el7.x86_64 openssl-libs-1.0.1e-42.el7_1.9.x86_64 pcre-8.32-15.el7.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) thread apply all bt

Thread 15 (Thread 0x7fa2ebec0700 (LWP 24958)):
#0  0x00007fa2f40b5ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fa2e9ca81de in _socket_reap_own_threads () at socket.c:223
#2  socket_thread_reaper () at socket.c:251
#3  0x00007fa2f5274c3b in gf_timer_proc (ctx=0x7fa2f601f010) at timer.c:184
#4  0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fa2f39fb28d in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7fa2eb6bf700 (LWP 24959)):
#0  0x00007fa2f40bbe91 in sigwait () from /lib64/libpthread.so.0
#1  0x00007fa2f572e8cb in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2021
#2  0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fa2f39fb28d in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7fa2eaebe700 (LWP 24960)):
#0  0x00007fa2f40b8a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fa2f529c288 in syncenv_task (proc=proc@entry=0x7fa2f605fd10) at syncop.c:607
#2  0x00007fa2f529cfc0 in syncenv_processor (thdata=0x7fa2f605fd10) at syncop.c:699
#3  0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fa2f39fb28d in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7fa2ea6bd700 (LWP 24961)):
#0  0x00007fa2f40b8a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fa2f529c288 in syncenv_task (proc=proc@entry=0x7fa2f60600d0) at syncop.c:607
#2  0x00007fa2f529cfc0 in syncenv_processor (thdata=0x7fa2f60600d0) at syncop.c:699
#3  0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fa2f39fb28d in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7fa2e81de700 (LWP 24963)):
#0  0x00007fa2f39fb863 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fa2f52ba9b0 in event_dispatch_epoll_worker (data=0x7fa2f60a55a0) at event-epoll.c:668
#2  0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fa2f39fb28d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7fa2e49cf700 (LWP 24964)):
#0  0x00007fa2f39fb863 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fa2f52ba9b0 in event_dispatch_epoll_worker (data=0x7fa2e009dcb0) at event-epoll.c:668
#2  0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fa2f39fb28d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7fa2dd7fa700 (LWP 24970)):
#0  0x00007fa2f39f0c3d in poll () from /lib64/libc.so.6
#1  0x00007fa2e9cadb65 in poll (__timeout=-1, __nfds=2, __fds=0x7fa2dd7f9e80) at /usr/include/bits/poll2.h:46
#2  socket_poller (ctx=0x7fa2e01b45e0) at socket.c:2558
#3  0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fa2f39fb28d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fa2c3fff700 (LWP 24974)):
#0  0x00007fa2f39f0c3d in poll () from /lib64/libc.so.6
#1  0x00007fa2e9cadb65 in poll (__timeout=-1, __nfds=2, __fds=0x7fa2c3ffee80) at /usr/include/bits/poll2.h:46
#2  socket_poller (ctx=0x7fa2e01840b0) at socket.c:2558
#3  0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#4  0x00007fa2f39fb28d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fa2c37fe700 (LWP 24975)):
#0  0x00007fa2f39f0c3d in poll () from /lib64/libc.so.6
#1  0x00007fa2e9cadb65 in poll (__timeout=-1, __nfds=2, __fds=0x7fa2c37fde80) at /usr/include/bits/poll2.h:46
#2  socket_poller (ctx=0x7fa2e0123580) at socket.c:2558
#3  0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fa2f39fb28d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fa2c2ffd700 (LWP 24976)):
#0  0x00007fa2f39f0c3d in poll () from /lib64/libc.so.6
#1  0x00007fa2e9cadb65 in poll (__timeout=-1, __nfds=2, __fds=0x7fa2c2ffce80) at /usr/include/bits/poll2.h:46
#2  socket_poller (ctx=0x7fa2e0153b50) at socket.c:2558
#3  0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fa2f39fb28d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fa2c27fc700 (LWP 24977)):
#0  0x00007fa2f39f0c3d in poll () from /lib64/libc.so.6
#1  0x00007fa2e9cadb65 in poll (__timeout=-1, __nfds=2, __fds=0x7fa2c27fbe80) at /usr/include/bits/poll2.h:46
#2  socket_poller (ctx=0x7fa2e00f30b0) at socket.c:2558
#3  0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fa2f39fb28d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fa2c1ffb700 (LWP 24978)):
#0  0x00007fa2f39f0c3d in poll () from /lib64/libc.so.6
#1  0x00007fa2e9cadb65 in poll (__timeout=-1, __nfds=2, __fds=0x7fa2c1ffae80) at /usr/include/bits/poll2.h:46
#2  socket_poller (ctx=0x7fa2e00c22f0) at socket.c:2558
#3  0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fa2f39fb28d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fa2c16f9700 (LWP 24979)):
#0  0x00007fa2f39f23a0 in readv () from /lib64/libc.so.6
#1  0x00007fa2ebee4fec in fuse_thread_proc (data=0x7fa2f604c770) at fuse-bridge.c:4896
#2  0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fa2f39fb28d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fa2c0ef8700 (LWP 24980)):
#0  0x00007fa2f40b86d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fa2ebed1fc3 in notify_kernel_loop (data=<optimized out>) at fuse-bridge.c:3873
#2  0x00007fa2f40b4dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fa2f39fb28d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fa2f5714780 (LWP 24957)):
#0  0x00007fa2f40b5ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fa2f52baeb8 in event_dispatch_epoll (event_pool=0x7fa2f603dd30) at event-epoll.c:762
#2  0x00007fa2f572b877 in main (argc=4, argv=0x7ffd34cddc88) at glusterfsd.c:2370

#################################################################

Will also add the sosreport location info

Comment 6 Nag Pavan Chilakam 2016-05-27 09:54:49 UTC
[root@dhcp35-126 gluster]# ps aux | grep gluster
root     24873  3.7  1.5 1131328 61560 ?       Ssl  11:48   1:33 /usr/sbin/glusterfs --volfile-server=10.70.35.191 --volfile-id=newssl /mnt/newssl
root     24957 11.5  1.6 1131328 62792 ?       Ssl  11:49   4:43 /usr/sbin/glusterfs --volfile-server=10.70.35.27 --volfile-id=newssl /mnt/newssl-dup



I have put all sosreports @ nchilaka@rhsqe-repo bug.1337863]$ pwd
/home/repo/sosreports/nchilaka/bug.1337863

I have even put the statedumps collected by Pranith during debugging 

the client sosreport was not completing, hence took a tar of the /var/log and copied it to the client dir in the sosreport location

Comment 7 Ambarish 2016-05-28 17:26:01 UTC
Hitting the reported failure on 3.1.2 as well.

Comment 8 Atin Mukherjee 2016-05-30 07:14:20 UTC
From the backtrace it looks like pthread_join is waiting on _socket_reap_own_threads (). This function is been introduced recently in rhgs-3.1.3 with https://code.engineering.redhat.com/gerrit/#/c/73681 and I was suspecting that this patch could have regressed the functionality.

However Comment 7 makes me think the other way now. It'd really helpful if we can get back trace on rhgs-3.1.2 for the same scenario.

In the mean time, could you shed some light on this, Jeff?

Comment 10 Jeff Darcy 2016-05-31 17:49:13 UTC
(In reply to Atin Mukherjee from comment #8)
> However Comment 7 makes me think the other way now. It'd really helpful if
> we can get back trace on rhgs-3.1.2 for the same scenario.

The fact that it occurs in 3.1.2 indicates pretty conclusively that it's not a problem with the reap_own_threads code.  Looking at the logs, it looks like glusterd on at least one of the servers reports several thousand SSL errors ("wrong version number" in a constant stream about 1.5 seconds apart.  This usually means that glusterd is set up to use SSL but clients are trying to make non-SSL connections.  The "common name" fields being reported for clients also seem a bit strange.  It would be useful to see what "Enable mgt and I/O encryption on the volume and mount it via FUSE" in the original report actually means in detail.

Comment 11 Atin Mukherjee 2016-06-01 04:57:43 UTC
Raising the needinfo on Ambarish as we need few clarification as mentioned in Comment 10

Comment 12 Ambarish 2016-06-01 12:06:29 UTC
I usually check in the gluster-mount log on the client to see if the encryptions are correctly enabled ,just to rule out config related issues.

I could see that I/O and glusterd encryptions were enabled,as Jeff mentioned.

Comment 13 Mohit Agrawal 2016-07-25 07:35:41 UTC
Hi,

  I have tried to reproduce the issue on my vm environment.I am not able to reproduce the issue after followed below steps.

  1) Setup environment 4 server(2x2) and 4 client Nodes
  2) Create two different mount point from two different server on one client 
  3) Run loop on all client simulatneously to generate the io load
     dd if=/dev/zero of=/mnt/file1.txt bs=5000k count=1000 
     dd if=/dev/zero of=/mnt1/file2.txt bs=5000k count=1000
  

gluster v info
 
Volume Name: restvol
Type: Distributed-Replicate
Volume ID: 946aee84-d470-4270-97f3-e1a53ea53c4b
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.65.7.151:/dist1/brick0
Brick2: 10.65.6.244:/dist1/brick1
Brick3: 10.65.7.148:/dist1/brick2
Brick4: 10.65.7.154:/dist1/brick3
Options Reconfigured:
server.allow-insecure: on
server.ssl: on
client.ssl: on
performance.readdir-ahead: on
 

  Can you please share the below command output at the time of hung glusterfs process from client
  gdb --batch --quiet -ex "thread apply all bt full" -ex "quit" -p <glusterfs-process-id>

  

Regards
Mohit Agrawal

Comment 14 Mohit Agrawal 2016-07-26 05:30:56 UTC
Hi,

 After executed one more command(gluster v heal info in while loop) on server as mentioned in comment 5 i am able to reproduce the issue.
For specific to one mount point process(glusterfs) is hanged on client. 

To find the root cause of hang I have generated a traceback of glusterfs process.

As per below backtrace (glusterfs process) it is showing  thread 13(_socket_reap_own_threads) is trying to join tid (2791 represent to poller thread).

As per socket.c code socket_reap_own_threads called by timer thread to reap threads after called pthread_join.

socket_poller called function socket_thread_reaper_add to add thread id in reap queue at last after got any error in polling loop or dying the thread.

It means at the time of calling socket_reap_own_threads that thread should not be running but from below traceback it confirms thread 2791(socket_poller) is still waiting in polling loop.

It means for a new rpc connection thread(with same thread id 2791) is generated again but somehow socket_reap_own_threads code was not called by timer thread before generated a new thread with same tid and because of waiting in poll it is hanged.



>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

gdb --batch --quiet -ex "thread apply all bt full" -ex "quit" -p 2777 

[New LWP 2796]
[New LWP 2795]
[New LWP 2794]
[New LWP 2793]
[New LWP 2792]
[New LWP 2791]
[New LWP 2784]
[New LWP 2783]
[New LWP 2781]
[New LWP 2780]
[New LWP 2779]
[New LWP 2778]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f1017023ef7 in pthread_join (threadid=139706882123520, thread_return=thread_return@entry=0x0) at pthread_join.c:92
92	    lll_wait_tid (pd->tid);

Thread 13 (Thread 0x7f100ee33700 (LWP 2778)):
#0  0x00007f1017023ef7 in pthread_join (threadid=139706720134912, thread_return=thread_return@entry=0x0) at pthread_join.c:92
        _tid = 2791
        _buffer = {__routine = 0x7f1017023e30 <cleanup>, __arg = 0x7f10016d2d28, __canceltype = 407298560, __prev = 0x0}
        oldtype = 0
        pd = 0x7f10016d2700
        self = 0x7f100ee33700
        result = 0
#1  0x00007f100cc1b1ce in _socket_reap_own_threads () at socket.c:223
        node = 0x7f0ff4001080
        tmp = 0x7f0ff8001080
        i = 0
#2  socket_thread_reaper () at socket.c:251
        __FUNCTION__ = "socket_thread_reaper"
#3  0x00007f10181e2c4b in gf_timer_proc (ctx=0x7f1018c79010) at timer.c:184
        at = <optimized out>
        need_cbk = 1 '\001'
        now = 103110752180
        now_ts = {tv_sec = 103, tv_nsec = 110752180}
        reg = 0x7f1018cb96d0
        sleepts = {tv_sec = 1, tv_nsec = 0}
        event = 0x7f100412f0a0
        old_THIS = 0x7f101846e200 <global_xlator>
        __FUNCTION__ = "gf_timer_proc"
#4  0x00007f1017022dc5 in start_thread (arg=0x7f100ee33700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f100ee33700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706945976064, 1630556678171971743, 0, 139706945976768, 139706945976064, 139707111936016, -1693077006916988769, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#5  0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

Thread 12 (Thread 0x7f100e632700 (LWP 2779)):
#0  0x00007f1017029e91 in do_sigwait (sig=0x7f100e631e1c, set=<optimized out>) at ../sysdeps/unix/sysv/linux/sigwait.c:61
        __arg4 = 8
        __arg2 = 0
        _a3 = 0
        _a1 = 139706937581088
        resultvar = <optimized out>
        __arg3 = 0
        __arg1 = 139706937581088
        _a4 = 8
        _a2 = 0
        ret = <optimized out>
        tmpset = {__val = {2822930839, 139707082203098, 4294967295, 0, 139707074193632, 139707105687952, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}}
#1  __sigwait (set=set@entry=0x7f100e631e20, sig=sig@entry=0x7f100e631e1c) at ../sysdeps/unix/sysv/linux/sigwait.c:99
        oldtype = 0
        result = 128
#2  0x00007f101869c8cb in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2021
        set = {__val = {18947, 0 <repeats 15 times>}}
        ret = <optimized out>
        sig = 0
#3  0x00007f1017022dc5 in start_thread (arg=0x7f100e632700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f100e632700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706937583360, 1630556678171971743, 0, 139706937584064, 139706937583360, 140735856486139, -1693078105891745633, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#4  0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

Thread 11 (Thread 0x7f100de31700 (LWP 2780)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
No locals.
#1  0x00007f101820a2c8 in syncenv_task (proc=proc@entry=0x7f1018cb9cd0) at syncop.c:607
        env = 0x7f1018cb9cd0
        task = 0x0
        sleep_till = {tv_sec = 1469508466, tv_nsec = 0}
        ret = <optimized out>
#2  0x00007f101820b000 in syncenv_processor (thdata=0x7f1018cb9cd0) at syncop.c:699
        env = 0x7f1018cb9cd0
        proc = 0x7f1018cb9cd0
        task = <optimized out>
#3  0x00007f1017022dc5 in start_thread (arg=0x7f100de31700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f100de31700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706929190656, 1630556678171971743, 0, 139706929191360, 139706929190656, 139707112201424, -1693074806819991393, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#4  0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

Thread 10 (Thread 0x7f100d630700 (LWP 2781)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
No locals.
#1  0x00007f101820a2c8 in syncenv_task (proc=proc@entry=0x7f1018cba090) at syncop.c:607
        env = 0x7f1018cb9cd0
        task = 0x0
        sleep_till = {tv_sec = 1469508466, tv_nsec = 0}
        ret = <optimized out>
#2  0x00007f101820b000 in syncenv_processor (thdata=0x7f1018cba090) at syncop.c:699
        env = 0x7f1018cb9cd0
        proc = 0x7f1018cba090
        task = <optimized out>
#3  0x00007f1017022dc5 in start_thread (arg=0x7f100d630700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f100d630700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706920797952, 1630556678171971743, 0, 139706920798656, 139706920797952, 139707112202384, -1693075914384682849, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#4  0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

Thread 9 (Thread 0x7f100b14e700 (LWP 2783)):
#0  0x00007f10169682c3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
No locals.
#1  0x00007f10182289f0 in event_dispatch_epoll_worker (data=0x7f1018cfbc40) at event-epoll.c:668
        event = {events = 1073741827, data = {ptr = 0x100000000, fd = 0, u32 = 0, u64 = 4294967296}}
        ret = <optimized out>
        ev_data = 0x7f1018cfbc40
        event_pool = 0x7f1018c97d30
        myindex = 1
        timetodie = 0
        __FUNCTION__ = "event_dispatch_epoll_worker"
#2  0x00007f1017022dc5 in start_thread (arg=0x7f100b14e700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f100b14e700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706882123520, 1630556678171971743, 0, 139706882124224, 139706882123520, 139707112471616, -1693089021551128417, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#3  0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

Thread 8 (Thread 0x7f1003fff700 (LWP 2784)):
#0  0x00007f10169682c3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
No locals.
#1  0x00007f10182289f0 in event_dispatch_epoll_worker (data=0x7f100407a440) at event-epoll.c:668
        event = {events = 0, data = {ptr = 0x0, fd = 0, u32 = 0, u64 = 0}}
        ret = <optimized out>
        ev_data = 0x7f100407a440
        event_pool = 0x7f1018c97d30
        myindex = 2
        timetodie = 0
        __FUNCTION__ = "event_dispatch_epoll_worker"
#2  0x00007f1017022dc5 in start_thread (arg=0x7f1003fff700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f1003fff700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706763310848, 1630556678171971743, 0, 139706763311552, 139706763310848, 139707112062256, -1693105695151042401, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#3  0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

Thread 7 (Thread 0x7f10016d2700 (LWP 2791)):
#0  0x00007f101695d69d in poll () at ../sysdeps/unix/syscall-template.S:81
No locals.
#1  0x00007f100cc20b55 in poll (__timeout=-1, __nfds=2, __fds=0x7f10016d1e80) at /usr/include/bits/poll2.h:46
No locals.
#2  socket_poller (ctx=0x7f10041278a0) at socket.c:2558
        this = 0x7f10041278a0
        priv = 0x7f1004128450
        pfd = {{fd = 11, events = 59, revents = 0}, {fd = 10, events = 59, revents = 0}}
        to_write = (unknown: 254)
        ret = <optimized out>
        gen = 14
        cname = <optimized out>
        __FUNCTION__ = "socket_poller"
#3  0x00007f1017022dc5 in start_thread (arg=0x7f10016d2700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f10016d2700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706720134912, 1630556678171971743, 0, 139706720135616, 139706720134912, 139706764523680, -1693102277967687521, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#4  0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

Thread 6 (Thread 0x7f1000ed1700 (LWP 2792)):
#0  0x00007f101695d69d in poll () at ../sysdeps/unix/syscall-template.S:81
No locals.
#1  0x00007f100cc20b55 in poll (__timeout=-1, __nfds=2, __fds=0x7f1000ed0e80) at /usr/include/bits/poll2.h:46
No locals.
#2  socket_poller (ctx=0x7f10040fa070) at socket.c:2558
        this = 0x7f10040fa070
        priv = 0x7f10040fac20
        pfd = {{fd = 24, events = 59, revents = 0}, {fd = 15, events = 59, revents = 0}}
        to_write = (unknown: 254)
        ret = <optimized out>
        gen = 14
        cname = <optimized out>
        __FUNCTION__ = "socket_poller"
#3  0x00007f1017022dc5 in start_thread (arg=0x7f1000ed1700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f1000ed1700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706711742208, 1630556678171971743, 0, 139706711742912, 139706711742208, 139706764337264, -1693098978895933281, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#4  0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

Thread 5 (Thread 0x7f0febfff700 (LWP 2793)):
#0  0x00007f101695d69d in poll () at ../sysdeps/unix/syscall-template.S:81
No locals.
#1  0x00007f100cc20b55 in poll (__timeout=-1, __nfds=2, __fds=0x7f0febffee80) at /usr/include/bits/poll2.h:46
No locals.
#2  socket_poller (ctx=0x7f100409ea90) at socket.c:2558
        this = 0x7f100409ea90
        priv = 0x7f100409f740
        pfd = {{fd = 26, events = 59, revents = 0}, {fd = 21, events = 59, revents = 0}}
        to_write = (unknown: 254)
        ret = <optimized out>
        gen = 14
        cname = <optimized out>
        __FUNCTION__ = "socket_poller"
#3  0x00007f1017022dc5 in start_thread (arg=0x7f0febfff700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f0febfff700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706360657664, 1630556678171971743, 0, 139706360658368, 139706360657664, 139706763963024, -1675566285664759649, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#4  0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

Thread 4 (Thread 0x7f0feb7fe700 (LWP 2794)):
#0  0x00007f101695d69d in poll () at ../sysdeps/unix/syscall-template.S:81
No locals.
#1  0x00007f100cc20b55 in poll (__timeout=-1, __nfds=2, __fds=0x7f0feb7fde80) at /usr/include/bits/poll2.h:46
No locals.
#2  socket_poller (ctx=0x7f10040cc7e0) at socket.c:2558
        this = 0x7f10040cc7e0
        priv = 0x7f10040cd420
        pfd = {{fd = 28, events = 59, revents = 0}, {fd = 18, events = 59, revents = 0}}
        to_write = (unknown: 254)
        ret = <optimized out>
        gen = 14
        cname = <optimized out>
        __FUNCTION__ = "socket_poller"
#3  0x00007f1017022dc5 in start_thread (arg=0x7f0feb7fe700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f0feb7fe700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706352264960, 1630556678171971743, 0, 139706352265664, 139706352264960, 139706764150752, -1675567384639516513, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#4  0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

Thread 3 (Thread 0x7f0fead5c700 (LWP 2795)):
#0  0x00007f101695ee00 in __libc_readv (fd=8, vector=vector@entry=0x7f0fead5be70, count=count@entry=2) at ../sysdeps/unix/sysv/linux/readv.c:56
        resultvar = 18446744073709551104
        oldtype = 0
        result = <optimized out>
#1  0x00007f100ee5101c in fuse_thread_proc (data=0x7f1018ca6730) at fuse-bridge.c:4896
        mount_point = 0x0
        this = 0x7f1018ca6730
        priv = 0x7f1018caa7a0
        res = <optimized out>
        iobuf = 0x7f1018c974c0
        finh = <optimized out>
        iov_in = {{iov_base = 0x7f0fe4049030, iov_len = 80}, {iov_base = 0x7f10163f1000, iov_len = 131072}}
        msg = <optimized out>
        fuse_ops = 0x7f100f0607a0 <fuse_std_ops>
        pfd = {{fd = 6, events = 25, revents = 1}, {fd = 8, events = 25, revents = 1}}
        mount_finished = _gf_true
        __FUNCTION__ = "fuse_thread_proc"
#2  0x00007f1017022dc5 in start_thread (arg=0x7f0fead5c700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f0fead5c700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706341115648, 1630556678171971743, 0, 139706341116352, 139706341115648, 139707112122160, -1675564445808144225, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#3  0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

Thread 2 (Thread 0x7f0fea55b700 (LWP 2796)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
No locals.
#1  0x00007f100ee3dfc3 in notify_kernel_loop (data=<optimized out>) at fuse-bridge.c:3873
        len = 40
        rv = <optimized out>
        this = <optimized out>
        priv = 0x7f1018caa7a0
        node = <optimized out>
        tmp = 0x0
        pfoh = <optimized out>
        __FUNCTION__ = "notify_kernel_loop"
#2  0x00007f1017022dc5 in start_thread (arg=0x7f0fea55b700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f0fea55b700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139706332722944, 1630556678171971743, 0, 139706332723648, 139706332722944, 139707112122160, -1675565544782901089, -1693132815617568609}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#3  0x00007f1016967ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

Thread 1 (Thread 0x7f101867f780 (LWP 2777)):
#0  0x00007f1017023ef7 in pthread_join (threadid=139706882123520, thread_return=thread_return@entry=0x0) at pthread_join.c:92
        _tid = 2783
        _buffer = {__routine = 0x7f1017023e30 <cleanup>, __arg = 0x7f100b14ed28, __canceltype = -1631877440, __prev = 0x0}
        oldtype = 0
        pd = 0x7f100b14e700
        self = 0x7f101867f780
        result = 0
#1  0x00007f1018228ef8 in event_dispatch_epoll (event_pool=0x7f1018c97d30) at event-epoll.c:762
        i = <optimized out>
        t_id = 139706882123520
        pollercount = 1
        ret = 0
        ev_data = <optimized out>
        __FUNCTION__ = "event_dispatch_epoll"
#2  0x00007f1018699877 in main (argc=4, argv=0x7fff9ebb9878) at glusterfsd.c:2367
        ctx = 0x7f1018c79010
        ret = 0
        cmdlinestr = "/usr/sbin/glusterfs --volfile-server=10.65.7.148 --volfile-id=/restvol /mnt1", '\000' <repeats 4019 times>
        cmd = 0x7f1018c79010
        __FUNCTION__ = "main"
A debugging session is active.

	Inferior 1 [process 2777] will be detached.

Quit anyway? (y or n) [answered Y; input not from terminal]

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

I have tried same steps on latest patch(http://review.gluster.org/#/c/14886/5) after remove reap code,it is working fine.

Regards
Mohit Agrawal

Comment 15 Atin Mukherjee 2016-07-26 05:40:32 UTC
Since the patch is already posted upstream, moving the state to POST.

Comment 19 Atin Mukherjee 2016-09-17 14:22:12 UTC
Upstream mainline : http://review.gluster.org/14694
Upstream 3.8 : http://review.gluster.org/14891

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

Comment 22 Ambarish 2016-10-12 02:58:54 UTC
Tested on 3.8.4-2 with 4 workload generating clients.

Ran the same workload I ran while opening this BZ.

Ran multi-threaded iozone(with 4 and 8 threads per client) as BenT reported in BZ#1249166

No hangs were seen,Verified.

Comment 29 errata-xmlrpc 2017-03-23 05:32:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.