Bug 1415101 - glustershd process crashed on systemic setup
Summary: glustershd process crashed on systemic setup
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: replicate
Version: rhgs-3.2
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: RHGS 3.2.0
Assignee: Ravishankar N
QA Contact: nchilaka
URL:
Whiteboard:
Depends On: 1418629
Blocks: 1351528
TreeView+ depends on / blocked
 
Reported: 2017-01-20 09:28 UTC by nchilaka
Modified: 2017-03-23 06:03 UTC (History)
6 users (show)

Fixed In Version: glusterfs-3.8.4-15
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1418629 (view as bug list)
Environment:
Last Closed: 2017-03-23 06:03:54 UTC


Attachments (Terms of Use)
list of entries in splitbrain (22.44 KB, text/plain)
2017-01-20 11:56 UTC, nchilaka
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Description nchilaka 2017-01-20 09:28:55 UTC
Description of problem:
=======================
On my systemic setup , glustershd crashed

for the exact testing information kindly refer to
https://docs.google.com/spreadsheets/d/1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=1917869133

root@dhcp37-154 ~]# file /core.3893 
/core.3893: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/gl', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/glusterfs', platform: 'x86_64'
[root@dhcp37-154 ~]# 



[root@dhcp37-154 ~]# 
[root@dhcp37-154 ~]# 
[root@dhcp37-154 ~]# 
[root@dhcp37-154 ~]# 
[root@dhcp37-154 ~]# gdb /usr/sbin/glusterfs /core.3893
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done.
done.

warning: core file may not match specified executable file.
[New LWP 3914]
[New LWP 13648]
[New LWP 3894]
[New LWP 3895]
[New LWP 13826]
[New LWP 3898]
[New LWP 3897]
[New LWP 3896]
[New LWP 3893]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/gl'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fd8bda87ce5 in __gf_free (free_ptr=0x7fd8ac21f630) at mem-pool.c:314
314	        GF_ASSERT (GF_MEM_TRAILER_MAGIC ==
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 openssl-libs-1.0.1e-60.el7.x86_64 pcre-8.32-15.el7_2.1.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00007fd8bda87ce5 in __gf_free (free_ptr=0x7fd8ac21f630) at mem-pool.c:314
#1  0x00007fd8b004746b in client_fdctx_destroy (this=<optimized out>, fdctx=0x7fd8ac21f630)
    at client-rpc-fops.c:3304
#2  0x00007fd8b0049946 in client_child_up_reopen_done (fdctx=0x7fd8ac21f630, this=0x7fd8ac00f7a0)
    at client-handshake.c:672
#3  0x00007fd8b0048811 in client3_3_reopendir_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, 
    myframe=0x7fd8bb53bb30) at client-handshake.c:832
#4  0x00007fd8bd82b6e0 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fd8ac07b8f0, pollin=pollin@entry=0x7fd8a0002750)
    at rpc-clnt.c:791
#5  0x00007fd8bd82b9cf in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd8ac07b948, event=<optimized out>, 
    data=0x7fd8a0002750) at rpc-clnt.c:971
#6  0x00007fd8bd827893 in rpc_transport_notify (this=this@entry=0x7fd8ac08b5d0, 
    event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fd8a0002750) at rpc-transport.c:538
#7  0x00007fd8b23192d4 in socket_event_poll_in (this=this@entry=0x7fd8ac08b5d0) at socket.c:2267
#8  0x00007fd8b231b785 in socket_event_handler (fd=<optimized out>, idx=8, data=0x7fd8ac08b5d0, poll_in=1, 
    poll_out=0, poll_err=0) at socket.c:2397
#9  0x00007fd8bdabb5b0 in event_dispatch_epoll_handler (event=0x7fd8a946de80, event_pool=0x7fd8be59df00)
    at event-epoll.c:571
#10 event_dispatch_epoll_worker (data=0x7fd8ac03dfc0) at event-epoll.c:674
#11 0x00007fd8bc8c2dc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fd8bc20773d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd8bda87ce5 in __gf_free (free_ptr=0x7fd8ac21f630) at mem-pool.c:314
#1  0x00007fd8b004746b in client_fdctx_destroy (this=<optimized out>, fdctx=0x7fd8ac21f630)
    at client-rpc-fops.c:3304
#2  0x00007fd8b0049946 in client_child_up_reopen_done (fdctx=0x7fd8ac21f630, this=0x7fd8ac00f7a0)
    at client-handshake.c:672
#3  0x00007fd8b0048811 in client3_3_reopendir_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, 
    myframe=0x7fd8bb53bb30) at client-handshake.c:832
#4  0x00007fd8bd82b6e0 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fd8ac07b8f0, pollin=pollin@entry=0x7fd8a0002750)
    at rpc-clnt.c:791
#5  0x00007fd8bd82b9cf in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd8ac07b948, event=<optimized out>, 
    data=0x7fd8a0002750) at rpc-clnt.c:971
#6  0x00007fd8bd827893 in rpc_transport_notify (this=this@entry=0x7fd8ac08b5d0, 
    event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fd8a0002750) at rpc-transport.c:538
#7  0x00007fd8b23192d4 in socket_event_poll_in (this=this@entry=0x7fd8ac08b5d0) at socket.c:2267
#8  0x00007fd8b231b785 in socket_event_handler (fd=<optimized out>, idx=8, data=0x7fd8ac08b5d0, poll_in=1, 
    poll_out=0, poll_err=0) at socket.c:2397
#9  0x00007fd8bdabb5b0 in event_dispatch_epoll_handler (event=0x7fd8a946de80, event_pool=0x7fd8be59df00)
    at event-epoll.c:571
#10 event_dispatch_epoll_worker (data=0x7fd8ac03dfc0) at event-epoll.c:674
#11 0x00007fd8bc8c2dc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fd8bc20773d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd8bda87ce5 in __gf_free (free_ptr=0x7fd8ac21f630) at mem-pool.c:314
#1  0x00007fd8b004746b in client_fdctx_destroy (this=<optimized out>, fdctx=0x7fd8ac21f630)
    at client-rpc-fops.c:3304
#2  0x00007fd8b0049946 in client_child_up_reopen_done (fdctx=0x7fd8ac21f630, this=0x7fd8ac00f7a0)
    at client-handshake.c:672
#3  0x00007fd8b0048811 in client3_3_reopendir_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, 
    myframe=0x7fd8bb53bb30) at client-handshake.c:832
#4  0x00007fd8bd82b6e0 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fd8ac07b8f0, pollin=pollin@entry=0x7fd8a0002750)
    at rpc-clnt.c:791
#5  0x00007fd8bd82b9cf in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd8ac07b948, event=<optimized out>, 
    data=0x7fd8a0002750) at rpc-clnt.c:971
#6  0x00007fd8bd827893 in rpc_transport_notify (this=this@entry=0x7fd8ac08b5d0, 
    event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fd8a0002750) at rpc-transport.c:538
#7  0x00007fd8b23192d4 in socket_event_poll_in (this=this@entry=0x7fd8ac08b5d0) at socket.c:2267
#8  0x00007fd8b231b785 in socket_event_handler (fd=<optimized out>, idx=8, data=0x7fd8ac08b5d0, poll_in=1, 
    poll_out=0, poll_err=0) at socket.c:2397
#9  0x00007fd8bdabb5b0 in event_dispatch_epoll_handler (event=0x7fd8a946de80, event_pool=0x7fd8be59df00)
    at event-epoll.c:571
#10 event_dispatch_epoll_worker (data=0x7fd8ac03dfc0) at event-epoll.c:674
#11 0x00007fd8bc8c2dc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fd8bc20773d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd8bda87ce5 in __gf_free (free_ptr=0x7fd8ac21f630) at mem-pool.c:314
#1  0x00007fd8b004746b in client_fdctx_destroy (this=<optimized out>, fdctx=0x7fd8ac21f630)
    at client-rpc-fops.c:3304
#2  0x00007fd8b0049946 in client_child_up_reopen_done (fdctx=0x7fd8ac21f630, this=0x7fd8ac00f7a0)
    at client-handshake.c:672
#3  0x00007fd8b0048811 in client3_3_reopendir_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, 
    myframe=0x7fd8bb53bb30) at client-handshake.c:832
#4  0x00007fd8bd82b6e0 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fd8ac07b8f0, pollin=pollin@entry=0x7fd8a0002750)
    at rpc-clnt.c:791
#5  0x00007fd8bd82b9cf in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd8ac07b948, event=<optimized out>, 
    data=0x7fd8a0002750) at rpc-clnt.c:971
#6  0x00007fd8bd827893 in rpc_transport_notify (this=this@entry=0x7fd8ac08b5d0, 
    event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fd8a0002750) at rpc-transport.c:538
#7  0x00007fd8b23192d4 in socket_event_poll_in (this=this@entry=0x7fd8ac08b5d0) at socket.c:2267
#8  0x00007fd8b231b785 in socket_event_handler (fd=<optimized out>, idx=8, data=0x7fd8ac08b5d0, poll_in=1, 
    poll_out=0, poll_err=0) at socket.c:2397
#9  0x00007fd8bdabb5b0 in event_dispatch_epoll_handler (event=0x7fd8a946de80, event_pool=0x7fd8be59df00)
    at event-epoll.c:571
#10 event_dispatch_epoll_worker (data=0x7fd8ac03dfc0) at event-epoll.c:674
#11 0x00007fd8bc8c2dc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fd8bc20773d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd8bda87ce5 in __gf_free (free_ptr=0x7fd8ac21f630) at mem-pool.c:314
#1  0x00007fd8b004746b in client_fdctx_destroy (this=<optimized out>, fdctx=0x7fd8ac21f630)
    at client-rpc-fops.c:3304
#2  0x00007fd8b0049946 in client_child_up_reopen_done (fdctx=0x7fd8ac21f630, this=0x7fd8ac00f7a0)
    at client-handshake.c:672
#3  0x00007fd8b0048811 in client3_3_reopendir_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, 
    myframe=0x7fd8bb53bb30) at client-handshake.c:832
#4  0x00007fd8bd82b6e0 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fd8ac07b8f0, pollin=pollin@entry=0x7fd8a0002750)
    at rpc-clnt.c:791
#5  0x00007fd8bd82b9cf in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd8ac07b948, event=<optimized out>, 
    data=0x7fd8a0002750) at rpc-clnt.c:971
#6  0x00007fd8bd827893 in rpc_transport_notify (this=this@entry=0x7fd8ac08b5d0, 
    event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fd8a0002750) at rpc-transport.c:538
#7  0x00007fd8b23192d4 in socket_event_poll_in (this=this@entry=0x7fd8ac08b5d0) at socket.c:2267
#8  0x00007fd8b231b785 in socket_event_handler (fd=<optimized out>, idx=8, data=0x7fd8ac08b5d0, poll_in=1, 
    poll_out=0, poll_err=0) at socket.c:2397
#9  0x00007fd8bdabb5b0 in event_dispatch_epoll_handler (event=0x7fd8a946de80, event_pool=0x7fd8be59df00)
    at event-epoll.c:571
#10 event_dispatch_epoll_worker (data=0x7fd8ac03dfc0) at event-epoll.c:674
#11 0x00007fd8bc8c2dc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fd8bc20773d in clone () from /lib64/libc.so.6
(gdb) 


Version-Release number of selected component (if applicable):
=====================================
3.8.4-12

Comment 2 nchilaka 2017-01-20 09:30:28 UTC
Volume Name: systemic
Type: Distributed-Replicate
Volume ID: 8c0f9f83-65a0-49df-a8e8-d3f8ca08bd9a
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: 10.70.35.20:/rhs/brick1/systemic
Brick2: 10.70.37.86:/rhs/brick1/systemic
Brick3: 10.70.35.156:/rhs/brick1/systemic
Brick4: 10.70.37.154:/rhs/brick1/systemic
Brick5: 10.70.35.20:/rhs/brick2/systemic
Brick6: 10.70.37.86:/rhs/brick2/systemic
Brick7: 10.70.35.156:/rhs/brick2/systemic
Brick8: 10.70.37.154:/rhs/brick2/systemic
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

Comment 3 nchilaka 2017-01-20 10:13:05 UTC
core was found on node#4 10.70.37.156

core and sosreports are available at 
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1415101/

Comment 4 nchilaka 2017-01-20 11:55:27 UTC
More information:
It is a 4x2 volume spanning across 4 nodes, each hosting 2 bricks 
node 1 and 2 which host b1,b2,b5,b6 have no heal pendings or files in splitbrains
node 3 and 4 which host b3,b4,b7,b8 have lot of files in pending heals and lot of directories in split-brains 
Note: n4 is the node where shd crashed

following is heal info pending information:
[root@dhcp35-20 glusterfs]# gluster  v heal systemic info|grep ntries
Number of entries: 0
Number of entries: 0
Number of entries: 6753
Number of entries: 15934
Number of entries: 0
Number of entries: 0
Number of entries: 1084
Number of entries: 8011

split brain details are attached

Comment 5 nchilaka 2017-01-20 11:56:21 UTC
Created attachment 1242730 [details]
list of entries in splitbrain

Comment 6 nchilaka 2017-01-20 11:57:31 UTC
note i am unable to post the output of volume status due to following consistent problem


[root@dhcp35-20 glusterfs]# gluster v status systemic
Locking failed on 10.70.37.86. Please check log file for details.


root@dhcp35-20 glusterfs]# tail glusterd.log
[2017-01-20 10:21:25.123623] I [MSGID: 106499] [glusterd-handler.c:4362:__glusterd_handle_status_volume] 0-management: Received status volume req for volume systemic
[2017-01-20 10:21:25.124890] E [MSGID: 106116] [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Locking failed on 10.70.37.86. Please check log file for details.
[2017-01-20 10:21:25.128200] E [MSGID: 106151] [glusterd-syncop.c:1889:gd_sync_task_begin] 0-management: Locking Peers Failed.
[2017-01-20 10:21:25.128720] E [MSGID: 106116] [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on 10.70.37.86. Please check log file for details.
[2017-01-20 10:21:25.136358] E [MSGID: 106152] [glusterd-syncop.c:1590:gd_unlock_op_phase] 0-management: Failed to unlock on some peer(s)
[2017-01-20 11:00:49.929280] I [MSGID: 106499] [glusterd-handler.c:4362:__glusterd_handle_status_volume] 0-management: Received status volume req for volume systemic
[2017-01-20 11:00:49.931730] E [MSGID: 106116] [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Locking failed on 10.70.37.86. Please check log file for details.
[2017-01-20 11:00:49.932039] E [MSGID: 106151] [glusterd-syncop.c:1889:gd_sync_task_begin] 0-management: Locking Peers Failed.
[2017-01-20 11:00:49.932734] E [MSGID: 106116] [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on 10.70.37.86. Please check log file for details.
[2017-01-20 11:00:49.932884] E [MSGID: 106152] [glusterd-syncop.c:1590:gd_unlock_op_phase] 0-management: Failed to unlock on some peer(s)

Comment 7 nchilaka 2017-01-20 12:01:38 UTC
almost all (except 3 files which were logging directory creations) split=brain entries are directories, which were being created from 3 different clients in parallel


node 3 has seen  a few kernel hang information in dmesg, this could be that one of the client creating directory couldn't successfully create over here due to that momentary hang, while the other client successfully wrote to this node first and hence may be this caused the splitbrain

However, I did validate a couple of directories in the xattrs to see if they were due to gfid mismatch, but rather found them blaming each other in entries bits

Comment 8 nchilaka 2017-01-20 12:28:04 UTC
kernel hangs in dmesg of node3 ....other nodes have none
[root@dhcp35-156 ~]# dmesg -T
[Thu Jan 19 22:18:13 2017] INFO: task xfsaild/dm-0:487 blocked for more than 120 seconds.
[Thu Jan 19 22:18:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 19 22:18:13 2017] xfsaild/dm-0    D ffff88044fb87000     0   487      2 0x00000000
[Thu Jan 19 22:18:13 2017]  ffff880450293d58 0000000000000046 ffff88044fe31f60 ffff880450293fd8
[Thu Jan 19 22:18:13 2017]  ffff880450293fd8 ffff880450293fd8 ffff88044fe31f60 ffff88045022ac00
[Thu Jan 19 22:18:13 2017]  0000000000000000 ffff88044fe31f60 ffff880466014128 ffff88044fb87000
[Thu Jan 19 22:18:13 2017] Call Trace:
[Thu Jan 19 22:18:13 2017]  [<ffffffff8168b799>] schedule+0x29/0x70
[Thu Jan 19 22:18:13 2017]  [<ffffffffa029497d>] _xfs_log_force+0x1bd/0x2b0 [xfs]
[Thu Jan 19 22:18:13 2017]  [<ffffffff810c4fe0>] ? wake_up_state+0x20/0x20
[Thu Jan 19 22:18:13 2017]  [<ffffffffa0294a96>] xfs_log_force+0x26/0x80 [xfs]
[Thu Jan 19 22:18:13 2017]  [<ffffffffa02a0360>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[Thu Jan 19 22:18:13 2017]  [<ffffffffa02a04ba>] xfsaild+0x15a/0x660 [xfs]
[Thu Jan 19 22:18:13 2017]  [<ffffffffa02a0360>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[Thu Jan 19 22:18:13 2017]  [<ffffffff810b064f>] kthread+0xcf/0xe0
[Thu Jan 19 22:18:13 2017]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
[Thu Jan 19 22:18:13 2017]  [<ffffffff81696618>] ret_from_fork+0x58/0x90
[Thu Jan 19 22:18:13 2017]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
[Thu Jan 19 22:24:13 2017] INFO: task glusterd:1323 blocked for more than 120 seconds.
[Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 19 22:24:13 2017] glusterd        D ffff8804505f7968     0  1323      1 0x00000080
[Thu Jan 19 22:24:13 2017]  ffff8800bb843c70 0000000000000082 ffff88044bb23ec0 ffff8800bb843fd8
[Thu Jan 19 22:24:13 2017]  ffff8800bb843fd8 ffff8800bb843fd8 ffff88044bb23ec0 ffff8804505f7960
[Thu Jan 19 22:24:13 2017]  ffff8804505f7964 ffff88044bb23ec0 00000000ffffffff ffff8804505f7968
[Thu Jan 19 22:24:13 2017] Call Trace:
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168994f>] mutex_lock+0x1f/0x2f
[Thu Jan 19 22:24:13 2017]  [<ffffffff8120ccff>] do_last+0x28f/0x12a0
[Thu Jan 19 22:24:13 2017]  [<ffffffff811de456>] ? kmem_cache_alloc_trace+0x1d6/0x200
[Thu Jan 19 22:24:13 2017]  [<ffffffff8120ddd2>] path_openat+0xc2/0x490
[Thu Jan 19 22:24:13 2017]  [<ffffffff811b0f21>] ? handle_mm_fault+0x6b1/0xfe0
[Thu Jan 19 22:24:13 2017]  [<ffffffff810f5090>] ? futex_wake+0x80/0x160
[Thu Jan 19 22:24:13 2017]  [<ffffffff8120ff4b>] do_filp_open+0x4b/0xb0
[Thu Jan 19 22:24:13 2017]  [<ffffffff8121cbc7>] ? __alloc_fd+0xa7/0x130
[Thu Jan 19 22:24:13 2017]  [<ffffffff811fd453>] do_sys_open+0xf3/0x1f0
[Thu Jan 19 22:24:13 2017]  [<ffffffff811fd56e>] SyS_open+0x1e/0x20
[Thu Jan 19 22:24:13 2017]  [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b
[Thu Jan 19 22:24:13 2017] INFO: task glusterfsd:4048 blocked for more than 120 seconds.
[Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 19 22:24:13 2017] glusterfsd      D ffff8804505f7968     0  4048      1 0x00000080
[Thu Jan 19 22:24:13 2017]  ffff8804522cfdf0 0000000000000086 ffff88043a3c1f60 ffff8804522cffd8
[Thu Jan 19 22:24:13 2017]  ffff8804522cffd8 ffff8804522cffd8 ffff88043a3c1f60 ffff8804505f7960
[Thu Jan 19 22:24:13 2017]  ffff8804505f7964 ffff88043a3c1f60 00000000ffffffff ffff8804505f7968
[Thu Jan 19 22:24:13 2017] Call Trace:
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168994f>] mutex_lock+0x1f/0x2f
[Thu Jan 19 22:24:13 2017]  [<ffffffff8120f40b>] do_unlinkat+0x13b/0x2b0
[Thu Jan 19 22:24:13 2017]  [<ffffffff81210486>] SyS_unlink+0x16/0x20
[Thu Jan 19 22:24:13 2017]  [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b
[Thu Jan 19 22:24:13 2017] INFO: task glusterfsd:4049 blocked for more than 120 seconds.
[Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 19 22:24:13 2017] glusterfsd      D ffff8804505f7968     0  4049      1 0x00000080
[Thu Jan 19 22:24:13 2017]  ffff880451b4bc70 0000000000000086 ffff88043a3c4e70 ffff880451b4bfd8
[Thu Jan 19 22:24:13 2017]  ffff880451b4bfd8 ffff880451b4bfd8 ffff88043a3c4e70 ffff8804505f7960
[Thu Jan 19 22:24:13 2017]  ffff8804505f7964 ffff88043a3c4e70 00000000ffffffff ffff8804505f7968
[Thu Jan 19 22:24:13 2017] Call Trace:
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168994f>] mutex_lock+0x1f/0x2f
[Thu Jan 19 22:24:13 2017]  [<ffffffff8120ccff>] do_last+0x28f/0x12a0
[Thu Jan 19 22:24:13 2017]  [<ffffffff811de456>] ? kmem_cache_alloc_trace+0x1d6/0x200
[Thu Jan 19 22:24:13 2017]  [<ffffffff8120ddd2>] path_openat+0xc2/0x490
[Thu Jan 19 22:24:13 2017]  [<ffffffff810f5090>] ? futex_wake+0x80/0x160
[Thu Jan 19 22:24:13 2017]  [<ffffffff8120ff4b>] do_filp_open+0x4b/0xb0
[Thu Jan 19 22:24:13 2017]  [<ffffffff8121cbc7>] ? __alloc_fd+0xa7/0x130
[Thu Jan 19 22:24:13 2017]  [<ffffffff811fd453>] do_sys_open+0xf3/0x1f0
[Thu Jan 19 22:24:13 2017]  [<ffffffff811fd56e>] SyS_open+0x1e/0x20
[Thu Jan 19 22:24:13 2017]  [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b
[Thu Jan 19 22:24:13 2017] INFO: task glusterfsd:4380 blocked for more than 120 seconds.
[Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 19 22:24:13 2017] glusterfsd      D ffff8804505f7968     0  4380      1 0x00000080
[Thu Jan 19 22:24:13 2017]  ffff880417a3bdf0 0000000000000086 ffff88043eefaf10 ffff880417a3bfd8
[Thu Jan 19 22:24:13 2017]  ffff880417a3bfd8 ffff880417a3bfd8 ffff88043eefaf10 ffff8804505f7960
[Thu Jan 19 22:24:13 2017]  ffff8804505f7964 ffff88043eefaf10 00000000ffffffff ffff8804505f7968
[Thu Jan 19 22:24:13 2017] Call Trace:
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168994f>] mutex_lock+0x1f/0x2f
[Thu Jan 19 22:24:13 2017]  [<ffffffff8120f40b>] do_unlinkat+0x13b/0x2b0
[Thu Jan 19 22:24:13 2017]  [<ffffffff81210486>] SyS_unlink+0x16/0x20
[Thu Jan 19 22:24:13 2017]  [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b
[Thu Jan 19 22:24:13 2017] INFO: task glusterfsd:4442 blocked for more than 120 seconds.
[Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 19 22:24:13 2017] glusterfsd      D ffff8804505f7968     0  4442      1 0x00000080
[Thu Jan 19 22:24:13 2017]  ffff8803c960bdf0 0000000000000086 ffff88043c225e20 ffff8803c960bfd8
[Thu Jan 19 22:24:13 2017]  ffff8803c960bfd8 ffff8803c960bfd8 ffff88043c225e20 ffff8804505f7960
[Thu Jan 19 22:24:13 2017]  ffff8804505f7964 ffff88043c225e20 00000000ffffffff ffff8804505f7968
[Thu Jan 19 22:24:13 2017] Call Trace:
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168994f>] mutex_lock+0x1f/0x2f
[Thu Jan 19 22:24:13 2017]  [<ffffffff8120f40b>] do_unlinkat+0x13b/0x2b0
[Thu Jan 19 22:24:13 2017]  [<ffffffff81210486>] SyS_unlink+0x16/0x20
[Thu Jan 19 22:24:13 2017]  [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b
[Thu Jan 19 22:24:13 2017] INFO: task glusterfsd:4994 blocked for more than 120 seconds.
[Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 19 22:24:13 2017] glusterfsd      D ffff8804505f7968     0  4994      1 0x00000080
[Thu Jan 19 22:24:13 2017]  ffff88000b1c7df0 0000000000000086 ffff88043c220fb0 ffff88000b1c7fd8
[Thu Jan 19 22:24:13 2017]  ffff88000b1c7fd8 ffff88000b1c7fd8 ffff88043c220fb0 ffff8804505f7960
[Thu Jan 19 22:24:13 2017]  ffff8804505f7964 ffff88043c220fb0 00000000ffffffff ffff8804505f7968
[Thu Jan 19 22:24:13 2017] Call Trace:
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168994f>] mutex_lock+0x1f/0x2f
[Thu Jan 19 22:24:13 2017]  [<ffffffff8120f40b>] do_unlinkat+0x13b/0x2b0
[Thu Jan 19 22:24:13 2017]  [<ffffffff811b4463>] ? remove_vma+0x63/0x70
[Thu Jan 19 22:24:13 2017]  [<ffffffff81210486>] SyS_unlink+0x16/0x20
[Thu Jan 19 22:24:13 2017]  [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b
[Thu Jan 19 22:24:13 2017] INFO: task glusterfsd:4996 blocked for more than 120 seconds.
[Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 19 22:24:13 2017] glusterfsd      D ffff8804505f7968     0  4996      1 0x00000080
[Thu Jan 19 22:24:13 2017]  ffff88000b1afc70 0000000000000086 ffff88044037ce70 ffff88000b1affd8
[Thu Jan 19 22:24:13 2017]  ffff88000b1affd8 ffff88000b1affd8 ffff88044037ce70 ffff8804505f7960
[Thu Jan 19 22:24:13 2017]  ffff8804505f7964 ffff88044037ce70 00000000ffffffff ffff8804505f7968
[Thu Jan 19 22:24:13 2017] Call Trace:
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168994f>] mutex_lock+0x1f/0x2f
[Thu Jan 19 22:24:13 2017]  [<ffffffff8120ccff>] do_last+0x28f/0x12a0
[Thu Jan 19 22:24:13 2017]  [<ffffffff811de456>] ? kmem_cache_alloc_trace+0x1d6/0x200
[Thu Jan 19 22:24:13 2017]  [<ffffffff8120ddd2>] path_openat+0xc2/0x490
[Thu Jan 19 22:24:13 2017]  [<ffffffff8120ff4b>] do_filp_open+0x4b/0xb0
[Thu Jan 19 22:24:13 2017]  [<ffffffff8121cbc7>] ? __alloc_fd+0xa7/0x130
[Thu Jan 19 22:24:13 2017]  [<ffffffff811fd453>] do_sys_open+0xf3/0x1f0
[Thu Jan 19 22:24:13 2017]  [<ffffffff811fd56e>] SyS_open+0x1e/0x20
[Thu Jan 19 22:24:13 2017]  [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b
[Thu Jan 19 22:24:13 2017] INFO: task glusterfsd:6159 blocked for more than 120 seconds.
[Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 19 22:24:13 2017] glusterfsd      D ffff8804505f7968     0  6159      1 0x00000080
[Thu Jan 19 22:24:13 2017]  ffff88009db83df0 0000000000000086 ffff880452214e70 ffff88009db83fd8
[Thu Jan 19 22:24:13 2017]  ffff88009db83fd8 ffff88009db83fd8 ffff880452214e70 ffff8804505f7960
[Thu Jan 19 22:24:13 2017]  ffff8804505f7964 ffff880452214e70 00000000ffffffff ffff8804505f7968
[Thu Jan 19 22:24:13 2017] Call Trace:
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168994f>] mutex_lock+0x1f/0x2f
[Thu Jan 19 22:24:13 2017]  [<ffffffff8120f40b>] do_unlinkat+0x13b/0x2b0
[Thu Jan 19 22:24:13 2017]  [<ffffffff811b4463>] ? remove_vma+0x63/0x70
[Thu Jan 19 22:24:13 2017]  [<ffffffff81210486>] SyS_unlink+0x16/0x20
[Thu Jan 19 22:24:13 2017]  [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b
[Thu Jan 19 22:24:13 2017] INFO: task glusterfsd:6276 blocked for more than 120 seconds.
[Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 19 22:24:13 2017] glusterfsd      D ffff8804505f7968     0  6276      1 0x00000080
[Thu Jan 19 22:24:13 2017]  ffff88000618fdf0 0000000000000086 ffff88044debce70 ffff88000618ffd8
[Thu Jan 19 22:24:13 2017]  ffff88000618ffd8 ffff88000618ffd8 ffff88044debce70 ffff8804505f7960
[Thu Jan 19 22:24:13 2017]  ffff8804505f7964 ffff88044debce70 00000000ffffffff ffff8804505f7968
[Thu Jan 19 22:24:13 2017] Call Trace:
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0
[Thu Jan 19 22:24:13 2017]  [<ffffffff8168994f>] mutex_lock+0x1f/0x2f
[Thu Jan 19 22:24:13 2017]  [<ffffffff8120f40b>] do_unlinkat+0x13b/0x2b0
[Thu Jan 19 22:24:13 2017]  [<ffffffff81210486>] SyS_unlink+0x16/0x20
[Thu Jan 19 22:24:13 2017]  [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b
[Fri Jan 20 15:22:26 2017] nr_pdflush_threads exported in /proc is scheduled for removal
[Fri Jan 20 15:27:08 2017] warning: `turbostat' uses 32-bit capabilities (legacy support in use)
[root@dhcp35-156 ~]#

Comment 9 nchilaka 2017-01-20 12:38:26 UTC
client logs which were creating parallely directories 
/var/www/html/sosreports/nchilaka/bug.1415101/clients/same-dir-create-clients

Comment 11 nchilaka 2017-01-20 12:41:20 UTC
one kernel hang on 2nd client of the same-dir-create set ie rhs-client23
[root@rhs-client23 tmp]# dmesg 
[990206.629702] INFO: task mkdir:16145 blocked for more than 120 seconds.
[990206.629742] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[990206.629775] mkdir           D ffff8804719280b0     0 16145  10981 0x00000080
[990206.629780]  ffff88043b523d90 0000000000000086 ffff88043b495c00 ffff88043b523fd8
[990206.629785]  ffff88043b523fd8 ffff88043b523fd8 ffff88043b495c00 ffff8804719280a8
[990206.629788]  ffff8804719280ac ffff88043b495c00 00000000ffffffff ffff8804719280b0
[990206.629791] Call Trace:
[990206.629802]  [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70
[990206.629806]  [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0
[990206.629809]  [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f
[990206.629814]  [<ffffffff811ee085>] filename_create+0x85/0x180
[990206.629819]  [<ffffffff811c0e65>] ? kmem_cache_alloc+0x35/0x1d0
[990206.629821]  [<ffffffff811eec0f>] ? getname_flags+0x4f/0x1a0
[990206.629823]  [<ffffffff811eec84>] ? getname_flags+0xc4/0x1a0
[990206.629825]  [<ffffffff811eef31>] user_path_create+0x41/0x60
[990206.629829]  [<ffffffff811f0246>] SyS_mkdirat+0x46/0xe0
[990206.629833]  [<ffffffff81641113>] ? do_page_fault+0x23/0x80
[990206.629835]  [<ffffffff811f02f9>] SyS_mkdir+0x19/0x20
[990206.629838]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[1003985.918712] SELinux: initialized (dev binfmt_misc, type binfmt_misc), uses genfs_contexts
[1003990.523134] nr_pdflush_threads exported in /proc is scheduled for removal
[1003994.086202] warning: `turbostat' uses 32-bit capabilities (legacy support in use)

Comment 12 nchilaka 2017-01-20 13:32:37 UTC
raised a seperate bug for rpc ping time out issue
https://bugzilla.redhat.com/show_bug.cgi?id=1415178  - systemic testing: seeing lot of ping time outs which would lead to splitbrains

Comment 14 Ravishankar N 2017-02-02 11:41:15 UTC
Upstream patch: https://review.gluster.org/#/c/16521/

Comment 15 Atin Mukherjee 2017-02-13 12:00:09 UTC
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/97481/

Comment 17 nchilaka 2017-02-28 13:46:10 UTC
Ravi, 
Can you help me with a set of functional steps or cases to verify this bug.
Given that the fix was submitted with 3.8.4-15, I am not in a position to try on soak the fix on my systemic testing as it is running on 3.8.4-14, and upgrading my systemic setup to this build is not feasible at this time of the release.
We can sit together and using gdb validate to confirm this problem is fixed

Comment 18 Ravishankar N 2017-02-28 16:39:13 UTC
Nag, it is not easy to reproduce this even using gdb. The fix was done based on code-reading. If you want, I can explain how the changes made in the patch fixes the crash. Let me know if that is okay.

Comment 19 nchilaka 2017-03-02 09:00:41 UTC
I have run some stress testing on  3.8.4-15 and i haven't hit any crash.
As there is no straigtforward way of verifying the fix, I am moving it to verified after discussing with dev and based on comment#18

Comment 21 errata-xmlrpc 2017-03-23 06:03:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.