Description of problem: ======================= On my systemic setup , glustershd crashed for the exact testing information kindly refer to https://docs.google.com/spreadsheets/d/1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=1917869133 root@dhcp37-154 ~]# file /core.3893 /core.3893: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/gl', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/glusterfs', platform: 'x86_64' [root@dhcp37-154 ~]# [root@dhcp37-154 ~]# [root@dhcp37-154 ~]# [root@dhcp37-154 ~]# [root@dhcp37-154 ~]# [root@dhcp37-154 ~]# gdb /usr/sbin/glusterfs /core.3893 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done. done. warning: core file may not match specified executable file. [New LWP 3914] [New LWP 13648] [New LWP 3894] [New LWP 3895] [New LWP 13826] [New LWP 3898] [New LWP 3897] [New LWP 3896] [New LWP 3893] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/gl'. Program terminated with signal 11, Segmentation fault. #0 0x00007fd8bda87ce5 in __gf_free (free_ptr=0x7fd8ac21f630) at mem-pool.c:314 314 GF_ASSERT (GF_MEM_TRAILER_MAGIC == Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 openssl-libs-1.0.1e-60.el7.x86_64 pcre-8.32-15.el7_2.1.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) bt #0 0x00007fd8bda87ce5 in __gf_free (free_ptr=0x7fd8ac21f630) at mem-pool.c:314 #1 0x00007fd8b004746b in client_fdctx_destroy (this=<optimized out>, fdctx=0x7fd8ac21f630) at client-rpc-fops.c:3304 #2 0x00007fd8b0049946 in client_child_up_reopen_done (fdctx=0x7fd8ac21f630, this=0x7fd8ac00f7a0) at client-handshake.c:672 #3 0x00007fd8b0048811 in client3_3_reopendir_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fd8bb53bb30) at client-handshake.c:832 #4 0x00007fd8bd82b6e0 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fd8ac07b8f0, pollin=pollin@entry=0x7fd8a0002750) at rpc-clnt.c:791 #5 0x00007fd8bd82b9cf in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd8ac07b948, event=<optimized out>, data=0x7fd8a0002750) at rpc-clnt.c:971 #6 0x00007fd8bd827893 in rpc_transport_notify (this=this@entry=0x7fd8ac08b5d0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fd8a0002750) at rpc-transport.c:538 #7 0x00007fd8b23192d4 in socket_event_poll_in (this=this@entry=0x7fd8ac08b5d0) at socket.c:2267 #8 0x00007fd8b231b785 in socket_event_handler (fd=<optimized out>, idx=8, data=0x7fd8ac08b5d0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2397 #9 0x00007fd8bdabb5b0 in event_dispatch_epoll_handler (event=0x7fd8a946de80, event_pool=0x7fd8be59df00) at event-epoll.c:571 #10 event_dispatch_epoll_worker (data=0x7fd8ac03dfc0) at event-epoll.c:674 #11 0x00007fd8bc8c2dc5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007fd8bc20773d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd8bda87ce5 in __gf_free (free_ptr=0x7fd8ac21f630) at mem-pool.c:314 #1 0x00007fd8b004746b in client_fdctx_destroy (this=<optimized out>, fdctx=0x7fd8ac21f630) at client-rpc-fops.c:3304 #2 0x00007fd8b0049946 in client_child_up_reopen_done (fdctx=0x7fd8ac21f630, this=0x7fd8ac00f7a0) at client-handshake.c:672 #3 0x00007fd8b0048811 in client3_3_reopendir_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fd8bb53bb30) at client-handshake.c:832 #4 0x00007fd8bd82b6e0 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fd8ac07b8f0, pollin=pollin@entry=0x7fd8a0002750) at rpc-clnt.c:791 #5 0x00007fd8bd82b9cf in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd8ac07b948, event=<optimized out>, data=0x7fd8a0002750) at rpc-clnt.c:971 #6 0x00007fd8bd827893 in rpc_transport_notify (this=this@entry=0x7fd8ac08b5d0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fd8a0002750) at rpc-transport.c:538 #7 0x00007fd8b23192d4 in socket_event_poll_in (this=this@entry=0x7fd8ac08b5d0) at socket.c:2267 #8 0x00007fd8b231b785 in socket_event_handler (fd=<optimized out>, idx=8, data=0x7fd8ac08b5d0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2397 #9 0x00007fd8bdabb5b0 in event_dispatch_epoll_handler (event=0x7fd8a946de80, event_pool=0x7fd8be59df00) at event-epoll.c:571 #10 event_dispatch_epoll_worker (data=0x7fd8ac03dfc0) at event-epoll.c:674 #11 0x00007fd8bc8c2dc5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007fd8bc20773d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd8bda87ce5 in __gf_free (free_ptr=0x7fd8ac21f630) at mem-pool.c:314 #1 0x00007fd8b004746b in client_fdctx_destroy (this=<optimized out>, fdctx=0x7fd8ac21f630) at client-rpc-fops.c:3304 #2 0x00007fd8b0049946 in client_child_up_reopen_done (fdctx=0x7fd8ac21f630, this=0x7fd8ac00f7a0) at client-handshake.c:672 #3 0x00007fd8b0048811 in client3_3_reopendir_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fd8bb53bb30) at client-handshake.c:832 #4 0x00007fd8bd82b6e0 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fd8ac07b8f0, pollin=pollin@entry=0x7fd8a0002750) at rpc-clnt.c:791 #5 0x00007fd8bd82b9cf in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd8ac07b948, event=<optimized out>, data=0x7fd8a0002750) at rpc-clnt.c:971 #6 0x00007fd8bd827893 in rpc_transport_notify (this=this@entry=0x7fd8ac08b5d0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fd8a0002750) at rpc-transport.c:538 #7 0x00007fd8b23192d4 in socket_event_poll_in (this=this@entry=0x7fd8ac08b5d0) at socket.c:2267 #8 0x00007fd8b231b785 in socket_event_handler (fd=<optimized out>, idx=8, data=0x7fd8ac08b5d0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2397 #9 0x00007fd8bdabb5b0 in event_dispatch_epoll_handler (event=0x7fd8a946de80, event_pool=0x7fd8be59df00) at event-epoll.c:571 #10 event_dispatch_epoll_worker (data=0x7fd8ac03dfc0) at event-epoll.c:674 #11 0x00007fd8bc8c2dc5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007fd8bc20773d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd8bda87ce5 in __gf_free (free_ptr=0x7fd8ac21f630) at mem-pool.c:314 #1 0x00007fd8b004746b in client_fdctx_destroy (this=<optimized out>, fdctx=0x7fd8ac21f630) at client-rpc-fops.c:3304 #2 0x00007fd8b0049946 in client_child_up_reopen_done (fdctx=0x7fd8ac21f630, this=0x7fd8ac00f7a0) at client-handshake.c:672 #3 0x00007fd8b0048811 in client3_3_reopendir_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fd8bb53bb30) at client-handshake.c:832 #4 0x00007fd8bd82b6e0 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fd8ac07b8f0, pollin=pollin@entry=0x7fd8a0002750) at rpc-clnt.c:791 #5 0x00007fd8bd82b9cf in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd8ac07b948, event=<optimized out>, data=0x7fd8a0002750) at rpc-clnt.c:971 #6 0x00007fd8bd827893 in rpc_transport_notify (this=this@entry=0x7fd8ac08b5d0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fd8a0002750) at rpc-transport.c:538 #7 0x00007fd8b23192d4 in socket_event_poll_in (this=this@entry=0x7fd8ac08b5d0) at socket.c:2267 #8 0x00007fd8b231b785 in socket_event_handler (fd=<optimized out>, idx=8, data=0x7fd8ac08b5d0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2397 #9 0x00007fd8bdabb5b0 in event_dispatch_epoll_handler (event=0x7fd8a946de80, event_pool=0x7fd8be59df00) at event-epoll.c:571 #10 event_dispatch_epoll_worker (data=0x7fd8ac03dfc0) at event-epoll.c:674 #11 0x00007fd8bc8c2dc5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007fd8bc20773d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd8bda87ce5 in __gf_free (free_ptr=0x7fd8ac21f630) at mem-pool.c:314 #1 0x00007fd8b004746b in client_fdctx_destroy (this=<optimized out>, fdctx=0x7fd8ac21f630) at client-rpc-fops.c:3304 #2 0x00007fd8b0049946 in client_child_up_reopen_done (fdctx=0x7fd8ac21f630, this=0x7fd8ac00f7a0) at client-handshake.c:672 #3 0x00007fd8b0048811 in client3_3_reopendir_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fd8bb53bb30) at client-handshake.c:832 #4 0x00007fd8bd82b6e0 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fd8ac07b8f0, pollin=pollin@entry=0x7fd8a0002750) at rpc-clnt.c:791 #5 0x00007fd8bd82b9cf in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd8ac07b948, event=<optimized out>, data=0x7fd8a0002750) at rpc-clnt.c:971 #6 0x00007fd8bd827893 in rpc_transport_notify (this=this@entry=0x7fd8ac08b5d0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fd8a0002750) at rpc-transport.c:538 #7 0x00007fd8b23192d4 in socket_event_poll_in (this=this@entry=0x7fd8ac08b5d0) at socket.c:2267 #8 0x00007fd8b231b785 in socket_event_handler (fd=<optimized out>, idx=8, data=0x7fd8ac08b5d0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2397 #9 0x00007fd8bdabb5b0 in event_dispatch_epoll_handler (event=0x7fd8a946de80, event_pool=0x7fd8be59df00) at event-epoll.c:571 #10 event_dispatch_epoll_worker (data=0x7fd8ac03dfc0) at event-epoll.c:674 #11 0x00007fd8bc8c2dc5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007fd8bc20773d in clone () from /lib64/libc.so.6 (gdb) Version-Release number of selected component (if applicable): ===================================== 3.8.4-12
Volume Name: systemic Type: Distributed-Replicate Volume ID: 8c0f9f83-65a0-49df-a8e8-d3f8ca08bd9a Status: Started Snapshot Count: 0 Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: 10.70.35.20:/rhs/brick1/systemic Brick2: 10.70.37.86:/rhs/brick1/systemic Brick3: 10.70.35.156:/rhs/brick1/systemic Brick4: 10.70.37.154:/rhs/brick1/systemic Brick5: 10.70.35.20:/rhs/brick2/systemic Brick6: 10.70.37.86:/rhs/brick2/systemic Brick7: 10.70.35.156:/rhs/brick2/systemic Brick8: 10.70.37.154:/rhs/brick2/systemic Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on features.quota-deem-statfs: on features.inode-quota: on features.quota: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on
core was found on node#4 10.70.37.156 core and sosreports are available at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1415101/
More information: It is a 4x2 volume spanning across 4 nodes, each hosting 2 bricks node 1 and 2 which host b1,b2,b5,b6 have no heal pendings or files in splitbrains node 3 and 4 which host b3,b4,b7,b8 have lot of files in pending heals and lot of directories in split-brains Note: n4 is the node where shd crashed following is heal info pending information: [root@dhcp35-20 glusterfs]# gluster v heal systemic info|grep ntries Number of entries: 0 Number of entries: 0 Number of entries: 6753 Number of entries: 15934 Number of entries: 0 Number of entries: 0 Number of entries: 1084 Number of entries: 8011 split brain details are attached
Created attachment 1242730 [details] list of entries in splitbrain
note i am unable to post the output of volume status due to following consistent problem [root@dhcp35-20 glusterfs]# gluster v status systemic Locking failed on 10.70.37.86. Please check log file for details. root@dhcp35-20 glusterfs]# tail glusterd.log [2017-01-20 10:21:25.123623] I [MSGID: 106499] [glusterd-handler.c:4362:__glusterd_handle_status_volume] 0-management: Received status volume req for volume systemic [2017-01-20 10:21:25.124890] E [MSGID: 106116] [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Locking failed on 10.70.37.86. Please check log file for details. [2017-01-20 10:21:25.128200] E [MSGID: 106151] [glusterd-syncop.c:1889:gd_sync_task_begin] 0-management: Locking Peers Failed. [2017-01-20 10:21:25.128720] E [MSGID: 106116] [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on 10.70.37.86. Please check log file for details. [2017-01-20 10:21:25.136358] E [MSGID: 106152] [glusterd-syncop.c:1590:gd_unlock_op_phase] 0-management: Failed to unlock on some peer(s) [2017-01-20 11:00:49.929280] I [MSGID: 106499] [glusterd-handler.c:4362:__glusterd_handle_status_volume] 0-management: Received status volume req for volume systemic [2017-01-20 11:00:49.931730] E [MSGID: 106116] [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Locking failed on 10.70.37.86. Please check log file for details. [2017-01-20 11:00:49.932039] E [MSGID: 106151] [glusterd-syncop.c:1889:gd_sync_task_begin] 0-management: Locking Peers Failed. [2017-01-20 11:00:49.932734] E [MSGID: 106116] [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on 10.70.37.86. Please check log file for details. [2017-01-20 11:00:49.932884] E [MSGID: 106152] [glusterd-syncop.c:1590:gd_unlock_op_phase] 0-management: Failed to unlock on some peer(s)
almost all (except 3 files which were logging directory creations) split=brain entries are directories, which were being created from 3 different clients in parallel node 3 has seen a few kernel hang information in dmesg, this could be that one of the client creating directory couldn't successfully create over here due to that momentary hang, while the other client successfully wrote to this node first and hence may be this caused the splitbrain However, I did validate a couple of directories in the xattrs to see if they were due to gfid mismatch, but rather found them blaming each other in entries bits
kernel hangs in dmesg of node3 ....other nodes have none [root@dhcp35-156 ~]# dmesg -T [Thu Jan 19 22:18:13 2017] INFO: task xfsaild/dm-0:487 blocked for more than 120 seconds. [Thu Jan 19 22:18:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Jan 19 22:18:13 2017] xfsaild/dm-0 D ffff88044fb87000 0 487 2 0x00000000 [Thu Jan 19 22:18:13 2017] ffff880450293d58 0000000000000046 ffff88044fe31f60 ffff880450293fd8 [Thu Jan 19 22:18:13 2017] ffff880450293fd8 ffff880450293fd8 ffff88044fe31f60 ffff88045022ac00 [Thu Jan 19 22:18:13 2017] 0000000000000000 ffff88044fe31f60 ffff880466014128 ffff88044fb87000 [Thu Jan 19 22:18:13 2017] Call Trace: [Thu Jan 19 22:18:13 2017] [<ffffffff8168b799>] schedule+0x29/0x70 [Thu Jan 19 22:18:13 2017] [<ffffffffa029497d>] _xfs_log_force+0x1bd/0x2b0 [xfs] [Thu Jan 19 22:18:13 2017] [<ffffffff810c4fe0>] ? wake_up_state+0x20/0x20 [Thu Jan 19 22:18:13 2017] [<ffffffffa0294a96>] xfs_log_force+0x26/0x80 [xfs] [Thu Jan 19 22:18:13 2017] [<ffffffffa02a0360>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] [Thu Jan 19 22:18:13 2017] [<ffffffffa02a04ba>] xfsaild+0x15a/0x660 [xfs] [Thu Jan 19 22:18:13 2017] [<ffffffffa02a0360>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] [Thu Jan 19 22:18:13 2017] [<ffffffff810b064f>] kthread+0xcf/0xe0 [Thu Jan 19 22:18:13 2017] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 [Thu Jan 19 22:18:13 2017] [<ffffffff81696618>] ret_from_fork+0x58/0x90 [Thu Jan 19 22:18:13 2017] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 [Thu Jan 19 22:24:13 2017] INFO: task glusterd:1323 blocked for more than 120 seconds. [Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Jan 19 22:24:13 2017] glusterd D ffff8804505f7968 0 1323 1 0x00000080 [Thu Jan 19 22:24:13 2017] ffff8800bb843c70 0000000000000082 ffff88044bb23ec0 ffff8800bb843fd8 [Thu Jan 19 22:24:13 2017] ffff8800bb843fd8 ffff8800bb843fd8 ffff88044bb23ec0 ffff8804505f7960 [Thu Jan 19 22:24:13 2017] ffff8804505f7964 ffff88044bb23ec0 00000000ffffffff ffff8804505f7968 [Thu Jan 19 22:24:13 2017] Call Trace: [Thu Jan 19 22:24:13 2017] [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70 [Thu Jan 19 22:24:13 2017] [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0 [Thu Jan 19 22:24:13 2017] [<ffffffff8168994f>] mutex_lock+0x1f/0x2f [Thu Jan 19 22:24:13 2017] [<ffffffff8120ccff>] do_last+0x28f/0x12a0 [Thu Jan 19 22:24:13 2017] [<ffffffff811de456>] ? kmem_cache_alloc_trace+0x1d6/0x200 [Thu Jan 19 22:24:13 2017] [<ffffffff8120ddd2>] path_openat+0xc2/0x490 [Thu Jan 19 22:24:13 2017] [<ffffffff811b0f21>] ? handle_mm_fault+0x6b1/0xfe0 [Thu Jan 19 22:24:13 2017] [<ffffffff810f5090>] ? futex_wake+0x80/0x160 [Thu Jan 19 22:24:13 2017] [<ffffffff8120ff4b>] do_filp_open+0x4b/0xb0 [Thu Jan 19 22:24:13 2017] [<ffffffff8121cbc7>] ? __alloc_fd+0xa7/0x130 [Thu Jan 19 22:24:13 2017] [<ffffffff811fd453>] do_sys_open+0xf3/0x1f0 [Thu Jan 19 22:24:13 2017] [<ffffffff811fd56e>] SyS_open+0x1e/0x20 [Thu Jan 19 22:24:13 2017] [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b [Thu Jan 19 22:24:13 2017] INFO: task glusterfsd:4048 blocked for more than 120 seconds. [Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Jan 19 22:24:13 2017] glusterfsd D ffff8804505f7968 0 4048 1 0x00000080 [Thu Jan 19 22:24:13 2017] ffff8804522cfdf0 0000000000000086 ffff88043a3c1f60 ffff8804522cffd8 [Thu Jan 19 22:24:13 2017] ffff8804522cffd8 ffff8804522cffd8 ffff88043a3c1f60 ffff8804505f7960 [Thu Jan 19 22:24:13 2017] ffff8804505f7964 ffff88043a3c1f60 00000000ffffffff ffff8804505f7968 [Thu Jan 19 22:24:13 2017] Call Trace: [Thu Jan 19 22:24:13 2017] [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70 [Thu Jan 19 22:24:13 2017] [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0 [Thu Jan 19 22:24:13 2017] [<ffffffff8168994f>] mutex_lock+0x1f/0x2f [Thu Jan 19 22:24:13 2017] [<ffffffff8120f40b>] do_unlinkat+0x13b/0x2b0 [Thu Jan 19 22:24:13 2017] [<ffffffff81210486>] SyS_unlink+0x16/0x20 [Thu Jan 19 22:24:13 2017] [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b [Thu Jan 19 22:24:13 2017] INFO: task glusterfsd:4049 blocked for more than 120 seconds. [Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Jan 19 22:24:13 2017] glusterfsd D ffff8804505f7968 0 4049 1 0x00000080 [Thu Jan 19 22:24:13 2017] ffff880451b4bc70 0000000000000086 ffff88043a3c4e70 ffff880451b4bfd8 [Thu Jan 19 22:24:13 2017] ffff880451b4bfd8 ffff880451b4bfd8 ffff88043a3c4e70 ffff8804505f7960 [Thu Jan 19 22:24:13 2017] ffff8804505f7964 ffff88043a3c4e70 00000000ffffffff ffff8804505f7968 [Thu Jan 19 22:24:13 2017] Call Trace: [Thu Jan 19 22:24:13 2017] [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70 [Thu Jan 19 22:24:13 2017] [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0 [Thu Jan 19 22:24:13 2017] [<ffffffff8168994f>] mutex_lock+0x1f/0x2f [Thu Jan 19 22:24:13 2017] [<ffffffff8120ccff>] do_last+0x28f/0x12a0 [Thu Jan 19 22:24:13 2017] [<ffffffff811de456>] ? kmem_cache_alloc_trace+0x1d6/0x200 [Thu Jan 19 22:24:13 2017] [<ffffffff8120ddd2>] path_openat+0xc2/0x490 [Thu Jan 19 22:24:13 2017] [<ffffffff810f5090>] ? futex_wake+0x80/0x160 [Thu Jan 19 22:24:13 2017] [<ffffffff8120ff4b>] do_filp_open+0x4b/0xb0 [Thu Jan 19 22:24:13 2017] [<ffffffff8121cbc7>] ? __alloc_fd+0xa7/0x130 [Thu Jan 19 22:24:13 2017] [<ffffffff811fd453>] do_sys_open+0xf3/0x1f0 [Thu Jan 19 22:24:13 2017] [<ffffffff811fd56e>] SyS_open+0x1e/0x20 [Thu Jan 19 22:24:13 2017] [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b [Thu Jan 19 22:24:13 2017] INFO: task glusterfsd:4380 blocked for more than 120 seconds. [Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Jan 19 22:24:13 2017] glusterfsd D ffff8804505f7968 0 4380 1 0x00000080 [Thu Jan 19 22:24:13 2017] ffff880417a3bdf0 0000000000000086 ffff88043eefaf10 ffff880417a3bfd8 [Thu Jan 19 22:24:13 2017] ffff880417a3bfd8 ffff880417a3bfd8 ffff88043eefaf10 ffff8804505f7960 [Thu Jan 19 22:24:13 2017] ffff8804505f7964 ffff88043eefaf10 00000000ffffffff ffff8804505f7968 [Thu Jan 19 22:24:13 2017] Call Trace: [Thu Jan 19 22:24:13 2017] [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70 [Thu Jan 19 22:24:13 2017] [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0 [Thu Jan 19 22:24:13 2017] [<ffffffff8168994f>] mutex_lock+0x1f/0x2f [Thu Jan 19 22:24:13 2017] [<ffffffff8120f40b>] do_unlinkat+0x13b/0x2b0 [Thu Jan 19 22:24:13 2017] [<ffffffff81210486>] SyS_unlink+0x16/0x20 [Thu Jan 19 22:24:13 2017] [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b [Thu Jan 19 22:24:13 2017] INFO: task glusterfsd:4442 blocked for more than 120 seconds. [Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Jan 19 22:24:13 2017] glusterfsd D ffff8804505f7968 0 4442 1 0x00000080 [Thu Jan 19 22:24:13 2017] ffff8803c960bdf0 0000000000000086 ffff88043c225e20 ffff8803c960bfd8 [Thu Jan 19 22:24:13 2017] ffff8803c960bfd8 ffff8803c960bfd8 ffff88043c225e20 ffff8804505f7960 [Thu Jan 19 22:24:13 2017] ffff8804505f7964 ffff88043c225e20 00000000ffffffff ffff8804505f7968 [Thu Jan 19 22:24:13 2017] Call Trace: [Thu Jan 19 22:24:13 2017] [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70 [Thu Jan 19 22:24:13 2017] [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0 [Thu Jan 19 22:24:13 2017] [<ffffffff8168994f>] mutex_lock+0x1f/0x2f [Thu Jan 19 22:24:13 2017] [<ffffffff8120f40b>] do_unlinkat+0x13b/0x2b0 [Thu Jan 19 22:24:13 2017] [<ffffffff81210486>] SyS_unlink+0x16/0x20 [Thu Jan 19 22:24:13 2017] [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b [Thu Jan 19 22:24:13 2017] INFO: task glusterfsd:4994 blocked for more than 120 seconds. [Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Jan 19 22:24:13 2017] glusterfsd D ffff8804505f7968 0 4994 1 0x00000080 [Thu Jan 19 22:24:13 2017] ffff88000b1c7df0 0000000000000086 ffff88043c220fb0 ffff88000b1c7fd8 [Thu Jan 19 22:24:13 2017] ffff88000b1c7fd8 ffff88000b1c7fd8 ffff88043c220fb0 ffff8804505f7960 [Thu Jan 19 22:24:13 2017] ffff8804505f7964 ffff88043c220fb0 00000000ffffffff ffff8804505f7968 [Thu Jan 19 22:24:13 2017] Call Trace: [Thu Jan 19 22:24:13 2017] [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70 [Thu Jan 19 22:24:13 2017] [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0 [Thu Jan 19 22:24:13 2017] [<ffffffff8168994f>] mutex_lock+0x1f/0x2f [Thu Jan 19 22:24:13 2017] [<ffffffff8120f40b>] do_unlinkat+0x13b/0x2b0 [Thu Jan 19 22:24:13 2017] [<ffffffff811b4463>] ? remove_vma+0x63/0x70 [Thu Jan 19 22:24:13 2017] [<ffffffff81210486>] SyS_unlink+0x16/0x20 [Thu Jan 19 22:24:13 2017] [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b [Thu Jan 19 22:24:13 2017] INFO: task glusterfsd:4996 blocked for more than 120 seconds. [Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Jan 19 22:24:13 2017] glusterfsd D ffff8804505f7968 0 4996 1 0x00000080 [Thu Jan 19 22:24:13 2017] ffff88000b1afc70 0000000000000086 ffff88044037ce70 ffff88000b1affd8 [Thu Jan 19 22:24:13 2017] ffff88000b1affd8 ffff88000b1affd8 ffff88044037ce70 ffff8804505f7960 [Thu Jan 19 22:24:13 2017] ffff8804505f7964 ffff88044037ce70 00000000ffffffff ffff8804505f7968 [Thu Jan 19 22:24:13 2017] Call Trace: [Thu Jan 19 22:24:13 2017] [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70 [Thu Jan 19 22:24:13 2017] [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0 [Thu Jan 19 22:24:13 2017] [<ffffffff8168994f>] mutex_lock+0x1f/0x2f [Thu Jan 19 22:24:13 2017] [<ffffffff8120ccff>] do_last+0x28f/0x12a0 [Thu Jan 19 22:24:13 2017] [<ffffffff811de456>] ? kmem_cache_alloc_trace+0x1d6/0x200 [Thu Jan 19 22:24:13 2017] [<ffffffff8120ddd2>] path_openat+0xc2/0x490 [Thu Jan 19 22:24:13 2017] [<ffffffff8120ff4b>] do_filp_open+0x4b/0xb0 [Thu Jan 19 22:24:13 2017] [<ffffffff8121cbc7>] ? __alloc_fd+0xa7/0x130 [Thu Jan 19 22:24:13 2017] [<ffffffff811fd453>] do_sys_open+0xf3/0x1f0 [Thu Jan 19 22:24:13 2017] [<ffffffff811fd56e>] SyS_open+0x1e/0x20 [Thu Jan 19 22:24:13 2017] [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b [Thu Jan 19 22:24:13 2017] INFO: task glusterfsd:6159 blocked for more than 120 seconds. [Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Jan 19 22:24:13 2017] glusterfsd D ffff8804505f7968 0 6159 1 0x00000080 [Thu Jan 19 22:24:13 2017] ffff88009db83df0 0000000000000086 ffff880452214e70 ffff88009db83fd8 [Thu Jan 19 22:24:13 2017] ffff88009db83fd8 ffff88009db83fd8 ffff880452214e70 ffff8804505f7960 [Thu Jan 19 22:24:13 2017] ffff8804505f7964 ffff880452214e70 00000000ffffffff ffff8804505f7968 [Thu Jan 19 22:24:13 2017] Call Trace: [Thu Jan 19 22:24:13 2017] [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70 [Thu Jan 19 22:24:13 2017] [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0 [Thu Jan 19 22:24:13 2017] [<ffffffff8168994f>] mutex_lock+0x1f/0x2f [Thu Jan 19 22:24:13 2017] [<ffffffff8120f40b>] do_unlinkat+0x13b/0x2b0 [Thu Jan 19 22:24:13 2017] [<ffffffff811b4463>] ? remove_vma+0x63/0x70 [Thu Jan 19 22:24:13 2017] [<ffffffff81210486>] SyS_unlink+0x16/0x20 [Thu Jan 19 22:24:13 2017] [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b [Thu Jan 19 22:24:13 2017] INFO: task glusterfsd:6276 blocked for more than 120 seconds. [Thu Jan 19 22:24:13 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Jan 19 22:24:13 2017] glusterfsd D ffff8804505f7968 0 6276 1 0x00000080 [Thu Jan 19 22:24:13 2017] ffff88000618fdf0 0000000000000086 ffff88044debce70 ffff88000618ffd8 [Thu Jan 19 22:24:13 2017] ffff88000618ffd8 ffff88000618ffd8 ffff88044debce70 ffff8804505f7960 [Thu Jan 19 22:24:13 2017] ffff8804505f7964 ffff88044debce70 00000000ffffffff ffff8804505f7968 [Thu Jan 19 22:24:13 2017] Call Trace: [Thu Jan 19 22:24:13 2017] [<ffffffff8168c889>] schedule_preempt_disabled+0x29/0x70 [Thu Jan 19 22:24:13 2017] [<ffffffff8168a4e5>] __mutex_lock_slowpath+0xc5/0x1c0 [Thu Jan 19 22:24:13 2017] [<ffffffff8168994f>] mutex_lock+0x1f/0x2f [Thu Jan 19 22:24:13 2017] [<ffffffff8120f40b>] do_unlinkat+0x13b/0x2b0 [Thu Jan 19 22:24:13 2017] [<ffffffff81210486>] SyS_unlink+0x16/0x20 [Thu Jan 19 22:24:13 2017] [<ffffffff816966c9>] system_call_fastpath+0x16/0x1b [Fri Jan 20 15:22:26 2017] nr_pdflush_threads exported in /proc is scheduled for removal [Fri Jan 20 15:27:08 2017] warning: `turbostat' uses 32-bit capabilities (legacy support in use) [root@dhcp35-156 ~]#
client logs which were creating parallely directories /var/www/html/sosreports/nchilaka/bug.1415101/clients/same-dir-create-clients
web link http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1415101/clients/same-dir-create-clients/
one kernel hang on 2nd client of the same-dir-create set ie rhs-client23 [root@rhs-client23 tmp]# dmesg [990206.629702] INFO: task mkdir:16145 blocked for more than 120 seconds. [990206.629742] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [990206.629775] mkdir D ffff8804719280b0 0 16145 10981 0x00000080 [990206.629780] ffff88043b523d90 0000000000000086 ffff88043b495c00 ffff88043b523fd8 [990206.629785] ffff88043b523fd8 ffff88043b523fd8 ffff88043b495c00 ffff8804719280a8 [990206.629788] ffff8804719280ac ffff88043b495c00 00000000ffffffff ffff8804719280b0 [990206.629791] Call Trace: [990206.629802] [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70 [990206.629806] [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0 [990206.629809] [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f [990206.629814] [<ffffffff811ee085>] filename_create+0x85/0x180 [990206.629819] [<ffffffff811c0e65>] ? kmem_cache_alloc+0x35/0x1d0 [990206.629821] [<ffffffff811eec0f>] ? getname_flags+0x4f/0x1a0 [990206.629823] [<ffffffff811eec84>] ? getname_flags+0xc4/0x1a0 [990206.629825] [<ffffffff811eef31>] user_path_create+0x41/0x60 [990206.629829] [<ffffffff811f0246>] SyS_mkdirat+0x46/0xe0 [990206.629833] [<ffffffff81641113>] ? do_page_fault+0x23/0x80 [990206.629835] [<ffffffff811f02f9>] SyS_mkdir+0x19/0x20 [990206.629838] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [1003985.918712] SELinux: initialized (dev binfmt_misc, type binfmt_misc), uses genfs_contexts [1003990.523134] nr_pdflush_threads exported in /proc is scheduled for removal [1003994.086202] warning: `turbostat' uses 32-bit capabilities (legacy support in use)
raised a seperate bug for rpc ping time out issue https://bugzilla.redhat.com/show_bug.cgi?id=1415178 - systemic testing: seeing lot of ping time outs which would lead to splitbrains
Upstream patch: https://review.gluster.org/#/c/16521/
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/97481/
Ravi, Can you help me with a set of functional steps or cases to verify this bug. Given that the fix was submitted with 3.8.4-15, I am not in a position to try on soak the fix on my systemic testing as it is running on 3.8.4-14, and upgrading my systemic setup to this build is not feasible at this time of the release. We can sit together and using gdb validate to confirm this problem is fixed
Nag, it is not easy to reproduce this even using gdb. The fix was done based on code-reading. If you want, I can explain how the changes made in the patch fixes the crash. Let me know if that is okay.
I have run some stress testing on 3.8.4-15 and i haven't hit any crash. As there is no straigtforward way of verifying the fix, I am moving it to verified after discussing with dev and based on comment#18
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html