Description of problem: During a random I/O test IOZone deadlocked: Mar 19 14:39:42 gqac022 kernel: INFO: task iozone:13184 blocked for more than 120 seconds. Mar 19 14:39:42 gqac022 kernel: Not tainted 2.6.32-431.el6.x86_64 #1 Mar 19 14:39:42 gqac022 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 19 14:39:42 gqac022 kernel: iozone D 0000000000000000 0 13184 13183 0x00000084 Mar 19 14:39:42 gqac022 kernel: ffff880be811fda8 0000000000000086 0000000000000000 0000000000000000 Mar 19 14:39:42 gqac022 kernel: ffff880be6fa0a80 ffff880be839f980 ffff880be811fdf8 ffffffffa02fc78b Mar 19 14:39:42 gqac022 kernel: ffff880be80db058 ffff880be811ffd8 000000000000fbc8 ffff880be80db058 Mar 19 14:39:42 gqac022 kernel: Call Trace: Mar 19 14:39:42 gqac022 kernel: [<ffffffffa02fc78b>] ? fuse_dentry_revalidate+0x2fb/0x340 [fuse] Mar 19 14:39:42 gqac022 kernel: [<ffffffff8152907e>] __mutex_lock_slowpath+0x13e/0x180 Mar 19 14:39:42 gqac022 kernel: [<ffffffff8122752f>] ? security_inode_permission+0x1f/0x30 Mar 19 14:39:42 gqac022 kernel: [<ffffffff81528f1b>] mutex_lock+0x2b/0x50 Mar 19 14:39:42 gqac022 kernel: [<ffffffff8119796c>] vfs_unlink+0x5c/0xf0 Mar 19 14:39:42 gqac022 kernel: [<ffffffff81196c1a>] ? lookup_hash+0x3a/0x50 Mar 19 14:39:42 gqac022 kernel: [<ffffffff8119a723>] do_unlinkat+0x163/0x260 Mar 19 14:39:42 gqac022 kernel: [<ffffffff810e2067>] ? audit_syscall_entry+0x1d7/0x200 Mar 19 14:39:42 gqac022 kernel: [<ffffffff8119a836>] sys_unlink+0x16/0x20 Mar 19 14:39:42 gqac022 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Version-Release number of selected component (if applicable): glusterfs-3.6.0.53 How reproducible: This is the first time I have hit it in about 20 runs on this build. Steps to Reproduce: 1. Run IOZone random IO test 2. 3. Actual results: IOZone hangs due to what I assume is a lost lock. Expected results: Normal operation. Additional info:
Client logs from event: [2015-03-19 14:10:35.136562] I [afr-self-heal-common.c:1398:afr_sh_missing_entry_call_impunge_recreate] 0-testvol-replicate-0: no missing files - /network_shared/stonewall.tmp. proceeding to metadata check [2015-03-19 14:10:35.148003] I [afr-self-heal-common.c:2874:afr_log_self_heal_completion_status] 0-testvol-replicate-0: gfid or missing entry self heal is started, on /network_shared/stonewall.tmp [2015-03-19 17:42:39.689155] C [rpc-clnt-ping.c:109:rpc_clnt_ping_timer_expired] 0-testvol-client-1: server 192.168.61.101:49152 has not responded in the last 42 seconds, disconnecting. [2015-03-19 17:42:39.757626] E [rpc-clnt.c:362:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f) [0x7fb4b34ddd3f] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91) [0x7fb4b34dd771] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fb4b34dd6be]))) 0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(FINODELK(30)) called at 2015-03-19 17:41:52.085874 (xid=0x20c7efe) [2015-03-19 17:42:39.757643] E [client-rpc-fops.c:1571:client3_3_finodelk_cbk] 0-testvol-client-1: remote operation failed: Transport endpoint is not connected [2015-03-19 17:42:39.757713] E [rpc-clnt.c:362:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f) [0x7fb4b34ddd3f] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91) [0x7fb4b34dd771] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fb4b34dd6be]))) 0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(FINODELK(30)) called at 2015-03-19 17:41:52.086589 (xid=0x20c7f0a) [2015-03-19 17:42:39.757722] E [client-rpc-fops.c:1571:client3_3_finodelk_cbk] 0-testvol-client-1: remote operation failed: Transport endpoint is not connected [2015-03-19 17:42:39.757766] E [rpc-clnt.c:362:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f) [0x7fb4b34ddd3f] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91) [0x7fb4b34dd771] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fb4b34dd6be]))) 0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(FINODELK(30)) called at 2015-03-19 17:41:52.088977 (xid=0x20c7f31) [2015-03-19 17:42:39.757775] E [client-rpc-fops.c:1571:client3_3_finodelk_cbk] 0-testvol-client-1: remote operation failed: Transport endpoint is not connected [2015-03-19 17:42:39.757827] E [rpc-clnt.c:362:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f) [0x7fb4b34ddd3f] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91) [0x7fb4b34dd771] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fb4b34dd6be]))) 0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(FINODELK(30)) called at 2015-03-19 17:41:52.089762 (xid=0x20c7f32) [2015-03-19 17:42:39.757837] E [client-rpc-fops.c:1571:client3_3_finodelk_cbk] 0-testvol-client-1: remote operation failed: Transport endpoint is not connected
And the server: [2015-03-19 17:42:39.779725] I [server-handshake.c:578:server_setvolume] 0-testvol-server: accepted client from gqac022.sbu.lab.eng.bos.redh at.com-1853-2015/03/18-21:00:28:131978-testvol-client-1-0-1 (version: 3.6.0.53) [2015-03-19 17:42:39.780749] E [posix.c:2508:posix_open] 0-testvol-posix: open on /rhs/brick1/.glusterfs/80/b7/80b7bf90-9d40-4b98-843a-86547 03022e8: No such file or directory [2015-03-19 17:42:39.780814] E [posix.c:2508:posix_open] 0-testvol-posix: open on /rhs/brick1/.glusterfs/ac/51/ac51c87e-9c97-4c78-a542-bc631 3c33328: No such file or directory [2015-03-19 17:42:39.780937] E [posix.c:2508:posix_open] 0-testvol-posix: open on /rhs/brick1/.glusterfs/25/99/259929ed-965b-4615-86f2-e94c68938069: No such file or directory [2015-03-19 17:42:39.781033] E [posix.c:2508:posix_open] 0-testvol-posix: open on /rhs/brick1/.glusterfs/52/7a/527ab0c5-53fc-4d54-8c03-a5aa759217f8: No such file or directory [2015-03-19 17:42:39.782572] E [inodelk.c:309:__inode_unlock_lock] 0-testvol-locks: Matching lock not found for unlock 8749056-8765439, by ecd507a4b47f0000 on 0x7f58dc0763b0 [2015-03-19 17:42:39.782609] E [server-rpc-fops.c:279:server_finodelk_cbk] 0-testvol-server: 34373742: FINODELK -2 (17d8210e-e60b-4f0d-ac4e-1b5fb4d3334f) ==> (Invalid argument) [2015-03-19 17:42:39.782622] E [inodelk.c:309:__inode_unlock_lock] 0-testvol-locks: Matching lock not found for unlock 515686400-515702783, by ecd507a4b47f0000 on 0x7f58dc0763b0 [2015-03-19 17:42:39.782662] E [server-rpc-fops.c:279:server_finodelk_cbk] 0-testvol-server: 34373745: FINODELK -2 (17d8210e-e60b-4f0d-ac4e-1b5fb4d3334f) ==> (Invalid argument) [2015-03-19 17:42:39.782785] E [inodelk.c:309:__inode_unlock_lock] 0-testvol-locks: Matching lock not found for unlock 35913728-35930111, by ecd507a4b47f0000 on 0x7f58dc0763b0
Created attachment 1005663 [details] Email thread on this issue. Capturing the e-mail thread we had on this issue between 03/20/2015 (started by Sayan) and 03/24/2015 (last updated by Vivek).
*** Bug 1204580 has been marked as a duplicate of this bug. ***
Hi Shyam, I read over an email chain that this bug is a known issue for 3.0.4. Could you please fill out the doc text?
Hang on the NFS-client gqac005 at "Mar 26 14:27:16 EDT" INFO: task iozone:23018 blocked for more than 120 seconds. Not tainted 2.6.32-431.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. iozone D 0000000000000011 0 23018 1 0x00000080 ffff880385fe7cc8 0000000000000086 0000000000000000 ffffffffa0247940 ffff880385fe7c38 ffffffff810149b9 ffff880385fe7c78 ffffffff810a70a1 ffff8806234405f8 ffff880385fe7fd8 000000000000fbc8 ffff8806234405f8 Call Trace: [<ffffffffa0247940>] ? rpc_execute+0x50/0xa0 [sunrpc] [<ffffffff810149b9>] ? read_tsc+0x9/0x20 [<ffffffff810a70a1>] ? ktime_get_ts+0xb1/0xf0 [<ffffffff810a70a1>] ? ktime_get_ts+0xb1/0xf0 [<ffffffff8111f930>] ? sync_page+0x0/0x50 [<ffffffff815280a3>] io_schedule+0x73/0xc0 [<ffffffff8111f96d>] sync_page+0x3d/0x50 [<ffffffff81528b6f>] __wait_on_bit+0x5f/0x90 [<ffffffff8111fba3>] wait_on_page_bit+0x73/0x80 [<ffffffff8109b320>] ? wake_bit_function+0x0/0x50 [<ffffffff81135bf5>] ? pagevec_lookup_tag+0x25/0x40 [<ffffffff8111ffcb>] wait_on_page_writeback_range+0xfb/0x190 [<ffffffff81120198>] filemap_write_and_wait_range+0x78/0x90 [<ffffffff811baa3e>] vfs_fsync_range+0x7e/0x100 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20 [<ffffffff811bab2d>] vfs_fsync+0x1d/0x20 [<ffffffff811bab6e>] do_fsync+0x3e/0x60 [<ffffffff811babc0>] sys_fsync+0x10/0x20 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Related entries in the nfs.log on gqas006 (time in UTC): [2015-03-26 18:17:12.188198] C [rpc-clnt-ping.c:109:rpc_clnt_ping_timer_expired] 0-testvol-client-2: server 192.168.79.102:49152 has not responded in the last 42 seconds, disconnecting. [2015-03-26 18:17:12.209865] E [rpc-clnt.c:362:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f) [0x7f8a589e3d3f] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91) [0x7f8a589e3771] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f8a589e36be]))) 0-testvol-client-2: forced unwinding frame type(GlusterFS 3.3) op(FSYNC(16)) called at 2015-03-26 18:14:54.129253 (xid=0xbb5) [2015-03-26 18:17:12.209896] W [client-rpc-fops.c:970:client3_3_fsync_cbk] 0-testvol-client-2: remote operation failed: Transport endpoint is not connected [2015-03-26 18:17:12.209927] W [afr-transaction.c:1710:afr_changelog_fsync_cbk] 0-testvol-replicate-1: fsync(6ec3e93f-d6be-430a-8ba2-fd18396c308b) failed on subvolume testvol-client-2. Transaction was WRITE [2015-03-26 18:17:12.238123] I [socket.c:3176:socket_submit_request] 0-testvol-client-2: not connected (priv->connected = 0) [2015-03-26 18:17:12.238148] W [rpc-clnt.c:1562:rpc_clnt_submit] 0-testvol-client-2: failed to submit rpc-request (XID: 0x2b8c Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (testvol-client-2) [2015-03-26 18:17:12.238169] W [client-rpc-fops.c:1804:client3_3_fxattrop_cbk] 0-testvol-client-2: remote operation failed: Transport endpoint is not connected First errors in the brick-log on gqas012 (timestamps in UTC): [2015-03-26 18:17:18.517324] W [socket.c:546:__socket_rwv] 0-tcp.testvol-server: writev on 192.168.79.101:996 failed (Connection reset by peer) [2015-03-26 18:17:18.518800] I [socket.c:2290:socket_event_handler] 0-transport: disconnecting now [2015-03-26 18:17:18.519205] I [server.c:519:server_rpc_notify] 0-testvol-server: disconnecting connectionfrom gqas006.sbu.lab.eng.bos.redhat.com-8646-2015/03/26-17:01:03:430363-testvol-client-2-0-0 [2015-03-26 18:17:18.528667] W [inodelk.c:404:pl_inodelk_log_cleanup] 0-testvol-server: releasing lock on 6ec3e93f-d6be-430a-8ba2-fd18396c308b held by {client=0x7fa64c008a60, pid=0 lk-owner=481402448a7f0000} [2015-03-26 18:17:18.528707] E [inodelk.c:309:__inode_unlock_lock] 0-testvol-locks: Matching lock not found for unlock 0-9223372036854775807, by 481402448a7f0000 on 0x7fa64c045ec0 [2015-03-26 18:17:18.528825] E [server-rpc-fops.c:279:server_finodelk_cbk] 0-testvol-server: 11198: FINODELK -2 (6ec3e93f-d6be-430a-8ba2-fd18396c308b) ==> (Invalid argument) Some sar stats on the GlusterSF server hosting the brick that got mentioned related to the disconnects (gqas012) where this client mounts the export from (/var/log/sa/sar26, time in EDT): ........... kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 02:10:01 PM 40739496 8805104 17.77 2108 7898512 833792 1.12 02:11:01 PM 34307572 15237028 30.75 2260 14095660 829540 1.12 02:12:01 PM 33989668 15554932 31.40 2284 14399876 829720 1.12 02:13:01 PM 33988424 15556176 31.40 2300 14399888 829540 1.12 02:14:02 PM 48767980 776620 1.57 1788 23420 830796 1.12 02:15:01 PM 36045192 13499408 27.25 2076 12313600 880964 1.19 02:16:01 PM 22940780 26603820 53.70 2092 25013844 919696 1.24 02:17:01 PM 21328164 28216436 56.95 2108 26559012 923972 1.24 02:18:01 PM 14682580 34862020 70.36 2160 33042012 905540 1.22 02:19:01 PM 236748 49307852 99.52 1028 47003300 872296 1.17 02:20:01 PM 266936 49277664 99.46 1044 47041912 860080 1.16 02:21:01 PM 217616 49326984 99.56 820 47174764 840004 1.13 02:22:02 PM 47125144 2419456 4.88 1784 1646388 839656 1.13 02:23:01 PM 29545200 19999400 40.37 2056 18633168 876868 1.18 02:24:01 PM 20712976 28831624 58.19 2084 27172612 917828 1.24 02:25:01 PM 20361348 29183252 58.90 2224 27549120 944316 1.27 02:26:01 PM 10397332 39147268 79.01 2404 37262500 936260 1.26 02:27:01 PM 3987912 45556688 91.95 2428 43477164 936260 1.26 02:28:01 PM 1246492 48298108 97.48 2452 46194420 901444 1.21 02:29:01 PM 1247860 48296740 97.48 2468 46194440 901444 1.21 02:30:01 PM 1250996 48293604 97.48 2484 46194452 903764 1.22 02:30:01 PM 1250996 48293604 97.48 2500 46194436 901448 1.21 02:31:01 PM 1252928 48291672 97.47 2516 46194476 901280 1.21 02:32:01 PM 1254628 48289972 97.47 2532 46194492 901280 1.21 02:33:01 PM 1252428 48292172 97.47 2556 46194504 901444 1.21 02:34:01 PM 1252136 48292464 97.47 2572 46194516 901100 1.21 02:35:01 PM 1252512 48292088 97.47 2588 46194528 901280 1.21 02:36:01 PM 1252720 48291880 97.47 2604 46194536 901444 1.21 02:37:01 PM 1252544 48292056 97.47 2636 46194544 901280 1.21 02:38:01 PM 1252388 48292212 97.47 2660 46194568 901280 1.21 ........... runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 02:10:01 PM 1 629 2.71 4.31 5.21 02:11:01 PM 0 623 2.01 3.81 4.98 02:12:01 PM 1 623 6.04 4.61 5.18 02:13:01 PM 0 623 7.79 5.36 5.40 02:14:02 PM 0 623 4.45 4.95 5.26 02:15:01 PM 9 632 10.55 6.34 5.70 02:16:01 PM 9 632 12.02 7.43 6.11 02:17:01 PM 9 632 16.92 9.69 6.97 02:18:01 PM 0 632 17.87 11.33 7.70 02:19:01 PM 2 630 11.04 10.72 7.74 02:20:01 PM 1 632 6.61 9.55 7.53 02:21:01 PM 0 630 4.95 8.49 7.29 02:22:02 PM 1 632 3.03 7.37 6.98 02:23:01 PM 3 632 3.47 6.71 6.77 02:24:01 PM 6 632 6.08 6.89 6.83 02:25:01 PM 0 634 15.05 9.26 7.64 02:26:01 PM 1 632 14.86 10.14 8.04 02:27:01 PM 2 630 16.28 11.36 8.58 02:28:01 PM 1 630 9.77 10.61 8.51 02:29:01 PM 1 630 4.22 8.86 8.04 02:30:01 PM 2 618 2.18 7.42 7.60 There probably are more interesting stats in /var/log/sa/sar26 on gqas012.
Here is the associated KCS artice which will be linked from the release notes. https://access.redhat.com/solutions/1397293
Already fixed as part of rebase against: 1216310
@Pranith: This is not fixed as a part of 1216310, rather it is fixed as a part of 1227167 and hence still needs a backport to 3.1 as it was merged into 3.7 branch today.
*** This bug has been marked as a duplicate of bug 1203901 ***