Bug 1203850 - Perf: IOZone deadlocks during random I/O test.
Summary: Perf: IOZone deadlocks during random I/O test.
Keywords:
Status: CLOSED DUPLICATE of bug 1203901
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: core
Version: rhgs-3.0
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: RHGS 3.1.0
Assignee: krishnan parthasarathi
QA Contact: Ben Turner
URL:
Whiteboard:
: 1204580 (view as bug list)
Depends On:
Blocks: 1191838 1204580
TreeView+ depends on / blocked
 
Reported: 2015-03-19 18:44 UTC by Ben Turner
Modified: 2016-09-17 14:42 UTC (History)
13 users (show)

Fixed In Version: glusterfs-3.7.1-2
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1204580 (view as bug list)
Environment:
Last Closed: 2015-07-07 17:07:21 UTC
Embargoed:


Attachments (Terms of Use)
Email thread on this issue. (83.35 KB, application/pdf)
2015-03-24 06:52 UTC, Nagaprasad Sathyanarayana
no flags Details

Description Ben Turner 2015-03-19 18:44:26 UTC
Description of problem:

During a random I/O test IOZone deadlocked:

Mar 19 14:39:42 gqac022 kernel: INFO: task iozone:13184 blocked for more than 120 seconds.
Mar 19 14:39:42 gqac022 kernel:      Not tainted 2.6.32-431.el6.x86_64 #1
Mar 19 14:39:42 gqac022 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 19 14:39:42 gqac022 kernel: iozone        D 0000000000000000     0 13184  13183 0x00000084
Mar 19 14:39:42 gqac022 kernel: ffff880be811fda8 0000000000000086 0000000000000000 0000000000000000
Mar 19 14:39:42 gqac022 kernel: ffff880be6fa0a80 ffff880be839f980 ffff880be811fdf8 ffffffffa02fc78b
Mar 19 14:39:42 gqac022 kernel: ffff880be80db058 ffff880be811ffd8 000000000000fbc8 ffff880be80db058
Mar 19 14:39:42 gqac022 kernel: Call Trace:
Mar 19 14:39:42 gqac022 kernel: [<ffffffffa02fc78b>] ? fuse_dentry_revalidate+0x2fb/0x340 [fuse]
Mar 19 14:39:42 gqac022 kernel: [<ffffffff8152907e>] __mutex_lock_slowpath+0x13e/0x180
Mar 19 14:39:42 gqac022 kernel: [<ffffffff8122752f>] ? security_inode_permission+0x1f/0x30
Mar 19 14:39:42 gqac022 kernel: [<ffffffff81528f1b>] mutex_lock+0x2b/0x50
Mar 19 14:39:42 gqac022 kernel: [<ffffffff8119796c>] vfs_unlink+0x5c/0xf0
Mar 19 14:39:42 gqac022 kernel: [<ffffffff81196c1a>] ? lookup_hash+0x3a/0x50
Mar 19 14:39:42 gqac022 kernel: [<ffffffff8119a723>] do_unlinkat+0x163/0x260
Mar 19 14:39:42 gqac022 kernel: [<ffffffff810e2067>] ? audit_syscall_entry+0x1d7/0x200
Mar 19 14:39:42 gqac022 kernel: [<ffffffff8119a836>] sys_unlink+0x16/0x20
Mar 19 14:39:42 gqac022 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Version-Release number of selected component (if applicable):

glusterfs-3.6.0.53

How reproducible:

This is the first time I have hit it in about 20 runs on this build.

Steps to Reproduce:
1.  Run IOZone random IO test
2.
3.

Actual results:

IOZone hangs due to what I assume is a lost lock.

Expected results:

Normal operation.

Additional info:

Comment 1 Ben Turner 2015-03-19 18:51:12 UTC
Client logs from event:

[2015-03-19 14:10:35.136562] I [afr-self-heal-common.c:1398:afr_sh_missing_entry_call_impunge_recreate] 0-testvol-replicate-0: no missing files - /network_shared/stonewall.tmp. proceeding to metadata check
[2015-03-19 14:10:35.148003] I [afr-self-heal-common.c:2874:afr_log_self_heal_completion_status] 0-testvol-replicate-0:  gfid or missing entry self heal  is started,   on /network_shared/stonewall.tmp
[2015-03-19 17:42:39.689155] C [rpc-clnt-ping.c:109:rpc_clnt_ping_timer_expired] 0-testvol-client-1: server 192.168.61.101:49152 has not responded in the last 42 seconds, disconnecting.
[2015-03-19 17:42:39.757626] E [rpc-clnt.c:362:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f) [0x7fb4b34ddd3f] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91) [0x7fb4b34dd771] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fb4b34dd6be]))) 0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(FINODELK(30)) called at 2015-03-19 17:41:52.085874 (xid=0x20c7efe)
[2015-03-19 17:42:39.757643] E [client-rpc-fops.c:1571:client3_3_finodelk_cbk] 0-testvol-client-1: remote operation failed: Transport endpoint is not connected
[2015-03-19 17:42:39.757713] E [rpc-clnt.c:362:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f) [0x7fb4b34ddd3f] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91) [0x7fb4b34dd771] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fb4b34dd6be]))) 0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(FINODELK(30)) called at 2015-03-19 17:41:52.086589 (xid=0x20c7f0a)
[2015-03-19 17:42:39.757722] E [client-rpc-fops.c:1571:client3_3_finodelk_cbk] 0-testvol-client-1: remote operation failed: Transport endpoint is not connected
[2015-03-19 17:42:39.757766] E [rpc-clnt.c:362:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f) [0x7fb4b34ddd3f] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91) [0x7fb4b34dd771] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fb4b34dd6be]))) 0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(FINODELK(30)) called at 2015-03-19 17:41:52.088977 (xid=0x20c7f31)
[2015-03-19 17:42:39.757775] E [client-rpc-fops.c:1571:client3_3_finodelk_cbk] 0-testvol-client-1: remote operation failed: Transport endpoint is not connected
[2015-03-19 17:42:39.757827] E [rpc-clnt.c:362:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f) [0x7fb4b34ddd3f] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91) [0x7fb4b34dd771] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fb4b34dd6be]))) 0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(FINODELK(30)) called at 2015-03-19 17:41:52.089762 (xid=0x20c7f32)
[2015-03-19 17:42:39.757837] E [client-rpc-fops.c:1571:client3_3_finodelk_cbk] 0-testvol-client-1: remote operation failed: Transport endpoint is not connected

Comment 2 Ben Turner 2015-03-19 19:28:06 UTC
And the server:

[2015-03-19 17:42:39.779725] I [server-handshake.c:578:server_setvolume] 0-testvol-server: accepted client from gqac022.sbu.lab.eng.bos.redh
at.com-1853-2015/03/18-21:00:28:131978-testvol-client-1-0-1 (version: 3.6.0.53)
[2015-03-19 17:42:39.780749] E [posix.c:2508:posix_open] 0-testvol-posix: open on /rhs/brick1/.glusterfs/80/b7/80b7bf90-9d40-4b98-843a-86547
03022e8: No such file or directory
[2015-03-19 17:42:39.780814] E [posix.c:2508:posix_open] 0-testvol-posix: open on /rhs/brick1/.glusterfs/ac/51/ac51c87e-9c97-4c78-a542-bc631
3c33328: No such file or directory
[2015-03-19 17:42:39.780937] E [posix.c:2508:posix_open] 0-testvol-posix: open on /rhs/brick1/.glusterfs/25/99/259929ed-965b-4615-86f2-e94c68938069: No such file or directory
[2015-03-19 17:42:39.781033] E [posix.c:2508:posix_open] 0-testvol-posix: open on /rhs/brick1/.glusterfs/52/7a/527ab0c5-53fc-4d54-8c03-a5aa759217f8: No such file or directory
[2015-03-19 17:42:39.782572] E [inodelk.c:309:__inode_unlock_lock] 0-testvol-locks:  Matching lock not found for unlock 8749056-8765439, by ecd507a4b47f0000 on 0x7f58dc0763b0
[2015-03-19 17:42:39.782609] E [server-rpc-fops.c:279:server_finodelk_cbk] 0-testvol-server: 34373742: FINODELK -2 (17d8210e-e60b-4f0d-ac4e-1b5fb4d3334f) ==> (Invalid argument)
[2015-03-19 17:42:39.782622] E [inodelk.c:309:__inode_unlock_lock] 0-testvol-locks:  Matching lock not found for unlock 515686400-515702783, by ecd507a4b47f0000 on 0x7f58dc0763b0
[2015-03-19 17:42:39.782662] E [server-rpc-fops.c:279:server_finodelk_cbk] 0-testvol-server: 34373745: FINODELK -2 (17d8210e-e60b-4f0d-ac4e-1b5fb4d3334f) ==> (Invalid argument)
[2015-03-19 17:42:39.782785] E [inodelk.c:309:__inode_unlock_lock] 0-testvol-locks:  Matching lock not found for unlock 35913728-35930111, by ecd507a4b47f0000 on 0x7f58dc0763b0

Comment 6 Nagaprasad Sathyanarayana 2015-03-24 06:52:04 UTC
Created attachment 1005663 [details]
Email thread on this issue.

Capturing the e-mail thread we had on this issue between 03/20/2015 (started by Sayan) and 03/24/2015 (last updated by Vivek).

Comment 7 Vivek Agarwal 2015-03-24 12:52:08 UTC
*** Bug 1204580 has been marked as a duplicate of this bug. ***

Comment 8 Pavithra 2015-03-26 06:04:45 UTC
Hi Shyam,

I read over an email chain that this bug is a known issue for 3.0.4. Could you please fill out the doc text?

Comment 9 Niels de Vos 2015-03-27 16:35:49 UTC
Hang on the NFS-client gqac005 at "Mar 26 14:27:16 EDT"

INFO: task iozone:23018 blocked for more than 120 seconds.
     Not tainted 2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
iozone        D 0000000000000011     0 23018      1 0x00000080
ffff880385fe7cc8 0000000000000086 0000000000000000 ffffffffa0247940
ffff880385fe7c38 ffffffff810149b9 ffff880385fe7c78 ffffffff810a70a1
ffff8806234405f8 ffff880385fe7fd8 000000000000fbc8 ffff8806234405f8
Call Trace:
[<ffffffffa0247940>] ? rpc_execute+0x50/0xa0 [sunrpc]
[<ffffffff810149b9>] ? read_tsc+0x9/0x20
[<ffffffff810a70a1>] ? ktime_get_ts+0xb1/0xf0
[<ffffffff810a70a1>] ? ktime_get_ts+0xb1/0xf0
[<ffffffff8111f930>] ? sync_page+0x0/0x50
[<ffffffff815280a3>] io_schedule+0x73/0xc0
[<ffffffff8111f96d>] sync_page+0x3d/0x50
[<ffffffff81528b6f>] __wait_on_bit+0x5f/0x90
[<ffffffff8111fba3>] wait_on_page_bit+0x73/0x80
[<ffffffff8109b320>] ? wake_bit_function+0x0/0x50
[<ffffffff81135bf5>] ? pagevec_lookup_tag+0x25/0x40
[<ffffffff8111ffcb>] wait_on_page_writeback_range+0xfb/0x190
[<ffffffff81120198>] filemap_write_and_wait_range+0x78/0x90
[<ffffffff811baa3e>] vfs_fsync_range+0x7e/0x100
[<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff811bab2d>] vfs_fsync+0x1d/0x20
[<ffffffff811bab6e>] do_fsync+0x3e/0x60
[<ffffffff811babc0>] sys_fsync+0x10/0x20
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b


Related entries in the nfs.log on gqas006 (time in UTC):

[2015-03-26 18:17:12.188198] C [rpc-clnt-ping.c:109:rpc_clnt_ping_timer_expired] 0-testvol-client-2: server 192.168.79.102:49152 has not responded in the last 42 seconds, disconnecting.
[2015-03-26 18:17:12.209865] E [rpc-clnt.c:362:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f) [0x7f8a589e3d3f] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91) [0x7f8a589e3771] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f8a589e36be]))) 0-testvol-client-2: forced unwinding frame type(GlusterFS 3.3) op(FSYNC(16)) called at 2015-03-26 18:14:54.129253 (xid=0xbb5)
[2015-03-26 18:17:12.209896] W [client-rpc-fops.c:970:client3_3_fsync_cbk] 0-testvol-client-2: remote operation failed: Transport endpoint is not connected
[2015-03-26 18:17:12.209927] W [afr-transaction.c:1710:afr_changelog_fsync_cbk] 0-testvol-replicate-1: fsync(6ec3e93f-d6be-430a-8ba2-fd18396c308b) failed on subvolume testvol-client-2. Transaction was WRITE
[2015-03-26 18:17:12.238123] I [socket.c:3176:socket_submit_request] 0-testvol-client-2: not connected (priv->connected = 0)
[2015-03-26 18:17:12.238148] W [rpc-clnt.c:1562:rpc_clnt_submit] 0-testvol-client-2: failed to submit rpc-request (XID: 0x2b8c Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (testvol-client-2)
[2015-03-26 18:17:12.238169] W [client-rpc-fops.c:1804:client3_3_fxattrop_cbk] 0-testvol-client-2: remote operation failed: Transport endpoint is not connected


First errors in the brick-log on gqas012 (timestamps in UTC):

[2015-03-26 18:17:18.517324] W [socket.c:546:__socket_rwv] 0-tcp.testvol-server: writev on 192.168.79.101:996 failed (Connection reset by peer)
[2015-03-26 18:17:18.518800] I [socket.c:2290:socket_event_handler] 0-transport: disconnecting now
[2015-03-26 18:17:18.519205] I [server.c:519:server_rpc_notify] 0-testvol-server: disconnecting connectionfrom gqas006.sbu.lab.eng.bos.redhat.com-8646-2015/03/26-17:01:03:430363-testvol-client-2-0-0
[2015-03-26 18:17:18.528667] W [inodelk.c:404:pl_inodelk_log_cleanup] 0-testvol-server: releasing lock on 6ec3e93f-d6be-430a-8ba2-fd18396c308b held by {client=0x7fa64c008a60, pid=0 lk-owner=481402448a7f0000}
[2015-03-26 18:17:18.528707] E [inodelk.c:309:__inode_unlock_lock] 0-testvol-locks:  Matching lock not found for unlock 0-9223372036854775807, by 481402448a7f0000 on 0x7fa64c045ec0
[2015-03-26 18:17:18.528825] E [server-rpc-fops.c:279:server_finodelk_cbk] 0-testvol-server: 11198: FINODELK -2 (6ec3e93f-d6be-430a-8ba2-fd18396c308b) ==> (Invalid argument)


Some sar stats on the GlusterSF server hosting the brick that got mentioned related to the disconnects (gqas012) where this client mounts the export from (/var/log/sa/sar26, time in EDT):

........... kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit %commit
02:10:01 PM  40739496   8805104     17.77      2108   7898512    833792    1.12
02:11:01 PM  34307572  15237028     30.75      2260  14095660    829540    1.12
02:12:01 PM  33989668  15554932     31.40      2284  14399876    829720    1.12
02:13:01 PM  33988424  15556176     31.40      2300  14399888    829540    1.12
02:14:02 PM  48767980    776620      1.57      1788     23420    830796    1.12
02:15:01 PM  36045192  13499408     27.25      2076  12313600    880964    1.19
02:16:01 PM  22940780  26603820     53.70      2092  25013844    919696    1.24
02:17:01 PM  21328164  28216436     56.95      2108  26559012    923972    1.24
02:18:01 PM  14682580  34862020     70.36      2160  33042012    905540    1.22
02:19:01 PM    236748  49307852     99.52      1028  47003300    872296    1.17
02:20:01 PM    266936  49277664     99.46      1044  47041912    860080    1.16
02:21:01 PM    217616  49326984     99.56       820  47174764    840004    1.13
02:22:02 PM  47125144   2419456      4.88      1784   1646388    839656    1.13
02:23:01 PM  29545200  19999400     40.37      2056  18633168    876868    1.18
02:24:01 PM  20712976  28831624     58.19      2084  27172612    917828    1.24
02:25:01 PM  20361348  29183252     58.90      2224  27549120    944316    1.27
02:26:01 PM  10397332  39147268     79.01      2404  37262500    936260    1.26
02:27:01 PM   3987912  45556688     91.95      2428  43477164    936260    1.26
02:28:01 PM   1246492  48298108     97.48      2452  46194420    901444    1.21
02:29:01 PM   1247860  48296740     97.48      2468  46194440    901444    1.21
02:30:01 PM   1250996  48293604     97.48      2484  46194452    903764    1.22
02:30:01 PM   1250996  48293604     97.48      2500  46194436    901448    1.21
02:31:01 PM   1252928  48291672     97.47      2516  46194476    901280    1.21
02:32:01 PM   1254628  48289972     97.47      2532  46194492    901280    1.21
02:33:01 PM   1252428  48292172     97.47      2556  46194504    901444    1.21
02:34:01 PM   1252136  48292464     97.47      2572  46194516    901100    1.21
02:35:01 PM   1252512  48292088     97.47      2588  46194528    901280    1.21
02:36:01 PM   1252720  48291880     97.47      2604  46194536    901444    1.21
02:37:01 PM   1252544  48292056     97.47      2636  46194544    901280    1.21
02:38:01 PM   1252388  48292212     97.47      2660  46194568    901280    1.21

...........   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
02:10:01 PM         1       629      2.71      4.31      5.21
02:11:01 PM         0       623      2.01      3.81      4.98
02:12:01 PM         1       623      6.04      4.61      5.18
02:13:01 PM         0       623      7.79      5.36      5.40
02:14:02 PM         0       623      4.45      4.95      5.26
02:15:01 PM         9       632     10.55      6.34      5.70
02:16:01 PM         9       632     12.02      7.43      6.11
02:17:01 PM         9       632     16.92      9.69      6.97
02:18:01 PM         0       632     17.87     11.33      7.70
02:19:01 PM         2       630     11.04     10.72      7.74
02:20:01 PM         1       632      6.61      9.55      7.53
02:21:01 PM         0       630      4.95      8.49      7.29
02:22:02 PM         1       632      3.03      7.37      6.98
02:23:01 PM         3       632      3.47      6.71      6.77
02:24:01 PM         6       632      6.08      6.89      6.83
02:25:01 PM         0       634     15.05      9.26      7.64
02:26:01 PM         1       632     14.86     10.14      8.04
02:27:01 PM         2       630     16.28     11.36      8.58
02:28:01 PM         1       630      9.77     10.61      8.51
02:29:01 PM         1       630      4.22      8.86      8.04
02:30:01 PM         2       618      2.18      7.42      7.60


There probably are more interesting stats in /var/log/sa/sar26 on gqas012.

Comment 11 Pavithra 2015-04-01 07:06:26 UTC
Here is the associated KCS artice which will be linked from the release notes.
https://access.redhat.com/solutions/1397293

Comment 16 Pranith Kumar K 2015-06-04 11:35:28 UTC
Already fixed as part of rebase against: 1216310

Comment 17 Shyamsundar 2015-06-04 17:19:18 UTC
@Pranith: This is not fixed as a part of 1216310, rather it is fixed as a part of 1227167 and hence still needs a backport to 3.1 as it was merged into 3.7 branch today.

Comment 20 Ben Turner 2015-07-07 17:07:21 UTC

*** This bug has been marked as a duplicate of bug 1203901 ***


Note You need to log in before you can comment on or make changes to this bug.