Bug 1014867
Summary: | xfssyncd and flush device threads hang in xlog_grant_head_wait | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Brad Hubbard <bhubbard> | |
Component: | kernel | Assignee: | Dave Chinner <dchinner> | |
Status: | CLOSED ERRATA | QA Contact: | Boris Ranto <branto> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 6.4 | CC: | aquini, branto, clober, dchinner, dhoward, djeffery, dwoodruf, dwysocha, eguan, esandeen, hamiller, jcpunk, lmcilroy, mgoodwin, msvoboda, ndevos, pdemmers, rwheeler, sauchter, tlavigne, vkaigoro | |
Target Milestone: | rc | Keywords: | ZStream | |
Target Release: | 6.5 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | kernel-2.6.32-432.el6 | Doc Type: | Bug Fix | |
Doc Text: |
When creating an XFS file system, an attempt to cover an active XFS log could, under certain circumstances, result in a deadlock between the xfssyncd and xfsbufd daemons. Consequently, several kernel threads became unresponsive and the XFS file system could not have been successfully created, leading to a kernel oops. A patch has been applied to prevent this situation by forcing the active XFS log onto a disk.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1032144 (view as bug list) | Environment: | ||
Last Closed: | 2013-11-21 20:32:26 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1025439, 1032147, 1032150 | |||
Bug Blocks: | 1021813, 1032144, 1032688 |
Description
Brad Hubbard
2013-10-03 00:31:36 UTC
There is no IO on any device reported by 'dev -d' and the kernel version is 2.6.32-358.6.1.el6.x86_64 Looking at the vmcore: KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-358.6.1.el6.x86_64/vmlinux DUMPFILE: /cores/retrace/tasks/418041799/crash/vmcore [PARTIAL DUMP] CPUS: 32 DATE: Tue Oct 1 11:36:52 2013 UPTIME: 1 days, 05:47:50 LOAD AVERAGE: 64.17, 63.52, 59.00 TASKS: 2768 NODENAME: <nodename> RELEASE: 2.6.32-358.6.1.el6.x86_64 VERSION: #1 SMP Fri Mar 29 16:51:51 EDT 2013 MACHINE: x86_64 (2700 Mhz) MEMORY: 128 GB PANIC: "Oops: 0002 [#1] SMP " (check log for details) crash> ps | grep UN | wc -l 64 The load is made up entirely of uninterruptible processes. crash> ps | grep UN | head 1308 1282 21 ffff88200b50a040 UN 0.0 15400 1024 find 1389 1365 25 ffff881d3fea4aa0 UN 0.0 15400 1024 find 1482 1481 26 ffff880d30c7e080 UN 0.0 15408 1036 find 1557 1556 28 ffff881a780ca040 UN 0.0 15408 1032 find 2775 2774 30 ffff880d246f1540 UN 0.0 108456 1876 bash 3207 2 17 ffff88106684aae0 UN 0.0 0 0 [xfssyncd/dm-26] 3208 2 24 ffff88106390b500 UN 0.0 0 0 [xfsaild/dm-26] 3445 3427 27 ffff882068454ae0 UN 0.0 15400 1028 find 3670 3656 17 ffff881d3fea6ae0 UN 0.0 15400 1024 find 5860 5834 17 ffff880ffa126040 UN 0.0 15400 1024 find crash> set 3207 PID: 3207 COMMAND: "xfssyncd/dm-26" TASK: ffff88106684aae0 [THREAD_INFO: ffff881064e8e000] CPU: 17 STATE: TASK_UNINTERRUPTIBLE crash> bt PID: 3207 TASK: ffff88106684aae0 CPU: 17 COMMAND: "xfssyncd/dm-26" #0 [ffff881064e8fc10] schedule at ffffffff8150db42 #1 [ffff881064e8fcd8] xlog_grant_head_wait at ffffffffa02f7b99 [xfs] #2 [ffff881064e8fd38] xlog_grant_head_check at ffffffffa02f7d5e [xfs] #3 [ffff881064e8fd78] xfs_log_reserve at ffffffffa02f7e50 [xfs] #4 [ffff881064e8fdb8] xfs_trans_reserve at ffffffffa03059d7 [xfs] #5 [ffff881064e8fe18] xfs_fs_log_dummy at ffffffffa02e9273 [xfs] #6 [ffff881064e8fe48] xfs_sync_worker at ffffffffa031b8d0 [xfs] #7 [ffff881064e8fe68] xfssyncd at ffffffffa031b7fe [xfs] #8 [ffff881064e8fee8] kthread at ffffffff81096936 #9 [ffff881064e8ff48] kernel_thread at ffffffff8100c0ca It's blocked on the xfs log waiting for space to become available. For how long? crash> task -R sched_info PID: 3207 TASK: ffff88106684aae0 CPU: 17 COMMAND: "xfssyncd/dm-26" sched_info = { pcount = 6362, run_delay = 22708276, last_arrival = 98453000519172, last_queued = 0, bkl_count = 0 }, crash> runq | grep "CPU 17" CPU 17 RUNQUEUE: ffff880028336700 crash> rq.clock ffff880028336700 clock = 107200042291679 crash> pd (107200042291679-98453000519172) $5 = 8747041772507 crash> pd (107200042291679-98453000519172)/1000000000 $6 = 8747 crash> pd 8747/60 $7 = 145 145 minutes. That's definitely stuck. First thing we need is how many bytes it is asking for. crash> whatis xlog_grant_head_wait int xlog_grant_head_wait(struct log *, struct xlog_grant_head *, struct xlog_ticket *, int); It's in the last argument to the function. crash> dis -rl ffffffffa02f7d5e ... /usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/fs/xfs/xfs_log.c: 290 0xffffffffa02f7d4d <xlog_grant_head_check+189>: mov (%r14),%ecx 0xffffffffa02f7d50 <xlog_grant_head_check+192>: mov %r13,%rdx 0xffffffffa02f7d53 <xlog_grant_head_check+195>: mov %rbx,%rsi 0xffffffffa02f7d56 <xlog_grant_head_check+198>: mov %r12,%rdi 0xffffffffa02f7d59 <xlog_grant_head_check+201>: callq 0xffffffffa02f7af0 <xlog_grant_head_wait> 0xffffffffa02f7d5e <xlog_grant_head_check+206>: mov %eax,%edx The 1st argument is passed in %rdi, 2nd in %rsi, 3rd in %rdx and 4th in %rcx. crash> dis -rl ffffffffa02f7b99 /usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/fs/xfs/xfs_log.c: 214 0xffffffffa02f7af0 <xlog_grant_head_wait>: push %rbp 0xffffffffa02f7af1 <xlog_grant_head_wait+1>: mov %rsp,%rbp 0xffffffffa02f7af4 <xlog_grant_head_wait+4>: push %r15 0xffffffffa02f7af6 <xlog_grant_head_wait+6>: push %r14 0xffffffffa02f7af8 <xlog_grant_head_wait+8>: push %r13 0xffffffffa02f7afa <xlog_grant_head_wait+10>: push %r12 0xffffffffa02f7afc <xlog_grant_head_wait+12>: push %rbx ... crash> bt -f PID: 3207 TASK: ffff88106684aae0 CPU: 17 COMMAND: "xfssyncd/dm-26" ... #1 [ffff881064e8fcd8] xlog_grant_head_wait at ffffffffa02f7b99 [xfs] ffff881064e8fce0: ffff8810586a1a00 0000000000000280 ffff881064e8fcf0: ffff8820681aef58 ffff88106684aae0 ffff881064e8fd00: ffff881064e8fd20 ffff8820681aef40 ffff881064e8fd10: ffff8820681aee00 ffff880f6ce57b50 ffff881064e8fd20: ffff881064e8fd8c 0000000000000280 ffff881064e8fd30: ffff881064e8fd70 ffffffffa02f7d5e #2 [ffff881064e8fd38] xlog_grant_head_check at ffffffffa02f7d5e [xfs] ... So the log is at: crash> struct log ffff8820681aee00 struct log { l_mp = 0xffff8820694fd800, l_ailp = 0xffff882068b74cc0, l_cilp = 0xffff882068b74dc0, l_xbuf = 0xffff882065f41680, l_targ = 0xffff8820680df9c0, ... And need_bytes is at: crash> rd ffff881064e8fd8c ffff881064e8fd8c: c621b5e000000ab4 ......!. crash> pd 0xab4 $4 = 2740 So it needs 2740 bytes. The function arguments to xlog_space_left are the log and &head->grant: crash> xlog_grant_head.grant ffff8820681aef40 grant = { counter = 0xfd0095c2f0 } head_cycle = 0xfd head_bytes = 0x95c2f0 crash> struct log.l_tail_lsn ffff8820681aee00 l_tail_lsn = { counter = 0xfc00004e4f } tail_cycle = 0xfc tail_bytes = 0x9c9e00 free_bytes = tail_bytes - head_bytes = 449296 bytes. So there is plenty of free space. Did the xfssyncd thread not get woken up? crash> set 1169 PID: 1169 COMMAND: "xfsaild/dm-30" TASK: ffff8801f78f4aa0 [THREAD_INFO: ffff8801fb526000] CPU: 1 STATE: TASK_UNINTERRUPTIBLE|TASK_WAKEKILL crash> bt PID: 1169 TASK: ffff8801f78f4aa0 CPU: 1 COMMAND: "xfsaild/dm-30" #0 [ffff8801fb527d10] schedule at ffffffff8150db42 #1 [ffff8801fb527dd8] schedule_timeout at ffffffff8150e982 #2 [ffff8801fb527e88] xfsaild at ffffffffa0239933 [xfs] #3 [ffff8801fb527ee8] kthread at ffffffff81096936 #4 [ffff8801fb527f48] kernel_thread at ffffffff8100c0ca crash> task -R sched_info PID: 1169 TASK: ffff8801f78f4aa0 CPU: 1 COMMAND: "xfsaild/dm-30" sched_info = { pcount = 77183915, run_delay = 973329148430, last_arrival = 3824697247325310, last_queued = 0, bkl_count = 0 }, crash> runq | grep "CPU 1" CPU 1 RUNQUEUE: ffff880028316700 crash> rq.clock ffff880028316700 clock = 3824697253567538 crash> pd (3824697253567538-3824697247325310) $15 = 6242228 crash> pd (3824697253567538-3824697247325310)/1000000000 $16 = 0 Okay, that's not stuck. crash> dis -rl ffffffffa0239933 ... 0xffffffffa0239923 <xfsaild+0x63>: mov -0x48(%rbp),%edi 0xffffffffa0239926 <xfsaild+0x66>: callq 0xffffffff81074a20 <msecs_to_jiffies> 0xffffffffa023992b <xfsaild+0x6b>: mov %rax,%rdi /usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/fs/xfs/linux-2.6/xfs_super.c: 803 0xffffffffa023992e <xfsaild+0x6e>: callq 0xffffffff8150e7f0 <schedule_timeout> /usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/arch/x86/include/asm/bitops.h: 359 0xffffffffa0239933 <xfsaild+0x73>: mov 0x8(%r12),%rax 'tout' is in %rbp-0x48 crash> dis -rl ffffffff8150e982 /usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/kernel/timer.c: 1357 0xffffffff8150e7f0 <schedule_timeout>: push %rbp ... crash> bt -f PID: 1169 TASK: ffff8801f78f4aa0 CPU: 1 COMMAND: "xfsaild/dm-30" ... #1 [ffff8801fb527dd8] schedule_timeout at ffffffff8150e982 ffff8801fb527de0: 0000000000000084 0000000000000286 ffff8801fb527df0: ffff8801fc8e8ac0 ffff8801fbbbbdf0 ffff8801fb527e00: 00000001e40120aa ffffffff810810e0 ffff8801fb527e10: ffff8801f78f4aa0 ffff8801fc8e8000 ffff8801fb527e20: 0000000000000000 0000000000000000 ffff8801fb527e30: 0000000000000000 ffff8801ffffffff ffff8801fb527e40: 00001846000022cc 0000000042b70c2d ffff8801fb527e50: 0000000000000282 ffff8801f89ae8c0 ffff8801fb527e60: ffff8801f78f4aa0 ffff8801f78f4aa0 ffff8801fb527e70: ffff8801f78f4aa0 ffff8801fb527ea8 ffff8801fb527e80: ffff8801fb527ee0 ffffffffa0239933 #2 [ffff8801fb527e88] xfsaild at ffffffffa0239933 [xfs] ... crash> px 0xffff8801fb527ee0-0x48 $11 = 0xffff8801fb527e98 crash> rd 0xffff8801fb527e98 ffff8801fb527e98: 000000000000000a ........ So tout is 0x10. crash> dis xfsaild ... 0xffffffffa02398df <xfsaild+31>: lea -0x38(%rbp),%r15 ... 0xffffffffa023994f <xfsaild+143>: mov %r15,%rsi 0xffffffffa0239952 <xfsaild+146>: mov %rbx,%rdi 0xffffffffa0239955 <xfsaild+149>: callq 0xffffffffa0225c70 <xfsaild_push> 0xffffffffa023995a <xfsaild+154>: mov %rax,-0x48(%rbp) 'last_pushed_lsn' is in %rbp-0x38 crash> px 0xffff8801fb527ee0-0x38 $17 = 0xffff8801fb527ea8 crash> rd 0xffff8801fb527ea8 ffff8801fb527ea8: 00001846000022cc ."..F... last_cycle = 0x1846 last_bytes = 0x459800 This is the same as log->l_tail_lsn so xfsaild has pushed the tail up to a point and doesn't seem to need to push it any further. crash> struct log.l_reserve_head ffff8801f7934c00 l_reserve_head = { lock = { raw_lock = { slock = 3543192368 } }, waiters = { next = 0xffff88015e60e140, prev = 0xffff88017ca1ed78 }, grant = { counter = 26693225856316 } } crash> list -H ffff8801f7934d48 -o xlog_ticket.t_queue -s xlog_ticket.t_task,t_curr_res ffff88015e60e140 t_task = 0xffff880062592aa0 t_curr_res = 224820 ffff88017ca1e9e0 t_task = 0xffff8801f78f5500 t_curr_res = 2740 ffff8801408291f8 t_task = 0xffff8801eb794ae0 t_curr_res = 112144 ffff880135b3eb50 t_task = 0xffff880146f48080 t_curr_res = 2740 ffff8801325fecc0 t_task = 0xffff8801f8ccc080 t_curr_res = 162588 ffff88012a782368 t_task = 0xffff88011d5daaa0 t_curr_res = 162588 ffff88012a782cc0 t_task = 0xffff88011154eae0 t_curr_res = 2740 ffff8801408294d8 t_task = 0xffff8801b38b4040 t_curr_res = 2740 ffff88012749fc08 t_task = 0xffff880111574040 t_curr_res = 2740 ffff8801a5f0ce30 t_task = 0xffff880110da0ae0 t_curr_res = 2740 ffff88017ca1e648 t_task = 0xffff88010e82c040 t_curr_res = 2740 ffff88017ca1ed78 t_task = 0xffff880090928ae0 t_curr_res = 224820 Pick the first task off the queue; crash> set 0xffff880062592aa0 PID: 18869 COMMAND: "java" TASK: ffff880062592aa0 [THREAD_INFO: ffff8801f9182000] CPU: 0 STATE: TASK_UNINTERRUPTIBLE crash> bt PID: 18869 TASK: ffff880062592aa0 CPU: 0 COMMAND: "java" #0 [ffff8801f9183b98] schedule at ffffffff8150db42 #1 [ffff8801f9183c60] xlog_grant_head_wait at ffffffffa0216b99 [xfs] #2 [ffff8801f9183cc0] xlog_grant_head_check at ffffffffa0216d5e [xfs] #3 [ffff8801f9183d00] xfs_log_reserve at ffffffffa0216e50 [xfs] #4 [ffff8801f9183d40] xfs_trans_reserve at ffffffffa02249d7 [xfs] #5 [ffff8801f9183da0] xfs_inactive at ffffffffa022a2d3 [xfs] #6 [ffff8801f9183df0] xfs_fs_clear_inode at ffffffffa0237930 [xfs] #7 [ffff8801f9183e10] clear_inode at ffffffff8119cfac #8 [ffff8801f9183e30] generic_delete_inode at ffffffff8119d766 #9 [ffff8801f9183e60] generic_drop_inode at ffffffff8119d805 #10 [ffff8801f9183e80] iput at ffffffff8119c652 #11 [ffff8801f9183ea0] do_unlinkat at ffffffff81191eb4 #12 [ffff8801f9183f70] sys_unlink at ffffffff81191f56 #13 [ffff8801f9183f80] system_call_fastpath at ffffffff8100b072 RIP: 00000034240dc7d7 RSP: 00007f2881a796b8 RFLAGS: 00210246 RAX: 0000000000000057 RBX: ffffffff8100b072 RCX: 0000000003cf5848 RDX: 00007f28839e8580 RSI: 0000000000000000 RDI: 00007f28839e8580 RBP: 00007f2881a79600 R8: 00007f28839e8580 R9: 000000001bf6e2a0 R10: 00007f288489db60 R11: 0000000000200202 R12: ffffffff81191f56 R13: ffff8801f9183f78 R14: 00007f28848a13c0 R15: 0000000000000000 ORIG_RAX: 0000000000000057 CS: 0033 SS: 002b crash> bt -f PID: 18869 TASK: ffff880062592aa0 CPU: 0 COMMAND: "java" ... #1 [ffff8801f9183c60] xlog_grant_head_wait at ffffffffa0216b99 [xfs] ffff8801f9183c68: ffff8801fb072b40 00000000000359b8 ffff8801f9183c78: ffff8801f7934d58 ffff880062592aa0 ffff8801f9183c88: ffff8801f9183ca8 ffff8801f7934d40 ffff8801f9183c98: ffff8801f7934c00 ffff88015e60e140 ffff8801f9183ca8: ffff8801f9183d14 00000000000359b8 ffff8801f9183cb8: ffff8801f9183cf8 ffffffffa0216d5e #2 [ffff8801f9183cc0] xlog_grant_head_check at ffffffffa0216d5e [xfs] ... crash> rd ffff8801f9183d14 ffff8801f9183d14: 1d4b5ea00006dc68 h....^K. crash> pd 0x6dc68 $25 = 449640 Ah, this task wants 449640 bytes and there is only 449220 free in the log. All the other tasks blocked in xlog_grant_head_wait() are waiting for this one. Lachlan, Your analysis matches what i found late yesterday. A problem with my laptop meant it took all day to get back online to post the information. What I'd also looked at was that I'd started walking the AIL to find out what pinned the tail of the log - it is an inode of no particular note: p /x *(struct xfs_inode_log_item *)0xffff880d246436d8 $16 = { ili_item = { li_ail = { next = 0xffff881fe4a16518, prev = 0xffff882068b74cc8 }, li_lsn = 0xfc00004e4f, li_desc = 0x0, li_mountp = 0xffff8820694fd800, li_ailp = 0xffff882068b74cc0, li_type = 0x123b, li_flags = 0x1, li_bio_list = 0x0, li_cb = 0xffffffffa02f27b0, li_ops = 0xffffffffa0339d60, li_cil = { next = 0xffff880d24643728, prev = 0xffff880d24643728 }, li_lv = 0x0, li_seq = 0xbf5 }, ili_inode = 0xffff880ae69e4000, ili_flush_lsn = 0xfc00004e4f, ili_last_lsn = 0xc3d, ili_lock_flags = 0x0, ili_logged = 0x1, ili_last_fields = 0x5, ili_fields = 0x0, ili_extents_buf = 0x0, ili_aextents_buf = 0x0, ili_format = { ilf_type = 0x123b, ilf_size = 0x3, ilf_fields = 0x5, ilf_asize = 0x2c, ilf_dsize = 0x10, ilf_ino = 0x436029, ilf_u = { ilfu_rdev = 0x0, ilfu_uuid = { __u_bits = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0} } }, ilf_blkno = 0x436020, ilf_len = 0x10, ilf_boffset = 0x1200 } } The only thign really interesting about this inode is the ili_flush_lsn, which matches the log tail_lsn. That implies the inode has been flushed to it's backing buffer, but the backing buffer has not yet been flushed. That ili_logged = 1 and ili_fields = 0 strongly implies that the inode has been flushed since the last time it was logged. Anyway, the next things to look at is the inode itself to determine if the flush lock is held - if it is held, then the inode is waiting for buffer Io completion to occurred and that will remove the inode from the AIL, freeing up log space. In that case, we need to find the buffer containing the inode. If the flush lock is not held, then this inode should not be in the AIL.... I started looking for the inode buffer in the cache, but at this point my laptop battery died and wouldn't take charge, and I've only just got it back, so I have no further information to add yet. -Dave. The inode flush lock: p /x *(struct xfs_inode *)0xffff880ae69e4000 $2 = { .... i_flush = { done = 0x0, wait = { lock = { raw_lock = { slock = 0x1a001a } }, task_list = { next = 0xffff880ae69e40d8, prev = 0xffff880ae69e40d8 } } }, .... indicates that it is currently held (done = 0). So now I need to find the buffer that it's been written to.... -Dave. Ok, I just walked AG 1's buffer cache rbtree index to find the buffer: p *(struct radix_tree_node *)0xffff88206744b6e0 $29 = { height = 1, count = 4, rcu_head = { next = 0xffff88206744cb48, func = 0xffffffff8127c5c0 <radix_tree_node_rcu_free> }, slots = {0xffff8820680df780, 0xffff8820680df6c0, 0xffff8820680df600, 0xffff8820680df540, crash> p *((struct xfs_perag *)0xffff8820680df540)->pag_buf_tree.rb_node $36 = { rb_parent_color = 1, rb_right = 0xffff882043310e00, rb_left = 0xffff882057028080 } crash> p /x *(struct xfs_buf *)0xffff882043310e00 |grep b_bn b_bn = 0xc3c3f0, crash> p *((struct xfs_perag *)0xffff8820680df6c0)->pag_buf_tree.rb_node $38 = { rb_parent_color = 1, rb_right = 0xffff881045670980, rb_left = 0xffff8810638106c0 } crash> p /x *(struct xfs_buf *)0xffff881045670980 |grep b_bn b_bn = 0x415098, crash> p /x *(struct xfs_buf *)0xffff881045670980 |grep right rb_right = 0xffff881eda7bf540, crash> p /x *(struct xfs_buf *)0xffff881eda7bf540 |grep b_bn b_bn = 0x44c798, crash> p /x *(struct xfs_buf *)0xffff881eda7bf540 |grep right rb_right = 0xffff881cbea63200, crash> p /x *(struct xfs_buf *)0xffff881eda7bf540 |grep left rb_left = 0xffff880ccd8ff500 crash> p /x *(struct xfs_buf *)0xffff880ccd8ff500 |grep b_bn b_bn = 0x41e700, crash> p /x *(struct xfs_buf *)0xffff880ccd8ff500 |grep right rb_right = 0xffff88101f3bd3c0, crash> p /x *(struct xfs_buf *)0xffff88101f3bd3c0 |grep b_bn b_bn = 0x41f7b0, crash> p /x *(struct xfs_buf *)0xffff88101f3bd3c0 |grep right rb_right = 0xffff880d30db3e00, crash> p /x *(struct xfs_buf *)0xffff880d30db3e00 |grep b_bn b_bn = 0x41ff80, crash> p /x *(struct xfs_buf *)0xffff880d30db3e00 |grep right rb_right = 0xffff880ae4962c80, crash> p /x *(struct xfs_buf *)0xffff880ae4962c80 |grep b_bn b_bn = 0x434070, crash> p /x *(struct xfs_buf *)0xffff880ae4962c80 |grep right rb_right = 0xffff880d2458ee00, crash> p /x *(struct xfs_buf *)0xffff880d2458ee00 |grep b_bn b_bn = 0x435e60, crash> p /x *(struct xfs_buf *)0xffff880d2458ee00 |grep right rb_right = 0xffff880d2464d6c0, crash> p /x *(struct xfs_buf *)0xffff880d2464d6c0 |grep b_bn b_bn = 0x436060, crash> p /x *(struct xfs_buf *)0xffff880d2464d6c0 |grep left rb_left = 0xffff881ed62c2200 crash> p /x *(struct xfs_buf *)0xffff881ed62c2200 |grep b_bn b_bn = 0x435ea8, crash> p /x *(struct xfs_buf *)0xffff881ed62c2200 |grep right rb_right = 0xffff880d24624800, crash> p /x *(struct xfs_buf *)0xffff880d24624800 |grep b_bn b_bn = 0x436010, crash> p /x *(struct xfs_buf *)0xffff880d24624800 |grep right rb_right = 0xffff880d24624680, crash> p /x *(struct xfs_buf *)0xffff880d24624680 |grep b_bn b_bn = 0x436020, crash> p /x *(struct xfs_buf *)0xffff880d24624680 $61 = { b_rbnode = { rb_parent_color = 0xffff880d24624801, rb_right = 0x0, rb_left = 0x0 }, b_file_offset = 0x86c04000, b_buffer_length = 0x2000, b_hold = { counter = 0x3 }, b_lru_ref = { counter = 0x2 }, b_flags = 0x240070, #define XBF_ASYNC (1 << 4) /* initiator will not wait for completion */ #define XBF_DONE (1 << 5) /* all pages in the buffer uptodate */ #define XBF_DELWRI (1 << 6) /* buffer has dirty pages */ #define _XBF_PAGES (1 << 18)/* backed by refcounted pages */ #define _XBF_DELWRI_Q (1 << 21)/* buffer on delwri queue */ And it's as I suspected. The inode buffer is on a delayed write queue, and has not been written to disk yet. The callback is xfs_buf_iodone_callbacks(), and the xfs_buf_log_item is: p /x *(struct xfs_buf_log_item *)0xffff8801097d95b8 $64 = { bli_item = { li_ail = { next = 0xffff881d3fc1f518, prev = 0xffff881d3fc1f5f8 }, li_lsn = 0xfd000030b7, li_desc = 0x0, li_mountp = 0xffff8820694fd800, li_ailp = 0xffff882068b74cc0, li_type = 0x123c, li_flags = 0x1, li_bio_list = 0xffff881d3fc1f518, li_cb = 0xffffffffa02d90a0, li_ops = 0xffffffffa0339c20, li_cil = { next = 0xffff8801097d9608, prev = 0xffff8801097d9608 }, li_lv = 0x0, li_seq = 0xc42 }, bli_buf = 0xffff880d24624680, bli_flags = 0x2, bli_recur = 0x1, bli_refcount = { counter = 0x1 }, bli_format = { blf_type = 0x123c, blf_size = 0xe, blf_flags = 0x1, blf_len = 0x10, blf_blkno = 0x436020, blf_map_size = 0x3, blf_data_map = {0x11111111} } here's the interesting bit: li_bio_list = 0xffff881d3fc1f518 That points to the second inode in the AIL - another inode in the same cluster as teh one at the tail above. and if I walk the li_bio_list, I find the inode log item for the inode at the tail of the log that I printed above. So, that confirms that we have an inode buffer that has not been written to disk, but it is queued for IO (i.e. on the delwri queue). Next step is to work out why it hasn't been flushed. -Dave. } crash> p jiffies jiffies = $80 = 4401938293 crash> p *(struct xfs_buf *)0xffff880d24624680 |grep queuetime b_queuetime = 4393177144, crash> p 4401938293-4393177144 $82 = 8761149 so it was queued 8761149 jiffies ago. crash> p xfs_params.xfs_buf_age.val $83 = 1500 So the buffer flush age hasn't been fiddled with. the xfs_bufd is not stuck, so that means the buffer must be locked or pinned and so the delwri code is skipping it. The b_sema count is 1, which means that the buffer is unlocked. However, the pin count is 1, which means the buffer is pinned and cannot be flushed until the log is forced. And that's the reason for the deadlock. So, short story is: - inode pins the tail of the log - inode flushed to backing buffer - backing buffer marked for write - xfsbufd can't flush the buffer because the buffer is pinned - xfssyncd can't force the log because it is stuck waiting for log space Time to think about a solution.... -Dave. Looks like Dave posted a patch for review http://oss.sgi.com/pipermail/xfs/2013-October/030641.html The is no workaround for the problem. The log size cannot be changed after the filesystem has been made, and logbsize does not affect the the amount of space transactions use. Even if you increase the size of the log, that doesn't prevent the problem from occurring, and probably won't even change the chances of hitting the deadlock significantly if the workload is sustained for any period of time.... Patch posted: http://post-office.corp.redhat.com/archives/rhkernel-list/2013-October/msg01087.html Brew build: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=6394089 This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release. Patch(es) available on kernel-2.6.32-424.el6 Patch(es) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-1645.html |