Bug 1014867
| Summary: | xfssyncd and flush device threads hang in xlog_grant_head_wait | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Brad Hubbard <bhubbard> | |
| Component: | kernel | Assignee: | Dave Chinner <dchinner> | |
| Status: | CLOSED ERRATA | QA Contact: | Boris Ranto <branto> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 6.4 | CC: | aquini, branto, clober, dchinner, dhoward, djeffery, dwoodruf, dwysocha, eguan, esandeen, hamiller, jcpunk, lmcilroy, mgoodwin, msvoboda, ndevos, pdemmers, rwheeler, sauchter, tlavigne, vkaigoro | |
| Target Milestone: | rc | Keywords: | ZStream | |
| Target Release: | 6.5 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | kernel-2.6.32-432.el6 | Doc Type: | Bug Fix | |
| Doc Text: |
When creating an XFS file system, an attempt to cover an active XFS log could, under certain circumstances, result in a deadlock between the xfssyncd and xfsbufd daemons. Consequently, several kernel threads became unresponsive and the XFS file system could not have been successfully created, leading to a kernel oops. A patch has been applied to prevent this situation by forcing the active XFS log onto a disk.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1032144 (view as bug list) | Environment: | ||
| Last Closed: | 2013-11-21 20:32:26 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1025439, 1032147, 1032150 | |||
| Bug Blocks: | 1021813, 1032144, 1032688 | |||
There is no IO on any device reported by 'dev -d' and the kernel version is 2.6.32-358.6.1.el6.x86_64 Looking at the vmcore:
KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-358.6.1.el6.x86_64/vmlinux
DUMPFILE: /cores/retrace/tasks/418041799/crash/vmcore [PARTIAL DUMP]
CPUS: 32
DATE: Tue Oct 1 11:36:52 2013
UPTIME: 1 days, 05:47:50
LOAD AVERAGE: 64.17, 63.52, 59.00
TASKS: 2768
NODENAME: <nodename>
RELEASE: 2.6.32-358.6.1.el6.x86_64
VERSION: #1 SMP Fri Mar 29 16:51:51 EDT 2013
MACHINE: x86_64 (2700 Mhz)
MEMORY: 128 GB
PANIC: "Oops: 0002 [#1] SMP " (check log for details)
crash> ps | grep UN | wc -l
64
The load is made up entirely of uninterruptible processes.
crash> ps | grep UN | head
1308 1282 21 ffff88200b50a040 UN 0.0 15400 1024 find
1389 1365 25 ffff881d3fea4aa0 UN 0.0 15400 1024 find
1482 1481 26 ffff880d30c7e080 UN 0.0 15408 1036 find
1557 1556 28 ffff881a780ca040 UN 0.0 15408 1032 find
2775 2774 30 ffff880d246f1540 UN 0.0 108456 1876 bash
3207 2 17 ffff88106684aae0 UN 0.0 0 0 [xfssyncd/dm-26]
3208 2 24 ffff88106390b500 UN 0.0 0 0 [xfsaild/dm-26]
3445 3427 27 ffff882068454ae0 UN 0.0 15400 1028 find
3670 3656 17 ffff881d3fea6ae0 UN 0.0 15400 1024 find
5860 5834 17 ffff880ffa126040 UN 0.0 15400 1024 find
crash> set 3207
PID: 3207
COMMAND: "xfssyncd/dm-26"
TASK: ffff88106684aae0 [THREAD_INFO: ffff881064e8e000]
CPU: 17
STATE: TASK_UNINTERRUPTIBLE
crash> bt
PID: 3207 TASK: ffff88106684aae0 CPU: 17 COMMAND: "xfssyncd/dm-26"
#0 [ffff881064e8fc10] schedule at ffffffff8150db42
#1 [ffff881064e8fcd8] xlog_grant_head_wait at ffffffffa02f7b99 [xfs]
#2 [ffff881064e8fd38] xlog_grant_head_check at ffffffffa02f7d5e [xfs]
#3 [ffff881064e8fd78] xfs_log_reserve at ffffffffa02f7e50 [xfs]
#4 [ffff881064e8fdb8] xfs_trans_reserve at ffffffffa03059d7 [xfs]
#5 [ffff881064e8fe18] xfs_fs_log_dummy at ffffffffa02e9273 [xfs]
#6 [ffff881064e8fe48] xfs_sync_worker at ffffffffa031b8d0 [xfs]
#7 [ffff881064e8fe68] xfssyncd at ffffffffa031b7fe [xfs]
#8 [ffff881064e8fee8] kthread at ffffffff81096936
#9 [ffff881064e8ff48] kernel_thread at ffffffff8100c0ca
It's blocked on the xfs log waiting for space to become available. For how long?
crash> task -R sched_info
PID: 3207 TASK: ffff88106684aae0 CPU: 17 COMMAND: "xfssyncd/dm-26"
sched_info = {
pcount = 6362,
run_delay = 22708276,
last_arrival = 98453000519172,
last_queued = 0,
bkl_count = 0
},
crash> runq | grep "CPU 17"
CPU 17 RUNQUEUE: ffff880028336700
crash> rq.clock ffff880028336700
clock = 107200042291679
crash> pd (107200042291679-98453000519172)
$5 = 8747041772507
crash> pd (107200042291679-98453000519172)/1000000000
$6 = 8747
crash> pd 8747/60
$7 = 145
145 minutes. That's definitely stuck.
First thing we need is how many bytes it is asking for.
crash> whatis xlog_grant_head_wait
int xlog_grant_head_wait(struct log *, struct xlog_grant_head *, struct xlog_ticket *, int);
It's in the last argument to the function.
crash> dis -rl ffffffffa02f7d5e
...
/usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/fs/xfs/xfs_log.c: 290
0xffffffffa02f7d4d <xlog_grant_head_check+189>: mov (%r14),%ecx
0xffffffffa02f7d50 <xlog_grant_head_check+192>: mov %r13,%rdx
0xffffffffa02f7d53 <xlog_grant_head_check+195>: mov %rbx,%rsi
0xffffffffa02f7d56 <xlog_grant_head_check+198>: mov %r12,%rdi
0xffffffffa02f7d59 <xlog_grant_head_check+201>: callq 0xffffffffa02f7af0 <xlog_grant_head_wait>
0xffffffffa02f7d5e <xlog_grant_head_check+206>: mov %eax,%edx
The 1st argument is passed in %rdi, 2nd in %rsi, 3rd in %rdx and 4th in %rcx.
crash> dis -rl ffffffffa02f7b99
/usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/fs/xfs/xfs_log.c: 214
0xffffffffa02f7af0 <xlog_grant_head_wait>: push %rbp
0xffffffffa02f7af1 <xlog_grant_head_wait+1>: mov %rsp,%rbp
0xffffffffa02f7af4 <xlog_grant_head_wait+4>: push %r15
0xffffffffa02f7af6 <xlog_grant_head_wait+6>: push %r14
0xffffffffa02f7af8 <xlog_grant_head_wait+8>: push %r13
0xffffffffa02f7afa <xlog_grant_head_wait+10>: push %r12
0xffffffffa02f7afc <xlog_grant_head_wait+12>: push %rbx
...
crash> bt -f
PID: 3207 TASK: ffff88106684aae0 CPU: 17 COMMAND: "xfssyncd/dm-26"
...
#1 [ffff881064e8fcd8] xlog_grant_head_wait at ffffffffa02f7b99 [xfs]
ffff881064e8fce0: ffff8810586a1a00 0000000000000280
ffff881064e8fcf0: ffff8820681aef58 ffff88106684aae0
ffff881064e8fd00: ffff881064e8fd20 ffff8820681aef40
ffff881064e8fd10: ffff8820681aee00 ffff880f6ce57b50
ffff881064e8fd20: ffff881064e8fd8c 0000000000000280
ffff881064e8fd30: ffff881064e8fd70 ffffffffa02f7d5e
#2 [ffff881064e8fd38] xlog_grant_head_check at ffffffffa02f7d5e [xfs]
...
So the log is at:
crash> struct log ffff8820681aee00
struct log {
l_mp = 0xffff8820694fd800,
l_ailp = 0xffff882068b74cc0,
l_cilp = 0xffff882068b74dc0,
l_xbuf = 0xffff882065f41680,
l_targ = 0xffff8820680df9c0,
...
And need_bytes is at:
crash> rd ffff881064e8fd8c
ffff881064e8fd8c: c621b5e000000ab4 ......!.
crash> pd 0xab4
$4 = 2740
So it needs 2740 bytes.
The function arguments to xlog_space_left are the log and &head->grant:
crash> xlog_grant_head.grant ffff8820681aef40
grant = {
counter = 0xfd0095c2f0
}
head_cycle = 0xfd
head_bytes = 0x95c2f0
crash> struct log.l_tail_lsn ffff8820681aee00
l_tail_lsn = {
counter = 0xfc00004e4f
}
tail_cycle = 0xfc
tail_bytes = 0x9c9e00
free_bytes = tail_bytes - head_bytes = 449296 bytes.
So there is plenty of free space. Did the xfssyncd thread not get woken up?
crash> set 1169
PID: 1169
COMMAND: "xfsaild/dm-30"
TASK: ffff8801f78f4aa0 [THREAD_INFO: ffff8801fb526000]
CPU: 1
STATE: TASK_UNINTERRUPTIBLE|TASK_WAKEKILL
crash> bt
PID: 1169 TASK: ffff8801f78f4aa0 CPU: 1 COMMAND: "xfsaild/dm-30"
#0 [ffff8801fb527d10] schedule at ffffffff8150db42
#1 [ffff8801fb527dd8] schedule_timeout at ffffffff8150e982
#2 [ffff8801fb527e88] xfsaild at ffffffffa0239933 [xfs]
#3 [ffff8801fb527ee8] kthread at ffffffff81096936
#4 [ffff8801fb527f48] kernel_thread at ffffffff8100c0ca
crash> task -R sched_info
PID: 1169 TASK: ffff8801f78f4aa0 CPU: 1 COMMAND: "xfsaild/dm-30"
sched_info = {
pcount = 77183915,
run_delay = 973329148430,
last_arrival = 3824697247325310,
last_queued = 0,
bkl_count = 0
},
crash> runq | grep "CPU 1"
CPU 1 RUNQUEUE: ffff880028316700
crash> rq.clock ffff880028316700
clock = 3824697253567538
crash> pd (3824697253567538-3824697247325310)
$15 = 6242228
crash> pd (3824697253567538-3824697247325310)/1000000000
$16 = 0
Okay, that's not stuck.
crash> dis -rl ffffffffa0239933
...
0xffffffffa0239923 <xfsaild+0x63>: mov -0x48(%rbp),%edi
0xffffffffa0239926 <xfsaild+0x66>: callq 0xffffffff81074a20 <msecs_to_jiffies>
0xffffffffa023992b <xfsaild+0x6b>: mov %rax,%rdi
/usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/fs/xfs/linux-2.6/xfs_super.c: 803
0xffffffffa023992e <xfsaild+0x6e>: callq 0xffffffff8150e7f0 <schedule_timeout>
/usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/arch/x86/include/asm/bitops.h: 359
0xffffffffa0239933 <xfsaild+0x73>: mov 0x8(%r12),%rax
'tout' is in %rbp-0x48
crash> dis -rl ffffffff8150e982
/usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/kernel/timer.c: 1357
0xffffffff8150e7f0 <schedule_timeout>: push %rbp
...
crash> bt -f
PID: 1169 TASK: ffff8801f78f4aa0 CPU: 1 COMMAND: "xfsaild/dm-30"
...
#1 [ffff8801fb527dd8] schedule_timeout at ffffffff8150e982
ffff8801fb527de0: 0000000000000084 0000000000000286
ffff8801fb527df0: ffff8801fc8e8ac0 ffff8801fbbbbdf0
ffff8801fb527e00: 00000001e40120aa ffffffff810810e0
ffff8801fb527e10: ffff8801f78f4aa0 ffff8801fc8e8000
ffff8801fb527e20: 0000000000000000 0000000000000000
ffff8801fb527e30: 0000000000000000 ffff8801ffffffff
ffff8801fb527e40: 00001846000022cc 0000000042b70c2d
ffff8801fb527e50: 0000000000000282 ffff8801f89ae8c0
ffff8801fb527e60: ffff8801f78f4aa0 ffff8801f78f4aa0
ffff8801fb527e70: ffff8801f78f4aa0 ffff8801fb527ea8
ffff8801fb527e80: ffff8801fb527ee0 ffffffffa0239933
#2 [ffff8801fb527e88] xfsaild at ffffffffa0239933 [xfs]
...
crash> px 0xffff8801fb527ee0-0x48
$11 = 0xffff8801fb527e98
crash> rd 0xffff8801fb527e98
ffff8801fb527e98: 000000000000000a ........
So tout is 0x10.
crash> dis xfsaild
...
0xffffffffa02398df <xfsaild+31>: lea -0x38(%rbp),%r15
...
0xffffffffa023994f <xfsaild+143>: mov %r15,%rsi
0xffffffffa0239952 <xfsaild+146>: mov %rbx,%rdi
0xffffffffa0239955 <xfsaild+149>: callq 0xffffffffa0225c70 <xfsaild_push>
0xffffffffa023995a <xfsaild+154>: mov %rax,-0x48(%rbp)
'last_pushed_lsn' is in %rbp-0x38
crash> px 0xffff8801fb527ee0-0x38
$17 = 0xffff8801fb527ea8
crash> rd 0xffff8801fb527ea8
ffff8801fb527ea8: 00001846000022cc ."..F...
last_cycle = 0x1846
last_bytes = 0x459800
This is the same as log->l_tail_lsn so xfsaild has pushed the tail up to a point and doesn't seem to need to push it any further.
crash> struct log.l_reserve_head ffff8801f7934c00
l_reserve_head = {
lock = {
raw_lock = {
slock = 3543192368
}
},
waiters = {
next = 0xffff88015e60e140,
prev = 0xffff88017ca1ed78
},
grant = {
counter = 26693225856316
}
}
crash> list -H ffff8801f7934d48 -o xlog_ticket.t_queue -s xlog_ticket.t_task,t_curr_res
ffff88015e60e140
t_task = 0xffff880062592aa0
t_curr_res = 224820
ffff88017ca1e9e0
t_task = 0xffff8801f78f5500
t_curr_res = 2740
ffff8801408291f8
t_task = 0xffff8801eb794ae0
t_curr_res = 112144
ffff880135b3eb50
t_task = 0xffff880146f48080
t_curr_res = 2740
ffff8801325fecc0
t_task = 0xffff8801f8ccc080
t_curr_res = 162588
ffff88012a782368
t_task = 0xffff88011d5daaa0
t_curr_res = 162588
ffff88012a782cc0
t_task = 0xffff88011154eae0
t_curr_res = 2740
ffff8801408294d8
t_task = 0xffff8801b38b4040
t_curr_res = 2740
ffff88012749fc08
t_task = 0xffff880111574040
t_curr_res = 2740
ffff8801a5f0ce30
t_task = 0xffff880110da0ae0
t_curr_res = 2740
ffff88017ca1e648
t_task = 0xffff88010e82c040
t_curr_res = 2740
ffff88017ca1ed78
t_task = 0xffff880090928ae0
t_curr_res = 224820
Pick the first task off the queue;
crash> set 0xffff880062592aa0
PID: 18869
COMMAND: "java"
TASK: ffff880062592aa0 [THREAD_INFO: ffff8801f9182000]
CPU: 0
STATE: TASK_UNINTERRUPTIBLE
crash> bt
PID: 18869 TASK: ffff880062592aa0 CPU: 0 COMMAND: "java"
#0 [ffff8801f9183b98] schedule at ffffffff8150db42
#1 [ffff8801f9183c60] xlog_grant_head_wait at ffffffffa0216b99 [xfs]
#2 [ffff8801f9183cc0] xlog_grant_head_check at ffffffffa0216d5e [xfs]
#3 [ffff8801f9183d00] xfs_log_reserve at ffffffffa0216e50 [xfs]
#4 [ffff8801f9183d40] xfs_trans_reserve at ffffffffa02249d7 [xfs]
#5 [ffff8801f9183da0] xfs_inactive at ffffffffa022a2d3 [xfs]
#6 [ffff8801f9183df0] xfs_fs_clear_inode at ffffffffa0237930 [xfs]
#7 [ffff8801f9183e10] clear_inode at ffffffff8119cfac
#8 [ffff8801f9183e30] generic_delete_inode at ffffffff8119d766
#9 [ffff8801f9183e60] generic_drop_inode at ffffffff8119d805
#10 [ffff8801f9183e80] iput at ffffffff8119c652
#11 [ffff8801f9183ea0] do_unlinkat at ffffffff81191eb4
#12 [ffff8801f9183f70] sys_unlink at ffffffff81191f56
#13 [ffff8801f9183f80] system_call_fastpath at ffffffff8100b072
RIP: 00000034240dc7d7 RSP: 00007f2881a796b8 RFLAGS: 00210246
RAX: 0000000000000057 RBX: ffffffff8100b072 RCX: 0000000003cf5848
RDX: 00007f28839e8580 RSI: 0000000000000000 RDI: 00007f28839e8580
RBP: 00007f2881a79600 R8: 00007f28839e8580 R9: 000000001bf6e2a0
R10: 00007f288489db60 R11: 0000000000200202 R12: ffffffff81191f56
R13: ffff8801f9183f78 R14: 00007f28848a13c0 R15: 0000000000000000
ORIG_RAX: 0000000000000057 CS: 0033 SS: 002b
crash> bt -f
PID: 18869 TASK: ffff880062592aa0 CPU: 0 COMMAND: "java"
...
#1 [ffff8801f9183c60] xlog_grant_head_wait at ffffffffa0216b99 [xfs]
ffff8801f9183c68: ffff8801fb072b40 00000000000359b8
ffff8801f9183c78: ffff8801f7934d58 ffff880062592aa0
ffff8801f9183c88: ffff8801f9183ca8 ffff8801f7934d40
ffff8801f9183c98: ffff8801f7934c00 ffff88015e60e140
ffff8801f9183ca8: ffff8801f9183d14 00000000000359b8
ffff8801f9183cb8: ffff8801f9183cf8 ffffffffa0216d5e
#2 [ffff8801f9183cc0] xlog_grant_head_check at ffffffffa0216d5e [xfs]
...
crash> rd ffff8801f9183d14
ffff8801f9183d14: 1d4b5ea00006dc68 h....^K.
crash> pd 0x6dc68
$25 = 449640
Ah, this task wants 449640 bytes and there is only 449220 free in the log. All the other tasks blocked in xlog_grant_head_wait() are waiting for this one.
Lachlan,
Your analysis matches what i found late yesterday. A problem with my laptop meant it took all day to get back online to post the information.
What I'd also looked at was that I'd started walking the AIL to find out what pinned the tail of the log - it is an inode of no particular note:
p /x *(struct xfs_inode_log_item *)0xffff880d246436d8
$16 = {
ili_item = {
li_ail = {
next = 0xffff881fe4a16518,
prev = 0xffff882068b74cc8
},
li_lsn = 0xfc00004e4f,
li_desc = 0x0,
li_mountp = 0xffff8820694fd800,
li_ailp = 0xffff882068b74cc0,
li_type = 0x123b,
li_flags = 0x1,
li_bio_list = 0x0,
li_cb = 0xffffffffa02f27b0,
li_ops = 0xffffffffa0339d60,
li_cil = {
next = 0xffff880d24643728,
prev = 0xffff880d24643728
},
li_lv = 0x0,
li_seq = 0xbf5
},
ili_inode = 0xffff880ae69e4000,
ili_flush_lsn = 0xfc00004e4f,
ili_last_lsn = 0xc3d,
ili_lock_flags = 0x0,
ili_logged = 0x1,
ili_last_fields = 0x5,
ili_fields = 0x0,
ili_extents_buf = 0x0,
ili_aextents_buf = 0x0,
ili_format = {
ilf_type = 0x123b,
ilf_size = 0x3,
ilf_fields = 0x5,
ilf_asize = 0x2c,
ilf_dsize = 0x10,
ilf_ino = 0x436029,
ilf_u = {
ilfu_rdev = 0x0,
ilfu_uuid = {
__u_bits = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
}
},
ilf_blkno = 0x436020,
ilf_len = 0x10,
ilf_boffset = 0x1200
}
}
The only thign really interesting about this inode is the ili_flush_lsn, which matches the log tail_lsn. That implies the inode has been flushed to it's backing buffer, but the backing buffer has not yet been flushed. That ili_logged = 1 and ili_fields = 0 strongly implies that the inode has been flushed since the last time it was logged.
Anyway, the next things to look at is the inode itself to determine if the flush lock is held - if it is held, then the inode is waiting for buffer Io completion to occurred and that will remove the inode from the AIL, freeing up log space. In that case, we need to find the buffer containing the inode.
If the flush lock is not held, then this inode should not be in the AIL....
I started looking for the inode buffer in the cache, but at this point my laptop battery died and wouldn't take charge, and I've only just got it back, so I have no further information to add yet.
-Dave.
The inode flush lock:
p /x *(struct xfs_inode *)0xffff880ae69e4000
$2 = {
....
i_flush = {
done = 0x0,
wait = {
lock = {
raw_lock = {
slock = 0x1a001a
}
},
task_list = {
next = 0xffff880ae69e40d8,
prev = 0xffff880ae69e40d8
}
}
},
....
indicates that it is currently held (done = 0). So now I need to find the buffer that it's been written to....
-Dave.
Ok, I just walked AG 1's buffer cache rbtree index to find the buffer:
p *(struct radix_tree_node *)0xffff88206744b6e0
$29 = {
height = 1,
count = 4,
rcu_head = {
next = 0xffff88206744cb48,
func = 0xffffffff8127c5c0 <radix_tree_node_rcu_free>
},
slots = {0xffff8820680df780, 0xffff8820680df6c0, 0xffff8820680df600, 0xffff8820680df540,
crash> p *((struct xfs_perag *)0xffff8820680df540)->pag_buf_tree.rb_node
$36 = {
rb_parent_color = 1,
rb_right = 0xffff882043310e00,
rb_left = 0xffff882057028080
}
crash> p /x *(struct xfs_buf *)0xffff882043310e00 |grep b_bn
b_bn = 0xc3c3f0,
crash> p *((struct xfs_perag *)0xffff8820680df6c0)->pag_buf_tree.rb_node
$38 = {
rb_parent_color = 1,
rb_right = 0xffff881045670980,
rb_left = 0xffff8810638106c0
}
crash> p /x *(struct xfs_buf *)0xffff881045670980 |grep b_bn
b_bn = 0x415098,
crash> p /x *(struct xfs_buf *)0xffff881045670980 |grep right
rb_right = 0xffff881eda7bf540,
crash> p /x *(struct xfs_buf *)0xffff881eda7bf540 |grep b_bn
b_bn = 0x44c798,
crash> p /x *(struct xfs_buf *)0xffff881eda7bf540 |grep right
rb_right = 0xffff881cbea63200,
crash> p /x *(struct xfs_buf *)0xffff881eda7bf540 |grep left
rb_left = 0xffff880ccd8ff500
crash> p /x *(struct xfs_buf *)0xffff880ccd8ff500 |grep b_bn
b_bn = 0x41e700,
crash> p /x *(struct xfs_buf *)0xffff880ccd8ff500 |grep right
rb_right = 0xffff88101f3bd3c0,
crash> p /x *(struct xfs_buf *)0xffff88101f3bd3c0 |grep b_bn
b_bn = 0x41f7b0,
crash> p /x *(struct xfs_buf *)0xffff88101f3bd3c0 |grep right
rb_right = 0xffff880d30db3e00,
crash> p /x *(struct xfs_buf *)0xffff880d30db3e00 |grep b_bn
b_bn = 0x41ff80,
crash> p /x *(struct xfs_buf *)0xffff880d30db3e00 |grep right
rb_right = 0xffff880ae4962c80,
crash> p /x *(struct xfs_buf *)0xffff880ae4962c80 |grep b_bn
b_bn = 0x434070,
crash> p /x *(struct xfs_buf *)0xffff880ae4962c80 |grep right
rb_right = 0xffff880d2458ee00,
crash> p /x *(struct xfs_buf *)0xffff880d2458ee00 |grep b_bn
b_bn = 0x435e60,
crash> p /x *(struct xfs_buf *)0xffff880d2458ee00 |grep right
rb_right = 0xffff880d2464d6c0,
crash> p /x *(struct xfs_buf *)0xffff880d2464d6c0 |grep b_bn
b_bn = 0x436060,
crash> p /x *(struct xfs_buf *)0xffff880d2464d6c0 |grep left
rb_left = 0xffff881ed62c2200
crash> p /x *(struct xfs_buf *)0xffff881ed62c2200 |grep b_bn
b_bn = 0x435ea8,
crash> p /x *(struct xfs_buf *)0xffff881ed62c2200 |grep right
rb_right = 0xffff880d24624800,
crash> p /x *(struct xfs_buf *)0xffff880d24624800 |grep b_bn
b_bn = 0x436010,
crash> p /x *(struct xfs_buf *)0xffff880d24624800 |grep right
rb_right = 0xffff880d24624680,
crash> p /x *(struct xfs_buf *)0xffff880d24624680 |grep b_bn
b_bn = 0x436020,
crash> p /x *(struct xfs_buf *)0xffff880d24624680
$61 = {
b_rbnode = {
rb_parent_color = 0xffff880d24624801,
rb_right = 0x0,
rb_left = 0x0
},
b_file_offset = 0x86c04000,
b_buffer_length = 0x2000,
b_hold = {
counter = 0x3
},
b_lru_ref = {
counter = 0x2
},
b_flags = 0x240070,
#define XBF_ASYNC (1 << 4) /* initiator will not wait for completion */
#define XBF_DONE (1 << 5) /* all pages in the buffer uptodate */
#define XBF_DELWRI (1 << 6) /* buffer has dirty pages */
#define _XBF_PAGES (1 << 18)/* backed by refcounted pages */
#define _XBF_DELWRI_Q (1 << 21)/* buffer on delwri queue */
And it's as I suspected. The inode buffer is on a delayed write queue, and has not been written to disk yet. The callback is xfs_buf_iodone_callbacks(), and
the xfs_buf_log_item is:
p /x *(struct xfs_buf_log_item *)0xffff8801097d95b8
$64 = {
bli_item = {
li_ail = {
next = 0xffff881d3fc1f518,
prev = 0xffff881d3fc1f5f8
},
li_lsn = 0xfd000030b7,
li_desc = 0x0,
li_mountp = 0xffff8820694fd800,
li_ailp = 0xffff882068b74cc0,
li_type = 0x123c,
li_flags = 0x1,
li_bio_list = 0xffff881d3fc1f518,
li_cb = 0xffffffffa02d90a0,
li_ops = 0xffffffffa0339c20,
li_cil = {
next = 0xffff8801097d9608,
prev = 0xffff8801097d9608
},
li_lv = 0x0,
li_seq = 0xc42
},
bli_buf = 0xffff880d24624680,
bli_flags = 0x2,
bli_recur = 0x1,
bli_refcount = {
counter = 0x1
},
bli_format = {
blf_type = 0x123c,
blf_size = 0xe,
blf_flags = 0x1,
blf_len = 0x10,
blf_blkno = 0x436020,
blf_map_size = 0x3,
blf_data_map = {0x11111111}
}
here's the interesting bit:
li_bio_list = 0xffff881d3fc1f518
That points to the second inode in the AIL - another inode in the same cluster as teh one at the tail above. and if I walk the li_bio_list, I find the inode log item for the inode at the tail of the log that I printed above.
So, that confirms that we have an inode buffer that has not been written to disk, but it is queued for IO (i.e. on the delwri queue). Next step is to work out why it hasn't been flushed.
-Dave.
}
crash> p jiffies jiffies = $80 = 4401938293 crash> p *(struct xfs_buf *)0xffff880d24624680 |grep queuetime b_queuetime = 4393177144, crash> p 4401938293-4393177144 $82 = 8761149 so it was queued 8761149 jiffies ago. crash> p xfs_params.xfs_buf_age.val $83 = 1500 So the buffer flush age hasn't been fiddled with. the xfs_bufd is not stuck, so that means the buffer must be locked or pinned and so the delwri code is skipping it. The b_sema count is 1, which means that the buffer is unlocked. However, the pin count is 1, which means the buffer is pinned and cannot be flushed until the log is forced. And that's the reason for the deadlock. So, short story is: - inode pins the tail of the log - inode flushed to backing buffer - backing buffer marked for write - xfsbufd can't flush the buffer because the buffer is pinned - xfssyncd can't force the log because it is stuck waiting for log space Time to think about a solution.... -Dave. Looks like Dave posted a patch for review http://oss.sgi.com/pipermail/xfs/2013-October/030641.html The is no workaround for the problem. The log size cannot be changed after the filesystem has been made, and logbsize does not affect the the amount of space transactions use. Even if you increase the size of the log, that doesn't prevent the problem from occurring, and probably won't even change the chances of hitting the deadlock significantly if the workload is sustained for any period of time.... Patch posted: http://post-office.corp.redhat.com/archives/rhkernel-list/2013-October/msg01087.html Brew build: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=6394089 This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release. Patch(es) available on kernel-2.6.32-424.el6 Patch(es) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-1645.html |
Description of problem: On an xfs file system with the following command many threads, including xfssyncd and flush hang in xlog_grant_head_wait. # mkfs.xfs -i size=512 -l lazy-count=0 -l version=1 -i attr=1 -f /dev/VolGroup0O/root First hung task message is the following. INFO: task xfssyncd/dm-26:3207 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. xfssyncd/dm-2 D 0000000000000011 0 3207 2 0x00000080 ffff881064e8fcd0 0000000000000046 0000000000000000 0000000000000002 0000000000000000 0000000000000000 0000000000000000 0000000000000000 ffff88106684b098 ffff881064e8ffd8 000000000000fb88 ffff88106684b098 Call Trace: [<ffffffffa02f7b99>] xlog_grant_head_wait+0xa9/0x1a0 [xfs] [<ffffffffa02f7d5e>] xlog_grant_head_check+0xce/0xf0 [xfs] [<ffffffffa02f7e50>] xfs_log_reserve+0xd0/0x230 [xfs] [<ffffffffa03059d7>] xfs_trans_reserve+0x227/0x240 [xfs] [<ffffffffa030ce3a>] ? kmem_zone_zalloc+0x3a/0x50 [xfs] [<ffffffffa02e9273>] xfs_fs_log_dummy+0x43/0x90 [xfs] [<ffffffffa02f6f54>] ? xfs_log_need_covered+0x94/0xd0 [xfs] [<ffffffffa031b8d0>] xfs_sync_worker+0x40/0xa0 [xfs] [<ffffffffa031b7fe>] xfssyncd+0x17e/0x210 [xfs] [<ffffffffa031b680>] ? xfssyncd+0x0/0x210 [xfs] [<ffffffff81096936>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810968a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 crash> dis -lr ffffffffa02f7b99|tail -5 0xffffffffa02f7b8e <xlog_grant_head_wait+158>: jne 0xffffffffa02f7c4f <xlog_grant_head_wait+351> /usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/fs/xfs/xfs_log.c: 228 <------==================== HERE 0xffffffffa02f7b94 <xlog_grant_head_wait+164>: callq 0xffffffff8150d790 <schedule> /usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/fs/xfs/linux-2.6/xfs_trace.h: 845 0xffffffffa02f7b99 <xlog_grant_head_wait+169>: mov 0x44908(%rip),%r11d # 0xffffffffa033c4a8 <__tracepoint_xfs_log_grant_wake+8> fs/xfs/xfs_log.c: 208 STATIC int 209 xlog_grant_head_wait( 210 struct log *log, 211 struct xlog_grant_head *head, 212 struct xlog_ticket *tic, 213 int need_bytes) 214 { 215 list_add_tail(&tic->t_queue, &head->waiters); 216 217 do { 218 if (XLOG_FORCED_SHUTDOWN(log)) 219 goto shutdown; 220 xlog_grant_push_ail(log, need_bytes); 221 222 __set_current_state(TASK_UNINTERRUPTIBLE); 223 spin_unlock(&head->lock); 224 225 XFS_STATS_INC(xs_sleep_logspace); 226 227 trace_xfs_log_grant_sleep(log, tic); 228 schedule(); <------==================== HERE 229 trace_xfs_log_grant_wake(log, tic); 230 231 spin_lock(&head->lock); 232 if (XLOG_FORCED_SHUTDOWN(log)) 233 goto shutdown; 234 } while (xlog_space_left(log, &head->grant) < need_bytes); 235 236 list_del_init(&tic->t_queue); 237 return 0; 238 shutdown: 239 list_del_init(&tic->t_queue); 240 return XFS_ERROR(EIO); 241 } xfs_info = sectsz=512 attr=1 data = bsize=4096 blocks=2097152, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=2560, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0 Vmcore task ID is 418041799 and sosreport available at https://api.access.redhat.com/rs/cases/00953932/attachments/d78e6866-f126-3aa4-b947-f1d2b766773a Niels demonstrated this is not bz883905 by verifying the block sizes are different and the AIL is not empty.