Bug 1014867

Summary:	xfssyncd and flush device threads hang in xlog_grant_head_wait
Product:	Red Hat Enterprise Linux 6	Reporter:	Brad Hubbard <bhubbard>
Component:	kernel	Assignee:	Dave Chinner <dchinner>
Status:	CLOSED ERRATA	QA Contact:	Boris Ranto <branto>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	6.4	CC:	aquini, branto, clober, dchinner, dhoward, djeffery, dwoodruf, dwysocha, eguan, esandeen, hamiller, jcpunk, lmcilroy, mgoodwin, msvoboda, ndevos, pdemmers, rwheeler, sauchter, tlavigne, vkaigoro
Target Milestone:	rc	Keywords:	ZStream
Target Release:	6.5
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	kernel-2.6.32-432.el6	Doc Type:	Bug Fix
Doc Text:	When creating an XFS file system, an attempt to cover an active XFS log could, under certain circumstances, result in a deadlock between the xfssyncd and xfsbufd daemons. Consequently, several kernel threads became unresponsive and the XFS file system could not have been successfully created, leading to a kernel oops. A patch has been applied to prevent this situation by forcing the active XFS log onto a disk.	Story Points:	---
Clone Of:
Clones:	1032144 (view as bug list)		Environment:
Last Closed:	2013-11-21 20:32:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1025439, 1032147, 1032150
Bug Blocks:	1021813, 1032144, 1032688

Description Brad Hubbard 2013-10-03 00:31:36 UTC

Description of problem:

On an xfs file system with the following command many threads, including xfssyncd and flush hang in xlog_grant_head_wait.

# mkfs.xfs -i size=512 -l lazy-count=0 -l version=1 -i attr=1 -f /dev/VolGroup0O/root

First hung task message is the following.

INFO: task xfssyncd/dm-26:3207 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
xfssyncd/dm-2 D 0000000000000011     0  3207      2 0x00000080
 ffff881064e8fcd0 0000000000000046 0000000000000000 0000000000000002
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
 ffff88106684b098 ffff881064e8ffd8 000000000000fb88 ffff88106684b098
Call Trace:
 [<ffffffffa02f7b99>] xlog_grant_head_wait+0xa9/0x1a0 [xfs]
 [<ffffffffa02f7d5e>] xlog_grant_head_check+0xce/0xf0 [xfs]
 [<ffffffffa02f7e50>] xfs_log_reserve+0xd0/0x230 [xfs]
 [<ffffffffa03059d7>] xfs_trans_reserve+0x227/0x240 [xfs]
 [<ffffffffa030ce3a>] ? kmem_zone_zalloc+0x3a/0x50 [xfs]
 [<ffffffffa02e9273>] xfs_fs_log_dummy+0x43/0x90 [xfs]
 [<ffffffffa02f6f54>] ? xfs_log_need_covered+0x94/0xd0 [xfs]
 [<ffffffffa031b8d0>] xfs_sync_worker+0x40/0xa0 [xfs]
 [<ffffffffa031b7fe>] xfssyncd+0x17e/0x210 [xfs]
 [<ffffffffa031b680>] ? xfssyncd+0x0/0x210 [xfs]
 [<ffffffff81096936>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff810968a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

crash> dis -lr ffffffffa02f7b99|tail -5
0xffffffffa02f7b8e <xlog_grant_head_wait+158>:  jne    0xffffffffa02f7c4f <xlog_grant_head_wait+351>
/usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/fs/xfs/xfs_log.c: 228   <------==================== HERE
0xffffffffa02f7b94 <xlog_grant_head_wait+164>:  callq  0xffffffff8150d790 <schedule>
/usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/fs/xfs/linux-2.6/xfs_trace.h: 845
0xffffffffa02f7b99 <xlog_grant_head_wait+169>:  mov    0x44908(%rip),%r11d        # 0xffffffffa033c4a8 <__tracepoint_xfs_log_grant_wake+8>

fs/xfs/xfs_log.c:

    208 STATIC int
    209 xlog_grant_head_wait(
    210         struct log              *log,
    211         struct xlog_grant_head  *head,
    212         struct xlog_ticket      *tic,
    213         int                     need_bytes)
    214 {
    215         list_add_tail(&tic->t_queue, &head->waiters);
    216 
    217         do {
    218                 if (XLOG_FORCED_SHUTDOWN(log))
    219                         goto shutdown;
    220                 xlog_grant_push_ail(log, need_bytes);
    221 
    222                 __set_current_state(TASK_UNINTERRUPTIBLE);
    223                 spin_unlock(&head->lock);
    224 
    225                 XFS_STATS_INC(xs_sleep_logspace);
    226 
    227                 trace_xfs_log_grant_sleep(log, tic);
    228                 schedule();   <------==================== HERE
    229                 trace_xfs_log_grant_wake(log, tic);
    230 
    231                 spin_lock(&head->lock);
    232                 if (XLOG_FORCED_SHUTDOWN(log))
    233                         goto shutdown;
    234         } while (xlog_space_left(log, &head->grant) < need_bytes);
    235 
    236         list_del_init(&tic->t_queue);
    237         return 0;
    238 shutdown:
    239         list_del_init(&tic->t_queue);
    240         return XFS_ERROR(EIO);
    241 }

xfs_info

         =                       sectsz=512   attr=1
data     =                       bsize=4096   blocks=2097152, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=2560, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0

Vmcore task ID is 418041799 and sosreport available at https://api.access.redhat.com/rs/cases/00953932/attachments/d78e6866-f126-3aa4-b947-f1d2b766773a

Niels demonstrated this is not bz883905 by verifying the block sizes are different and the AIL is not empty.

Comment 1 Brad Hubbard 2013-10-03 02:05:14 UTC

There is no IO on any device reported by 'dev -d' and the kernel version is 2.6.32-358.6.1.el6.x86_64

Comment 4 Lachlan McIlroy 2013-10-03 06:06:05 UTC

Looking at the vmcore:

      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-358.6.1.el6.x86_64/vmlinux
    DUMPFILE: /cores/retrace/tasks/418041799/crash/vmcore  [PARTIAL DUMP]
        CPUS: 32
        DATE: Tue Oct  1 11:36:52 2013
      UPTIME: 1 days, 05:47:50
LOAD AVERAGE: 64.17, 63.52, 59.00
       TASKS: 2768
    NODENAME: <nodename>
     RELEASE: 2.6.32-358.6.1.el6.x86_64
     VERSION: #1 SMP Fri Mar 29 16:51:51 EDT 2013
     MACHINE: x86_64  (2700 Mhz)
      MEMORY: 128 GB
       PANIC: "Oops: 0002 [#1] SMP " (check log for details)

crash> ps | grep UN | wc -l
64

The load is made up entirely of uninterruptible processes.

crash> ps | grep UN | head
   1308   1282  21  ffff88200b50a040  UN   0.0   15400   1024  find
   1389   1365  25  ffff881d3fea4aa0  UN   0.0   15400   1024  find
   1482   1481  26  ffff880d30c7e080  UN   0.0   15408   1036  find
   1557   1556  28  ffff881a780ca040  UN   0.0   15408   1032  find
   2775   2774  30  ffff880d246f1540  UN   0.0  108456   1876  bash
   3207      2  17  ffff88106684aae0  UN   0.0       0      0  [xfssyncd/dm-26]
   3208      2  24  ffff88106390b500  UN   0.0       0      0  [xfsaild/dm-26]
   3445   3427  27  ffff882068454ae0  UN   0.0   15400   1028  find
   3670   3656  17  ffff881d3fea6ae0  UN   0.0   15400   1024  find
   5860   5834  17  ffff880ffa126040  UN   0.0   15400   1024  find

crash> set 3207
    PID: 3207
COMMAND: "xfssyncd/dm-26"
   TASK: ffff88106684aae0  [THREAD_INFO: ffff881064e8e000]
    CPU: 17
  STATE: TASK_UNINTERRUPTIBLE 
crash> bt

PID: 3207   TASK: ffff88106684aae0  CPU: 17  COMMAND: "xfssyncd/dm-26"
 #0 [ffff881064e8fc10] schedule at ffffffff8150db42
 #1 [ffff881064e8fcd8] xlog_grant_head_wait at ffffffffa02f7b99 [xfs]
 #2 [ffff881064e8fd38] xlog_grant_head_check at ffffffffa02f7d5e [xfs]
 #3 [ffff881064e8fd78] xfs_log_reserve at ffffffffa02f7e50 [xfs]
 #4 [ffff881064e8fdb8] xfs_trans_reserve at ffffffffa03059d7 [xfs]
 #5 [ffff881064e8fe18] xfs_fs_log_dummy at ffffffffa02e9273 [xfs]
 #6 [ffff881064e8fe48] xfs_sync_worker at ffffffffa031b8d0 [xfs]
 #7 [ffff881064e8fe68] xfssyncd at ffffffffa031b7fe [xfs]
 #8 [ffff881064e8fee8] kthread at ffffffff81096936
 #9 [ffff881064e8ff48] kernel_thread at ffffffff8100c0ca

It's blocked on the xfs log waiting for space to become available.  For how long?

crash> task -R sched_info
PID: 3207   TASK: ffff88106684aae0  CPU: 17  COMMAND: "xfssyncd/dm-26"
  sched_info = {
    pcount = 6362, 
    run_delay = 22708276, 
    last_arrival = 98453000519172, 
    last_queued = 0, 
    bkl_count = 0
  }, 

crash> runq | grep "CPU 17"
CPU 17 RUNQUEUE: ffff880028336700
crash> rq.clock ffff880028336700
  clock = 107200042291679
crash> pd (107200042291679-98453000519172)
$5 = 8747041772507
crash> pd (107200042291679-98453000519172)/1000000000
$6 = 8747
crash> pd 8747/60
$7 = 145

145 minutes.  That's definitely stuck.

First thing we need is how many bytes it is asking for.

crash> whatis xlog_grant_head_wait
int xlog_grant_head_wait(struct log *, struct xlog_grant_head *, struct xlog_ticket *, int);

It's in the last argument to the function.

crash> dis -rl ffffffffa02f7d5e
...
/usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/fs/xfs/xfs_log.c: 290
0xffffffffa02f7d4d <xlog_grant_head_check+189>: mov    (%r14),%ecx
0xffffffffa02f7d50 <xlog_grant_head_check+192>: mov    %r13,%rdx
0xffffffffa02f7d53 <xlog_grant_head_check+195>: mov    %rbx,%rsi
0xffffffffa02f7d56 <xlog_grant_head_check+198>: mov    %r12,%rdi
0xffffffffa02f7d59 <xlog_grant_head_check+201>: callq  0xffffffffa02f7af0 <xlog_grant_head_wait>
0xffffffffa02f7d5e <xlog_grant_head_check+206>: mov    %eax,%edx

The 1st argument is passed in %rdi, 2nd in %rsi, 3rd in %rdx and 4th in %rcx.

crash> dis -rl ffffffffa02f7b99
/usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/fs/xfs/xfs_log.c: 214
0xffffffffa02f7af0 <xlog_grant_head_wait>:      push   %rbp
0xffffffffa02f7af1 <xlog_grant_head_wait+1>:    mov    %rsp,%rbp
0xffffffffa02f7af4 <xlog_grant_head_wait+4>:    push   %r15
0xffffffffa02f7af6 <xlog_grant_head_wait+6>:    push   %r14
0xffffffffa02f7af8 <xlog_grant_head_wait+8>:    push   %r13
0xffffffffa02f7afa <xlog_grant_head_wait+10>:   push   %r12
0xffffffffa02f7afc <xlog_grant_head_wait+12>:   push   %rbx
...

crash> bt -f
PID: 3207   TASK: ffff88106684aae0  CPU: 17  COMMAND: "xfssyncd/dm-26"
...
 #1 [ffff881064e8fcd8] xlog_grant_head_wait at ffffffffa02f7b99 [xfs]
    ffff881064e8fce0: ffff8810586a1a00 0000000000000280 
    ffff881064e8fcf0: ffff8820681aef58 ffff88106684aae0 
    ffff881064e8fd00: ffff881064e8fd20 ffff8820681aef40 
    ffff881064e8fd10: ffff8820681aee00 ffff880f6ce57b50 
    ffff881064e8fd20: ffff881064e8fd8c 0000000000000280 
    ffff881064e8fd30: ffff881064e8fd70 ffffffffa02f7d5e 
 #2 [ffff881064e8fd38] xlog_grant_head_check at ffffffffa02f7d5e [xfs]
...

So the log is at:

crash> struct log ffff8820681aee00
struct log {
  l_mp = 0xffff8820694fd800, 
  l_ailp = 0xffff882068b74cc0, 
  l_cilp = 0xffff882068b74dc0, 
  l_xbuf = 0xffff882065f41680, 
  l_targ = 0xffff8820680df9c0, 
...

And need_bytes is at:

crash> rd ffff881064e8fd8c
ffff881064e8fd8c:  c621b5e000000ab4                    ......!.

crash> pd 0xab4
$4 = 2740

So it needs 2740 bytes.

Comment 5 Lachlan McIlroy 2013-10-03 06:24:40 UTC

The function arguments to xlog_space_left are the log and &head->grant:

crash> xlog_grant_head.grant ffff8820681aef40
  grant = {
    counter = 0xfd0095c2f0
  }

head_cycle = 0xfd
head_bytes = 0x95c2f0

crash> struct log.l_tail_lsn ffff8820681aee00
  l_tail_lsn = {
    counter = 0xfc00004e4f
  }

tail_cycle = 0xfc
tail_bytes = 0x9c9e00

free_bytes = tail_bytes - head_bytes = 449296 bytes.

So there is plenty of free space.  Did the xfssyncd thread not get woken up?

Comment 9 Lachlan McIlroy 2013-10-04 06:35:56 UTC

crash> set 1169
    PID: 1169
COMMAND: "xfsaild/dm-30"
   TASK: ffff8801f78f4aa0  [THREAD_INFO: ffff8801fb526000]
    CPU: 1
  STATE: TASK_UNINTERRUPTIBLE|TASK_WAKEKILL 

crash> bt
PID: 1169   TASK: ffff8801f78f4aa0  CPU: 1   COMMAND: "xfsaild/dm-30"
 #0 [ffff8801fb527d10] schedule at ffffffff8150db42
 #1 [ffff8801fb527dd8] schedule_timeout at ffffffff8150e982
 #2 [ffff8801fb527e88] xfsaild at ffffffffa0239933 [xfs]
 #3 [ffff8801fb527ee8] kthread at ffffffff81096936
 #4 [ffff8801fb527f48] kernel_thread at ffffffff8100c0ca

crash> task -R sched_info
PID: 1169   TASK: ffff8801f78f4aa0  CPU: 1   COMMAND: "xfsaild/dm-30"
  sched_info = {
    pcount = 77183915, 
    run_delay = 973329148430, 
    last_arrival = 3824697247325310, 
    last_queued = 0, 
    bkl_count = 0
  }, 

crash> runq | grep "CPU 1"
CPU 1 RUNQUEUE: ffff880028316700

crash> rq.clock ffff880028316700
  clock = 3824697253567538

crash> pd (3824697253567538-3824697247325310)
$15 = 6242228

crash> pd (3824697253567538-3824697247325310)/1000000000
$16 = 0

Okay, that's not stuck.

crash> dis -rl ffffffffa0239933
...
0xffffffffa0239923 <xfsaild+0x63>:      mov    -0x48(%rbp),%edi
0xffffffffa0239926 <xfsaild+0x66>:      callq  0xffffffff81074a20 <msecs_to_jiffies>
0xffffffffa023992b <xfsaild+0x6b>:      mov    %rax,%rdi
/usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/fs/xfs/linux-2.6/xfs_super.c: 803
0xffffffffa023992e <xfsaild+0x6e>:      callq  0xffffffff8150e7f0 <schedule_timeout>
/usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/arch/x86/include/asm/bitops.h: 359
0xffffffffa0239933 <xfsaild+0x73>:      mov    0x8(%r12),%rax

'tout' is in %rbp-0x48

crash> dis -rl ffffffff8150e982
/usr/src/debug/kernel-2.6.32-358.6.1.el6/linux-2.6.32-358.6.1.el6.x86_64/kernel/timer.c: 1357
0xffffffff8150e7f0 <schedule_timeout>:  push   %rbp
...

crash> bt -f
PID: 1169   TASK: ffff8801f78f4aa0  CPU: 1   COMMAND: "xfsaild/dm-30"
...
 #1 [ffff8801fb527dd8] schedule_timeout at ffffffff8150e982
    ffff8801fb527de0: 0000000000000084 0000000000000286 
    ffff8801fb527df0: ffff8801fc8e8ac0 ffff8801fbbbbdf0 
    ffff8801fb527e00: 00000001e40120aa ffffffff810810e0 
    ffff8801fb527e10: ffff8801f78f4aa0 ffff8801fc8e8000 
    ffff8801fb527e20: 0000000000000000 0000000000000000 
    ffff8801fb527e30: 0000000000000000 ffff8801ffffffff 
    ffff8801fb527e40: 00001846000022cc 0000000042b70c2d 
    ffff8801fb527e50: 0000000000000282 ffff8801f89ae8c0 
    ffff8801fb527e60: ffff8801f78f4aa0 ffff8801f78f4aa0 
    ffff8801fb527e70: ffff8801f78f4aa0 ffff8801fb527ea8 
    ffff8801fb527e80: ffff8801fb527ee0 ffffffffa0239933 
 #2 [ffff8801fb527e88] xfsaild at ffffffffa0239933 [xfs]
...

crash> px 0xffff8801fb527ee0-0x48
$11 = 0xffff8801fb527e98

crash> rd 0xffff8801fb527e98
ffff8801fb527e98:  000000000000000a                    ........

So tout is 0x10.

crash> dis xfsaild
...
0xffffffffa02398df <xfsaild+31>:        lea    -0x38(%rbp),%r15
...
0xffffffffa023994f <xfsaild+143>:       mov    %r15,%rsi
0xffffffffa0239952 <xfsaild+146>:       mov    %rbx,%rdi
0xffffffffa0239955 <xfsaild+149>:       callq  0xffffffffa0225c70 <xfsaild_push>
0xffffffffa023995a <xfsaild+154>:       mov    %rax,-0x48(%rbp)

'last_pushed_lsn' is in %rbp-0x38

crash> px 0xffff8801fb527ee0-0x38
$17 = 0xffff8801fb527ea8

crash> rd 0xffff8801fb527ea8
ffff8801fb527ea8:  00001846000022cc                    ."..F...

last_cycle = 0x1846
last_bytes = 0x459800

This is the same as log->l_tail_lsn so xfsaild has pushed the tail up to a point and doesn't seem to need to push it any further.

Comment 10 Lachlan McIlroy 2013-10-04 07:05:30 UTC

crash> struct log.l_reserve_head ffff8801f7934c00
  l_reserve_head = {
    lock = {
      raw_lock = {
        slock = 3543192368
      }
    }, 
    waiters = {
      next = 0xffff88015e60e140, 
      prev = 0xffff88017ca1ed78
    }, 
    grant = {
      counter = 26693225856316
    }
  }

crash> list -H ffff8801f7934d48 -o xlog_ticket.t_queue -s xlog_ticket.t_task,t_curr_res
ffff88015e60e140
  t_task = 0xffff880062592aa0
  t_curr_res = 224820
ffff88017ca1e9e0
  t_task = 0xffff8801f78f5500
  t_curr_res = 2740
ffff8801408291f8
  t_task = 0xffff8801eb794ae0
  t_curr_res = 112144
ffff880135b3eb50
  t_task = 0xffff880146f48080
  t_curr_res = 2740
ffff8801325fecc0
  t_task = 0xffff8801f8ccc080
  t_curr_res = 162588
ffff88012a782368
  t_task = 0xffff88011d5daaa0
  t_curr_res = 162588
ffff88012a782cc0
  t_task = 0xffff88011154eae0
  t_curr_res = 2740
ffff8801408294d8
  t_task = 0xffff8801b38b4040
  t_curr_res = 2740
ffff88012749fc08
  t_task = 0xffff880111574040
  t_curr_res = 2740
ffff8801a5f0ce30
  t_task = 0xffff880110da0ae0
  t_curr_res = 2740
ffff88017ca1e648
  t_task = 0xffff88010e82c040
  t_curr_res = 2740
ffff88017ca1ed78
  t_task = 0xffff880090928ae0
  t_curr_res = 224820

Pick the first task off the queue;

crash> set 0xffff880062592aa0
    PID: 18869
COMMAND: "java"
   TASK: ffff880062592aa0  [THREAD_INFO: ffff8801f9182000]
    CPU: 0
  STATE: TASK_UNINTERRUPTIBLE 

crash> bt
PID: 18869  TASK: ffff880062592aa0  CPU: 0   COMMAND: "java"
 #0 [ffff8801f9183b98] schedule at ffffffff8150db42
 #1 [ffff8801f9183c60] xlog_grant_head_wait at ffffffffa0216b99 [xfs]
 #2 [ffff8801f9183cc0] xlog_grant_head_check at ffffffffa0216d5e [xfs]
 #3 [ffff8801f9183d00] xfs_log_reserve at ffffffffa0216e50 [xfs]
 #4 [ffff8801f9183d40] xfs_trans_reserve at ffffffffa02249d7 [xfs]
 #5 [ffff8801f9183da0] xfs_inactive at ffffffffa022a2d3 [xfs]
 #6 [ffff8801f9183df0] xfs_fs_clear_inode at ffffffffa0237930 [xfs]
 #7 [ffff8801f9183e10] clear_inode at ffffffff8119cfac
 #8 [ffff8801f9183e30] generic_delete_inode at ffffffff8119d766
 #9 [ffff8801f9183e60] generic_drop_inode at ffffffff8119d805
#10 [ffff8801f9183e80] iput at ffffffff8119c652
#11 [ffff8801f9183ea0] do_unlinkat at ffffffff81191eb4
#12 [ffff8801f9183f70] sys_unlink at ffffffff81191f56
#13 [ffff8801f9183f80] system_call_fastpath at ffffffff8100b072
    RIP: 00000034240dc7d7  RSP: 00007f2881a796b8  RFLAGS: 00210246
    RAX: 0000000000000057  RBX: ffffffff8100b072  RCX: 0000000003cf5848
    RDX: 00007f28839e8580  RSI: 0000000000000000  RDI: 00007f28839e8580
    RBP: 00007f2881a79600   R8: 00007f28839e8580   R9: 000000001bf6e2a0
    R10: 00007f288489db60  R11: 0000000000200202  R12: ffffffff81191f56
    R13: ffff8801f9183f78  R14: 00007f28848a13c0  R15: 0000000000000000
    ORIG_RAX: 0000000000000057  CS: 0033  SS: 002b

crash> bt -f
PID: 18869  TASK: ffff880062592aa0  CPU: 0   COMMAND: "java"
...
 #1 [ffff8801f9183c60] xlog_grant_head_wait at ffffffffa0216b99 [xfs]
    ffff8801f9183c68: ffff8801fb072b40 00000000000359b8 
    ffff8801f9183c78: ffff8801f7934d58 ffff880062592aa0 
    ffff8801f9183c88: ffff8801f9183ca8 ffff8801f7934d40 
    ffff8801f9183c98: ffff8801f7934c00 ffff88015e60e140 
    ffff8801f9183ca8: ffff8801f9183d14 00000000000359b8 
    ffff8801f9183cb8: ffff8801f9183cf8 ffffffffa0216d5e 
 #2 [ffff8801f9183cc0] xlog_grant_head_check at ffffffffa0216d5e [xfs]
...

crash> rd ffff8801f9183d14
ffff8801f9183d14:  1d4b5ea00006dc68                    h....^K.

crash> pd 0x6dc68
$25 = 449640

Ah, this task wants 449640 bytes and there is only 449220 free in the log.  All the other tasks blocked in xlog_grant_head_wait() are waiting for this one.

Comment 11 Dave Chinner 2013-10-04 07:46:06 UTC

Lachlan,

Your analysis matches what i found late yesterday. A problem with my laptop meant it took all day to get back online to post the information.

What I'd also looked at was that I'd started walking the AIL to find out what pinned the tail of the log - it is an inode of no particular note:

p /x *(struct xfs_inode_log_item *)0xffff880d246436d8
$16 = {
  ili_item = {
    li_ail = {
      next = 0xffff881fe4a16518, 
      prev = 0xffff882068b74cc8
    }, 
    li_lsn = 0xfc00004e4f, 
    li_desc = 0x0, 
    li_mountp = 0xffff8820694fd800, 
    li_ailp = 0xffff882068b74cc0, 
    li_type = 0x123b, 
    li_flags = 0x1, 
    li_bio_list = 0x0, 
    li_cb = 0xffffffffa02f27b0, 
    li_ops = 0xffffffffa0339d60, 
    li_cil = {
      next = 0xffff880d24643728, 
      prev = 0xffff880d24643728
    }, 
    li_lv = 0x0, 
    li_seq = 0xbf5
  }, 
  ili_inode = 0xffff880ae69e4000, 
  ili_flush_lsn = 0xfc00004e4f, 
  ili_last_lsn = 0xc3d, 
  ili_lock_flags = 0x0, 
  ili_logged = 0x1, 
  ili_last_fields = 0x5, 
  ili_fields = 0x0, 
  ili_extents_buf = 0x0, 
  ili_aextents_buf = 0x0, 
  ili_format = {
    ilf_type = 0x123b, 
    ilf_size = 0x3, 
    ilf_fields = 0x5, 
    ilf_asize = 0x2c, 
    ilf_dsize = 0x10, 
    ilf_ino = 0x436029, 
    ilf_u = {
      ilfu_rdev = 0x0, 
      ilfu_uuid = {
        __u_bits = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
      }
    }, 
    ilf_blkno = 0x436020, 
    ilf_len = 0x10, 
    ilf_boffset = 0x1200
  }
}

The only thign really interesting about this inode is the ili_flush_lsn, which matches the log tail_lsn. That implies the inode has been flushed to it's backing buffer, but the backing buffer has not yet been flushed. That ili_logged = 1 and ili_fields = 0 strongly implies that the inode has been flushed since the last time it was logged.

Anyway, the next things to look at is the inode itself to determine if the flush lock is held - if it is held, then the inode is waiting for buffer Io completion to occurred and that will remove the inode from the AIL, freeing up log space. In that case, we need to find the buffer containing the inode.

If the flush lock is not held, then this inode should not be in the AIL....

I started looking for the inode buffer in the cache, but at this point my laptop battery died and wouldn't take charge, and I've only just got it back, so I have no further information to add yet.

-Dave.

Comment 13 Dave Chinner 2013-10-04 08:25:35 UTC

The inode flush lock:

p /x *(struct xfs_inode *)0xffff880ae69e4000
$2 = {
....
  i_flush = {
    done = 0x0, 
    wait = {
      lock = {
        raw_lock = {
          slock = 0x1a001a
        }
      }, 
      task_list = {
        next = 0xffff880ae69e40d8, 
        prev = 0xffff880ae69e40d8
      }
    }
  },
....

indicates that it is currently held (done = 0). So now I need to find the buffer that it's been written to....

-Dave.

Comment 14 Dave Chinner 2013-10-04 09:32:39 UTC

Ok, I just walked AG 1's buffer cache rbtree index to find the buffer:

p *(struct radix_tree_node *)0xffff88206744b6e0
$29 = {
  height = 1, 
  count = 4, 
  rcu_head = {
    next = 0xffff88206744cb48, 
    func = 0xffffffff8127c5c0 <radix_tree_node_rcu_free>
  }, 
  slots = {0xffff8820680df780, 0xffff8820680df6c0, 0xffff8820680df600, 0xffff8820680df540,

crash> p *((struct xfs_perag *)0xffff8820680df540)->pag_buf_tree.rb_node
$36 = {
  rb_parent_color = 1, 
  rb_right = 0xffff882043310e00, 
  rb_left = 0xffff882057028080
}
crash> p /x *(struct xfs_buf *)0xffff882043310e00 |grep b_bn
  b_bn = 0xc3c3f0, 
crash> p *((struct xfs_perag *)0xffff8820680df6c0)->pag_buf_tree.rb_node
$38 = {
  rb_parent_color = 1, 
  rb_right = 0xffff881045670980, 
  rb_left = 0xffff8810638106c0
}
crash> p /x *(struct xfs_buf *)0xffff881045670980 |grep b_bn
  b_bn = 0x415098, 
crash> p /x *(struct xfs_buf *)0xffff881045670980 |grep right
    rb_right = 0xffff881eda7bf540, 
crash> p /x *(struct xfs_buf *)0xffff881eda7bf540 |grep b_bn
  b_bn = 0x44c798, 
crash> p /x *(struct xfs_buf *)0xffff881eda7bf540 |grep right
    rb_right = 0xffff881cbea63200, 
crash> p /x *(struct xfs_buf *)0xffff881eda7bf540 |grep left
    rb_left = 0xffff880ccd8ff500
crash> p /x *(struct xfs_buf *)0xffff880ccd8ff500 |grep b_bn
  b_bn = 0x41e700, 
crash> p /x *(struct xfs_buf *)0xffff880ccd8ff500 |grep right
    rb_right = 0xffff88101f3bd3c0, 
crash> p /x *(struct xfs_buf *)0xffff88101f3bd3c0 |grep b_bn
  b_bn = 0x41f7b0, 
crash> p /x *(struct xfs_buf *)0xffff88101f3bd3c0 |grep right
    rb_right = 0xffff880d30db3e00, 
crash> p /x *(struct xfs_buf *)0xffff880d30db3e00 |grep b_bn
  b_bn = 0x41ff80, 
crash> p /x *(struct xfs_buf *)0xffff880d30db3e00 |grep right
    rb_right = 0xffff880ae4962c80, 
crash> p /x *(struct xfs_buf *)0xffff880ae4962c80 |grep b_bn
  b_bn = 0x434070, 
crash> p /x *(struct xfs_buf *)0xffff880ae4962c80 |grep right
    rb_right = 0xffff880d2458ee00, 
crash> p /x *(struct xfs_buf *)0xffff880d2458ee00 |grep b_bn
  b_bn = 0x435e60, 
crash> p /x *(struct xfs_buf *)0xffff880d2458ee00 |grep right
    rb_right = 0xffff880d2464d6c0, 
crash> p /x *(struct xfs_buf *)0xffff880d2464d6c0 |grep b_bn
  b_bn = 0x436060, 
crash> p /x *(struct xfs_buf *)0xffff880d2464d6c0 |grep left
    rb_left = 0xffff881ed62c2200
crash> p /x *(struct xfs_buf *)0xffff881ed62c2200 |grep b_bn
  b_bn = 0x435ea8, 
crash> p /x *(struct xfs_buf *)0xffff881ed62c2200 |grep right
    rb_right = 0xffff880d24624800,
crash> p /x *(struct xfs_buf *)0xffff880d24624800 |grep b_bn
  b_bn = 0x436010, 
crash> p /x *(struct xfs_buf *)0xffff880d24624800 |grep right
    rb_right = 0xffff880d24624680, 
crash> p /x *(struct xfs_buf *)0xffff880d24624680 |grep b_bn
  b_bn = 0x436020, 
crash> p /x *(struct xfs_buf *)0xffff880d24624680
$61 = {
  b_rbnode = {
    rb_parent_color = 0xffff880d24624801, 
    rb_right = 0x0, 
    rb_left = 0x0
  }, 
  b_file_offset = 0x86c04000, 
  b_buffer_length = 0x2000, 
  b_hold = {
    counter = 0x3
  }, 
  b_lru_ref = {
    counter = 0x2
  }, 
  b_flags = 0x240070, 

#define XBF_ASYNC        (1 << 4) /* initiator will not wait for completion */
#define XBF_DONE         (1 << 5) /* all pages in the buffer uptodate */
#define XBF_DELWRI       (1 << 6) /* buffer has dirty pages */
#define _XBF_PAGES       (1 << 18)/* backed by refcounted pages */
#define _XBF_DELWRI_Q    (1 << 21)/* buffer on delwri queue */


And it's as I suspected. The inode buffer is on a delayed write queue, and has not been written to disk yet. The callback is xfs_buf_iodone_callbacks(), and
the xfs_buf_log_item is:

p /x *(struct xfs_buf_log_item *)0xffff8801097d95b8
$64 = {
  bli_item = {
    li_ail = {
      next = 0xffff881d3fc1f518, 
      prev = 0xffff881d3fc1f5f8
    }, 
    li_lsn = 0xfd000030b7, 
    li_desc = 0x0, 
    li_mountp = 0xffff8820694fd800, 
    li_ailp = 0xffff882068b74cc0, 
    li_type = 0x123c, 
    li_flags = 0x1, 
    li_bio_list = 0xffff881d3fc1f518, 
    li_cb = 0xffffffffa02d90a0, 
    li_ops = 0xffffffffa0339c20, 
    li_cil = { 
      next = 0xffff8801097d9608, 
      prev = 0xffff8801097d9608
    }, 
    li_lv = 0x0, 
    li_seq = 0xc42
  }, 
  bli_buf = 0xffff880d24624680,
  bli_flags = 0x2, 
  bli_recur = 0x1, 
  bli_refcount = {
    counter = 0x1
  }, 
  bli_format = { 
    blf_type = 0x123c, 
    blf_size = 0xe, 
    blf_flags = 0x1, 
    blf_len = 0x10, 
    blf_blkno = 0x436020, 
    blf_map_size = 0x3, 
    blf_data_map = {0x11111111}
  }

here's the interesting bit:

 li_bio_list = 0xffff881d3fc1f518

That points to the second inode in the AIL - another inode in the same cluster as teh one at the tail above. and if I walk the li_bio_list, I find the inode log item for the inode at the tail of the log that I printed above.

So, that confirms that we have an inode buffer that has not been written to disk, but it is queued for IO (i.e. on the delwri queue). Next step is to work out why it hasn't been flushed.

-Dave.



}

Comment 15 Dave Chinner 2013-10-04 09:53:10 UTC

crash> p jiffies
jiffies = $80 = 4401938293
crash> p *(struct xfs_buf *)0xffff880d24624680 |grep queuetime
  b_queuetime = 4393177144,
crash> p 4401938293-4393177144
$82 = 8761149

so it was queued 8761149 jiffies ago.

crash> p xfs_params.xfs_buf_age.val
$83 = 1500

So the buffer flush age hasn't been fiddled with. the xfs_bufd is not stuck, so that means the buffer must be locked or pinned and so the delwri code is skipping it. The b_sema count is 1, which means that the buffer is unlocked. However, the pin count is 1, which means the buffer is pinned and cannot be flushed until the log is forced. And that's the reason for the deadlock.

So, short story is:
- inode pins the tail of the log
- inode flushed to backing buffer
- backing buffer marked for write
- xfsbufd can't flush the buffer because the buffer is pinned
- xfssyncd can't force the log because it is stuck waiting for log space

Time to think about a solution....

-Dave.

Comment 19 Mark Goodwin 2013-10-09 01:38:11 UTC

Looks like Dave posted a patch for review
http://oss.sgi.com/pipermail/xfs/2013-October/030641.html

Comment 22 Dave Chinner 2013-10-09 03:15:31 UTC

The is no workaround for the problem. The log size cannot be changed after the filesystem has been made, and logbsize does not affect the the amount of space transactions use. Even if you increase the size of the log, that doesn't prevent the problem from occurring, and probably won't even change the chances of hitting the deadlock significantly if the workload is sustained for any period of time....

Comment 23 Dave Chinner 2013-10-10 10:27:46 UTC

Patch posted:
http://post-office.corp.redhat.com/archives/rhkernel-list/2013-October/msg01087.html

Brew build:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=6394089

Comment 24 RHEL Program Management 2013-10-10 10:32:12 UTC

This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 31 Rafael Aquini 2013-10-15 23:45:13 UTC

Patch(es) available on kernel-2.6.32-424.el6

Comment 36 Rafael Aquini 2013-11-04 11:42:56 UTC

Patch(es)

Comment 42 errata-xmlrpc 2013-11-21 20:32:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-1645.html