Bug 451819

Summary: process hangs in async direct IO / possible race between dio_bio_end_aio() and dio_await_one() ?
Product: Red Hat Enterprise Linux 4 Reporter: Issue Tracker <tao>
Component: kernelAssignee: Jeff Moyer <jmoyer>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.6CC: bmr, john.daley, jplans, rlerch, shantikatta, sputhenp, tao
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, in the kernel, there was a race condition may have been encountered between dio_bio_end_aio() and dio_await_one(). This may have lead to a situation where direct I/O is left waiting indefinitely on an I/O process that has already completed. With this update, the the reference counting operations are locked so that the submission and completion paths see a unified state, which resolves this issue.
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-05-18 19:08:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 391511, 455750, 456083, 458752, 461297    
Attachments:
Description Flags
dio: lock refcount operations none

Comment 2 Ulrich Obergfell 2008-06-17 16:42:12 UTC
analysis pertaining to 1822328-vmcore
-------------------------------------


The "oracle" process PID 8960 is executing an async direct I/O request (write)
to a file in /data1. The process has incremented the "t_updates" count in the
"running transaction" of /data1 during the following flow of execution ...


  ext3_direct_IO
    ext3_journal_start
      ext3_journal_start_sb
        journal = EXT3_SB(sb)->s_journal

        journal_start
          start_this_handle
            transaction = journal->j_running_transaction
            transaction->t_updates++

    blockdev_direct_IO


The "kjournald" PID 30348 associated with /data1 (journal->j_task) is blocked
in journal_commit_transaction() because the "t_updates" count is not zero ...


  journal_commit_transaction
    commit_transaction = journal->j_running_transaction
    commit_transaction->t_state = T_LOCKED

    while (commit_transaction->t_updates)
           prepare_to_wait(&journal->j_wait_updates, ...,
                            TASK_UNINTERRUPTIBLE)
           if (commit_transaction->t_updates)
               schedule


"kjournald" locked the "running transaction" before it blocked. Therefore, many
other processes (191 "oracle" and one "ls") are waiting on that lock ...


  journal_start
    start_this_handle
      transaction = journal->j_running_transaction
      if (transaction->t_state == T_LOCKED)
          prepare_to_wait(&journal->j_wait_transaction_locked, ...,
                           TASK_UNINTERRUPTIBLE)
          schedule


So, there is a tree of dependencies ...


  waiting on      oracle      oracle     ...     ls       ...    oracle
   T_LOCKED      PID 25604   PID 25542        PID 24278         PID 22651
                     |           |                |                 |
                     +-----------+--...---+--...--+-------...-------+
                                          |
  waiting on                          kjournald
   t_updates                          PID 30348
                                          |
                                        oracle
                                       PID 8960


The above tree indicates that the progress of transactions in /data1 depends
on "oracle" PID 8960. Hence, we need to take a closer look at that process.
After returning from blockdev_direct_IO(), the process would decrement the
"t_updates" count in the "running transaction" of /data1 and it would also
wake up "kjournald". "kjournald" could then proceed and eventually unlock
the transaction ...


  ext3_direct_IO
    blockdev_direct_IO

    __ext3_journal_stop
      journal_stop
        transaction->t_updates--
        if (!transaction->t_updates)
            wake_up(&journal->j_wait_updates)  // wake up "kjournald"


However, blockdev_direct_IO() does not return because the process is blocked
indefinitely in the following code path ...


    blockdev_direct_IO
      __blockdev_direct_IO
        direct_io_worker
          dio_await_completion
            dio_await_one
              set_current_state(TASK_UNINTERRUPTIBLE)
              io_schedule


Even though the process is executing an *asynchronous* request, it is waiting
on the completion of an I/O. This is because the process is writing to a hole
in a sparse file as illustrated in the following picture ...


     vaddr    address space             file data  offset   fs block
             +-------------+           +---------+
             |             |           |         |
             |             |           |         |
             | I/O buffer  |           |         | b00000   01290c27
  2a974a3e00 |/////////////|           |/////////| b00200
  2a974a4000 |/////////////|           |/////////| b00400
             |/////////////|           |/////////| b00600
             |/////////////|           |/////////| b00800
             |/////////////|           |/////////| b00a00
             |/////////////|           |/////////| b00c00
             |/////////////|           |/////////| b00e00
             |/////////////|-- write ->|/////////| b01000   00000000 (hole)
             |/////////////|           |/////////| b01200
  2a974a5000 |/////////////|           |/////////| b01400
             |/////////////|           |/////////| b01600
             |/////////////|           |/////////| b01800
             |/////////////|           |/////////| b01a00
             |/////////////|           |/////////| b01c00
             |/////////////|           |/////////| b01e00
             |/////////////|           |/////////| b02000   00000000 (hole)
             |      :      |           |    :    |             :
             |      :      |           |    :    |             :
             |      :      |           |    :    |             :
             |/////////////|           |/////////| c00000   00000000 (hole)
  2a975a3e00 |/////////////|           |/////////| c00200
             |             |           |         | c00400
             |             |           |         | c00600
             |             |           |         | c00800
             |             |           |         | c00c00
             |             |           |         | c00e00
             |             |           |         |
             |             |           |         |
             +-------------+           +---------+


Writing to a hole in a sparse file involves the following code in do_direct_IO()
and direct_io_worker() ...


  do_direct_IO(...)
  {
          ...
  do_holes:
                          /* Handle holes */
                          if (!buffer_mapped(map_bh)) {
                                  ...
                                  if (dio->rw == WRITE) {
                                          ...
                                          return -ENOTBLK;
                                  }
          ...
  }

  direct_io_worker(...)
  {
          ...
          for (seg = 0; seg < nr_segs; seg++) {
                  ...
                  ret = do_direct_IO(dio);
                  ...
                  if (ret) {
                          ...
                          break;
                  }
          }
          ...
          if (ret == -ENOTBLK && rw == WRITE) {
                  ...
                  ret = 0;
          }
          ...
          if (dio->bio)
                dio_bio_submit(dio);

          /* All IO is now issued, send it on its way */
          blk_run_address_space(inode->i_mapping);
          ...

          Note: dio->is_async is true, ret is zero, dio->result is non-zero.
                However, dio->result is not equal dio->size because only the
                buffer content from 2a974a3e00 to 2a974a4bff is being written
                to file offset b00200 to b00fff ("partial write"). dio->result
                is 0xe00, dio->size is 0x100000. ret remains *unchanged*.

          if (dio->is_async && ret == 0 && dio->result &&
              ((rw & READ) || (dio->result == dio->size)))
                  ret = -EIOCBQUEUED;

          Note: In this case ret is zero, call dio_await_completion().

          if (ret != -EIOCBQUEUED)
                  dio_await_completion(dio);
          ...
}


Since dio_await_completion() is called in this case, there appears to be an
opportunity of a race between dio_await_one() and dio_bio_end_aio() where a
process can go to wait for the completion of an I/O even though that I/O has
already been completed ...


  static int wait_for_more_bios(struct dio *dio)
  {
          ...
          return (atomic_read(&dio->refcount) > 1) && (dio->bio_list == NULL);
  }

  static struct bio *dio_await_one(struct dio *dio)
  {
          ...
          unsigned long flags;
          ...
          spin_lock_irqsave(&dio->bio_lock, flags);
          while (wait_for_more_bios(dio)) {
                  set_current_state(TASK_UNINTERRUPTIBLE);
                  if (wait_for_more_bios(dio)) {
                          dio->waiter = current;
                          spin_unlock_irqrestore(&dio->bio_lock, flags);
                          io_schedule();
                          spin_lock_irqsave(&dio->bio_lock, flags);
                          dio->waiter = NULL;
                  }
                  set_current_state(TASK_RUNNING);
          }
          ...
  }

  static int dio_bio_end_aio(struct bio *bio,unsigned int bytes_done,int error)
  {
          struct dio *dio = bio->bi_private;
          int waiter_holds_ref = 0;
          int remaining;
          ...

          Note: dio->waiter is accessed *without* protection via dio->bio_lock

          waiter_holds_ref = !!dio->waiter;
          remaining = atomic_sub_return(1, (&dio->refcount));
          if (remaining == 1 && waiter_holds_ref)
                  wake_up_process(dio->waiter);
          ...
  }


The following picture illustrates a race scenario which appears to lead to the
problem ...


  dio_await_one() on CPU X             |  dio_bio_end_aio() on CPU Y
---------------------------------------+---------------------------------------
                                       |
  spin_lock_irqsave                    |
    (&dio->bio_lock, flags);           |
                                       |
  Note: dio->refcount still > 1        |  Note: dio->waiter still NULL
                                       |
  if (wait_for_more_bios(dio)) {       |  waiter_holds_ref = !!dio->waiter;
                                       |
      dio->waiter = current;           |  remaining = atomic_sub_return(1,
                                       |                 (&dio->refcount));
      spin_unlock_irqrestore           |
        (&dio->bio_lock, flags);       |  Note: remaining is 1 but
                                       |        waiter_holds_ref is NULL
      io_schedule();                   |
      ^^^^^^^^^^^                      |  if (remaining == 1 &&
      wait indefinitely                |      waiter_holds_ref)
                                       |      wake_up_process(dio->waiter);
                                       |      ^^^^^^^^^^^^^^^
                                       |      not called


dio->refcount is incremented by dio_bio_submit() before the I/O is issued and
it is decremented by dio_bio_end_aio() to indicate that the I/O has completed.
However, in case of the race shown above, dio_bio_end_aio() is not aware that
a process is about to wait for the I/O that is just being completed. So, the
process is not awakened and waits indefinitely on an already completed I/O.

The symptoms in the vmcore suggest that the race has actually occurred:

 -  The process is blocked even though dio->refcount is 1 (no ongoing I/O).

 -  None of the allocated entries in the "bio" slab cache is pointing to the
    dio structure of the blocked process. If an I/O was ongoing, one would
    expect to find a bio->bi_private pointing to the dio.


Interestingly, the upstream source of dio_bio_end_aio() uses the dio->bio_lock
to synchronize the access to dio->waiter. This is an example from 2.6.20 ...

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=blob;f=fs/direct-io.c;h=028699395e6c62aed13febdc6d849f1c2fed30cb;hb=HEAD


  static int dio_bio_end_aio(struct bio *bio,unsigned int bytes_done,int error)
  {
          struct dio *dio = bio->bi_private;
          unsigned long remaining;
          unsigned long flags;
          ...
          spin_lock_irqsave(&dio->bio_lock, flags);
          remaining = --dio->refcount;
          if (remaining == 1 && dio->waiter)
                  wake_up_process(dio->waiter);
          spin_unlock_irqrestore(&dio->bio_lock, flags);
          ...
  }


Comment 6 Jeff Moyer 2008-06-20 15:12:25 UTC
I'll take a look.

Comment 7 Jeff Moyer 2008-06-20 19:32:06 UTC
I'll backport:

commit 5eb6c7a2ab413dea1ee6c08dd58263a1c2c2efa3
Author: Zach Brown <zach.brown>
Date:   Sun Dec 10 02:21:07 2006 -0800

    [PATCH] dio: lock refcount operations

and

commit fcb82f8835c1d71b4fe5de1d9894f45370f80dab
Author: Zach Brown <zach.brown>
Date:   Tue Jul 3 15:28:55 2007 -0700

    dio: remove bogus refcounting BUG_ON

Thanks for the detailed problem report.

Comment 8 Jeff Moyer 2008-06-20 20:45:23 UTC
Created attachment 309947 [details]
dio: lock refcount operations

This is a patch combining and backporting the two aforementioned upstream
commits.

Comment 17 RHEL Program Management 2008-07-17 21:01:19 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 20 RHEL Program Management 2008-07-21 13:37:52 UTC
This bug has been copied as 4.7 z-stream (EUS) bug #456083 and now must be
resolved in the current update release, set blocker flag.

Comment 21 Vivek Goyal 2008-07-22 20:26:29 UTC
Committed in 78.1.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 24 RHEL Program Management 2008-09-03 12:54:42 UTC
Updating PM score.

Comment 27 Robert Peterson 2008-12-10 16:56:59 UTC
*** Bug 467906 has been marked as a duplicate of this bug. ***

Comment 32 Jeff Moyer 2009-02-10 17:45:44 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
There is a race between dio_bio_end_aio() and dio_await_one() which can lead to a situation where a process performing asynchronous, direct I/O is left waiting indefinitely on an I/O that has already completed.  The solution is to lock the reference counting operations so that the submission and completion paths see a unified state.

Comment 36 Bryn M. Reeves 2009-02-20 17:00:10 UTC
re-post of analysis from comment #3

crash> sys
      KERNEL: /cores/20080425054953/work/vmlinux
    DUMPFILE: /cores/20080425054953/work/1822328-vmcore
        CPUS: 16
        DATE: Thu Apr 24 03:14:27 2008
      UPTIME: 03:18:57
LOAD AVERAGE: 268.20, 166.55, 80.32
       TASKS: 3527
    NODENAME: host.example.com
     RELEASE: 2.6.9-67.0.4.ELlargesmp
     VERSION: #1 SMP Fri Jan 18 05:10:01 EST 2008
     MACHINE: x86_64  (2800 Mhz)
      MEMORY: 31.5 GB
       PANIC: "Oops: 0002 [1] SMP " (check log for details)



crash> bt 8960
PID: 8960   TASK: 102edef7030       CPU: 4   COMMAND: "oracle"
 #0 [100b5397ac8] schedule at ffffffff8030e7e8
 #1 [100b5397ba0] io_schedule at ffffffff8030f15e
 #2 [100b5397bc0] __blockdev_direct_IO at ffffffff8019c9b2
 #3 [100b5397c70] ext3_direct_IO at ffffffffa011253a
 #4 [100b5397cf0] generic_file_direct_IO at ffffffff8015c8fa
 #5 [100b5397d30] generic_file_direct_write at ffffffff8015c980
 #6 [100b5397d80] __generic_file_aio_write_nolock at ffffffff8015ccbc
 #7 [100b5397e00] generic_file_aio_write_nolock at ffffffff8015cf9f
 #8 [100b5397e30] generic_file_aio_write at ffffffff8015d069
 #9 [100b5397e80] ext3_file_write at ffffffffa010df01
#10 [100b5397ea0] aio_pwrite at ffffffff8019e1ab
#11 [100b5397eb0] aio_run_iocb at ffffffff8019d53f
#12 [100b5397ed0] io_submit_one at ffffffff8019e560
#13 [100b5397f00] sys_io_submit at ffffffff8019e665
#14 [100b5397f80] tracesys at ffffffff801103ce (via system_call)
    RIP: 0000002a96f5f6a7  RSP: 0000007fbfff67b8  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: ffffffff801103ce  RCX: ffffffffffffffff
    RDX: 0000007fbfff67c0  RSI: 0000000000000001  RDI: 0000002a971ce000
    RBP: 0000000006003750   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000007fbfffb3b8  R11: 0000000000000202  R12: 0000007fbfff8900
    R13: 0000000000000001  R14: 0000007fbfff8900  R15: 0000007fbfff88e0
    ORIG_RAX: 00000000000000d1  CS: 0033  SS: 002b



the process has been blocked for 7 minutes and 7 seconds ...


crash> px per_cpu__runqueues | grep "\[4\]"
  [4]: 104000616e0

crash> px ((struct runqueue *)0x104000616e0)->timestamp_last_tick
$1 = 0xac6c8e38db3

crash> px ((struct task_struct *)0x102edef7030)->last_ran
$2 = 0xa6346839bb6

crash> pd (0xac6c8e38db3 - 0xa6346839bb6) / 1000000000
$3 = 427 <- seconds blocked



the "running transaction" in /data1 ...


crash> px ((struct task_struct *)0x102edef7030)->journal_info
$4 = (void *) 0x10276615640

crash> px ((struct handle_s *)0x10276615640)->h_transaction
$5 = (transaction_t *) 0x103fb8952c0 <---------------------------------+
                                                                       |
                                                                       |
crash> mount | grep /data1                                             |
     10300025500      104000fe800 ext3   /dev/md10          /data1     |
                                                                       |
crash> px ((struct super_block *)0x104000fe800)->s_fs_info             |
$6 = (void *) 0x100d1d68000                                            |
                                                                       |
crash> px ((struct ext3_sb_info *)0x100d1d68000)->s_journal            |
$7 = (struct journal_s *) 0x1040016bc00                                |
                                                                       |
crash> px ((struct journal_s *)0x1040016bc00)->j_running_transaction   |
$8 = (transaction_t *) 0x103fb8952c0 <---------------------------------+



"kjournald" PID 30348 is waiting on "t_updates" ...


crash> pd ((transaction_t *)0x103fb8952c0)->t_updates
$9 = 1

crash> px &((struct journal_s *)0x1040016bc00)->j_wait_updates
$10 = (wait_queue_head_t *) 0x1040016bce0

crash> waitq 0x1040016bce0
PID: 30348  TASK: 101fa8727f0       CPU: 8   COMMAND: "kjournald"


crash> bt 30348
PID: 30348  TASK: 101fa8727f0       CPU: 8   COMMAND: "kjournald"
 #0 [100d1ca1ad8] schedule at ffffffff8030e7e8
 #1 [100d1ca1bb0] journal_commit_transaction at ffffffffa00fa872
 #2 [100d1ca1e80] kjournald at ffffffffa00fda5c
 #3 [100d1ca1f50] kernel_thread at ffffffff80110f47



192 processes are waiting on "T_LOCKED" ...


crash> pd ((transaction_t *)0x103fb8952c0)->t_state
$11 = T_LOCKED

crash> px &((struct journal_s *)0x1040016bc00)->j_wait_transaction_locked
$12 = (wait_queue_head_t *) 0x1040016bc68

crash> waitq 0x1040016bc68 | wc -l
192


crash> waitq 0x1040016bc68
PID: 25604  TASK: 10444f1f030       CPU: 4   COMMAND: "oracle"
PID: 25542  TASK: 10216e75030       CPU: 15  COMMAND: "oracle"
...
PID: 24278  TASK: 100120d1030       CPU: 1   COMMAND: "ls"
...
PID: 22651  TASK: 1044b3b67f0       CPU: 3   COMMAND: "oracle"


crash> bt 24278
PID: 24278  TASK: 100120d1030       CPU: 1   COMMAND: "ls"
 #0 [1000b047ca8] schedule at ffffffff8030e7e8
 #1 [1000b047d80] start_this_handle at ffffffffa00f8359
 #2 [1000b047e60] journal_start at ffffffffa00f84c6
 #3 [1000b047e80] ext3_dirty_inode at ffffffffa0112797
 #4 [1000b047ea0] __mark_inode_dirty at ffffffff80199b0c
 #5 [1000b047ed0] update_atime at ffffffff80192f5e
 #6 [1000b047f00] vfs_readdir at ffffffff8018c3a8
 #7 [1000b047f30] sys_getdents64 at ffffffff8018c784
 #8 [1000b047f80] tracesys at ffffffff801103ce (via system_call)
    RIP: 0000003369d8c68e  RSP: 0000007fbffff320  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff801103ce  RCX: ffffffffffffffff
    RDX: 0000000000001000  RSI: 000000000051a598  RDI: 0000000000000003
    RBP: 0000002a957836a0   R8: 0000003369f31ce8   R9: 0000000000000010
    R10: 0000007fbffff201  R11: 0000000000000246  R12: 0000000000000000
    R13: 0000007fbffff320  R14: 000000000051a560  R15: 000000000051a540
    ORIG_RAX: 00000000000000d9  CS: 0033  SS: 002b



the copy of the user iocb on the kernel stack of "oracle" PID 8960 ...


crash> dis sys_io_submit
...
asmlinkage long sys_io_submit(aio_context_t ctx_id, long nr,
                              struct iocb __user * __user *iocbpp)
{
        ... 
        for (i=0; i<nr; i++) {
                ...
                struct iocb tmp;
                ...
                if (unlikely(copy_from_user(&tmp, user_iocb, sizeof(tmp)))) {

                                        // dl = sizeof(tmp)
0xffffffff8019e645 <sys_io_submit+149>: mov    $0x40,%dl
0xffffffff8019e647 <sys_io_submit+151>: mov    %r12,%rsi
                                        // rdi = rsp (&tmp)
0xffffffff8019e64a <sys_io_submit+154>: mov    %rsp,%rdi
                                        // call copy_from_user
0xffffffff8019e64d <sys_io_submit+157>: callq  0xffffffff801ee960
...


crash> bt -f 8960
PID: 8960   TASK: 102edef7030       CPU: 4   COMMAND: "oracle"
...
#13 [100b5397f00] sys_io_submit at ffffffff8019e665
                                                   -+
    100b5397f08: 0000002a97ec2ea0 0000000000000000  |
    100b5397f18: 0000001600000001 0000002a974a3e00  |  copy of
    100b5397f28: 0000000000100000 0000000000b00200  |  user iocb
    100b5397f38: 0000000000000000 0000000000000000  |
                                                   -+
    100b5397f48: 0000000000000001 0000000000000001
    100b5397f58: 0000007fbfff88e0 0000007fbfff8900
    100b5397f68: 0000000000000001 0000007fbfff8900
    100b5397f78: 0000000006003750 ffffffff801103ce
...


crash> px *(struct iocb *)0x100b5397f08
$13 = {
  aio_data = 0x2a97ec2ea0,
  aio_key = 0x0,
  aio_reserved1 = 0x0,
  aio_lio_opcode = 0x1 = IOCB_CMD_PWRITE,
  aio_reqprio = 0x0,
  aio_fildes = 0x16 =22, file descriptor
  aio_buf = 0x2a974a3e00,  |
  aio_nbytes = 0x100000,   |
  aio_offset = 0xb00200,   |
  aio_reserved2 = 0x0,     |
  aio_reserved3 = 0x0      |
}                          |
                           v
crash> files 8960 | grep " 22 "
 22    1024869c880      1044fde6228      1056b3fcb18   REG 
/data1/oracle/archive/CC/1_59673_629046150.dbf



the dio structure pertaining to the above iocb ...


crash> dis __blockdev_direct_IO
...
static struct bio *dio_await_one(struct dio *dio)
{
        unsigned long flags;
        struct bio *bio = NULL;

        spin_lock_irqsave(&dio->bio_lock, flags);
        while (wait_for_more_bios(dio)) {
                set_current_state(TASK_UNINTERRUPTIBLE);
                if (wait_for_more_bios(dio)) {
                        dio->waiter = current;
                        spin_unlock_irqrestore(&dio->bio_lock, flags);
                        io_schedule();
...
                                                // rax = current
0xffffffff8019c98b <__blockdev_direct_IO+2780>: mov    %gs:0x0,%rax
                                                // rbx = rbp->bio_lock
0xffffffff8019c994 <__blockdev_direct_IO+2789>: lea    0x318(%rbp),%rbx
0xffffffff8019c99b <__blockdev_direct_IO+2796>: mov    %r12,%rsi
                                                // rbp->waiter = rax
0xffffffff8019c99e <__blockdev_direct_IO+2799>: mov    %rax,0x328(%rbp)
0xffffffff8019c9a5 <__blockdev_direct_IO+2806>: mov    %rbx,%rdi
                                                // call _spin_unlock_irqrestore
0xffffffff8019c9a8 <__blockdev_direct_IO+2809>: callq  0xffffffff8030f7ca
                                                // call io_schedule
0xffffffff8019c9ad <__blockdev_direct_IO+2814>: callq  0xffffffff8030f138
0xffffffff8019c9b2 <__blockdev_direct_IO+2819>: mov    %rbx,%rdi


crash> dis io_schedule 1                        // save rbp (dio)
0xffffffff8030f138 <io_schedule>:               push   %rbp


crash> bt -f 8960
PID: 8960   TASK: 102edef7030       CPU: 4   COMMAND: "oracle"
...
 #1 [100b5397ba0] io_schedule at ffffffff8030f15e
    100b5397ba8: 00000100083a8e80 000001010012bb18 
                 saved rbp (dio)
    100b5397bb8: 000001010012b800 ffffffff8019c9b2 
... 


crash> px *(struct dio *)0x1010012b800
$14 = {
  bio = 0x0, 
  inode = 0x1056b3fcb18, 
  rw = 0x1 = WRITE, 
  i_size = 0x11b49800, 
  lock_type = 0x1 = DIO_LOCKING,
  blkbits = 0x9, 
  blkfactor = 0x3, 
  start_zero_done = 0x1, 
  pages_in_io = 0x102, 
  size = 0x100000, 
  block_in_file = 0x5808, 
  blocks_available = 0x0, 
  final_block_in_request = 0x6001, 
  first_block_in_page = 0x7, 
  boundary = 0x0, 
  reap_counter = 0x1, 
  get_blocks = 0xffffffffa01114c8 <ext3_direct_io_get_blocks>, 
  end_io = 0, 
  final_block_in_bio = 0x9486140, 
  next_block_for_io = 0x9486140, 
  map_bh = {
    b_state = 0x0, 
    b_this_page = 0x5245504f2e533253, 
    b_page = 0x2c504f4320454341, 
    b_count = {
      counter = 0x53345320
    }, 
    b_size = 0x1000, 
    b_blocknr = 0x1290c27, 
    b_data = 0x204552454857204
    b_bdev = 0x100db0e7cc0, 
    b_end_io = 0x2044495f414e491e, 
    b_private = 0x0, 
    b_assoc_buffers = {
      next = 0x492e4b5320444e41, 
      prev = 0x333431203d2044
    }
  }, 
  cur_page = 0x0, 
  cur_page_offset = 0x0, 
  cur_page_len = 0xc00, 
  cur_page_block = 0x948613a, 
  curr_page = 0x40, 
  total_pages = 0x101, 
  curr_user_address = 0x2a974e3e00, 
  pages = {0x10002cea048, 0x102ff4e21a0, 0x102ff4e21d8, 0x102ff4e2210, 0x102ff4e
2248, 0x102ff4e2280, 0x102ff4e22b8, 0x102ff4e22f0, 0x102ff4e2328, 0x102ff4e2360,
 0x102ff4e2398, 0x102ff4e23d0, 0x102ff4e2408, 0x102ff4e2440, 0x102ff4e2478, 0x10
2ff4e24b0, 0x102ff4e24e8, 0x102ff4e2520, 0x102ff4e2558, 0x102ff4e2590, 0x102ff4e
25c8, 0x102ff4e2600, 0x102ff4e2638, 0x102ff4e2670, 0x102ff4e26a8, 0x102ff4e26e0,
 0x102ff4e2718, 0x102ff4e2750, 0x102ff4e2788, 0x102ff4e27c0, 0x102ff4e27f8, 0x10
2ff4e2830, 0x102ff4e2868, 0x102ff4e28a0, 0x102ff4e28d8, 0x102ff4e2910, 0x102ff4e
2948, 0x102ff4e2980, 0x102ff4e29b8, 0x102ff4e29f0, 0x102ff4e2a28, 0x102ff4e2a60,
 0x102ff4e2a98, 0x102ff4e2ad0, 0x102ff4e2b08, 0x102ff4e2b40, 0x102ff4e2b78, 0x10
2ff4e2bb0, 0x102ff4e2be8, 0x102ff4e2c20, 0x102ff4e2c58, 0x102ff4e2c90, 0x102ff4e
2cc8, 0x102ff4e2d00, 0x102ff4e2d38, 0x102ff4e2d70, 0x102ff4e2da8, 0x102ff4e2de0,
 0x102ff4e2e18, 0x102ff4e2e50, 0x102ff4e2e88, 0x102ff4e2ec0, 0x102ff4e2ef8, 0x10
2ff4e2f30}, 
  head = 0x40, 
  tail = 0x40, 
  page_errors = 0x0, 
  refcount = {
    counter = 0x1 <----------+  reference count is 1 (no ongoing I/O)
  },                         |
  bio_lock = {               |
    lock = 0x1,              |
    magic = 0xdead4ead       |
  },                         |
  bio_list = 0x0,            |
  waiter = 0x102edef7030, <--+  but waiter is not NULL
  iocb = 0x107b7a3c780,      |
  is_async = 0x1,            |
  io_error = 0x0,            |
  result = 0xe00             |
}                            |
                             |
crash> ps | grep 102edef7030 v
   8960      1   4     102edef7030    UN   0.1 10634640  32276  oracle

Comment 38 Bryn M. Reeves 2009-02-23 18:55:57 UTC
repost of analysis from comment #5:

details of direct I/O request
-----------------------------


     vaddr    address space             file data  offset    fs block
             +-------------+           +---------+
             |             |           |         |
             |             |           |         |
             | I/O buffer  |           |         | 3900000   1698f91e
  2a97ce3e00 |/////////////|           |/////////| 3900200
  2a97ce4000 |/////////////|           |/////////| 3900400
             |/////////////|           |/////////| 3900600
             |/////////////|           |/////////| 3900800
             |/////////////|           |/////////| 3900a00
             |/////////////|           |/////////| 3900c00
             |/////////////|           |/////////| 3900e00
             |/////////////|-- write ->|/////////| 3901000   00000000 (hole)
             |/////////////|           |/////////| 3901200
  2a97ce5000 |/////////////|           |/////////| 3901400
             |/////////////|           |/////////| 3901600
             |/////////////|           |/////////| 3901800
             |/////////////|           |/////////| 3901a00
             |/////////////|           |/////////| 3901c00
             |/////////////|           |/////////| 3901e00
             |/////////////|           |/////////| 3902000   00000000 (hole)
             |      :      |           |    :    |             :
             |      :      |           |    :    |             :
             |      :      |           |    :    |             :
             |/////////////|           |/////////| 3a00000   00000000 (hole)
  2a97be3e00 |/////////////|           |/////////| 3a00200
             |             |           |         | 3a00400
             |             |           |         | 3a00600
             |             |           |         | 3a00800
             |             |           |         | 3a00a00
             |             |           |         | 3a00c00
             |             |           |         |
             |             |           |         |
             +-------------+           +---------+



details from 1822328_vmcore_20080616
------------------------------------


crash> sys
      KERNEL: /cores/20080425054953/work/vmlinux
    DUMPFILE: /cores/20080617043355/work/1822328_vmcore_20080616
        CPUS: 16
        DATE: Mon Jun 16 15:19:16 2008
      UPTIME: 01:14:21
LOAD AVERAGE: 29.24, 34.39, 30.91
       TASKS: 275
    NODENAME: host.example.com
     RELEASE: 2.6.9-67.0.4.ELlargesmp
     VERSION: #1 SMP Fri Jan 18 05:10:01 EST 2008
     MACHINE: x86_64  (3000 Mhz)
      MEMORY: 16 GB
       PANIC: "Oops: 0002 [1] SMP " (check log for details)


crash> bt 21356
PID: 21356  TASK: 103f8f97030       CPU: 10  COMMAND: "oracle"
 #0 [100ba6ddac8] schedule at ffffffff8030e7e8
 #1 [100ba6ddba0] io_schedule at ffffffff8030f15e
 #2 [100ba6ddbc0] __blockdev_direct_IO at ffffffff8019c9b2
 #3 [100ba6ddc70] ext3_direct_IO at ffffffffa010d53a
 #4 [100ba6ddcf0] generic_file_direct_IO at ffffffff8015c8fa
 #5 [100ba6ddd30] generic_file_direct_write at ffffffff8015c980
 #6 [100ba6ddd80] __generic_file_aio_write_nolock at ffffffff8015ccbc
 #7 [100ba6dde00] generic_file_aio_write_nolock at ffffffff8015cf9f
 #8 [100ba6dde30] generic_file_aio_write at ffffffff8015d069
 #9 [100ba6dde80] ext3_file_write at ffffffffa0108f01
#10 [100ba6ddea0] aio_pwrite at ffffffff8019e1ab
#11 [100ba6ddeb0] aio_run_iocb at ffffffff8019d53f
#12 [100ba6dded0] io_submit_one at ffffffff8019e560
#13 [100ba6ddf00] sys_io_submit at ffffffff8019e665
#14 [100ba6ddf80] system_call at ffffffff8011026a
    RIP: 0000002a96f5f6a7  RSP: 0000007fbfff8890  RFLAGS: 00010246
    RAX: 00000000000000d1  RBX: ffffffff8011026a  RCX: 0000000000100000
    RDX: 0000007fbfff6890  RSI: 0000000000000001  RDI: 0000002a971ce000
    RBP: 0000000006003570   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000007fbfffb488  R11: 0000000000000202  R12: 0000007fbfff89d0
    R13: 0000000000000001  R14: 0000007fbfff89d0  R15: 0000007fbfff89b0
    ORIG_RAX: 00000000000000d1  CS: 0033  SS: 002b



the process has been blocked for 24 minutes and 52 seconds ...


crash> px per_cpu__runqueues | grep "\[10\]"
  [10]: 1040001d760

crash> px ((struct runqueue *)0x1040001d760)->timestamp_last_tick
$1 = 0x420917ddc2d

crash> px ((struct task_struct *)0x103f8f97030)->last_ran
$2 = 0x2c50250db37

crash> pd (0x420917ddc2d - 0x2c50250db37) / 1000000000
$3 = 1492 <- seconds blocked



the "running transaction" in /data ...


crash> px ((struct task_struct *)0x103f8f97030)->journal_info
$4 = (void *) 0x103f38461f0

crash> px ((struct handle_s *)0x103f38461f0)->h_transaction
$5 = (transaction_t *) 0x100012f11c0


crash> mount | grep /data
     10200042500      10037c5f800 ext3   /dev/md0                /data     

crash> px ((struct super_block *)0x10037c5f800)->s_fs_info
$6 = (void *) 0x101e8de0000

crash> px ((struct ext3_sb_info *)0x101e8de0000)->s_journal
$7 = (struct journal_s *) 0x10037d58c00

crash> px ((struct journal_s *)0x10037d58c00)->j_running_transaction
$8 = (transaction_t *) 0x100012f11c0



"kjournald" PID 17538 is waiting on "t_updates" ...


crash> pd ((transaction_t *)0x100012f11c0)->t_updates
$9 = 1

crash> px &((struct journal_s *)0x10037d58c00)->j_wait_updates
$10 = (wait_queue_head_t *) 0x10037d58ce0

crash> waitq 0x10037d58ce0
PID: 17538  TASK: 100dffb4030       CPU: 13  COMMAND: "kjournald"


crash> bt 17538
PID: 17538  TASK: 100dffb4030       CPU: 13  COMMAND: "kjournald"
 #0 [101f0891ad8] schedule at ffffffff8030e7e8
 #1 [101f0891bb0] journal_commit_transaction at ffffffffa00f5872
 #2 [101f0891e80] kjournald at ffffffffa00f8a5c
 #3 [101f0891f50] kernel_thread at ffffffff80110f47



24 processes are waiting on "T_LOCKED" ...


crash> pd ((transaction_t *)0x100012f11c0)->t_state
$11 = T_LOCKED

crash> px &((struct journal_s *)0x10037d58c00)->j_wait_transaction_locked
$12 = (wait_queue_head_t *) 0x10037d58c68

crash> waitq 0x10037d58c68 | wc -l
24


crash> waitq 0x10037d58c68
PID: 21578  TASK: 101eb004030       CPU: 7   COMMAND: "ls"
PID: 21329  TASK: 103f7782030       CPU: 5   COMMAND: "oracle"
PID: 196    TASK: 103f8e93030       CPU: 14  COMMAND: "pdflush"
PID: 21321  TASK: 101e7b077f0       CPU: 10  COMMAND: "oracle"
PID: 21337  TASK: 101f8561030       CPU: 4   COMMAND: "oracle"
PID: 21341  TASK: 103f4fbb030       CPU: 4   COMMAND: "oracle"
PID: 21315  TASK: 101f03cc030       CPU: 4   COMMAND: "oracle"
PID: 21319  TASK: 101f29877f0       CPU: 8   COMMAND: "oracle"
PID: 21391  TASK: 10183645030       CPU: 2   COMMAND: "iozone"
PID: 21386  TASK: 1018377f030       CPU: 7   COMMAND: "iozone"
PID: 21384  TASK: 1018372a030       CPU: 15  COMMAND: "iozone"
PID: 21387  TASK: 101837f97f0       CPU: 9   COMMAND: "iozone"
PID: 21397  TASK: 10183fce030       CPU: 12  COMMAND: "iozone"
PID: 21383  TASK: 1018372a7f0       CPU: 10  COMMAND: "iozone"
PID: 21392  TASK: 10183efd7f0       CPU: 1   COMMAND: "iozone"
PID: 21390  TASK: 101836457f0       CPU: 0   COMMAND: "iozone"
PID: 21393  TASK: 10183efd030       CPU: 8   COMMAND: "iozone"
PID: 21389  TASK: 10183e607f0       CPU: 14  COMMAND: "iozone"
PID: 21394  TASK: 10183f647f0       CPU: 3   COMMAND: "iozone"
PID: 21396  TASK: 10183fce7f0       CPU: 6   COMMAND: "iozone"
PID: 21382  TASK: 101f4cfe030       CPU: 11  COMMAND: "iozone"
PID: 21395  TASK: 10183f64030       CPU: 4   COMMAND: "iozone"
PID: 21388  TASK: 101837f9030       CPU: 5   COMMAND: "iozone"
PID: 21385  TASK: 1018377f7f0       CPU: 13  COMMAND: "iozone"


crash> bt 21578
PID: 21578  TASK: 101eb004030       CPU: 7   COMMAND: "ls"
 #0 [101aaea3ca8] schedule at ffffffff8030e7e8
 #1 [101aaea3d80] start_this_handle at ffffffffa00f3359
 #2 [101aaea3e60] journal_start at ffffffffa00f34c6
 #3 [101aaea3e80] ext3_dirty_inode at ffffffffa010d797
 #4 [101aaea3ea0] __mark_inode_dirty at ffffffff80199b0c
 #5 [101aaea3ed0] update_atime at ffffffff80192f5e
 #6 [101aaea3f00] vfs_readdir at ffffffff8018c3a8
 #7 [101aaea3f30] sys_getdents64 at ffffffff8018c784
 #8 [101aaea3f80] system_call at ffffffff8011026a
    RIP: 0000003369d8c68e  RSP: 0000007fbffff318  RFLAGS: 00010246
    RAX: 00000000000000d9  RBX: ffffffff8011026a  RCX: 0000000000000000
    RDX: 0000000000001000  RSI: 000000000051ce18  RDI: 0000000000000003
    RBP: 0000002a957836a0   R8: 0000003369f31ce8   R9: 0000000000000010
    R10: 0000007fbffff201  R11: 0000000000000246  R12: 0000000000000001
    R13: 0000007fbffff310  R14: 000000000051cde0  R15: 000000000051cdc0
    ORIG_RAX: 00000000000000d9  CS: 0033  SS: 002b



the copy of the user iocb on the kernel stack of "oracle" PID 21356 ...


crash> bt -f 21356
PID: 21356  TASK: 103f8f97030       CPU: 10  COMMAND: "oracle"
...
#13 [100ba6ddf00] sys_io_submit at ffffffff8019e665
                                                   -+
    100ba6ddf08: 0000002a97ec39b0 0000000000000000  |
    100ba6ddf18: 0000001600000001 0000002a97ce3e00  |  copy of
    100ba6ddf28: 0000000000100000 0000000003900200  |  user iocb
    100ba6ddf38: 0000000000000000 0000000000000000  |
                                                   -+
    100ba6ddf48: 0000000000000000 0000000000000001 
    100ba6ddf58: 0000007fbfff89b0 0000007fbfff89d0 
    100ba6ddf68: 0000000000000001 0000007fbfff89d0 
    100ba6ddf78: 0000000006003570 ffffffff8011026a 
...


crash> px *(struct iocb *)0x100ba6ddf08
$13 = {
  aio_data = 0x2a97ec39b0, 
  aio_key = 0x0, 
  aio_reserved1 = 0x0, 
  aio_lio_opcode = 0x1 = IOCB_CMD_PWRITE, 
  aio_reqprio = 0x0, 
  aio_fildes = 0x16 =22, file descriptor
  aio_buf = 0x2a97ce3e00,   |
  aio_nbytes = 0x100000,    |
  aio_offset = 0x3900200,   |
  aio_reserved2 = 0x0,      |
  aio_reserved3 = 0x0       |
}                           |
                            v
crash> files 21356 | grep " 22 "
 22    103f64a2980      10037cc6138      101d1a0a810   REG 
/data/oracle/archive/DBTEST/1_472_657217547.dbf



the dio structure pertaining to the above iocb ...


crash> bt -f 21356
PID: 21356  TASK: 103f8f97030       CPU: 10  COMMAND: "oracle"
...
 #1 [100ba6ddba0] io_schedule at ffffffff8030f15e
    100ba6ddba8: 00000101f83af210 00000101f3c8bb18 
                 saved rbp (dio)
    100ba6ddbb8: 00000101f3c8b800 ffffffff8019c9b2 
...


crash> px *(struct dio *)0x101f3c8b800
$7 = {
  bio = 0x0, 
  inode = 0x101d1a0a810, 
  rw = 0x1 = WRITE, 
  i_size = 0xbce0200, 
  lock_type = 0x1 = DIO_LOCKING, 
  blkbits = 0x9, 
  blkfactor = 0x3, 
  start_zero_done = 0x1, 
  pages_in_io = 0x102, 
  size = 0x100000, 
  block_in_file = 0x1c808, 
  blocks_available = 0x0, 
  final_block_in_request = 0x1d001, 
  first_block_in_page = 0x7, 
  boundary = 0x0, 
  reap_counter = 0x1, 
  get_blocks = 0xffffffffa010c4c8 <ext3_direct_io_get_blocks>, 
  end_io = 0, 
  final_block_in_bio = 0xb4c7c8f8, 
  next_block_for_io = 0xb4c7c8f8, 
  map_bh = {
    b_state = 0x0, 
    b_this_page = 0xfffffffffffffffc, 
    b_page = 0x5d2, 
    b_count = {
      counter = 0x2
    }, 
    b_size = 0x1000, 
    b_blocknr = 0x1698f91e, 
    b_data = 0x47b
    b_bdev = 0x103f8f263c0, 
    b_end_io = 0xfffffffffffffffc, 
    b_private = 0x0, 
    b_assoc_buffers = {
      next = 0x9c00000002, 
      prev = 0xfffffffffffffffc
    }
  }, 
  cur_page = 0x0, 
  cur_page_offset = 0x0, 
  cur_page_len = 0xc00, 
  cur_page_block = 0xb4c7c8f2, 
  curr_page = 0x40, 
  total_pages = 0x101, 
  curr_user_address = 0x2a97d23e00, 
  pages = {0x101fb401cb8, 0x101fe55fcd8, 0x101fe55fd10, 0x101fe55fd48,
0x101fe55
fd80, 0x101fe55fdb8, 0x101fe55fdf0, 0x101fe55fe28, 0x101fe5513e0,
0x101fe551418,
 0x101fe551450, 0x101fe551488, 0x101fe5514c0, 0x101fe5514f8, 0x101fe551530,
0x10
1fe551568, 0x101fe5515a0, 0x101fe5515d8, 0x101fe551610, 0x101fe551648,
0x101fe55
1680, 0x101fe5516b8, 0x101fe5516f0, 0x101fe551728, 0x101fe551060,
0x101fe551098,
 0x101fe5510d0, 0x101fe551108, 0x101fe551140, 0x101fe551178, 0x101fe5511b0,
0x10
1fe5511e8, 0x101fe551220, 0x101fe551258, 0x101fe551290, 0x101fe5512c8,
0x101fe55
1a00, 0x101fe551a38, 0x101fe551a70, 0x101fe551aa8, 0x101fe53f4e0,
0x101fe53f518,
 0x101fe53f550, 0x101fe53f588, 0x101fe53f5c0, 0x101fe53f5f8, 0x101fe53f630,
0x10
1fe53f668, 0x101fe53f6a0, 0x101fe53f6d8, 0x101fe53f710, 0x101fe53f748,
0x101fe53
f780, 0x101fe53f7b8, 0x101fe53f7f0, 0x101fe53f828, 0x101fe53f860,
0x101fe53f898,
 0x101fe53f8d0, 0x101fe53f908, 0x101fe53f940, 0x101fe53f978, 0x101fe53f9b0,
0x10
1fe53f9e8}, 
  head = 0x40, 
  tail = 0x40, 
  page_errors = 0x0, 
  refcount = {
    counter = 0x1 <----------+  reference count is 1 (no ongoing I/O)
  },                         |
  bio_lock = {               |
    lock = 0x1,              |
    magic = 0xdead4ead       |
  },                         |
  bio_list = 0x0,            |
  waiter = 0x103f8f97030, <--+  but waiter is not NULL
  iocb = 0x10198315480,      |
  is_async = 0x1,            |
  io_error = 0x0,            |
  result = 0xe00             |
}                            |
                             |
crash> ps | grep 103f8f97030 v
  21356      1  10     103f8f97030    UN   0.2 10634640  30764  oracle

Comment 45 Ryan Lerch 2009-04-09 01:35:12 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1 @@
-There is a race between dio_bio_end_aio() and dio_await_one() which can lead to a situation where a process performing asynchronous, direct I/O is left waiting indefinitely on an I/O that has already completed.  The solution is to lock the reference counting operations so that the submission and completion paths see a unified state.+Previously, in the kernel, there was a race condition may have been encountered between dio_bio_end_aio() and dio_await_one(). This may have lead to a situation where direct I/O is left waiting indefinitely on an I/O process that has already completed.  With this update, the the reference counting operations are locked so that the submission and completion paths see a unified state, which resolves this issue.

Comment 47 errata-xmlrpc 2009-05-18 19:08:11 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html