Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 546700

Summary: Deadlock in aio
Product: Red Hat Enterprise Linux 5 Reporter: Matt Cross <matt.cross>
Component: kernelAssignee: Jeff Moyer <jmoyer>
Status: CLOSED ERRATA QA Contact: Igor Zhang <yugzhang>
Severity: medium Docs Contact:
Priority: low    
Version: 5.4CC: cward, jarod, yugzhang
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 587402 (view as bug list) Environment:
Last Closed: 2011-01-13 20:57:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 587402    
Attachments:
Description Flags
test case to reproduce the problem
none
do not call flush_workqueue from wthin a workqueue
none
improved testcase
none
Updated patch that fixes the logic inversion none

Description Matt Cross 2009-12-11 18:08:07 UTC
Created attachment 377776 [details]
test case to reproduce the problem

Description of problem:

If many threads call io_destroy() with outstanding I/O, the aio subsystem can deadlock. The aio workqueue handler gets stuck in flush_workqueue() waiting for the workqueue_mutex, while another thread in io_destroy() is also in flush_workqeue() holding the workqueue_mutex and waiting for the workqueue to complete. The attached code demonstrates the issue and a workaround. It usually fails in a few seconds.


Version-Release number of selected component (if applicable):

2.6.18-164.6.1

How reproducible:

Very

Steps to Reproduce:
1. Compile attached aio_test.c "cc -o aio_test aio_test.c -lpthread -laio"
2. Create a testfile for it to read: "dd if=/dev/zero of=/tmp/testfile bs=1M count=10"
3. Run test code like this: "./aio_test 0 100 /tmp/testfile"
  
Actual results:

In less than one minute, the test will hang.  There will be 100 threads of aio_test, most stuck waiting for aio completion or waiting for the 'workqueue_mutex'.  One will hold the workqueue_mutex waiting for the aio workqueue to flush, and an AIO workqueue thread will be stuck trying to acquire the 'workqueue_mutex'.

Stack traces of the two deadlocked threads:

crash> foreach aio/0 bt
PID: 265 TASK: ffff8100179d0800 CPU: 0 COMMAND: "aio/0"
 #0 [ffff810017c5bc70] schedule at ffffffff80066027
 0000001 [ffff810017c5bd58] mutex_lock_nested at ffffffff8006729f
 0000002 [ffff810017c5bdf8] __put_ioctx at ffffffff800facc9
 #3 [ffff810017c5be18] aio_fput_routine at ffffffff800fb259
 #4 [ffff810017c5be38] run_workqueue at ffffffff80050050
 #5 [ffff810017c5be78] worker_thread at ffffffff8004c933
 #6 [ffff810017c5bee8] kthread at ffffffff80034a6a
 0000007 [ffff810017c5bf48] kernel_thread at ffffffff80061079

crash> bt 3602
PID: 3602 TASK: ffff81000d0c2240 CPU: 1 COMMAND: "aio_test"
 #0 [ffff81000d0c5df8] schedule at ffffffff80066027
 0000001 [ffff81000d0c5ee0] flush_cpu_workqueue at ffffffff800a1c05
 0000002 [ffff81000d0c5f30] flush_workqueue at ffffffff800a1caa
 #3 [ffff81000d0c5f50] __put_ioctx at ffffffff800facc9
 #4 [ffff81000d0c5f70] sys_io_destroy at ffffffff800fb18b
 #5 [ffff81000d0c5f80] tracesys at ffffffff800602a6 (via system_call)
    RIP: 00002b02b54e6637 RSP: 0000000078ffdf48 RFLAGS: 00000202
    RAX: ffffffffffffffda RBX: ffffffff800602a6 RCX: ffffffffffffffff
    RDX: 0000000000000002 RSI: 0000000000000000 RDI: 00002aaaaab03000
    RBP: 0000000078ffe130 R8: 000000365ab50064 R9: 000000365ab500e0
    R10: 0000000078ffe9d0 R11: 0000000000000202 R12: ffffffff800fb18b
    R13: 0000000078ffe130 R14: 000000000000eda9 R15: 0000000078ffe130
    ORIG_RAX: 00000000000000cf CS: 0033 SS: 002b

Expected results:

Test should run forever.

Additional info:

This has been reproduced on x86_64 SMP systems, including running in VMWare with 2 CPU's.  Has not been tested on UP systems.

Seems to be fixed in 2.6.22 - running this test against stock 2.6.21 produces the same results, running it against 2.6.22 shows no problems.  I believe the fix is in this patch:  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a9df62c7585e6caa1e7d2425b2b14460ec3afc20

Comment 1 Matt Cross 2009-12-11 18:12:49 UTC
Forgot to mention that there is a workaround - if the application waits for all I/O to complete before calling io_destroy() then the deadlock does not occur.  If you run the test app as './aio_test 1 100 /tmp/testfile' it enables the workaround and does not deadlock.

Comment 2 Jeff Moyer 2010-02-02 20:17:41 UTC
Thanks for the nice write-up, Matt.  I agree that the commit you referenced would indeed solve the problem.

What happens is that your program closes the file descriptor before calling io_destroy:

  if (fd >= 0)
    close(fd);
  if (ioctx_initted)
    io_destroy(ioctx);

Because of this, the last reference to the filp will be dropped in interrupt (I/O completion) context, and the final put on the filp must be scheduled for process context.

Then, any one of the other aio_test threads can call io_destroy and it will trigger the deadlock, since io_destroy calls flush_workqueue, which takes the workqueue_mutex, and that will in turn wait for the aio_fput_routine to run, which does the last put on the ioctx, which in turn calls flush_workqueue for the aio_wq.

Comment 3 Matt Cross 2010-02-02 21:35:38 UTC
I agree, Jeff.  Implementing the workaround in my real application showed this - my application was closing the fd and calling io_destroy in two different pthread thread-specific data destructors, and it turned out that the one that did the close was running before the io_destroy.  Once I modified my application to always do the close() after the io_destroy(), the hang went away.

Let me know if you'd like me to test a fix.

Comment 4 Jeff Moyer 2010-02-03 16:55:23 UTC
I can't actually trigger this hang.  Did you test on bare metal, or just in a VM?  If the former, what is the I/O performance of your device?  While running your test, I'm getting between 150 and 200 MB/s.  Maybe my device is too fast to trigger the race?  It's also an 8 cpu system, so I tried using the same multiplier as you, 50 threads per cpu.  Still, no hang.

Comment 5 Matt Cross 2010-02-03 17:37:41 UTC
I know that I've seen the hang on bare metal with my real application, I can't recall if I reproduced it with this test code on bare metal or just on a VM.  I'll try it again on bare metal with this test code and I'll let you know what I find.

Comment 6 Jeff Moyer 2010-02-03 18:31:58 UTC
Created attachment 388587 [details]
do not call flush_workqueue from wthin a workqueue

A kernel with this patch applied can be found here:
  http://people.redhat.com/jmoyer/aio/rhel5/kernel-2.6.18-186.el5.jmoyer.aio.1.x86_64.rpm

Since I can't reproduce this, would you mind giving it a try?

Thanks!

Comment 7 Matt Cross 2010-02-03 21:01:13 UTC
Created attachment 388642 [details]
improved testcase

I was not able to reproduce the problem on real hardware using my original test case.  I tweaked the test case and am now able to reproduce the problem on real hardware.

Comment 8 Matt Cross 2010-02-03 21:07:28 UTC
Jeff, I am testing your kernel and I can't reproduce the problem.  However, on reviewing the patch I'm not sure it's correct.  Shouldn't the check in aio_fput_routine() be "if (!wq_context)"?  As currently implemented it looks like it only does the flush_workqueue() when running in the context of a workqueue handler, I think the right thing is to do the opposite.  I think your patch removes the race condition by never calling flush_workqueue() from the mainline code, which is bad because items could still be on the workqueue related to the io context being removed.

Comment 9 Jeff Moyer 2010-02-03 21:21:46 UTC
Boy, how did I miss that?  Thanks for pointing out that flaw, you are definitely right.  Are you able to build kernels for testing?  If so, could you invert that logic and see if the problem is still addressed?

Thanks!

Comment 10 Matt Cross 2010-02-03 21:32:53 UTC
Sure, I'll put that patch with the corrected logic against 164.6.1 (which is what I have handy) and let you know how it goes.

Comment 11 Jeff Moyer 2010-02-03 21:39:57 UTC
Created attachment 388656 [details]
Updated patch that fixes the logic inversion

Comment 12 Jeff Moyer 2010-02-03 21:41:25 UTC
I kicked off a build with that patch, but it will be a while before I'm able to upload the kernel to my people page.  In the mean time, I'd love to hear your results.  Thanks for all of the help!

Comment 13 Matt Cross 2010-02-03 22:17:04 UTC
I built a kernel by correcting the original patch (I just added the ! as I described).  My test hung after 29 seconds...  I'll take a look at it tomorrow and let you know what I find.

Comment 14 Matt Cross 2010-02-04 16:56:54 UTC
Never mind, I managed to compile the kernel but not build the RPM, so I was running the old code without your patch.

I just reran the test and it looks good.  I ran this kernel: http://people.redhat.com/jmoyer/aio/rhel5/kernel-2.6.18-186.el5.jmoyer.aio.2.x86_64.rpm

Comment 16 RHEL Program Management 2010-05-20 12:42:16 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 18 Jarod Wilson 2010-09-27 19:11:20 UTC
in kernel-2.6.18-225.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 22 errata-xmlrpc 2011-01-13 20:57:32 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html